A collection of example applications showcasing Runpod Flash - a framework for building production-ready AI applications with distributed GPU and CPU computing.
Flash is a Python framework that lets you run functions on Runpod's Serverless infrastructure with a single decorator. Write code locally, deploy globally—Flash handles provisioning, scaling, and routing automatically.
from runpod_flash import Endpoint, GpuType
@Endpoint(name="image-gen", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, dependencies=["torch", "diffusers"])
async def generate_image(prompt: str) -> bytes:
# This runs on a cloud GPU, not your laptop
...Key features:
@Endpointdecorator: Mark any async function to run on serverless infrastructure- Auto-scaling: Scale to zero when idle, scale up under load
- Local development:
flash runstarts a local server with hot reload - One-command deploy:
flash deploypackages and ships your code
- Python 3.10+
- uv: Install with
curl -LsSf https://astral.sh/uv/install.sh | sh - Runpod account: Sign up here
# Clone and install
git clone https://github.com/runpod/flash-examples.git
cd flash-examples
uv sync && uv pip install -e .
# Authenticate with Runpod
uv run flash login
# Run all examples locally
uv run flash runOpen http://localhost:8888/docs to explore all endpoints.
Using pip, poetry, or conda? See DEVELOPMENT.md for alternative setups.
| Category | Example | Description |
|---|---|---|
| Getting Started | 01_hello_world | Basic GPU worker |
| 02_cpu_worker | CPU-only worker | |
| 03_mixed_workers | GPU + CPU pipeline | |
| 04_dependencies | Dependency management | |
| ML Inference | 01_text_to_speech | Qwen3-TTS model serving |
| Advanced | 05_load_balancer | HTTP routing with load balancer |
| Scaling | 01_autoscaling | Worker autoscaling configuration |
| Data | 01_network_volumes | Persistent storage with network volumes |
More examples coming soon in each category.
flash login # Authenticate with Runpod (opens browser)
flash run # Run development server (localhost:8888)
flash build # Build deployment package
flash deploy --env <name># Build and deploy to environment
flash undeploy <name> # Delete deployed endpointSee CLI-REFERENCE.md for complete documentation.
The Endpoint class configures functions for execution on Runpod's serverless infrastructure:
Queue-based (one function = one endpoint):
from runpod_flash import Endpoint, GpuType
@Endpoint(name="my-worker", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, workers=(0, 3), dependencies=["torch"])
async def process(data: dict) -> dict:
import torch
# this code runs on Runpod GPUs
return {"result": "processed"}Load-balanced (multiple routes, shared workers):
from runpod_flash import Endpoint
api = Endpoint(name="my-api", cpu="cpu3c-1-2", workers=(1, 3))
@api.get("/health")
async def health():
return {"status": "ok"}
@api.post("/compute")
async def compute(data: dict) -> dict:
return {"result": data}Client mode (connect to an existing endpoint):
from runpod_flash import Endpoint
ep = Endpoint(id="ep-abc123")
job = await ep.run({"prompt": "hello"})
await job.wait()
print(job.output)GPU Workers (gpu=):
| Type | Use Case |
|---|---|
GpuType.NVIDIA_GEFORCE_RTX_4090 |
RTX 4090 (24GB) |
GpuType.NVIDIA_RTX_6000_ADA_GENERATION |
RTX 6000 Ada (48GB) |
GpuType.NVIDIA_A100_80GB_PCIe |
A100 (80GB) |
CPU Workers (cpu=):
| Type | Specs |
|---|---|
cpu3g-2-8 |
2 vCPU, 8GB RAM |
cpu3c-4-8 |
4 vCPU, 8GB RAM (Compute) |
cpu5c-4-16 |
4 vCPU, 16GB RAM (Latest) |
Workers automatically scale based on demand:
workers=(0, 3)- Scale from 0 to 3 workers (cost-efficient)workers=(1, 5)- Keep 1 warm, scale up to 5idle_timeout=5- Minutes before scaling down
See CONTRIBUTING.md for contribution guidelines and DEVELOPMENT.md for development setup.
MIT License - see LICENSE for details.