Flash supports two execution models: queue-based (QB) and load-balanced (LB). This guide covers creating load-balanced endpoints with HTTP routing using the Endpoint class.
Queue-Based (@Endpoint(...) on a function)
- Requests queued and processed sequentially
- Automatic retry logic on failure
- Higher latency (queuing + processing)
- One function per endpoint
Load-Balanced (ep = Endpoint(...) + @ep.post("/path"))
- Requests routed directly to available workers
- Direct HTTP execution, no queue
- Lower latency (direct HTTP)
- Multiple routes on a single endpoint
Use Load-Balanced when you need:
- Low latency API endpoints
- Custom HTTP routing (GET, POST, PUT, DELETE, PATCH)
- Multiple routes sharing the same workers
- REST API semantics
Use Queue-Based when you need:
- Automatic retry logic
- Sequential, fault-tolerant processing
- Simple request/response pattern
from runpod_flash import Endpoint, GpuGroup
# create a load-balanced endpoint
api = Endpoint(name="example-api", gpu=GpuGroup.ADA_24, workers=(1, 3))
@api.post("/greet")
async def greet_user(name: str):
return {"message": f"Hello, {name}!"}
@api.get("/health")
async def health():
return {"status": "ok"}Key points:
- Create an
Endpointinstance with a name and compute config - Use
.get(),.post(),.put(),.delete(),.patch()to register routes - All routes on the same
Endpointshare the same workers - Paths must start with
/
from runpod_flash import Endpoint
api = Endpoint(name="user-service", cpu="cpu3c-1-2", workers=(1, 5))
@api.get("/users")
def list_users():
return {"users": []}
@api.post("/users")
async def create_user(name: str, email: str):
return {"id": 1, "name": name, "email": email}
@api.get("/users/{user_id}")
def get_user(user_id: int):
return {"id": user_id, "name": "Alice"}
@api.delete("/users/{user_id}")
async def delete_user(user_id: int):
return {"deleted": True}When deployed, a single endpoint is created with all four HTTP routes registered. FastAPI handles routing to the correct function.
The following paths are reserved and cannot be used:
/ping-- health check endpoint/execute-- framework endpoint for internal function execution
from runpod_flash import Endpoint, GpuGroup
api = Endpoint(name="inference-api", gpu=GpuGroup.ADA_24, workers=(1, 5))
@api.post("/predict")
async def predict(data: dict) -> dict:
import torch
model = torch.load("/models/model.pt")
return {"prediction": model.predict(data)}
@api.get("/health")
async def health():
return {"status": "ok", "gpu": "available"}from runpod_flash import Endpoint
api = Endpoint(name="data-api", cpu="cpu3c-1-2", workers=(1, 3))
@api.post("/process")
async def process(data: dict) -> dict:
return {"echo": data}
@api.get("/health")
async def health():
return {"status": "healthy"}Run locally with flash run:
flash run
# starts a local dev server at http://localhost:8888
# all routes are auto-discovered and registeredThe dev server exposes your routes at http://localhost:8888/{endpoint_name}/{path}.
import pytest
from runpod_flash import Endpoint
api = Endpoint(name="test-api", cpu="cpu3c-1-2")
@api.post("/calculate")
async def calculate(operation: str, a: int, b: int):
if operation == "add":
return a + b
elif operation == "multiply":
return a * b
raise ValueError(f"Unknown operation: {operation}")
@pytest.mark.asyncio
async def test_calculate_add():
result = await calculate("add", 5, 3)
assert result == 8flash build scans your code for Endpoint patterns:
- Finds
Endpoint(...)variable assignments (LB endpoints) - Finds
@Endpoint(...)decorator usage (QB endpoints) - Extracts HTTP routing metadata (method, path) for LB routes
- Creates manifest with route registry
- Validates for conflicts and reserved paths
- Packages everything for deployment
# build the project
flash build
# deploy to an environment
flash deploy --env productionLB endpoints use subdomain-based URLs:
# health check
curl https://{endpoint-id}.api.runpod.ai/ping
# call a route
curl -X POST https://{endpoint-id}.api.runpod.ai/users \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-d '{"name": "Alice", "email": "alice@example.com"}'Use Endpoint(id=...) to call a deployed LB endpoint:
from runpod_flash import Endpoint
ep = Endpoint(id="your-endpoint-id")
# HTTP calls return raw response data
result = await ep.post("/predict", {"data": [1, 2, 3]})
health = await ep.get("/health")"path must start with '/'"
- Use absolute paths:
/api/endpointnotapi/endpoint
"Duplicate route"
- Two functions with same method and path on same endpoint
- Change path or method to make each route unique
"HTTP error from endpoint: 500"
- Function raised an error during execution. Check endpoint logs.
"Connection refused"
- Container not running or uvicorn failed to start. Check container logs.
- Group related routes on the same
Endpointinstance - Use descriptive paths like
/api/users/{user_id}not/api/u - Test locally with
flash runbefore deploying - Handle errors gracefully with meaningful error messages
- Use CPU endpoints for I/O-bound work to save costs
- Set appropriate
workersscaling based on expected traffic
- Flash SDK Reference -- complete API reference
- Load Balancer Endpoints (Internal) -- internal architecture
- LoadBalancer Runtime Architecture -- runtime execution details