Load-Balanced Endpoints

Introduction

Flash supports two execution models: queue-based (QB) and load-balanced (LB). This guide covers creating load-balanced endpoints with HTTP routing using the Endpoint class.

QB vs LB

Queue-Based (@Endpoint(...) on a function)

Requests queued and processed sequentially
Automatic retry logic on failure
Higher latency (queuing + processing)
One function per endpoint

Load-Balanced (ep = Endpoint(...) + @ep.post("/path"))

Requests routed directly to available workers
Direct HTTP execution, no queue
Lower latency (direct HTTP)
Multiple routes on a single endpoint

When to Use Each

Use Load-Balanced when you need:

Low latency API endpoints
Custom HTTP routing (GET, POST, PUT, DELETE, PATCH)
Multiple routes sharing the same workers
REST API semantics

Use Queue-Based when you need:

Automatic retry logic
Sequential, fault-tolerant processing
Simple request/response pattern

Quick Start

from runpod_flash import Endpoint, GpuGroup

# create a load-balanced endpoint
api = Endpoint(name="example-api", gpu=GpuGroup.ADA_24, workers=(1, 3))

@api.post("/greet")
async def greet_user(name: str):
    return {"message": f"Hello, {name}!"}

@api.get("/health")
async def health():
    return {"status": "ok"}

Key points:

Create an Endpoint instance with a name and compute config
Use .get(), .post(), .put(), .delete(), .patch() to register routes
All routes on the same Endpoint share the same workers
Paths must start with /

HTTP Routing

Single Endpoint, Multiple Routes

from runpod_flash import Endpoint

api = Endpoint(name="user-service", cpu="cpu3c-1-2", workers=(1, 5))

@api.get("/users")
def list_users():
    return {"users": []}

@api.post("/users")
async def create_user(name: str, email: str):
    return {"id": 1, "name": name, "email": email}

@api.get("/users/{user_id}")
def get_user(user_id: int):
    return {"id": user_id, "name": "Alice"}

@api.delete("/users/{user_id}")
async def delete_user(user_id: int):
    return {"deleted": True}

When deployed, a single endpoint is created with all four HTTP routes registered. FastAPI handles routing to the correct function.

Reserved Paths

The following paths are reserved and cannot be used:

/ping -- health check endpoint
/execute -- framework endpoint for internal function execution

GPU Load-Balanced Endpoint

from runpod_flash import Endpoint, GpuGroup

api = Endpoint(name="inference-api", gpu=GpuGroup.ADA_24, workers=(1, 5))

@api.post("/predict")
async def predict(data: dict) -> dict:
    import torch
    model = torch.load("/models/model.pt")
    return {"prediction": model.predict(data)}

@api.get("/health")
async def health():
    return {"status": "ok", "gpu": "available"}

CPU Load-Balanced Endpoint

from runpod_flash import Endpoint

api = Endpoint(name="data-api", cpu="cpu3c-1-2", workers=(1, 3))

@api.post("/process")
async def process(data: dict) -> dict:
    return {"echo": data}

@api.get("/health")
async def health():
    return {"status": "healthy"}

Local Development

Run locally with flash run:

flash run
# starts a local dev server at http://localhost:8888
# all routes are auto-discovered and registered

The dev server exposes your routes at http://localhost:8888/{endpoint_name}/{path}.

Testing

import pytest
from runpod_flash import Endpoint

api = Endpoint(name="test-api", cpu="cpu3c-1-2")

@api.post("/calculate")
async def calculate(operation: str, a: int, b: int):
    if operation == "add":
        return a + b
    elif operation == "multiply":
        return a * b
    raise ValueError(f"Unknown operation: {operation}")

@pytest.mark.asyncio
async def test_calculate_add():
    result = await calculate("add", 5, 3)
    assert result == 8

Building and Deploying

Build Process

flash build scans your code for Endpoint patterns:

Finds Endpoint(...) variable assignments (LB endpoints)
Finds @Endpoint(...) decorator usage (QB endpoints)
Extracts HTTP routing metadata (method, path) for LB routes
Creates manifest with route registry
Validates for conflicts and reserved paths
Packages everything for deployment

Deployment

# build the project
flash build

# deploy to an environment
flash deploy --env production

Verifying Deployment

LB endpoints use subdomain-based URLs:

# health check
curl https://{endpoint-id}.api.runpod.ai/ping

# call a route
curl -X POST https://{endpoint-id}.api.runpod.ai/users \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -d '{"name": "Alice", "email": "alice@example.com"}'

Calling LB Endpoints Programmatically

Use Endpoint(id=...) to call a deployed LB endpoint:

from runpod_flash import Endpoint

ep = Endpoint(id="your-endpoint-id")

# HTTP calls return raw response data
result = await ep.post("/predict", {"data": [1, 2, 3]})
health = await ep.get("/health")

Troubleshooting

Validation Errors

"path must start with '/'"

Use absolute paths: /api/endpoint not api/endpoint

"Duplicate route"

Two functions with same method and path on same endpoint
Change path or method to make each route unique

Runtime Errors

"HTTP error from endpoint: 500"

Function raised an error during execution. Check endpoint logs.

"Connection refused"

Container not running or uvicorn failed to start. Check container logs.

Best Practices

Group related routes on the same Endpoint instance
Use descriptive paths like /api/users/{user_id} not /api/u
Test locally with flash run before deploying
Handle errors gracefully with meaningful error messages
Use CPU endpoints for I/O-bound work to save costs
Set appropriate workers scaling based on expected traffic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load-Balanced Endpoints

Introduction

QB vs LB

When to Use Each

Quick Start

HTTP Routing

Single Endpoint, Multiple Routes

Reserved Paths

GPU Load-Balanced Endpoint

CPU Load-Balanced Endpoint

Local Development

Testing

Building and Deploying

Build Process

Deployment

Verifying Deployment

Calling LB Endpoints Programmatically

Troubleshooting

Validation Errors

Runtime Errors

Best Practices

Related Documentation

FilesExpand file tree

Using_Remote_With_LoadBalancer.md

Latest commit

History

Using_Remote_With_LoadBalancer.md

File metadata and controls

Load-Balanced Endpoints

Introduction

QB vs LB

When to Use Each

Quick Start

HTTP Routing

Single Endpoint, Multiple Routes

Reserved Paths

GPU Load-Balanced Endpoint

CPU Load-Balanced Endpoint

Local Development

Testing

Building and Deploying

Build Process

Deployment

Verifying Deployment

Calling LB Endpoints Programmatically

Troubleshooting

Validation Errors

Runtime Errors

Best Practices

Related Documentation