Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
314 changes: 314 additions & 0 deletions content/posts/2026-03-03-v0.6.0-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,314 @@
---
date: '2026-03-03T12:00:00-00:00'
draft: false
title: 'AIBrix v0.6.0 Release: Envoy Sidecar, Combined Routing, Routing Profiles, LoRA Delivery & New APIs'
author: ["The AIBrix Team"]

disableShare: true
hideSummary: true
searchHidden: false
ShowReadingTime: false
ShowWordCount: false
ShowBreadCrumbs: true
ShowPostNavLinks: true
ShowRssButtonInSectionTermList: false
UseHugoToc: true
ShowToc: true
tocopen: true
---

# 🚀 AIBrix v0.6.0 Release

Today we're excited to announce **AIBrix v0.6.0**, a release that expands how you deploy and route inference traffic. Key highlights include:

- **Envoy Sidecar Support** – Run Envoy alongside the gateway-plugin without requiring a separate Envoy Gateway controller, simplifying deployments.
- **Combined Routing Strategy** – Deploy both PD-optimized and combined pods in the same environment and route traffic intelligently based on workload and system load.
- **Routing Profiles** – Define multiple routing behaviors in a single model configuration and select them per request using a header.
- **Improved LoRA Artifact Delivery** – Artifact downloads are now fully handled by the **AIBrix runtime**, with direct credential passing, first-class **Volcengine TOS** support, and non-blocking async downloads.
- **Expanded API Surface**
- OpenAI-compatible audio APIs: `/v1/audio/transcriptions`, `/v1/audio/translations`
- New endpoints: `/v1/classify` and `/v1/rerank`
- **Custom Routing Paths** – Extend model router endpoints through deployment annotations.

Together, these updates make **AIBrix v0.6.0** easier to deploy, easier to observe, and more adaptable for production AI workloads. For the complete list of changes, commit history, and contributor details, see the [**AIBrix v0.6.0 Release Notes**](https://github.com/vllm-project/aibrix/releases/tag/v0.6.0).

## v0.6.0 Highlight Features

### Envoy as a Sidecar: Simplifying Gateway Deployments

This release introduces support for running **Envoy as a sidecar** alongside the AIBrix gateway-plugin. Instead of relying on an external Envoy Gateway controller, operators can now embed Envoy directly within the same pod as the gateway plugin. This approach provides a lighter-weight deployment option and reduces the architectural complexity of the gateway stack.

The new mode is controlled through the **`envoyAsSideCar`** flag in the Helm chart. When enabled, Envoy runs as a sidecar container that shares the lifecycle of the gateway-plugin pod. This removes the hard dependency on Envoy Gateway while giving operators more direct control over Envoy configuration and behavior.

### Flexible Deployment Modes

With this change, AIBrix now supports two mutually exclusive deployment patterns, allowing teams to choose the model that best fits their infrastructure and operational preferences.

```
+---------------------------+ +-----------------------------+
| Envoy Gateway Mode | | Envoy Sidecar Mode |
| (current default) | | (new option) |
| | | |
| Envoy Gateway Controller | | AIBrix Gateway Plugin Pod |
| + HTTPRoute | | + Envoy Sidecar Container |
+---------------------------+ +-----------------------------+
```


In **Envoy Gateway mode**, Envoy is managed through a separate control-plane component that uses the Kubernetes Gateway API. Resources such as `GatewayClass`, `EnvoyExtensionPolicy`, and `HTTPRoute` define how traffic is routed and processed, following a controller-driven architecture.

In contrast, **Envoy Sidecar mode** runs Envoy directly within the gateway-plugin pod. Envoy receives its configuration from a ConfigMap and is exposed through the gateway-plugin service, eliminating the need for Gateway API controllers. This model simplifies networking and reduces the number of required cluster resources.

### Why This Matters

Supporting both deployment approaches gives operators greater flexibility when running AIBrix in different environments. While the Gateway API model remains ideal for fully managed, controller-driven setups, the sidecar mode enables simpler and more self-contained deployments.

This is especially useful in edge environments, lightweight clusters, or scenarios where running additional controllers is undesirable. By embedding Envoy directly with the gateway plugin, teams can deploy and operate the gateway with fewer moving parts while still benefiting from Envoy’s powerful networking capabilities.

---

## Smarter Request Routing with AIBrix’s Combined Pod Strategy

Modern LLM inference workloads are rarely uniform. Some requests contain long prompts that benefit from specialized execution pipelines, while others are short and interactive. Designing an infrastructure that efficiently handles both cases can be challenging.

In the latest AIBrix release, we introduce the Combined Routing Strategy — a smarter approach to routing requests across different pod types within the same deployment. This feature allows AIBrix to dynamically choose between PD-optimized pods (prefill/decode disaggregated) and combined pods (single-process inference), improving overall performance, flexibility, and resource utilization.

Instead of forcing operators to choose one architecture or maintain separate deployments, AIBrix can now run both approaches together and intelligently decide which pod should handle each request.

### Introducing the Combined Routing Strategy

With this new strategy, AIBrix enables intelligent request routing across both PD and combined pods within the same deployment.

Each pod type serves a specific purpose:

- **PD Pods (Prefill/Decode Disaggregated)** — Designed for workloads where separating prefill and decode stages improves efficiency. These pods are ideal for long prompts or workloads dominated by heavy prompt processing.
- **Combined Pods (Single-Pod Inference)** — Handle the entire request lifecycle within a single process. They provide efficient handling for short prompts and can absorb traffic when PD resources are busy.

The AIBrix gateway now evaluates real-time system conditions and selects the most appropriate target pod automatically.

### How Routing Works

At the core of the Combined Strategy is an updated routing algorithm that evaluates available pods and dynamically selects the best candidate.

```
+---------------------------+
| Client |
+-------------+-------------+
|
Routing Algorithm (Gateway)
|
+-------------------------+------------------------+
| |
▼ ▼
+------------------+ +---------------------+
| PD Pods | | Combined Pods |
| (Prefill/Decode) | | (Single-Pod Logic) |
+------------------+ +---------------------+
▲ ▲
| Selected when PD advantages apply | Selected when PD load is high
| (long prompts, prefill heavy, etc.) | or combined is underutilized
```

The routing decision incorporates several signals, including:

- **Current pod load**
- **Queue metrics**
- **Pod availability**
- **Scoring logic** used to rank candidate pods

This scoring mechanism allows AIBrix to balance traffic fairly and efficiently across both PD and combined pods.

### Benefits of the Combined Routing Strategy

By supporting both PD and combined pods in the same deployment, AIBrix can adapt to changing workloads and route requests more efficiently. Key benefits include:

- **Efficient handling of mixed workloads** — Long prompts are routed to PD pods for better prefill/decode performance, while short or interactive requests are served by combined pods for faster responses.
- **Overflow handling during traffic spikes** — Combined pods can absorb requests when PD pods are saturated, helping maintain stable latency.
- **Single deployment flexibility** — Run PD and combined pods together for the same model without managing separate clusters.
- **Dynamic traffic routing** — Requests are routed based on real-time system conditions rather than static configuration.
- **Better resource utilization** — Traffic is distributed intelligently across available pods to maximize GPU usage and throughput.

---

## Routing Profiles: One Deployment, Multiple Routing Behaviors

As inference workloads grow more diverse, different types of traffic often require different routing strategies. Some requests benefit from PD (prefill/decode) routing, others prioritize low latency, while general workloads may only need simple load balancing. Traditionally, supporting these variations required multiple deployments or complex label configurations.

With this release, AIBrix introduces Routing Profiles — a new way to define multiple routing behaviors within a single model configuration and select them dynamically per request.

Instead of spreading routing settings across pod labels or maintaining separate deployments, you can define multiple named routing profiles inside the **model.aibrix.ai/config** annotation. Clients then select the desired routing behavior at request time using the **config-profile** header.

This allows a single model deployment to handle multiple traffic patterns — such as general workloads, PD routing, or low-latency requests — all using the same set of pods.

### Defining Routing Profiles

Routing profiles are defined as structured JSON inside the **model.aibrix.ai/config** annotation. Each profile can configure routing behavior and parameters such as:

- **routingStrategy** (e.g. random, pd, or least-latency)
- Prompt-length bucket ranges used by PD routing (**promptLenBucketMinLength**, **promptLenBucketMaxLength**)
- Whether pods operate in combined prefill/decode mode

You also define a **defaultProfile**, which acts as a fallback if a client does not specify a profile. These are the options currently supported; additional profile configuration options will be added gradually.

Example:

```json
{
"defaultProfile": "pd",
"profiles": {
"default": {
"routingStrategy": "random",
"promptLenBucketMinLength": 0,
"promptLenBucketMaxLength": 4096
},
"pd": {
"routingStrategy": "pd",
"promptLenBucketMinLength": 0,
"promptLenBucketMaxLength": 2048
},
"low-latency": {
"routingStrategy": "least-latency",
"promptLenBucketMinLength": 0,
"promptLenBucketMaxLength": 2048
}
}
}
```

In this example, the pd profile is configured as the default. Clients can explicitly choose default, pd, or low-latency depending on their workload. If no profile is provided, AIBrix automatically falls back to the default profile.

### Why Routing Profiles Matter

Routing Profiles simplify routing configuration while making deployments far more flexible. Instead of creating separate deployments for different traffic patterns, operators can define multiple routing behaviors in a single configuration and select them dynamically.

This approach provides several benefits:

- **Single source of truth** for routing configuration
- **Per-request flexibility** without extra deployments
- **Cleaner gateway logic** that is easier to manage
- **Clear separation** between workload types (batch vs. interactive, PD vs. single-pod)

In practice, this means one deployment can support many routing behaviors, allowing AIBrix to adapt to different workload patterns without increasing operational complexity.

---

## Streamlined LoRA Artifact Delivery with AIBrix Runtime

Managing LoRA adapters often requires coordinating credentials, storage access, and runtime behavior. To simplify this, AIBrix now moves LoRA artifact preparation and delivery entirely into the runtime, making it the single source of truth for downloading and preparing model artifacts.

This change simplifies credential handling, centralizes artifact management, and keeps the runtime responsive during downloads.

### Runtime as the Source of Truth

Previously, artifact preparation involved coordination between controllers and the runtime. Now the AIBrix runtime handles artifact validation and downloads directly, while the controller simply passes the required information such as credentials.

This clearer separation reduces operational complexity and simplifies the LoRA adapter lifecycle.

### Simplified Credential Flow

The modeladapter controller now retrieves credentials directly from the referenced Kubernetes Secret and embeds them into the LoRA load request.

The flow is straightforward:

- The controller reads the Kubernetes Secret.
- Converts **secret.Data** into a key-value map.
- Sends the credentials directly to the runtime.

This makes it easier to support IAM-style credentials for S3-compatible storage systems and removes ambiguity around artifact access configuration.

### Non-Blocking Artifact Downloads

Since object storage SDKs often rely on blocking I/O, artifact downloads are executed using:

```python
asyncio.to_thread
```

This runs downloads in a worker thread, keeping the async runtime responsive and allowing concurrent requests to continue while artifacts are being fetched.

### A Simpler, More Reliable Pipeline

With these updates, LoRA artifact delivery becomes more streamlined. The runtime now manages artifact preparation centrally, credentials flow more cleanly, and downloads no longer block the event loop—making LoRA adapter management more reliable and production-ready.

---

## New API Endpoints and Custom Routing Support

We're excited to introduce several new API endpoints, along with enhanced flexibility for custom routing paths in your deployments.

### New Endpoints

The following endpoints are now available:

- **/v1/audio/transcriptions** – Convert audio to text with high-accuracy speech recognition.
- **/v1/audio/translations** – Transcribe and translate audio into your target language.
- **/v1/classify** – Perform text classification tasks with optimized model inference.
- **/v1/rerank** – Improve retrieval quality by reranking candidate results based on relevance.

These additions expand support for speech processing, content classification, and retrieval optimization workflows.

---

## Custom Path Extension Support

You can now extend the model router with custom API paths directly through deployment annotations.

This allows you to expose additional routes beyond the standard API surface without modifying core routing logic.

### Example Deployment Configuration

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-r1-distill-llama-8b
namespace: default
labels:
model.aibrix.ai/name: deepseek-r1-distill-llama-8b
model.aibrix.ai/port: "8000"
annotations:
model.aibrix.ai/model-router-custom-paths: /version,/score
```

### Custom Paths and Why They Matter

Custom routes are defined using the annotation **model.aibrix.ai/model-router-custom-paths**, where multiple paths can be specified as a comma-separated list. Spaces and empty entries are ignored.

Example: `/version,/score`

These paths allow you to expose additional model-specific endpoints such as:

- **/version** — Return model version or metadata.
- **/score** — Implement custom scoring or evaluation logic.

By supporting custom routing paths, AIBrix makes it easier to extend the model API layer. You can integrate speech workflows, improve retrieval pipelines with reranking, add classification capabilities, or expose model-specific functionality—all through a simple, declarative deployment configuration.

---

## Other Improvements

v0.6.0 also expands observability, deployment options, and stability:

- **Observability & metrics:** Gateway metrics collected directly from the gateway layer; granular inference request metrics; new **SGLang gateway metrics dashboard**; Prometheus auth via Kubernetes secrets and query queueing.
- **Deployment & installation:** Gateway plugin can run in **standalone mode** without Kubernetes; simplified **Docker Compose** installation for local and dev setups.
- **StormService & control plane:** Per-role revision tracking; role-level status aggregation; **periodic reconciliation** for ModelAdapter; dynamic discovery provider updates; improved PodGroup and scheduling strategy handling.
- **KVCache framework:** **Block-first layout** support; padding token support in CUDA kernels; **aibrix_pd_reuse_connector** for combined PD reuse workflows.
- **Gateway & routing:** Session affinity routing; external header filters for advanced routing; **least-request routing** for distributed DP API servers.
- **Bug fixes:** RDMA issues in P/D setups (SGLang/vLLM); Redis auth in Helm; divide-by-zero in APA autoscaling; envoy extension policy paths and gateway service configuration; metrics label cardinality panic; CGO build alignment for builder/runtime env.

## Contributors & Community

This v0.6.0 release includes **95 merged PRs**, with **15** from first-time contributors 💫. Thank you to everyone who helped shape this release through code, issues, reviews, and feedback.

Special shout-out to [@Jeffwan](https://github.com/Jeffwan), [@varungup90](https://github.com/varungup90), [@googs1025](https://github.com/googs1025), [@scarlet25151](https://github.com/scarlet25151), [@DwyaneShi](https://github.com/DwyaneShi), and [@nurali-techie](https://github.com/nurali-techie) for their continued work on reliability, gateway improvements, control-plane evolution, and documentation.

We’re excited to welcome the following new contributors to the AIBrix community:

[@sherlockkenan](https://github.com/sherlockkenan), [@dczhu](https://github.com/dczhu), [@sceneryback](https://github.com/sceneryback), [@Deepam02](https://github.com/Deepam02), [@rayne-Li](https://github.com/rayne-Li), [@n0gu-furiosa](https://github.com/n0gu-furiosa), [@cabrinha](https://github.com/cabrinha), [@sanmuny](https://github.com/sanmuny), [@erictanjn](https://github.com/erictanjn), [@fungaren](https://github.com/fungaren), [@paranoidRick](https://github.com/paranoidRick), [@pbillaut](https://github.com/pbillaut), [@alpe](https://github.com/alpe), [@liangdong1201](https://github.com/liangdong1201), [@yahavb](https://github.com/yahavb) 🙌

Your contributions continue to make AIBrix more scalable, production-ready, and welcoming as an open community. We’re excited to see the ecosystem grow—keep them coming!

## Next Steps

If you're running LLMs in production or exploring architectures around serverless, KV cache, or P/D disaggregation, we'd love your feedback and collaboration. Check out the [v0.7.0 roadmap](https://github.com/vllm-project/aibrix/issues/1978), join the discussion, and contribute on GitHub.