From dcc2cb312b73c80f090d6817c707213a4fadf9e7 Mon Sep 17 00:00:00 2001 From: varungupta Date: Tue, 3 Mar 2026 16:56:26 -0800 Subject: [PATCH 1/5] add release blog for v0.6.0 Signed-off-by: varungupta --- content/posts/2026-03-03-v0.6.0-release.md | 376 +++++++++++++++++++++ 1 file changed, 376 insertions(+) create mode 100644 content/posts/2026-03-03-v0.6.0-release.md diff --git a/content/posts/2026-03-03-v0.6.0-release.md b/content/posts/2026-03-03-v0.6.0-release.md new file mode 100644 index 0000000..7f267d2 --- /dev/null +++ b/content/posts/2026-03-03-v0.6.0-release.md @@ -0,0 +1,376 @@ +--- +date: '2026-03-03T12:00:00-00:00' +draft: false +title: 'AIBrix v0.6.0 Release: Envoy Sidecar, Combined Routing, Routing Profiles, LoRA Delivery & New APIs' +author: ["The AIBrix Team"] + +disableShare: true +hideSummary: true +searchHidden: false +ShowReadingTime: false +ShowWordCount: false +ShowBreadCrumbs: true +ShowPostNavLinks: true +ShowRssButtonInSectionTermList: false +UseHugoToc: true +ShowToc: true +tocopen: true +--- + +# 🚀 AIBrix v0.6.0 Release + +Today we're excited to announce **AIBrix v0.6.0**, a release that broadens how you deploy and route inference traffic. This release adds **Envoy as a sidecar** alongside the gateway-plugin for simpler, more flexible gateway deployments without a separate Envoy Gateway controller. The new **Combined Routing Strategy** lets you run both PD-optimized and combined pods in the same deployment and route traffic intelligently based on load and workload. **Routing Profiles** give you multiple routing behaviors in one model config, selectable per request via a header. LoRA artifact delivery is now fully delegated to the Aibrix runtime with direct credential passing, first-class **Volcengine TOS** support, and non-blocking async downloads. The API surface grows with **OpenAI-compatible audio** (/v1/audio/transcriptions, /v1/audio/translations), **/v1/classify**, and **/v1/rerank**, plus **custom path extensions** via annotations. Together, v0.6.0 makes AIBrix more deployable, observable, and adaptable for production. + +## v0.6.0 Highlight Features + +### Envoy as a Sidecar: Simplifying Gateway Deployments + +- Adds support for running Envoy as a sidecar alongside the gateway-plugin. +- Introduces a more flexible and lightweight alternative to the existing Envoy Gateway integration. +- Controlled via the **envoyAsSideCar** flag in the Helm chart. +- Allows embedding Envoy directly in the same pod as the gateway plugin instead of relying on an external Envoy Gateway controller. +- Reduces architectural complexity and removes a hard dependency on Envoy Gateway. +- Gives operators more direct control over Envoy configuration and lifecycle. + +### Deployment Modes + +This feature introduces two mutually exclusive deployment modes: + +``` ++---------------------------+ +-----------------------------+ +| Envoy Gateway Mode | | Envoy Sidecar Mode | +| (current default) | | (new option) | +| | | | +| Envoy Gateway Controller | | AIBrix Gateway Plugin Pod | +| + HTTPRoute | | + Envoy Sidecar Container | ++---------------------------+ +-----------------------------+ +``` + +### Envoy Gateway Mode + +- Envoy is managed as a separate control-plane component. +- Uses Kubernetes Gateway API resources. +- Requires resources such as GatewayClass, EnvoyExtensionPolicy, and HTTPRoute. +- Follows a controller-driven architecture. + +### Envoy Sidecar Mode + +- Envoy runs as a sidecar container in the same pod as the gateway plugin. +- Receives static configuration via a ConfigMap. +- Does not require Gateway API controller resources. +- Exposes Envoy directly through the gateway-plugin service. +- Simplifies networking and reduces resource sprawl. + +### Impact + +- Supports a broader range of deployment patterns. +- Enables both fully managed Gateway control-plane setups and self-contained sidecar-based deployments. +- Particularly useful for edge environments or scenarios where running separate controllers is not desirable. + +--- + +## Combined Routing Strategy: Smarter Pod Selection + +In this release, AIBrix enhances its routing logic with a **Combined Routing Strategy** that allows the system to intelligently route requests to both PD-optimized pods (prefill/decode disaggregated) and combined pods (non-PD) within the same deployment. + +Previously, AIBrix distinguished mainly between PD-style (prefill/decode) pods and traditional single-pod models, limiting flexibility in how traffic is assigned based on workload characteristics. With this feature, operators can deploy both PD and combined pods for a model and let the gateway routing algorithm dynamically choose the best target based on current load and request patterns. + +### What "Combined Strategy" Means + +The new combined strategy enhances resource utilization and routing efficiency: + +- PD pods continue to serve disaggregated traffic where prefill/decode separation yields performance benefits (for prompt-heavy workloads). +- Combined pods handle requests end-to-end in a single process — ideal when PD pods are under heavy load or for shorter prompt workloads. +- The routing algorithm now includes pod scoring and dynamic selection between PD and combined pods based on load, queue metrics, and defined strategy. This means combined pods can absorb traffic automatically when PD pods are oversubscribed, helping maintain low latency and high throughput. + +### Routing Flow Overview + +Here's a high-level illustration of how traffic routing works under the Combined Strategy: + +``` + +---------------------------+ + | Client | + +-------------+-------------+ + | + ▼ + Routing Algorithm (Gateway) + | + +-------------------------+------------------------+ + | | + ▼ ▼ + +------------------+ +---------------------+ + | PD Pods | | Combined Pods | + | (Prefill/Decode) | | (Single-Pod Logic) | + +------------------+ +---------------------+ + ▲ ▲ + | Selected when PD advantages apply | Selected when PD load is high + | (long prompts, prefill heavy, etc.) | or combined is underutilized +``` + +### Why It Matters + +- Run both PD and combined pod types for the same model in a single deployment — no separate clusters required. +- Dynamically route traffic based on real-time conditions instead of static labels. +- Improve performance for mixed workloads: + - Long, prompt-heavy requests benefit from PD pods. + - Short or interactive requests are efficiently handled by combined pods. +- Maintain stable latency under variable load — combined pods can absorb overflow when PD pods are saturated. +- Achieve better overall resource utilization and throughput. +- Powered by updated routing logic, new metrics, and a pod scoring mechanism for fair and efficient pod selection. + +**Result:** AIBrix moves beyond rigid PD vs. non-PD routing and delivers an adaptive, single-deployment model that automatically selects the most appropriate pod type for each request. + +--- + +## Routing Profiles: One Config, Many Behaviors + +With this release, AIBrix introduces **Routing Profiles** — a flexible way to define multiple routing behaviors in a single model configuration and select them per request, without duplicating deployments. + +Instead of scattering routing-related settings across pod labels or creating separate deployments for different workloads, you can define multiple named profiles inside a single structured config (via the **model.aibrix.ai/config** annotation). Clients then select the desired behavior at request time using the **config-profile** header. + +This enables a single model deployment to serve multiple traffic patterns — for example: + +- Default routing for general workloads +- PD (prefill/decode) routing for disaggregated serving +- Low-latency routing for interactive use cases + +All from the same pods. + +### How It Works + +A routing profile is defined inside the **model.aibrix.ai/config** JSON annotation. Each profile can specify: + +- **routingStrategy** (e.g. random, pd, least-latency) +- Prompt-length bucket bounds for PD routing: + - **promptLenBucketMinLength** + - **promptLenBucketMaxLength** +- Whether a pod is combined for prefill/decode + +You also define a **defaultProfile** that acts as the fallback. + +If a client omits the **config-profile** header or specifies an unknown profile, AIBrix automatically falls back to the configured **defaultProfile**. + +### Example: Defining Profiles on a Pod + +You can define multiple profiles in the pod template (for example, in a StormService) like this: + +```json +{ + "defaultProfile": "pd", + "profiles": { + "default": { + "routingStrategy": "random", + "promptLenBucketMinLength": 0, + "promptLenBucketMaxLength": 4096 + }, + "pd": { + "routingStrategy": "pd", + "promptLenBucketMinLength": 0, + "promptLenBucketMaxLength": 2048 + }, + "low-latency": { + "routingStrategy": "least-latency", + "promptLenBucketMinLength": 0, + "promptLenBucketMaxLength": 2048 + } + } +} +``` + +In this example: + +- The **pd** profile is the default. +- Clients can explicitly select **default**, **pd**, or **low-latency**. +- If no profile is specified, **pd** is used automatically. + +### Why This Matters + +Previously, routing behavior was controlled through multiple pod labels. As routing options grew (PD mode, prompt-length buckets, latency-aware routing, etc.), configuration became harder to manage and reason about. + +Routing Profiles provide: + +- A single source of truth for routing configuration +- Per-request flexibility without extra deployments +- Cleaner gateway logic +- Better separation of workload types (batch vs. interactive, PD vs. single-pod, etc.) +- No pod duplication just to change routing behavior + +In short: **one deployment, many behaviors.** + +--- + +## Streamlined LoRA Artifact Delivery with Aibrix Runtime + +LoRA artifact delivery is now fully delegated to the Aibrix runtime, making the runtime the single source of truth for preparing and downloading model artifacts. + +These updates tighten the LoRA adapter lifecycle end-to-end: + +- The controller now passes real credentials directly to the runtime +- The runtime centrally handles artifact preparation and downloads +- First-class support for Volcengine TOS is added +- Async responsiveness is preserved during artifact downloads + +Let's break down what's changed. + +### Controllers Now Pass Full Kubernetes Secret Data + +Previously, credential handling required additional indirection. Now, the modeladapter controller takes a more direct approach: + +- The **loraClient** fetches the referenced Kubernetes Secret +- It converts **secret.Data** into a string map +- The credentials are embedded directly into the LoRA load request payload + +This makes it straightforward to support IAM-style credentials for S3-compatible storage systems. + +### Runtime Behavior + +On the runtime side: + +- The Aibrix runtime accepts direct key/value credentials +- Optional **additional_config** is supported +- The artifact service prefers direct credentials when present +- It falls back to loading credentials from a secret file only when necessary + +This simplifies credential flow and removes ambiguity about where artifact access is configured. + +### First-Class tos:// Support for Volcengine TOS + +We've added native support for Volcengine Object Storage (TOS). + +**What's Included** + +- The artifact URL validator now recognizes **tos://** as a valid scheme +- The Python runtime includes a dedicated **TOSArtifactDownloader** implementation + +**Required Credential Keys** + +The TOS downloader expects credentials in the following format: + +- **TOS_ACCESS_KEY** +- **TOS_SECRET_KEY** +- **TOS_SESSION_KEY** (optional) + +This makes integrating with Volcengine TOS seamless and consistent with other supported object storage backends. + +### Async Runtime Remains Responsive + +Object storage SDK calls are typically blocking I/O operations, which can stall async runtimes if not handled carefully. + +To prevent this, the runtime now wraps artifact download operations using: + +```python +asyncio.to_thread +``` + +This ensures: + +- Downloads run in a separate worker thread +- The main event loop remains responsive +- Concurrent requests are not blocked while artifacts are being fetched + +The result: better runtime responsiveness without sacrificing compatibility with existing storage SDKs. + +### Why This Matters + +With these improvements: + +- Credential handling is cleaner and more secure +- Artifact preparation logic is centralized in the runtime +- Volcengine TOS works out of the box +- Async performance remains stable under load + +Together, these changes make LoRA adapter management more robust, scalable, and production-ready. + +--- + +## New API Endpoints and Custom Routing Support + +We're excited to introduce several new API endpoints, along with enhanced flexibility for custom routing paths in your deployments. + +### New Endpoints + +The following endpoints are now available: + +- **/v1/audio/transcriptions** – Convert audio to text with high-accuracy speech recognition. +- **/v1/audio/translations** – Transcribe and translate audio into your target language. +- **/v1/classify** – Perform text classification tasks with optimized model inference. +- **/v1/rerank** – Improve retrieval quality by reranking candidate results based on relevance. + +These additions expand support for speech processing, content classification, and retrieval optimization workflows. + +--- + +## Custom Path Extension Support + +You can now extend the model router with custom API paths directly through deployment annotations. + +This allows you to expose additional routes beyond the standard API surface without modifying core routing logic. + +### Example Deployment Configuration + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: deepseek-r1-distill-llama-8b + namespace: default + labels: + model.aibrix.ai/name: deepseek-r1-distill-llama-8b + model.aibrix.ai/port: "8000" + annotations: + model.aibrix.ai/model-router-custom-paths: /version,/score +``` + +### How It Works + +- Custom paths are defined using the annotation: **model.aibrix.ai/model-router-custom-paths** +- Multiple paths must be comma-separated. +- Spaces and empty entries are ignored. +- Example: `/version,/score` + +This makes it easy to expose additional model-specific endpoints such as: + +- **/version** – For model version metadata +- **/score** – For custom scoring or evaluation logic + +### Why This Matters + +With these updates, you can: + +- Integrate speech workflows directly into your API layer +- Improve retrieval pipelines with built-in reranking +- Add classification capabilities without additional infrastructure +- Extend routing behavior in a clean, declarative way + +These improvements provide greater flexibility while keeping deployments simple and maintainable. + +--- + +## Other Improvements + +v0.6.0 also expands observability, deployment options, and stability: + +- **Observability & metrics:** Gateway metrics collected directly from the gateway layer; granular inference request metrics; new **SGLang gateway metrics dashboard**; Prometheus auth via Kubernetes secrets and query queueing. +- **Deployment & installation:** Gateway plugin can run in **standalone mode** without Kubernetes; simplified **Docker Compose** installation for local and dev setups. +- **StormService & control plane:** Per-role revision tracking; role-level status aggregation; **periodic reconciliation** for ModelAdapter; dynamic discovery provider updates; improved PodGroup and scheduling strategy handling. +- **KVCache framework:** **Block-first layout** support; padding token support in CUDA kernels; **aibrix_pd_reuse_connector** for combined PD reuse workflows. +- **Gateway & routing:** Session affinity routing; external header filters for advanced routing; **least-request routing** for distributed DP API servers. +- **Bug fixes:** RDMA issues in P/D setups (SGLang/vLLM); Redis auth in Helm; divide-by-zero in APA autoscaling; envoy extension policy paths and gateway service configuration; metrics label cardinality panic; CGO build alignment for builder/runtime env. + +For a full list of changes, commit history, and contributor details, see the [**AIBrix v0.6.0 Release Notes**](https://github.com/vllm-project/aibrix/releases/tag/v0.6.0). + +## Contributors & Community + +This v0.6.0 release includes **42 contributions**, with **18** from first-time contributors 💫. A huge thank-you to everyone who helped shape this release through code, issues, reviews, and feedback. + +Special shout-out to [@Jeffwan](https://github.com/Jeffwan), [@varungup90](https://github.com/varungup90), and [@googs1025](https://github.com/googs1025) for their continued work on reliability, gateway improvements, and control-plane evolution. + +We’re excited to welcome the following new contributors to the AIBrix community: + +[@autopear](https://github.com/autopear), [@ronaldosaheki](https://github.com/ronaldosaheki), [@jiangxiaobin96](https://github.com/jiangxiaobin96), [@mayooot](https://github.com/mayooot), [@zyfy29](https://github.com/zyfy29), [@zhengkezhou1](https://github.com/zhengkezhou1), [@tianzhiqiang3](https://github.com/tianzhiqiang3), [@atakli](https://github.com/atakli), [@jwjwjw3](https://github.com/jwjwjw3), [@lx1036](https://github.com/lx1036), [@chethanuk](https://github.com/chethanuk), [@baozixiaoxixi](https://github.com/baozixiaoxixi), [@TylerGillson](https://github.com/TylerGillson), [@omrishiv](https://github.com/omrishiv), [@lex1ng](https://github.com/lex1ng), [@ChenTaoyu-SJTU](https://github.com/ChenTaoyu-SJTU), [@zhenyu-02](https://github.com/zhenyu-02), [@yapple](https://github.com/yapple) 🙌 + +Your contributions continue to make AIBrix more scalable, production-ready, and welcoming as an open community. We’re excited to see the ecosystem grow—keep them coming! + +## Next Steps + +If you're running LLMs in production or exploring architectures around serverless, KV cache, or P/D disaggregation, we'd love your feedback and collaboration. Check out the [v0.7.0 roadmap](https://github.com/vllm-project/aibrix/issues/1978), join the discussion, and contribute on GitHub. \ No newline at end of file From c4afdfdae40f3cae40f67cc73976c3cd8ea56dcc Mon Sep 17 00:00:00 2001 From: varungupta Date: Tue, 3 Mar 2026 17:23:24 -0800 Subject: [PATCH 2/5] convert to bullet points Signed-off-by: varungupta --- content/posts/2026-03-03-v0.6.0-release.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/content/posts/2026-03-03-v0.6.0-release.md b/content/posts/2026-03-03-v0.6.0-release.md index 7f267d2..e7a012b 100644 --- a/content/posts/2026-03-03-v0.6.0-release.md +++ b/content/posts/2026-03-03-v0.6.0-release.md @@ -19,7 +19,18 @@ tocopen: true # 🚀 AIBrix v0.6.0 Release -Today we're excited to announce **AIBrix v0.6.0**, a release that broadens how you deploy and route inference traffic. This release adds **Envoy as a sidecar** alongside the gateway-plugin for simpler, more flexible gateway deployments without a separate Envoy Gateway controller. The new **Combined Routing Strategy** lets you run both PD-optimized and combined pods in the same deployment and route traffic intelligently based on load and workload. **Routing Profiles** give you multiple routing behaviors in one model config, selectable per request via a header. LoRA artifact delivery is now fully delegated to the Aibrix runtime with direct credential passing, first-class **Volcengine TOS** support, and non-blocking async downloads. The API surface grows with **OpenAI-compatible audio** (/v1/audio/transcriptions, /v1/audio/translations), **/v1/classify**, and **/v1/rerank**, plus **custom path extensions** via annotations. Together, v0.6.0 makes AIBrix more deployable, observable, and adaptable for production. +Today we're excited to announce **AIBrix v0.6.0**, a release that expands how you deploy and route inference traffic. Key highlights include: + +- **Envoy Sidecar Support** – Run Envoy alongside the gateway-plugin without requiring a separate Envoy Gateway controller, simplifying deployments. +- **Combined Routing Strategy** – Deploy both PD-optimized and combined pods in the same environment and route traffic intelligently based on workload and system load. +- **Routing Profiles** – Define multiple routing behaviors in a single model configuration and select them per request using a header. +- **Improved LoRA Artifact Delivery** – Artifact downloads are now fully handled by the **AIBrix runtime**, with direct credential passing, first-class **Volcengine TOS** support, and non-blocking async downloads. +- **Expanded API Surface** + - OpenAI-compatible audio APIs: `/v1/audio/transcriptions`, `/v1/audio/translations` + - New endpoints: `/v1/classify` and `/v1/rerank` +- **Custom Routing Paths** – Extend model router endpoints through deployment annotations. + +Together, these updates make **AIBrix v0.6.0** more deployable, observable, and adaptable for production AI workloads. ## v0.6.0 Highlight Features @@ -146,6 +157,8 @@ A routing profile is defined inside the **model.aibrix.ai/config** JSON annotati - **promptLenBucketMaxLength** - Whether a pod is combined for prefill/decode +These are the options currently supported; additional profile configuration options will be added gradually. + You also define a **defaultProfile** that acts as the fallback. If a client omits the **config-profile** header or specifies an unknown profile, AIBrix automatically falls back to the configured **defaultProfile**. From 3f74e22eb35106d2505b174257d8656e2fac702e Mon Sep 17 00:00:00 2001 From: varungupta Date: Tue, 3 Mar 2026 19:10:01 -0800 Subject: [PATCH 3/5] update contributors and community section Signed-off-by: varungupta --- content/posts/2026-03-03-v0.6.0-release.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/posts/2026-03-03-v0.6.0-release.md b/content/posts/2026-03-03-v0.6.0-release.md index e7a012b..d274e7e 100644 --- a/content/posts/2026-03-03-v0.6.0-release.md +++ b/content/posts/2026-03-03-v0.6.0-release.md @@ -374,13 +374,13 @@ For a full list of changes, commit history, and contributor details, see the [** ## Contributors & Community -This v0.6.0 release includes **42 contributions**, with **18** from first-time contributors 💫. A huge thank-you to everyone who helped shape this release through code, issues, reviews, and feedback. +This v0.6.0 release includes **95 merged PRs**, with **15** from first-time contributors 💫. Thank you to everyone who helped shape this release through code, issues, reviews, and feedback. -Special shout-out to [@Jeffwan](https://github.com/Jeffwan), [@varungup90](https://github.com/varungup90), and [@googs1025](https://github.com/googs1025) for their continued work on reliability, gateway improvements, and control-plane evolution. +Special shout-out to [@Jeffwan](https://github.com/Jeffwan), [@varungup90](https://github.com/varungup90), [@googs1025](https://github.com/googs1025), [@scarlet25151](https://github.com/scarlet25151), [@DwyaneShi](https://github.com/DwyaneShi), and [@nurali-techie](https://github.com/nurali-techie) for their continued work on reliability, gateway improvements, control-plane evolution, and documentation. We’re excited to welcome the following new contributors to the AIBrix community: -[@autopear](https://github.com/autopear), [@ronaldosaheki](https://github.com/ronaldosaheki), [@jiangxiaobin96](https://github.com/jiangxiaobin96), [@mayooot](https://github.com/mayooot), [@zyfy29](https://github.com/zyfy29), [@zhengkezhou1](https://github.com/zhengkezhou1), [@tianzhiqiang3](https://github.com/tianzhiqiang3), [@atakli](https://github.com/atakli), [@jwjwjw3](https://github.com/jwjwjw3), [@lx1036](https://github.com/lx1036), [@chethanuk](https://github.com/chethanuk), [@baozixiaoxixi](https://github.com/baozixiaoxixi), [@TylerGillson](https://github.com/TylerGillson), [@omrishiv](https://github.com/omrishiv), [@lex1ng](https://github.com/lex1ng), [@ChenTaoyu-SJTU](https://github.com/ChenTaoyu-SJTU), [@zhenyu-02](https://github.com/zhenyu-02), [@yapple](https://github.com/yapple) 🙌 +[@sherlockkenan](https://github.com/sherlockkenan), [@dczhu](https://github.com/dczhu), [@sceneryback](https://github.com/sceneryback), [@Deepam02](https://github.com/Deepam02), [@rayne-Li](https://github.com/rayne-Li), [@n0gu-furiosa](https://github.com/n0gu-furiosa), [@cabrinha](https://github.com/cabrinha), [@sanmuny](https://github.com/sanmuny), [@erictanjn](https://github.com/erictanjn), [@fungaren](https://github.com/fungaren), [@paranoidRick](https://github.com/paranoidRick), [@pbillaut](https://github.com/pbillaut), [@alpe](https://github.com/alpe), [@liangdong1201](https://github.com/liangdong1201), [@yahavb](https://github.com/yahavb) 🙌 Your contributions continue to make AIBrix more scalable, production-ready, and welcoming as an open community. We’re excited to see the ecosystem grow—keep them coming! From 4d8c54a0da3415f1ef7cd619da577c19a3324bd9 Mon Sep 17 00:00:00 2001 From: varungupta Date: Wed, 4 Mar 2026 12:42:17 -0800 Subject: [PATCH 4/5] add v0.6.0 release notes as part of summary Signed-off-by: varungupta --- content/posts/2026-03-03-v0.6.0-release.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/content/posts/2026-03-03-v0.6.0-release.md b/content/posts/2026-03-03-v0.6.0-release.md index d274e7e..c66143f 100644 --- a/content/posts/2026-03-03-v0.6.0-release.md +++ b/content/posts/2026-03-03-v0.6.0-release.md @@ -30,7 +30,7 @@ Today we're excited to announce **AIBrix v0.6.0**, a release that expands how yo - New endpoints: `/v1/classify` and `/v1/rerank` - **Custom Routing Paths** – Extend model router endpoints through deployment annotations. -Together, these updates make **AIBrix v0.6.0** more deployable, observable, and adaptable for production AI workloads. +Together, these updates make **AIBrix v0.6.0** easier to deploy, easier to observe, and more adaptable for production AI workloads. For the complete list of changes, commit history, and contributor details, see the [**AIBrix v0.6.0 Release Notes**](https://github.com/vllm-project/aibrix/releases/tag/v0.6.0). ## v0.6.0 Highlight Features @@ -370,8 +370,6 @@ v0.6.0 also expands observability, deployment options, and stability: - **Gateway & routing:** Session affinity routing; external header filters for advanced routing; **least-request routing** for distributed DP API servers. - **Bug fixes:** RDMA issues in P/D setups (SGLang/vLLM); Redis auth in Helm; divide-by-zero in APA autoscaling; envoy extension policy paths and gateway service configuration; metrics label cardinality panic; CGO build alignment for builder/runtime env. -For a full list of changes, commit history, and contributor details, see the [**AIBrix v0.6.0 Release Notes**](https://github.com/vllm-project/aibrix/releases/tag/v0.6.0). - ## Contributors & Community This v0.6.0 release includes **95 merged PRs**, with **15** from first-time contributors 💫. Thank you to everyone who helped shape this release through code, issues, reviews, and feedback. From 29b3e8d9b221c3354e3bb56e495a85b889d1272a Mon Sep 17 00:00:00 2001 From: varungupta Date: Wed, 4 Mar 2026 13:39:21 -0800 Subject: [PATCH 5/5] address code review comments Signed-off-by: varungupta --- content/posts/2026-03-03-v0.6.0-release.md | 243 +++++++-------------- 1 file changed, 85 insertions(+), 158 deletions(-) diff --git a/content/posts/2026-03-03-v0.6.0-release.md b/content/posts/2026-03-03-v0.6.0-release.md index c66143f..78ebc9a 100644 --- a/content/posts/2026-03-03-v0.6.0-release.md +++ b/content/posts/2026-03-03-v0.6.0-release.md @@ -36,16 +36,13 @@ Together, these updates make **AIBrix v0.6.0** easier to deploy, easier to obser ### Envoy as a Sidecar: Simplifying Gateway Deployments -- Adds support for running Envoy as a sidecar alongside the gateway-plugin. -- Introduces a more flexible and lightweight alternative to the existing Envoy Gateway integration. -- Controlled via the **envoyAsSideCar** flag in the Helm chart. -- Allows embedding Envoy directly in the same pod as the gateway plugin instead of relying on an external Envoy Gateway controller. -- Reduces architectural complexity and removes a hard dependency on Envoy Gateway. -- Gives operators more direct control over Envoy configuration and lifecycle. +This release introduces support for running **Envoy as a sidecar** alongside the AIBrix gateway-plugin. Instead of relying on an external Envoy Gateway controller, operators can now embed Envoy directly within the same pod as the gateway plugin. This approach provides a lighter-weight deployment option and reduces the architectural complexity of the gateway stack. -### Deployment Modes +The new mode is controlled through the **`envoyAsSideCar`** flag in the Helm chart. When enabled, Envoy runs as a sidecar container that shares the lifecycle of the gateway-plugin pod. This removes the hard dependency on Envoy Gateway while giving operators more direct control over Envoy configuration and behavior. -This feature introduces two mutually exclusive deployment modes: +### Flexible Deployment Modes + +With this change, AIBrix now supports two mutually exclusive deployment patterns, allowing teams to choose the model that best fits their infrastructure and operational preferences. ``` +---------------------------+ +-----------------------------+ @@ -57,46 +54,41 @@ This feature introduces two mutually exclusive deployment modes: +---------------------------+ +-----------------------------+ ``` -### Envoy Gateway Mode -- Envoy is managed as a separate control-plane component. -- Uses Kubernetes Gateway API resources. -- Requires resources such as GatewayClass, EnvoyExtensionPolicy, and HTTPRoute. -- Follows a controller-driven architecture. +In **Envoy Gateway mode**, Envoy is managed through a separate control-plane component that uses the Kubernetes Gateway API. Resources such as `GatewayClass`, `EnvoyExtensionPolicy`, and `HTTPRoute` define how traffic is routed and processed, following a controller-driven architecture. -### Envoy Sidecar Mode +In contrast, **Envoy Sidecar mode** runs Envoy directly within the gateway-plugin pod. Envoy receives its configuration from a ConfigMap and is exposed through the gateway-plugin service, eliminating the need for Gateway API controllers. This model simplifies networking and reduces the number of required cluster resources. -- Envoy runs as a sidecar container in the same pod as the gateway plugin. -- Receives static configuration via a ConfigMap. -- Does not require Gateway API controller resources. -- Exposes Envoy directly through the gateway-plugin service. -- Simplifies networking and reduces resource sprawl. +### Why This Matters -### Impact +Supporting both deployment approaches gives operators greater flexibility when running AIBrix in different environments. While the Gateway API model remains ideal for fully managed, controller-driven setups, the sidecar mode enables simpler and more self-contained deployments. -- Supports a broader range of deployment patterns. -- Enables both fully managed Gateway control-plane setups and self-contained sidecar-based deployments. -- Particularly useful for edge environments or scenarios where running separate controllers is not desirable. +This is especially useful in edge environments, lightweight clusters, or scenarios where running additional controllers is undesirable. By embedding Envoy directly with the gateway plugin, teams can deploy and operate the gateway with fewer moving parts while still benefiting from Envoy’s powerful networking capabilities. --- -## Combined Routing Strategy: Smarter Pod Selection +## Smarter Request Routing with AIBrix’s Combined Pod Strategy + +Modern LLM inference workloads are rarely uniform. Some requests contain long prompts that benefit from specialized execution pipelines, while others are short and interactive. Designing an infrastructure that efficiently handles both cases can be challenging. + +In the latest AIBrix release, we introduce the Combined Routing Strategy — a smarter approach to routing requests across different pod types within the same deployment. This feature allows AIBrix to dynamically choose between PD-optimized pods (prefill/decode disaggregated) and combined pods (single-process inference), improving overall performance, flexibility, and resource utilization. -In this release, AIBrix enhances its routing logic with a **Combined Routing Strategy** that allows the system to intelligently route requests to both PD-optimized pods (prefill/decode disaggregated) and combined pods (non-PD) within the same deployment. +Instead of forcing operators to choose one architecture or maintain separate deployments, AIBrix can now run both approaches together and intelligently decide which pod should handle each request. -Previously, AIBrix distinguished mainly between PD-style (prefill/decode) pods and traditional single-pod models, limiting flexibility in how traffic is assigned based on workload characteristics. With this feature, operators can deploy both PD and combined pods for a model and let the gateway routing algorithm dynamically choose the best target based on current load and request patterns. +### Introducing the Combined Routing Strategy -### What "Combined Strategy" Means +With this new strategy, AIBrix enables intelligent request routing across both PD and combined pods within the same deployment. -The new combined strategy enhances resource utilization and routing efficiency: +Each pod type serves a specific purpose: -- PD pods continue to serve disaggregated traffic where prefill/decode separation yields performance benefits (for prompt-heavy workloads). -- Combined pods handle requests end-to-end in a single process — ideal when PD pods are under heavy load or for shorter prompt workloads. -- The routing algorithm now includes pod scoring and dynamic selection between PD and combined pods based on load, queue metrics, and defined strategy. This means combined pods can absorb traffic automatically when PD pods are oversubscribed, helping maintain low latency and high throughput. +- **PD Pods (Prefill/Decode Disaggregated)** — Designed for workloads where separating prefill and decode stages improves efficiency. These pods are ideal for long prompts or workloads dominated by heavy prompt processing. +- **Combined Pods (Single-Pod Inference)** — Handle the entire request lifecycle within a single process. They provide efficient handling for short prompts and can absorb traffic when PD resources are busy. -### Routing Flow Overview +The AIBrix gateway now evaluates real-time system conditions and selects the most appropriate target pod automatically. -Here's a high-level illustration of how traffic routing works under the Combined Strategy: +### How Routing Works + +At the core of the Combined Strategy is an updated routing algorithm that evaluates available pods and dynamically selects the best candidate. ``` +---------------------------+ @@ -118,54 +110,48 @@ Here's a high-level illustration of how traffic routing works under the Combined | (long prompts, prefill heavy, etc.) | or combined is underutilized ``` -### Why It Matters - -- Run both PD and combined pod types for the same model in a single deployment — no separate clusters required. -- Dynamically route traffic based on real-time conditions instead of static labels. -- Improve performance for mixed workloads: - - Long, prompt-heavy requests benefit from PD pods. - - Short or interactive requests are efficiently handled by combined pods. -- Maintain stable latency under variable load — combined pods can absorb overflow when PD pods are saturated. -- Achieve better overall resource utilization and throughput. -- Powered by updated routing logic, new metrics, and a pod scoring mechanism for fair and efficient pod selection. +The routing decision incorporates several signals, including: -**Result:** AIBrix moves beyond rigid PD vs. non-PD routing and delivers an adaptive, single-deployment model that automatically selects the most appropriate pod type for each request. +- **Current pod load** +- **Queue metrics** +- **Pod availability** +- **Scoring logic** used to rank candidate pods ---- +This scoring mechanism allows AIBrix to balance traffic fairly and efficiently across both PD and combined pods. -## Routing Profiles: One Config, Many Behaviors +### Benefits of the Combined Routing Strategy -With this release, AIBrix introduces **Routing Profiles** — a flexible way to define multiple routing behaviors in a single model configuration and select them per request, without duplicating deployments. +By supporting both PD and combined pods in the same deployment, AIBrix can adapt to changing workloads and route requests more efficiently. Key benefits include: -Instead of scattering routing-related settings across pod labels or creating separate deployments for different workloads, you can define multiple named profiles inside a single structured config (via the **model.aibrix.ai/config** annotation). Clients then select the desired behavior at request time using the **config-profile** header. +- **Efficient handling of mixed workloads** — Long prompts are routed to PD pods for better prefill/decode performance, while short or interactive requests are served by combined pods for faster responses. +- **Overflow handling during traffic spikes** — Combined pods can absorb requests when PD pods are saturated, helping maintain stable latency. +- **Single deployment flexibility** — Run PD and combined pods together for the same model without managing separate clusters. +- **Dynamic traffic routing** — Requests are routed based on real-time system conditions rather than static configuration. +- **Better resource utilization** — Traffic is distributed intelligently across available pods to maximize GPU usage and throughput. -This enables a single model deployment to serve multiple traffic patterns — for example: +--- -- Default routing for general workloads -- PD (prefill/decode) routing for disaggregated serving -- Low-latency routing for interactive use cases +## Routing Profiles: One Deployment, Multiple Routing Behaviors -All from the same pods. +As inference workloads grow more diverse, different types of traffic often require different routing strategies. Some requests benefit from PD (prefill/decode) routing, others prioritize low latency, while general workloads may only need simple load balancing. Traditionally, supporting these variations required multiple deployments or complex label configurations. -### How It Works +With this release, AIBrix introduces Routing Profiles — a new way to define multiple routing behaviors within a single model configuration and select them dynamically per request. -A routing profile is defined inside the **model.aibrix.ai/config** JSON annotation. Each profile can specify: +Instead of spreading routing settings across pod labels or maintaining separate deployments, you can define multiple named routing profiles inside the **model.aibrix.ai/config** annotation. Clients then select the desired routing behavior at request time using the **config-profile** header. -- **routingStrategy** (e.g. random, pd, least-latency) -- Prompt-length bucket bounds for PD routing: - - **promptLenBucketMinLength** - - **promptLenBucketMaxLength** -- Whether a pod is combined for prefill/decode +This allows a single model deployment to handle multiple traffic patterns — such as general workloads, PD routing, or low-latency requests — all using the same set of pods. -These are the options currently supported; additional profile configuration options will be added gradually. +### Defining Routing Profiles -You also define a **defaultProfile** that acts as the fallback. +Routing profiles are defined as structured JSON inside the **model.aibrix.ai/config** annotation. Each profile can configure routing behavior and parameters such as: -If a client omits the **config-profile** header or specifies an unknown profile, AIBrix automatically falls back to the configured **defaultProfile**. +- **routingStrategy** (e.g. random, pd, or least-latency) +- Prompt-length bucket ranges used by PD routing (**promptLenBucketMinLength**, **promptLenBucketMaxLength**) +- Whether pods operate in combined prefill/decode mode -### Example: Defining Profiles on a Pod +You also define a **defaultProfile**, which acts as a fallback if a client does not specify a profile. These are the options currently supported; additional profile configuration options will be added gradually. -You can define multiple profiles in the pod template (for example, in a StormService) like this: +Example: ```json { @@ -190,109 +176,60 @@ You can define multiple profiles in the pod template (for example, in a StormSer } ``` -In this example: +In this example, the pd profile is configured as the default. Clients can explicitly choose default, pd, or low-latency depending on their workload. If no profile is provided, AIBrix automatically falls back to the default profile. -- The **pd** profile is the default. -- Clients can explicitly select **default**, **pd**, or **low-latency**. -- If no profile is specified, **pd** is used automatically. +### Why Routing Profiles Matter -### Why This Matters - -Previously, routing behavior was controlled through multiple pod labels. As routing options grew (PD mode, prompt-length buckets, latency-aware routing, etc.), configuration became harder to manage and reason about. +Routing Profiles simplify routing configuration while making deployments far more flexible. Instead of creating separate deployments for different traffic patterns, operators can define multiple routing behaviors in a single configuration and select them dynamically. -Routing Profiles provide: +This approach provides several benefits: -- A single source of truth for routing configuration -- Per-request flexibility without extra deployments -- Cleaner gateway logic -- Better separation of workload types (batch vs. interactive, PD vs. single-pod, etc.) -- No pod duplication just to change routing behavior +- **Single source of truth** for routing configuration +- **Per-request flexibility** without extra deployments +- **Cleaner gateway logic** that is easier to manage +- **Clear separation** between workload types (batch vs. interactive, PD vs. single-pod) -In short: **one deployment, many behaviors.** +In practice, this means one deployment can support many routing behaviors, allowing AIBrix to adapt to different workload patterns without increasing operational complexity. --- -## Streamlined LoRA Artifact Delivery with Aibrix Runtime - -LoRA artifact delivery is now fully delegated to the Aibrix runtime, making the runtime the single source of truth for preparing and downloading model artifacts. - -These updates tighten the LoRA adapter lifecycle end-to-end: - -- The controller now passes real credentials directly to the runtime -- The runtime centrally handles artifact preparation and downloads -- First-class support for Volcengine TOS is added -- Async responsiveness is preserved during artifact downloads - -Let's break down what's changed. - -### Controllers Now Pass Full Kubernetes Secret Data +## Streamlined LoRA Artifact Delivery with AIBrix Runtime -Previously, credential handling required additional indirection. Now, the modeladapter controller takes a more direct approach: +Managing LoRA adapters often requires coordinating credentials, storage access, and runtime behavior. To simplify this, AIBrix now moves LoRA artifact preparation and delivery entirely into the runtime, making it the single source of truth for downloading and preparing model artifacts. -- The **loraClient** fetches the referenced Kubernetes Secret -- It converts **secret.Data** into a string map -- The credentials are embedded directly into the LoRA load request payload +This change simplifies credential handling, centralizes artifact management, and keeps the runtime responsive during downloads. -This makes it straightforward to support IAM-style credentials for S3-compatible storage systems. +### Runtime as the Source of Truth -### Runtime Behavior +Previously, artifact preparation involved coordination between controllers and the runtime. Now the AIBrix runtime handles artifact validation and downloads directly, while the controller simply passes the required information such as credentials. -On the runtime side: +This clearer separation reduces operational complexity and simplifies the LoRA adapter lifecycle. -- The Aibrix runtime accepts direct key/value credentials -- Optional **additional_config** is supported -- The artifact service prefers direct credentials when present -- It falls back to loading credentials from a secret file only when necessary +### Simplified Credential Flow -This simplifies credential flow and removes ambiguity about where artifact access is configured. +The modeladapter controller now retrieves credentials directly from the referenced Kubernetes Secret and embeds them into the LoRA load request. -### First-Class tos:// Support for Volcengine TOS +The flow is straightforward: -We've added native support for Volcengine Object Storage (TOS). +- The controller reads the Kubernetes Secret. +- Converts **secret.Data** into a key-value map. +- Sends the credentials directly to the runtime. -**What's Included** +This makes it easier to support IAM-style credentials for S3-compatible storage systems and removes ambiguity around artifact access configuration. -- The artifact URL validator now recognizes **tos://** as a valid scheme -- The Python runtime includes a dedicated **TOSArtifactDownloader** implementation +### Non-Blocking Artifact Downloads -**Required Credential Keys** - -The TOS downloader expects credentials in the following format: - -- **TOS_ACCESS_KEY** -- **TOS_SECRET_KEY** -- **TOS_SESSION_KEY** (optional) - -This makes integrating with Volcengine TOS seamless and consistent with other supported object storage backends. - -### Async Runtime Remains Responsive - -Object storage SDK calls are typically blocking I/O operations, which can stall async runtimes if not handled carefully. - -To prevent this, the runtime now wraps artifact download operations using: +Since object storage SDKs often rely on blocking I/O, artifact downloads are executed using: ```python asyncio.to_thread ``` -This ensures: - -- Downloads run in a separate worker thread -- The main event loop remains responsive -- Concurrent requests are not blocked while artifacts are being fetched +This runs downloads in a worker thread, keeping the async runtime responsive and allowing concurrent requests to continue while artifacts are being fetched. -The result: better runtime responsiveness without sacrificing compatibility with existing storage SDKs. +### A Simpler, More Reliable Pipeline -### Why This Matters - -With these improvements: - -- Credential handling is cleaner and more secure -- Artifact preparation logic is centralized in the runtime -- Volcengine TOS works out of the box -- Async performance remains stable under load - -Together, these changes make LoRA adapter management more robust, scalable, and production-ready. +With these updates, LoRA artifact delivery becomes more streamlined. The runtime now manages artifact preparation centrally, credentials flow more cleanly, and downloads no longer block the event loop—making LoRA adapter management more reliable and production-ready. --- @@ -334,28 +271,18 @@ metadata: model.aibrix.ai/model-router-custom-paths: /version,/score ``` -### How It Works +### Custom Paths and Why They Matter -- Custom paths are defined using the annotation: **model.aibrix.ai/model-router-custom-paths** -- Multiple paths must be comma-separated. -- Spaces and empty entries are ignored. -- Example: `/version,/score` +Custom routes are defined using the annotation **model.aibrix.ai/model-router-custom-paths**, where multiple paths can be specified as a comma-separated list. Spaces and empty entries are ignored. -This makes it easy to expose additional model-specific endpoints such as: - -- **/version** – For model version metadata -- **/score** – For custom scoring or evaluation logic - -### Why This Matters +Example: `/version,/score` -With these updates, you can: +These paths allow you to expose additional model-specific endpoints such as: -- Integrate speech workflows directly into your API layer -- Improve retrieval pipelines with built-in reranking -- Add classification capabilities without additional infrastructure -- Extend routing behavior in a clean, declarative way +- **/version** — Return model version or metadata. +- **/score** — Implement custom scoring or evaluation logic. -These improvements provide greater flexibility while keeping deployments simple and maintainable. +By supporting custom routing paths, AIBrix makes it easier to extend the model API layer. You can integrate speech workflows, improve retrieval pipelines with reranking, add classification capabilities, or expose model-specific functionality—all through a simple, declarative deployment configuration. ---