Building Clean, Maintainable vLLM Modifications Using the Plugin System

Dhruvil Bhatt (AWS SageMaker) Nov 20, 2025

[!NOTE] Originally posted on this Medium article. Source: https://github.com/vllm-project/vllm-ascend Avoiding forks, avoiding monkey patches, and keeping sanity intact Overview Large Language Model inference has been...

Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

vLLM Semantic Router Team Nov 19, 2025

The earlier versions of vLLM Semantic Router relied on classification-based routing, a straightforward approach where user queries are classified into one of 14 MMLU domain...

Docker Model Runner Integrates vLLM for High-Throughput Inferencing

Docker Team Nov 19, 2025

Expanding Docker Model Runner’s Capabilities Today, we’re excited to announce that Docker Model Runner now integrates the vLLM inference engine and safetensors models, unlocking high-throughput...

Shared Memory IPC Caching: Accelerating Data Transfer in LLM Inference Systems

Donglu Wang (Cohere) Nov 13, 2025

[!NOTE] Originally posted on the Cohere blog. Introducing Shared Memory IPC Caching — a high-performance caching mechanism contributed by Cohere to the vLLM project. By...

Fast and Affordable LLMs serving on Intel Arc Pro B-Series GPUs with vLLM

Intel vLLM Team Nov 11, 2025

Intel® Arc™ Pro B-Series GPU Family GPUs deliver powerful AI capabilities with a focus on accessibility and exceptional price-to-performance ratios. Their large memory capacity and...

No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan

vLLM and TorchTitan Teams Nov 10, 2025

We demonstrate an open-source bitwise consistent on-policy RL run with TorchTitan as the training engine and vLLM as the inference engine. Built on top of...

Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM

NVIDIA Nemotron Team Oct 31, 2025

We are excited to release NVIDIA Nemotron Nano 2 VL, supported by vLLM. This open vision language model (VLM) is built for video understanding and...

Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM

Linian Wang (Peking University) Oct 28, 2025

TL;DR: For best compatibility with vLLM, use Kimi K2 models whose chat templates were updated after commit 94a4053eb8863059dd8afc00937f054e1365abbd (Kimi-K2-0905) or commit 0102674b179db4ca5a28cd9a4fb446f87f0c1454 (Kimi-K2). The updates...

From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA

Ivar Flakstad (Hugging Face), OneZero-Y, Huamin Chen (Red Hat), Xunzhuo Liu (Tencent) Oct 27, 2025

Semantic routing systems face a scaling challenge. When each classification request requires running multiple fine-tuned models independently, the computational cost grows linearly with the number...

Zero-Reload Model Switching with vLLM Sleep Mode

Embedded LLM Oct 26, 2025

Introduction The multi-model serving problem: You have two LLMs that each fit on your GPU, but not both at once. Traditional solutions force a bad...

Now Serving NVIDIA Nemotron with vLLM

NVIDIA Nemotron Team Oct 23, 2025

Agentic AI systems, capable of reasoning, planning, and taking autonomous actions, are powering the next leap in developer applications. To build these systems, developers need...

No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL

The Agent Lightning (AGL) Team Oct 22, 2025

TL;DR. Agent often calls LLMs via OpenAI‑compatible endpoints, which previously return only string-based inputs and outputs. In agent RL, this can lead to inconsistencies between...

vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU

Google Team Oct 16, 2025

vLLM TPU is now powered by tpu-inference, an expressive and powerful new hardware plugin unifying JAX and PyTorch under a single lowering path. It is...

SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference

vLLM Team Oct 9, 2025

Introduction Over the past several months, we’ve been collaborating closely with NVIDIA to unlock the full potential of their latest NVIDIA Blackwell GPU architecture (B200/GB200)...

DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action

vLLM Team Sep 29, 2025

Introduction We are excited to announce Day 0 support for DeepSeek-V3.2-Exp, featuring DeepSeek Sparse Attention (DSA) (paper) designed for long context tasks. In this post,...

The First vLLM Meetup in Korea

vLLM Team Sep 16, 2025

The first vLLM meetup in Korea was held on August 19, 2025, in Seoul, hosted by Rebellions and Red Hat with support from PyTorch Korea...

vLLM Semantic Router: Next Phase in LLM inference

vLLM Semantic Router Team Sep 11, 2025

Industry Status: Inference ≠ More Is Better Over the past year, hybrid reasoning and automatic routing have increasingly defined progress in large-model infrastructure—shifting the debate...

vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency

The vLLM Team Sep 11, 2025

We’re excited to announce that vLLM now supports Qwen3-Next, the latest generation of foundation models from the Qwen team. Qwen3-Next introduces a hybrid architecture with...

Serving Geospatial, Vision, and Beyond: Enabling Multimodal Output Processing in vLLM

Christian Pinto (IBM Research Europe - Dublin), Michele Gazzetti (IBM Research Europe - Dublin), Michael Johnston (IBM Research Europe - Dublin), Maximilien Philippe Marie de Bayser (IBM Research - Brazil) Sep 5, 2025

Introduction Until recently, generative AI infrastructure has been tightly coupled with autoregressive text generation models that produce output token-by-token, typically in the form of natural...

Inside vLLM: Anatomy of a High-Throughput LLM Inference System

Aleksa Gordic Sep 5, 2025

[!NOTE] Originally posted on Aleksa Gordic’s website. From paged attention, continuous batching, prefix caching, specdec, etc. to multi-GPU, multi-node dynamic serving at scale In this...

Introduction to torch.compile and How It Works with vLLM

Luka Govedič (Red Hat), Richard Zou (Meta), Addie Stevens (Red Hat), Kaichao You (Tsinghua University), Michael Goin (Red Hat), Saša Zelenović (Red Hat) Aug 20, 2025

[!NOTE] This blog originated from our biweekly vLLM office hours, a community forum hosted by Red Hat with vLLM project committers and the UC Berkeley...

GLM-4.5 Meets vLLM: Built for Intelligent Agents

Yuxuan Zhang Aug 19, 2025

Introduction General Language Model (GLM) is a family of foundation models created by Zhipu.ai (now renamed to Z.ai). The GLM team has long-term collaboration with...

CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond

Kaichao You Aug 11, 2025

TL;DR: If you hit an illegal memory access was encountered error, you can enable CUDA core dump to debug the issue. Simply set the following...

vLLM Now Supports gpt-oss

The vLLM Team Aug 5, 2025

We’re thrilled to announce that vLLM now supports gpt-oss on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this...

MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference

MiniMax Jun 30, 2025

This article explores how MiniMax-M1’s hybrid architecture is efficiently supported in vLLM. We discuss the model’s unique features, the challenges of efficient inference, and the...

Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU

The Ascend Team on vLLM May 12, 2025

Since December 2024, through the joint efforts of the vLLM community and the Ascend team on vLLM, we have completed the Hardware Pluggable RFC. This...

Accelerating RLHF with vLLM, Best Practice from OpenRLHF

The OpenRLHF Team Apr 23, 2025

As demand grows for training reasoning-capable large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) has emerged as a cornerstone technique. However, conventional RLHF...

Transformers modeling backend integration in vLLM

The Hugging Face Team Apr 11, 2025

The Hugging Face Transformers library offers a flexible, unified interface to a vast ecosystem of model architectures. From research to fine-tuning on custom dataset, Transformers...

Llama 4 in vLLM

The vLLM Team Apr 5, 2025

We’re excited to announce that vLLM now supports the Llama 4 herd of models: Scout (17B-16E) and Maverick (17B-128E). You can run these powerful long-context,...

PTPC-FP8: Boosting vLLM Performance on AMD ROCm

AMD and Embedded LLM Feb 24, 2025

TL;DR: vLLM on AMD ROCm now has better FP8 performance! What’s new? PTPC-FP8 quantization is now supported in vLLM (v0.7.3+) on AMD ROCm. Why is...

Introducing AIBrix: A Scalable, Cost-Effective Control Plane for vLLM

AIBrix Team Feb 21, 2025

Today, we are excited to announce vllm-project/aibrix: a battery-included vLLM Kubernetes serving stack developed by Bytedance. Started in early 2024, AIBrix has been successfully deployed...

Distributed Inference with vLLM

vLLM Team Feb 17, 2025

Motivation Serving large models often leads to memory bottlenecks, such as the dreaded CUDA out of memory error. To tackle this, there are two main...

vLLM V1: A Major Upgrade to vLLM's Core Architecture

vLLM Team Jan 27, 2025

We are thrilled to announce the alpha release of vLLM V1, a major upgrade to vLLM’s core architecture. Based on lessons we learned over the...

Introducing vLLM Inference Provider in Llama Stack

Yuan Tang (Red Hat) and Ashwin Bharambe (Meta) Jan 27, 2025

We are excited to announce that vLLM inference provider is now available in Llama Stack through the collaboration between the Red Hat AI Engineering team...

High Performance and Easy Deployment of vLLM in K8S with vLLM production-stack

LMCache Team Jan 21, 2025

TL;DR vLLM boasts the largest open-source community, but what does it take to transform vLLM from the best single-node LLM engine to a premier LLM...

Structured Decoding in vLLM: a gentle introduction

Guest Post by BentoML and Red Hat Jan 14, 2025

TL/DR: Structured decoding allows precise control over LLM output formats vLLM now supports both outlines and XGrammar backends for structured decoding Recent XGrammar integration brings...

vLLM 2024 Retrospective and 2025 Vision

vLLM Team Jan 10, 2025

The vLLM community achieved remarkable growth in 2024, evolving from a specialized inference engine to become the de facto serving solution for the open-source AI...

Installing and Developing vLLM with Ease

vLLM Team Jan 10, 2025

The field of LLM inference is advancing at an unprecedented pace. With new models and features emerging weekly, the traditional software release pipeline often struggles...

Serving LLMs on AMD MI300X: Best Practices

Guest Post by Embedded LLM and Hot Aisle Inc. Oct 23, 2024

TL;DR: vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-to-first-token (TTFT) than Text Generation Inference (TGI) for Llama...

How Speculative Decoding Boosts vLLM Performance by up to 2.8x

vLLM Team Oct 17, 2024

Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll...

vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction

vLLM Team Sep 5, 2024

TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less...

vLLM’s Open Governance and Performance Roadmap

vLLM Team Jul 25, 2024

We would like to share two updates to the vLLM community. Future of vLLM is Open We are excited to see vLLM is becoming the...

Announcing Llama 3.1 Support in vLLM

vLLM Team Jul 23, 2024

Today, the vLLM team is excited to partner with Meta to announce the support for the Llama 3.1 model series. Llama 3.1 comes with exciting...

Notes on vLLM v.s. DeepSpeed-FastGen

vLLM Team Nov 14, 2023

TL;DR: vLLM matches DeepSpeed-FastGen’s speed in common scenarios and surpasses it when handling longer outputs. DeepSpeed-FastGen only outperforms vLLM in scenarios with long prompts and...

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Yu, Joey Gonzalez, Hao Zhang, and Ion Stoica (* Equal Contribution) Jun 20, 2023

GitHub | Documentation | Paper LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and...