Expanding Docker Model Runner’s Capabilities

Today, we’re excited to announce that Docker Model Runner now integrates the vLLM inference engine and safetensors models, unlocking high-throughput AI inference with the same Docker tooling you already use.

When we first introduced Docker Model Runner, our goal was to make it simple for developers to run and experiment with large language models (LLMs) using Docker. We designed it to integrate multiple inference engines from day one, starting with llama.cpp, to make it easy to get models running anywhere.

Now, we’re taking the next step in that journey. With vLLM integration, you can scale AI workloads from low-end to high-end Nvidia hardware, without ever leaving your Docker workflow.

Why vLLM?

vLLM is a high-throughput, open-source inference engine built to serve large language models efficiently at scale. It’s used across the industry for deploying production-grade LLMs thanks to its focus on throughput, latency, and memory efficiency.

Here’s what makes vLLM stand out:

By bringing vLLM to Docker Model Runner, we’re bridging the gap between fast local experimentation and robust production inference.

How vLLM Works

Running vLLM models with Docker Model Runner is as simple as installing the backend and running your model, no special setup required.

Install Docker Model Runner with vLLM backend:

docker model install-runner --backend vllm --gpu cuda

Once the installation finishes, you’re ready to start using it right away:

docker model run ai/smollm2-vllm "Can you read me?"
Sure, I am ready to read you.

Or access it via API:

curl --location 'http://localhost:12434/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
  "model": "ai/smollm2-vllm",
  "messages": [
    {
      "role": "user",
      "content": "Can you read me?"
    }
  ]
}'

Note that there’s no reference to vLLM in the HTTP request or CLI command.

That’s because Docker Model Runner automatically routes the request to the correct inference engine based on the model you’re using, ensuring a seamless experience whether you’re using llama.cpp or vLLM.

Why Multiple Inference Engines?

Until now, developers had to choose between simplicity and performance. You could either run models easily (using simplified portable tools like Docker Model Runner with llama.cpp) or achieve maximum throughput (with frameworks like vLLM).

Docker Model Runner now gives you both.

You can:

Use the same consistent Docker commands, CI/CD workflows, and deployment environments throughout.

This flexibility makes Docker Model Runner a first in the industry — no other tool lets you switch between multiple inference engines within a single, portable, containerized workflow.

By unifying these engines under one interface, Docker is making AI truly portable, from laptops to clusters, and everything in between.

Safetensors (vLLM) vs. GGUF (llama.cpp): Choosing the Right Format

With the addition of vLLM, Docker Model Runner is now compatible with the two most dominant open-source model formats: Safetensors and GGUF. While Model Runner abstracts the complexity of setting up the engines, understanding the difference between these formats helps in choosing the right tool for your infrastructure.

Docker Model Runner intelligently routes your request: if you pull a GGUF model, it utilizes llama.cpp; if you pull a safetensors model, it leverages the power of vLLM. With Docker Model Runner, both can be pushed and pulled as OCI images to any OCI registry.

vLLM-compatible models on Docker Hub

vLLM models are in safetensors format. Some early safetensors models available on Docker Hub:

Available Now: x86_64 with Nvidia

Our initial release is optimized for and available on systems running the x86_64 architecture with Nvidia GPUs. Our team has dedicated its efforts to creating a rock-solid experience on this platform, and we’re confident you’ll feel the difference.

What’s Next?

This launch is just the beginning. Our vLLM roadmap is focused on two key areas: expanding platform access and continuous performance tuning.

Thank you for your support and patience as we grow.

How You Can Get Involved

The strength of Docker Model Runner lies in its community, and there’s always room to grow. We need your help to make this project the best it can be. To get involved, you can:

We’re incredibly excited about this new chapter for Docker Model Runner, and we can’t wait to see what we can build together. Let’s get to work!