As organizations accelerate the journey to production for large language model (LLM) workloads, the ecosystem of open source tools is growing fast. Two powerful projects—vLLM and llm-d—have recently emerged to tackle the complexity of inference at scale.

This has led to a common question among engineering teams: "Should we use vLLM or llm-d?" While comparing these tools is natural, the strategic answer lies not in choosing one over the other, but in understanding how they work together. It's about recognizing that a high-performance engine needs a championship-winning race strategy to deliver consistent results.

Understanding the ecosystem: Engine versus platform

The primary challenge developers face isn't just scaling—it's navigating the different layers of the AI stack.

When moving from a laptop prototype to a production cluster, it’s easy to assume the inference engine (the software that runs the model) handles everything, from traffic management to scaling. However, monolithic LLM servers weren't originally designed for the dynamic, cloud-native world. Running them in isolation can sometimes lead to inefficient GPU utilization or unpredictable latency—especially as workloads vary in context length and token rate.

To solve this, it helps to look at how these tools complement one another.

vLLM: The high-performance Formula 1 car

Think of vLLM as your Formula 1 car. It is a state-of-the-art, enterprise-grade inference engine designed for raw speed and efficiency.

vLLM provides the horsepower. Its performance edge comes from deep technical innovations like PagedAttention (which manages memory like an operating system), speculative decoding, and tensor parallelism. It is the component responsible for executing inference workloads, managing GPU memory on the node, and delivering fast responses.

If you want to serve a model on a single node or a well-tuned multi-GPU cluster, vLLM is the car that gets you to the track. But even the fastest F1 car benefits from a team to help it win the championship.

llm-d: The pit crew and race strategist

If vLLM is the car, llm-d is the pit crew, the race strategist, and the telemetry system combined.

llm-d is a cloud-native distributed inference framework designed to orchestrate vLLM. It acknowledges that a single car needs support to manage a long, complex race. llm-d disaggregates the inference process, breaking it down into manageable components to help scale effectively.

To understand why this relationship is useful, let's look at the two phases of LLM generation through our racing lens:

  1. The "prefill" (the formation lap): This is analogous to the formation lap where drivers warm their tires and check systems. In LLMs, this is where the system processes the user's prompt and calculates the initial Key-Value (KV) cache. It is compute-intensive and heavy.
  2. The "decode" (the race): This is the fast, iterative race itself. The model generates one token at a time. This phase requires high-speed memory bandwidth to access and produce new tokens quickly.

In a standard setup, one machine handles both phases. llm-d acts as the race control, using prefix-aware routing to determine which backend handles which request, making sure the "car" is always in the optimal mode for the track ahead.

Better together: Orchestrating the fleet

There is no llm-d without vLLM. They are designed to be teammates. When you pair the engine (vLLM) with the orchestrator (llm-d), you unlock specific integrations that solve complex production hurdles:

  • Independent scaling (disaggregation): You can serve multibillion parameter LLMs with disaggregated prefill and decode workers. Because llm-d separates these phases, you can scale your "warm-up" resources independently from your "race" resources, optimizing hardware utilization.
  • Expert-parallel scheduling for MoE: For massive Mixture of Experts (MoE) models, llm-d enables expert-parallel scheduling. This enables different "experts" within the model to be distributed across various vLLM nodes, allowing you to run models that are too large for a single GPU setup.
  • KV cache-aware routing: This is the equivalent of a pit crew knowing exactly how worn the tires are. llm-d intelligently reuses cached KV pairs from previous requests (prefix cache reuse). By routing a request to a worker that has seen similar data before, it reduces latency and compute costs.
  • Kubernetes-native elasticity (KEDA & ArgoCD): This is where llm-d shines as a platform. It integrates seamlessly with KEDA (Kubernetes event-driven autoscaling) and ArgoCD. This allows the system to dynamically scale the fleet of vLLM "cars" up or down based on real-time demand, enabling high availability without burning budget on idling GPUs.
  • Granular telemetry: llm-d acts as the race engineer, observing per-token metrics like time to first token, KV cache hit rate, and GPU memory pressure.

Final thoughts

Deploying vLLM on its own is a fantastic way to get started. But as you move toward a globally scalable LLM service, you will likely need more than just the engine.

llm-d does not replace vLLM, it enhances it. It provides the cloud-native control plane that turns a high-performance engine into a winning inference system. By using them together, you can be sure that your AI infrastructure isn't just fast—it's championship-ready.

Ready to get on the track? Dive deeper with this introduction to llm-d or test things out with the 30-day, self-supported OpenShift AI Developer Sandbox.

Resource

The adaptable enterprise: Why AI readiness is disruption readiness

This e-book, written by Michael Ferris, Red Hat COO and CSO, navigates the pace of change and technological disruption with AI that faces IT leaders today.

About the authors

Christopher Nuland is a Principal Technical Marketing Manager for AI at Red Hat and has been with the company for over six years. Before Red Hat, he focused on machine learning and big data analytics for companies in the finance and agriculture sectors. Once coming to Red Hat, he specialized in cloud native migrations, metrics-driven transformations, and the deployment and management of modern AI platforms as a Senior Architect for Red Hat’s consulting services, working almost exclusively with Fortune 50 companies until recently moving into his current role. Christopher has spoken worldwide on AI at conferences like IBM Think, KubeCon EU/US, and Red Hat’s Summit events.

Carlos Condado is a Senior Product Marketing Manager for Red Hat AI. He helps organizations navigate the path from AI experimentation to enterprise-scale deployment by guiding the adoption of MLOps practices and integration of AI models into existing hybrid cloud infrastructures. As part of the Red Hat AI team, he works across engineering, product, and go-to-market functions to help shape strategy, messaging, and customer enablement around Red Hat’s open, flexible, and consistent AI portfolio.

With a diverse background spanning data analytics, integration, cybersecurity, and AI, Carlos brings a cross-functional perspective to emerging technologies. He is passionate about technological innovations and helping enterprises unlock the value of their data and gain a competitive advantage through scalable, production-ready AI solutions.

UI_Icon-Red_Hat-Close-A-Black-RGB

Browse by channel

automation icon

Automation

The latest on IT automation for tech, teams, and environments

AI icon

Artificial intelligence

Updates on the platforms that free customers to run AI workloads anywhere

open hybrid cloud icon

Open hybrid cloud

Explore how we build a more flexible future with hybrid cloud

security icon

Security

The latest on how we reduce risks across environments and technologies

edge icon

Edge computing

Updates on the platforms that simplify operations at the edge

Infrastructure icon

Infrastructure

The latest on the world’s leading enterprise Linux platform

application development icon

Applications

Inside our solutions to the toughest application challenges

Virtualization icon

Virtualization

The future of enterprise virtualization for your workloads on-premise or across clouds