Every organization piloting generative AI (gen AI) eventually hits the "inference wall." It’s the moment when the excitement of a working prototype meets the cold reality of production. Suddenly, that single model running on a developer’s laptop needs to serve thousands of concurrent users, maintain sub-50ms latency, and somehow not bankrupt the IT budget in cloud costs.
The core challenge for enterprise AI is mainly operational: Solving the efficiency equation. It is no longer enough to just run a model, you must run it with precision performance. How do you maximize tokens per dollar? How do you make sure that a sudden spike in traffic doesn’t bring your application to a halt?
In this post, we look at 3 practical strategies that help IT leaders and architects solve the inference puzzle:
- Optimized runtimes (vLLM) to treat your inference engine like a high-performance race car
- Model optimization to do more with less using compression and speculators
- Distributed inference (llm-d) to break the "one model, one GPU" limit and scale horizontally
The silent killer of AI return on investment: Inefficient inference
The math of AI inference is unforgiving. Unlike traditional microservices, LLM requests are non-uniform, stateful, and compute-intensive. A single request can occupy a GPU for seconds, not milliseconds. If your infrastructure isn't optimized, you end up over-provisioning expensive hardware just to handle peak loads, leaving GPUs idle most of the time.
To fix this, we need to move beyond simple model serving and adopt a true AI platform approach.
1. Tune the engine with vLLM
Before you add more hardware, you need to maximize what you already have. Running an unoptimized model on a GPU is like driving a Ferrari in first gear—it's wasteful and inefficient.
vLLM has emerged as the industry standard for open source inference. It acts as the high-performance engine, using advanced techniques like PagedAttention to manage memory non-contiguously—much like an operating system (OS) manages RAM. This allows the system to batch more requests together without running out of memory, drastically increasing throughput.
vLLM is designed for raw speed and efficiency on a single node. By standardizing on vLLM within Red Hat OpenShift AI, organizations can swap underlying hardware (NVIDIA, AMD, Google, Intel) without rewriting their serving logic.
High-throughput serving in action
In this video, we see vLLM in action, handling a surge of concurrent requests. By optimizing memory allocation with PagedAttention, the engine maintains high throughput and low latency, processing requests that would choke a standard runtime.
2. Optimize your models (compression and speculation)
Once your engine is tuned, the next step is to optimize the model itself. Huge models are heavy and slow. By reducing their footprint and improving how they generate tokens, you can achieve massive gains in speed and efficiency.
To support this, we’ve launched the Red Hat AI Hugging Face repository, an open library of models that solve the "trust vs. speed" dilemma. Rather than just hosting raw weights, this repository features Red Hat vValidated models—third-party models that have been rigorously tested for performance and accuracy on Red Hat platforms. By accessing validated versions of leading open models like Llama, Mistral, and Qwen, enterprise teams can bypass the uncertainty of "wild" open source models and deploy artifacts pre-certified for high-performance inference on Red Hat AI.
Compress without compromise
Tools like LLM Compressor allow you to apply techniques like quantization—converting models from 16-bit to 8-bit or 4-bit precision. This drastically reduces the memory footprint and takes advantage of low precision hardware, allowing you to fit larger models on smaller, cheaper GPUs. The key is doing this with "lossless" techniques that preserve the model's accuracy while shrinking its size.
Compression in action
In this video, we see a breakdown of how LLM Compressor optimizes models for production. We watch as engineers apply quantization algorithms to drastically reduce model size, enabling higher throughput and lower latency on existing hardware.
Accelerate with speculators
Speculative decoding is a game-changer for latency. Instead of waiting for the massive "verifier" model to generate every single token, you use a smaller, faster "draft" model (the speculator) to guess the next few words. The large model then verifies them in parallel. If the guesses are right (and they often are), you get multiple tokens for the "price" of one. Tools like Speculators enable you to train speculative models on your own custom data to match your workloads.
Speculative decoding in action
In this video, we watch an optimized pipeline using a "speculator" model alongside the main LLM. The draft model rapidly predicts tokens, and the verifier accepts them in batches. We see the "tokens per second" counter jump significantly compared to the baseline, all while maintaining the exact same output quality.
3. Break the single-node barrier with distributed inference using llm-d
The final piece of the puzzle is scale. What happens when your model is too big for one GPU, or your traffic volume exceeds what one server can handle? Traditional load balancers fail here because they don't understand the stateful nature of LLM caches.
Enter llm-d. It is a distributed inference server designed to orchestrate vLLM at scale, introducing 2 critical capabilities—disaggregation and intelligent scheduling.
Disaggregation: Separating prefill from decode
Inference has two distinct phases:
- Prefill: The heavy lifting of processing your prompt (compute-intensive)
- Decode: Generating the response one token at a time (memory-bandwidth intensive)
llm-d separates these workloads. It routes "prefill" tasks to one set of pods and "decode" tasks to another. This allows you to scale them independently. If you have long prompts but short answers (like summarizing documents), you scale up prefill nodes. If you have short prompts but long creative answers, you scale up decode nodes.
Intelligent inference scheduling
Beyond disaggregation, llm-d introduces intelligent scheduling that traditional web load balancers cannot provide. Because LLM inference relies heavily on Key-Value (KV) caching, routing requests efficiently is critical to performance.
llm-d implements KV-cache aware routing, which tracks the state of the cache across the cluster. When a request arrives (for example, a follow-up question in a chat), llm-d identifies which pod holds the relevant context in its cache and routes the request specifically to that node. This eliminates the need to re-compute the "prefill" phase, significantly reducing latency and freeing up compute resources for other users.
Scaling with llm-d in action
This video highlights the power of KV-cache aware routing. Watch as we send a request to the cluster. llm-d recognizes that a similar prompt (perhaps a system prompt or a shared document) was processed recently. Instead of sending the request to a random node, it intelligently routes it to the specific pod that already has that context cached. The result? The "time to first token" (TTFT) drops significantly because the system doesn't have to recompute what it already knows.
Bringing it all together
Optimizing inference isn't just about buying faster chips. It's about architecture.
- vLLM helps each individual GPU to run at maximum efficiency
- Model optimization (LLM Compressor and Speculators) help you to run the most efficient version of your model
- llm-d orchestrates the entire fleet, enabling intelligent scaling and routing that traditional web servers can't touch
By combining these strategies on a unified platform like OpenShift AI, you move from "making AI work" to "making AI scale." You gain the control to run any model, on any accelerator, on any cloud—without the inference wall holding you back.
Learn more
Product
Red Hat AI
About the authors
Michael Goin is a lead maintainer of vLLM, the high-performance open-source engine for LLM inference. His contributions span core areas including kernel optimization, system scheduling, new model architectures, and hardware efficiency across GPUs and emerging accelerators. Michael led performance engineering at Neural Magic, which was acquired by Red Hat, and brings experience in inference systems to the open source AI community.
Kyle Sayers joined Red Hat as a Machine Learning Engineer in January of 2025 as part of Red Hat's acquisition of Neural Magic where he specializes in efficient algorithm implementations for LLMs. Kyle's passions lie with advancing the capabilities, efficiency, and democratization of machine learning systems.
Carlos Condado is a Senior Product Marketing Manager for Red Hat AI. He helps organizations navigate the path from AI experimentation to enterprise-scale deployment by guiding the adoption of MLOps practices and integration of AI models into existing hybrid cloud infrastructures. As part of the Red Hat AI team, he works across engineering, product, and go-to-market functions to help shape strategy, messaging, and customer enablement around Red Hat’s open, flexible, and consistent AI portfolio.
With a diverse background spanning data analytics, integration, cybersecurity, and AI, Carlos brings a cross-functional perspective to emerging technologies. He is passionate about technological innovations and helping enterprises unlock the value of their data and gain a competitive advantage through scalable, production-ready AI solutions.
More like this
AI insights with actionable automation accelerate the journey to autonomous networks
Fast and simple AI deployment on Intel Xeon with Red Hat OpenShift
Technically Speaking | Build a production-ready AI toolbox
Technically Speaking | Platform engineering for AI agents
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Virtualization
The future of enterprise virtualization for your workloads on-premise or across clouds