Cracking the inference code: 3 proven strategies for high-performance AI

February 2, 2026Michael Goin, Kyle Sayers, Carlos Condado, Megan Flynn5-minute read

Every organization piloting generative AI (gen AI) eventually hits the "inference wall." It’s the moment when the excitement of a working prototype meets the cold reality of production. Suddenly, that single model running on a developer’s laptop needs to serve thousands of concurrent users, maintain sub-50ms latency, and somehow not bankrupt the IT budget in cloud costs.

The core challenge for enterprise AI is mainly operational: Solving the efficiency equation. It is no longer enough to just run a model, you must run it with precision performance. How do you maximize tokens per dollar? How do you make sure that a sudden spike in traffic doesn’t bring your application to a halt?

In this post, we look at 3 practical strategies that help IT leaders and architects solve the inference puzzle:

Optimized runtimes (vLLM) to treat your inference engine like a high-performance race car
Model optimization to do more with less using compression and speculators
Distributed inference (llm-d) to break the "one model, one GPU" limit and scale horizontally

The silent killer of AI return on investment: Inefficient inference

The math of AI inference is unforgiving. Unlike traditional microservices, LLM requests are non-uniform, stateful, and compute-intensive. A single request can occupy a GPU for seconds, not milliseconds. If your infrastructure isn't optimized, you end up over-provisioning expensive hardware just to handle peak loads, leaving GPUs idle most of the time.

To fix this, we need to move beyond simple model serving and adopt a true AI platform approach.

1. Tune the engine with vLLM

Before you add more hardware, you need to maximize what you already have. Running an unoptimized model on a GPU is like driving a Ferrari in first gear—it's wasteful and inefficient.

vLLM has emerged as the industry standard for open source inference. It acts as the high-performance engine, using advanced techniques like PagedAttention to manage memory non-contiguously—much like an operating system (OS) manages RAM. This allows the system to batch more requests together without running out of memory, drastically increasing throughput.

vLLM is designed for raw speed and efficiency on a single node. By standardizing on vLLM within Red Hat OpenShift AI, organizations can swap underlying hardware (NVIDIA, AMD, Google, Intel) without rewriting their serving logic.

High-throughput serving in action

In this video, we see vLLM in action, handling a surge of concurrent requests. By optimizing memory allocation with PagedAttention, the engine maintains high throughput and low latency, processing requests that would choke a standard runtime.

2. Optimize your models (compression and speculation)

Once your engine is tuned, the next step is to optimize the model itself. Huge models are heavy and slow. By reducing their footprint and improving how they generate tokens, you can achieve massive gains in speed and efficiency.

To support this, we’ve launched the Red Hat AI Hugging Face repository, an open library of models that solve the "trust vs. speed" dilemma. Rather than just hosting raw weights, this repository features Red Hat vValidated models—third-party models that have been rigorously tested for performance and accuracy on Red Hat platforms. By accessing validated versions of leading open models like Llama, Mistral, and Qwen, enterprise teams can bypass the uncertainty of "wild" open source models and deploy artifacts pre-certified for high-performance inference on Red Hat AI.

Compress without compromise

Tools like LLM Compressor allow you to apply techniques like quantization—converting models from 16-bit to 8-bit or 4-bit precision. This drastically reduces the memory footprint and takes advantage of low precision hardware, allowing you to fit larger models on smaller, cheaper GPUs. The key is doing this with "lossless" techniques that preserve the model's accuracy while shrinking its size.

Compression in action

In this video, we see a breakdown of how LLM Compressor optimizes models for production. We watch as engineers apply quantization algorithms to drastically reduce model size, enabling higher throughput and lower latency on existing hardware.

Accelerate with speculators

Speculative decoding is a game-changer for latency. Instead of waiting for the massive "verifier" model to generate every single token, you use a smaller, faster "draft" model (the speculator) to guess the next few words. The large model then verifies them in parallel. If the guesses are right (and they often are), you get multiple tokens for the "price" of one. Tools like Speculators enable you to train speculative models on your own custom data to match your workloads.

Speculative decoding in action

In this video, we watch an optimized pipeline using a "speculator" model alongside the main LLM. The draft model rapidly predicts tokens, and the verifier accepts them in batches. We see the "tokens per second" counter jump significantly compared to the baseline, all while maintaining the exact same output quality.

3. Break the single-node barrier with distributed inference using llm-d

The final piece of the puzzle is scale. What happens when your model is too big for one GPU, or your traffic volume exceeds what one server can handle? Traditional load balancers fail here because they don't understand the stateful nature of LLM caches.

Enter llm-d. It is a distributed inference server designed to orchestrate vLLM at scale, introducing 2 critical capabilities—disaggregation and intelligent scheduling.

Disaggregation: Separating prefill from decode

Inference has two distinct phases:

Prefill: The heavy lifting of processing your prompt (compute-intensive)
Decode: Generating the response one token at a time (memory-bandwidth intensive)

llm-d separates these workloads. It routes "prefill" tasks to one set of pods and "decode" tasks to another. This allows you to scale them independently. If you have long prompts but short answers (like summarizing documents), you scale up prefill nodes. If you have short prompts but long creative answers, you scale up decode nodes.

Intelligent inference scheduling

Beyond disaggregation, llm-d introduces intelligent scheduling that traditional web load balancers cannot provide. Because LLM inference relies heavily on Key-Value (KV) caching, routing requests efficiently is critical to performance.

llm-d implements KV-cache aware routing, which tracks the state of the cache across the cluster. When a request arrives (for example, a follow-up question in a chat), llm-d identifies which pod holds the relevant context in its cache and routes the request specifically to that node. This eliminates the need to re-compute the "prefill" phase, significantly reducing latency and freeing up compute resources for other users.

Scaling with llm-d in action

This video highlights the power of KV-cache aware routing. Watch as we send a request to the cluster. llm-d recognizes that a similar prompt (perhaps a system prompt or a shared document) was processed recently. Instead of sending the request to a random node, it intelligently routes it to the specific pod that already has that context cached. The result? The "time to first token" (TTFT) drops significantly because the system doesn't have to recompute what it already knows.

Bringing it all together

Optimizing inference isn't just about buying faster chips. It's about architecture.

vLLM helps each individual GPU to run at maximum efficiency
Model optimization (LLM Compressor and Speculators) help you to run the most efficient version of your model
llm-d orchestrates the entire fleet, enabling intelligent scaling and routing that traditional web servers can't touch

By combining these strategies on a unified platform like OpenShift AI, you move from "making AI work" to "making AI scale." You gain the control to run any model, on any accelerator, on any cloud—without the inference wall holding you back.

Learn more

About the authors

Michael Goin

Principal Engineer

Michael Goin is a lead maintainer of vLLM, the high-performance open-source engine for LLM inference. His contributions span core areas including kernel optimization, system scheduling, new model architectures, and hardware efficiency across GPUs and emerging accelerators. Michael led performance engineering at Neural Magic, which was acquired by Red Hat, and brings experience in inference systems to the open source AI community.

Read full bio

Kyle Sayers

Machine Learning Engineer

Kyle Sayers joined Red Hat as a Machine Learning Engineer in January of 2025 as part of Red Hat's acquisition of Neural Magic where he specializes in efficient algorithm implementations for LLMs. Kyle's passions lie with advancing the capabilities, efficiency, and democratization of machine learning systems.

Read full bio

Carlos Condado

Sr. Product Marketing Manager

Carlos Condado is a Senior Product Marketing Manager for Red Hat AI. He helps organizations navigate the path from AI experimentation to enterprise-scale deployment by guiding the adoption of MLOps practices and integration of AI models into existing hybrid cloud infrastructures. As part of the Red Hat AI team, he works across engineering, product, and go-to-market functions to help shape strategy, messaging, and customer enablement around Red Hat’s open, flexible, and consistent AI portfolio.

With a diverse background spanning data analytics, integration, cybersecurity, and AI, Carlos brings a cross-functional perspective to emerging technologies. He is passionate about technological innovations and helping enterprises unlock the value of their data and gain a competitive advantage through scalable, production-ready AI solutions.

Read full bio

Megan Flynn

Keep exploring

Browse by channel

Explore all channels

Cracking the inference code: 3 proven strategies for high-performance AI

The silent killer of AI return on investment: Inefficient inference

1. Tune the engine with vLLM

High-throughput serving in action

2. Optimize your models (compression and speculation)

Compress without compromise

Compression in action

Accelerate with speculators

Speculative decoding in action

3. Break the single-node barrier with distributed inference using llm-d

Disaggregation: Separating prefill from decode

Intelligent inference scheduling

Scaling with llm-d in action

Bringing it all together

Learn more

Red Hat AI

About the authors

Michael Goin

Kyle Sayers

Carlos Condado

Megan Flynn

More like this

Keep exploring

Browse by channel

Platforms

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links