What is llm-d?

Published 2025 年 10 月 24 日•1 分钟阅读

llm-d is a Kubernetes-native, open source framework that speeds up distributed large language model (LLM) inference at scale.

This means when an AI model receives complicated queries with a lot of data, llm-d provides a framework that makes processing faster.

llm-d was created by Google, NVIDIA, IBM Research, and CoreWeave. Its open source community contributes updates to improve the technology.

How Red Hat AI speeds up inference

LLM prompts can be complex and nonuniform. They typically require extensive computational resources and storage to process large amounts of data.

llm-d has a modular architecture that can support the increasing resource demands of sophisticated and larger reasoning models like LLMs.

A modular architecture allows all the different parts of the AI workload to work either together or separately, depending on the model's needs. This helps the model inference faster.

Imagine llm-d is like a marathon race: Each runner is in control of their own pace. You may cross the finish line at a different time than others, but everyone finishes when they’re ready. If everyone had to cross the finish line at the same time, you’d be tied to various unique needs of other runners, like endurance, water breaks, or time spent training. That would make things complicated.

A modular architecture lets pieces of the inference process work at their own pace to reach the best result as quickly as possible. It makes it easier to fix or update specific processes independently, too.

This specific way of processing models allows llm-d to handle the demands of LLM inference at scale. It also empowers users to go beyond single-server deployments and use generative AI (gen AI) inference across the enterprise.

How does distributed inference work?

The llm-d modular architecture is made up of:

Kubernetes: an open source container-orchestration platform that automates many of the manual processes involved in deploying, managing, and scaling containerized applications.
vLLM: an open source inference server that speeds up the outputs of gen AI applications.
Inference Gateway (IGW): a Kubernetes Gateway API extension that hosts features like model routing, serving priority, and “smart” load-balancing capabilities.

This accessible, modular architecture makes llm-d an ideal platform for distributed LLM inference at scale.

What is operationalized AI?

扩展阅读

一文理解 AI 基础架构 | 有哪些组件、优势和应用？

AI 基础架构结合了人工智能和机器学习（AI/ML）技术，来开发和部署可靠且可扩展的数据解决方案。精心设计的基础架构有助于数据科学家和开发人员访问数据、部署机器学习算法以及管理硬件的计算资源。

一文了解什么是 MLOps？

机器学习运维（MLOps）是一组工作流实践，旨在简化机器学习（ML）模型的部署和维护过程。MLOps 是一种结合了 DevOps 和 GitOps 原则的方法，以便将 ML 模型集成到软件开发过程中。

SLM 与 LLM 的对比：什么是小语言模型？

小语言模型（SLM）是大语言模型（LLM）的精简版本，具有更专业化的知识、定制速度更快且运行效率更高。

AI/ML 相关资源

特色产品

红帽 AI

灵活的解决方案，可加快 AI 解决方案在混合云环境中的开发和部署。

What is llm-d?

实施 AI 技术的 4 个关键注意事项

什么是 llm-d？我们为什么需要它？

自适应企业：AI 就绪，从容应对颠覆性挑战

扩展阅读

一文理解 AI 基础架构 | 有哪些组件、优势和应用？

一文了解什么是 MLOps？

SLM 与 LLM 的对比：什么是小语言模型？

AI/ML 相关资源

红帽 AI

平台

工具

试用购买与出售

联系我们

关于红帽

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links