Fast LLM Serving with vLLM and PagedAttention

Anyscale

Fast LLM Serving with vLLM and PagedAttention

1 year ago - 32:07

How to Efficiently Serve an LLM?

Ahmed Tremo

How to Efficiently Serve an LLM?

11 months ago - 12:13

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Woosuk Kwon & Xiaoxuan Liu, UC Berkeley

PyTorch

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Woosuk Kwon & Xiaoxuan Liu, UC Berkeley

9 months ago - 23:33

What Production-Grade LLM Serving Actually Requires (Infrastructure Deep Dive)

Predibase

What Production-Grade LLM Serving Actually Requires (Infrastructure Deep Dive)

2 months ago - 5:58

Create your multi-node LLM serving K8s cluster with one click

LMCache Team

Create your multi-node LLM serving K8s cluster with one click

4 months ago - 0:31

Simon Mo on vLLM: Easy, Fast, and Cost-Effective LLM Serving for Everyone

AMD Developer Central

Simon Mo on vLLM: Easy, Fast, and Cost-Effective LLM Serving for Everyone

3 weeks ago - 18:08

LLM Serving: The 4 Hard Truths No One Tells You

InfoQ

LLM Serving: The 4 Hard Truths No One Tells You

3 weeks ago - 49:59

Scalable and Efficient LLM Serving With the VLLM Production Stack - Junchen Jiang & Yue Zhu

The Linux Foundation

Scalable and Efficient LLM Serving With the VLLM Production Stack - Junchen Jiang & Yue Zhu

3 weeks ago - 39:36

Introducing Lemonade Server: Local LLM Serving with GPU and NPU Acceleration

AMD Developer Central

Introducing Lemonade Server: Local LLM Serving with GPU and NPU Acceleration

11 days ago - 6:55

[MLArchSys 2025]| Runtime Attestation for Secure LLM Serving in Cloud-Native TEE

Jianchang Su

[MLArchSys 2025]| Runtime Attestation for Secure LLM Serving in Cloud-Native TEE

1 month ago - 8:26

MobiSys 25 Teaser - EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

ACMMobiSys

MobiSys 25 Teaser - EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

3 weeks ago - 1:31

Simplify Your Open-Source LLM Serving with Anyscale's Aviary: Ray Serve Automation & Autoscaling

AI Insight News

Simplify Your Open-Source LLM Serving with Anyscale's Aviary: Ray Serve Automation & Autoscaling

2 years ago - 0:53

[MLArchSys 2025]|SafeKV: Safe KV-Cache Sharing in LLM Serving

kexin.chu2017

[MLArchSys 2025]|SafeKV: Safe KV-Cache Sharing in LLM Serving

1 month ago - 11:27

Lightning Talk: Best Practices for LLM Serving with DRA - Chen Wang & Abhishek Malvankar, IBM

CNCF [Cloud Native Computing Foundation]

Lightning Talk: Best Practices for LLM Serving with DRA - Chen Wang & Abhishek Malvankar, IBM

1 year ago - 9:37

vLLM vs NanoVLLM ⚡ Fast LLM Inference Battle! Which AI Engine Wins?

Serverwala Cloud Data Centers Pvt Ltd

vLLM vs NanoVLLM ⚡ Fast LLM Inference Battle! Which AI Engine Wins?

4 weeks ago - 1:00

E15 | MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving (ICML'24) 【中文】

MLSys Singapore

E15 | MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving (ICML'24) 【中文】

1 year ago - 35:14

LLM Serving (Rust) demo

Fuhai Gao

LLM Serving (Rust) demo

8 months ago - 5:06

What is vLLM? Efficient AI Inference for Large Language Models

IBM Technology

What is vLLM? Efficient AI Inference for Large Language Models

1 month ago - 4:58

Enabling Cost-Efficient LLM Serving with Ray Serve

Anyscale

Enabling Cost-Efficient LLM Serving with Ray Serve

1 year ago - 30:28

MLSys'25 - LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

MIT HAN Lab

MLSys'25 - LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

2 months ago - 11:36

Reducing Prefill Delay for LLM Serving in RAG By Sharing Knowledge

Junchen Jiang

Reducing Prefill Delay for LLM Serving in RAG By Sharing Knowledge

1 year ago - 19:10

Ray Aviary: Open-Source Multi-LLM Serving

John Snow Labs

Ray Aviary: Open-Source Multi-LLM Serving

1 year ago - 19:16

MLSys'25 - QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

MIT HAN Lab

MLSys'25 - QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

2 months ago - 13:45

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Kaichao You, Tsinghua University

PyTorch

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Kaichao You, Tsinghua University

11 days ago - 15:05

Legion Retreat 2024 - Low-Latency, High-Performance LLM Serving and Fine-tuning - Zhihao Jia

Legion Programming System

Legion Retreat 2024 - Low-Latency, High-Performance LLM Serving and Fine-tuning - Zhihao Jia

7 months ago - 30:35

Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inf. on Heterogen. Syst.

HotCarbon

Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inf. on Heterogen. Syst.

1 year ago - 10:47

Efficient LLM Serving on Hybrid Real-time and Best-effort Requests

The Prompt Index

Efficient LLM Serving on Hybrid Real-time and Best-effort Requests

3 months ago - 3:02

PagedAttention: Revolutionizing LLM Inference with Efficient Memory Management - DevConf.CZ 2025

DevConf

PagedAttention: Revolutionizing LLM Inference with Efficient Memory Management - DevConf.CZ 2025

4 weeks ago - 28:05

GOSIM CHINA 2024-Kaichao You  vLLM: Easy, Fast, and Cheap LLM Serving for Everyone

GOSIM Foundation

GOSIM CHINA 2024-Kaichao You vLLM: Easy, Fast, and Cheap LLM Serving for Everyone

8 months ago - 31:42

Introducing Ray Aviary | 🦜🔍 Open Source Multi-LLM Serving

Anyscale

Introducing Ray Aviary | 🦜🔍 Open Source Multi-LLM Serving

2 years ago - 13:33

Mélange - Cost Efficient LLM Serving by Using Mixture of GPUs - Hands on Demo

Fahd Mirza

Mélange - Cost Efficient LLM Serving by Using Mixture of GPUs - Hands on Demo

1 year ago - 10:58

E07 | Fast LLM Serving with vLLM and PagedAttention

MLSys Singapore

E07 | Fast LLM Serving with vLLM and PagedAttention

1 year ago - 55:36

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Keyur

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

9 months ago - 8:46

OSDI '24 - dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving

USENIX

OSDI '24 - dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving

10 months ago - 14:34

vLLM Inference Engine [ಕನ್ನಡದಲ್ಲಿ] | Easy, Fast, and Cheap LLM Serving with PagedAttention

Charan H U

vLLM Inference Engine [ಕನ್ನಡದಲ್ಲಿ] | Easy, Fast, and Cheap LLM Serving with PagedAttention

1 year ago - 15:45

GOSIM CHINA 2024 - Kaichao You: vLLM - Easy, Fast, and Cheap LLM Serving for Everyone

GOSIM Foundation

GOSIM CHINA 2024 - Kaichao You: vLLM - Easy, Fast, and Cheap LLM Serving for Everyone

7 months ago - 30:12

Building a Multi-Cluster Privately Hosted LLM Serving Platform on Ku... Julian Bright & Noah Yoshida

CNCF [Cloud Native Computing Foundation]

Building a Multi-Cluster Privately Hosted LLM Serving Platform on Ku... Julian Bright & Noah Yoshida

1 year ago - 25:48

SGLang: An Efficient Open-Source Framework for Large-Scale LLM Serving - Liangsheng Yin

PyTorch

SGLang: An Efficient Open-Source Framework for Large-Scale LLM Serving - Liangsheng Yin

11 days ago - 19:37

Unlock LLM Speed: VLLM Crushes the Competition!

Red Hat AI

Unlock LLM Speed: VLLM Crushes the Competition!

1 month ago - 0:48

vLLM: Fast & Affordable LLM Serving with PagedAttention | UC Berkeley's Open-Source Library

AI Insight News

vLLM: Fast & Affordable LLM Serving with PagedAttention | UC Berkeley's Open-Source Library

2 years ago - 2:25

【GOSIM AI Paris 2025】Erwan Gallen & Eldar Kurtic: vLLM: Multi-Accelerator & Quantized LLM Serving

GOSIM Foundation

【GOSIM AI Paris 2025】Erwan Gallen & Eldar Kurtic: vLLM: Multi-Accelerator & Quantized LLM Serving

1 month ago - 21:08

InstCache - A Predictive Cache for LLM Serving

Fahd Mirza

InstCache - A Predictive Cache for LLM Serving

6 months ago - 7:08

Mindvalley AI Summit 2025 | Live Stream

Mindvalley

Mindvalley AI Summit 2025 | Live Stream

Streamed 5 hours ago - 4:12:15

Scaling LLM Inference Globally: Novita AI & Vultr in Partnership

Vultr

Scaling LLM Inference Globally: Novita AI & Vultr in Partnership

4 weeks ago - 13:44

R&B song about AnyScale's Aviary, LLM serving library (AI music video) - Sway Ducky

Sway Ducky

R&B song about AnyScale's Aviary, LLM serving library (AI music video) - Sway Ducky

1 year ago - 0:55

LitServe - LLM Serving Inference Engine - Install and Test Locally

Fahd Mirza

LitServe - LLM Serving Inference Engine - Install and Test Locally

10 months ago - 10:29

LLM inference optimization: Architecture, KV cache and Flash attention

YanAITalk

LLM inference optimization: Architecture, KV cache and Flash attention

10 months ago - 44:06

#431 LLM ServingとClaude Code - 2025/7/22のZennトレンド

zenncast

#431 LLM ServingとClaude Code - 2025/7/22のZennトレンド

3 days ago - 7:02

MLSys Singapore

MLSys Singapore

MLSys Seminar @SG is a special interest group for Machine Learning System researchers and engineers in Singapore. We meet ...

@MLSysSingapore subscribers

FAST '25 - Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture...

USENIX

FAST '25 - Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture...

3 months ago - 17:17

DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference

PyTorch

DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference

Streamed 9 months ago - 32:03

NDSS 2025 - I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving

NDSS Symposium

NDSS 2025 - I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving

2 months ago - 16:22

Autellix: An Efficient Serving Engine for LLM Agents as General Programs

Arxiv Papers

Autellix: An Efficient Serving Engine for LLM Agents as General Programs

5 months ago - 36:28

[vLLM Office Hours #27] Intro to llm-d for Distributed LLM Inference

Neural Magic

[vLLM Office Hours #27] Intro to llm-d for Distributed LLM Inference

1 month ago - 1:19:57

Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services

UCFCompArch

Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services

4 months ago - 46:59

What is vLLM & How do I Serve Llama 3.1 With It?

Mosleh

What is vLLM & How do I Serve Llama 3.1 With It?

11 months ago - 7:23

Large Language Model Serving - ML Systems Design Interview

TRYEXCEPT

Large Language Model Serving - ML Systems Design Interview

4 months ago - 12:59

vLLM: Easy, Fast, and Cheap LLM Serving, Woosuk Kwon, UC Berkeley

AMD Developer Central

vLLM: Easy, Fast, and Cheap LLM Serving, Woosuk Kwon, UC Berkeley

7 months ago - 22:30

LMCache Team

LMCache Team

@LMCacheTeam subscribers

[short] Infinite-LLM: Efficient LLM Service for Long Context with Attention and Distributed KVCache

Arxiv Papers

[short] Infinite-LLM: Efficient LLM Service for Long Context with Attention and Distributed KVCache

1 year ago - 2:59

How LLM Use Large Context Windows

Fahd Mirza

How LLM Use Large Context Windows

1 year ago - 3:33

ACMMobiSys

ACMMobiSys

@ACMMobiSys subscribers

MIT HAN Lab

MIT HAN Lab

MIT HAN Lab: Hardware, AI and Neural-nets Accelerate Deep Learning Computing Group website: hanlab.mit.edu TinyML ...

@MITHANLab subscribers

kexin.chu2017

kexin.chu2017

@kexin.chu2017 subscribers

[QA] Autellix: An Efficient Serving Engine for LLM Agents as General Programs

Arxiv Papers

[QA] Autellix: An Efficient Serving Engine for LLM Agents as General Programs

5 months ago - 8:20

Predibase

Predibase

The highest quality models with the fastest throughput tailored to your use case—served in your cloud or ours. As the first platform ...

@Predibase subscribers

Predibase

Predibase

The highest quality models with the fastest throughput tailored to your use case—served in your cloud or ours. As the first platform ...

@Predibase subscribers

Jianchang Su

Jianchang Su

@JianchangSu-cc2fi subscribers

Fuhai Gao

Fuhai Gao

@fuhaigao1117 subscribers

Task Scheduling for Decentralized LLM Serving | Dr. Sanjaya Kumar Panda | GenLang 5.0

S.P.I.T. Media

Task Scheduling for Decentralized LLM Serving | Dr. Sanjaya Kumar Panda | GenLang 5.0

Streamed 2 weeks ago - 4:06:11

FLASHINFER  EFFICIENT AND CUSTOMIZABLE ATTENTION ENGINE FOR LLM INFERENCE SERVING

LuxaK

FLASHINFER EFFICIENT AND CUSTOMIZABLE ATTENTION ENGINE FOR LLM INFERENCE SERVING

6 months ago - 13:55

PyTorch

PyTorch

Welcome to the official PyTorch YouTube Channel. Learn about the latest PyTorch tutorials, new, and more. PyTorch is an open ...

@PyTorch subscribers

DeepLearn 2025, Xia (Ben) Hu

José Luís Reis

DeepLearn 2025, Xia (Ben) Hu

20 hours ago - 20:24

S.P.I.T. Media

S.P.I.T. Media

@SPITMedia-tu5rk subscribers

Optimize LLM inference with vLLM

Red Hat

Optimize LLM inference with vLLM

3 days ago - 6:13

Advancing efficient ML

Google Research

Advancing efficient ML

1 year ago - 12:04