Anyscale
Fast LLM Serving with vLLM and PagedAttention
1 year ago - 32:07
Ahmed Tremo
How to Efficiently Serve an LLM?
11 months ago - 12:13
PyTorch
vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Woosuk Kwon & Xiaoxuan Liu, UC Berkeley
9 months ago - 23:33
Predibase
What Production-Grade LLM Serving Actually Requires (Infrastructure Deep Dive)
2 months ago - 5:58
LMCache Team
Create your multi-node LLM serving K8s cluster with one click
4 months ago - 0:31
AMD Developer Central
Simon Mo on vLLM: Easy, Fast, and Cost-Effective LLM Serving for Everyone
3 weeks ago - 18:08
InfoQ
LLM Serving: The 4 Hard Truths No One Tells You
3 weeks ago - 49:59
The Linux Foundation
Scalable and Efficient LLM Serving With the VLLM Production Stack - Junchen Jiang & Yue Zhu
3 weeks ago - 39:36
AMD Developer Central
Introducing Lemonade Server: Local LLM Serving with GPU and NPU Acceleration
11 days ago - 6:55
Jianchang Su
[MLArchSys 2025]| Runtime Attestation for Secure LLM Serving in Cloud-Native TEE
1 month ago - 8:26
ACMMobiSys
MobiSys 25 Teaser - EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices
3 weeks ago - 1:31
AI Insight News
Simplify Your Open-Source LLM Serving with Anyscale's Aviary: Ray Serve Automation & Autoscaling
2 years ago - 0:53
kexin.chu2017
[MLArchSys 2025]|SafeKV: Safe KV-Cache Sharing in LLM Serving
1 month ago - 11:27
MLSys Singapore
E15 | MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving (ICML'24) 【中文】
1 year ago - 35:14
Fuhai Gao
LLM Serving (Rust) demo
8 months ago - 5:06
IBM Technology
What is vLLM? Efficient AI Inference for Large Language Models
1 month ago - 4:58
Anyscale
Enabling Cost-Efficient LLM Serving with Ray Serve
1 year ago - 30:28
MIT HAN Lab
MLSys'25 - LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
2 months ago - 11:36
Junchen Jiang
Reducing Prefill Delay for LLM Serving in RAG By Sharing Knowledge
1 year ago - 19:10
John Snow Labs
Ray Aviary: Open-Source Multi-LLM Serving
1 year ago - 19:16
MIT HAN Lab
MLSys'25 - QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
2 months ago - 13:45
PyTorch
vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Kaichao You, Tsinghua University
11 days ago - 15:05
Legion Programming System
Legion Retreat 2024 - Low-Latency, High-Performance LLM Serving and Fine-tuning - Zhihao Jia
7 months ago - 30:35
HotCarbon
Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inf. on Heterogen. Syst.
1 year ago - 10:47
The Prompt Index
Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
3 months ago - 3:02
DevConf
PagedAttention: Revolutionizing LLM Inference with Efficient Memory Management - DevConf.CZ 2025
4 weeks ago - 28:05
GOSIM Foundation
GOSIM CHINA 2024-Kaichao You vLLM: Easy, Fast, and Cheap LLM Serving for Everyone
8 months ago - 31:42
Anyscale
Introducing Ray Aviary | 🦜🔍 Open Source Multi-LLM Serving
2 years ago - 13:33
Fahd Mirza
Mélange - Cost Efficient LLM Serving by Using Mixture of GPUs - Hands on Demo
1 year ago - 10:58
MLSys Singapore
E07 | Fast LLM Serving with vLLM and PagedAttention
1 year ago - 55:36
Keyur
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
9 months ago - 8:46
USENIX
OSDI '24 - dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
10 months ago - 14:34
Charan H U
vLLM Inference Engine [ಕನ್ನಡದಲ್ಲಿ] | Easy, Fast, and Cheap LLM Serving with PagedAttention
1 year ago - 15:45
GOSIM Foundation
GOSIM CHINA 2024 - Kaichao You: vLLM - Easy, Fast, and Cheap LLM Serving for Everyone
7 months ago - 30:12
PyTorch
SGLang: An Efficient Open-Source Framework for Large-Scale LLM Serving - Liangsheng Yin
11 days ago - 19:37
Red Hat AI
Unlock LLM Speed: VLLM Crushes the Competition!
1 month ago - 0:48
AI Insight News
vLLM: Fast & Affordable LLM Serving with PagedAttention | UC Berkeley's Open-Source Library
2 years ago - 2:25
GOSIM Foundation
【GOSIM AI Paris 2025】Erwan Gallen & Eldar Kurtic: vLLM: Multi-Accelerator & Quantized LLM Serving
1 month ago - 21:08
Fahd Mirza
InstCache - A Predictive Cache for LLM Serving
6 months ago - 7:08
Mindvalley
Mindvalley AI Summit 2025 | Live Stream
Streamed 5 hours ago - 4:12:15
Vultr
Scaling LLM Inference Globally: Novita AI & Vultr in Partnership
4 weeks ago - 13:44
Sway Ducky
R&B song about AnyScale's Aviary, LLM serving library (AI music video) - Sway Ducky
1 year ago - 0:55
Fahd Mirza
LitServe - LLM Serving Inference Engine - Install and Test Locally
10 months ago - 10:29
YanAITalk
LLM inference optimization: Architecture, KV cache and Flash attention
10 months ago - 44:06
zenncast
#431 LLM ServingとClaude Code - 2025/7/22のZennトレンド
3 days ago - 7:02
MLSys Singapore
MLSys Seminar @SG is a special interest group for Machine Learning System researchers and engineers in Singapore. We meet ...
@MLSysSingapore subscribers
USENIX
FAST '25 - Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture...
3 months ago - 17:17
PyTorch
DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference
Streamed 9 months ago - 32:03
NDSS Symposium
NDSS 2025 - I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving
2 months ago - 16:22
Arxiv Papers
Autellix: An Efficient Serving Engine for LLM Agents as General Programs
5 months ago - 36:28
Neural Magic
[vLLM Office Hours #27] Intro to llm-d for Distributed LLM Inference
1 month ago - 1:19:57
UCFCompArch
Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services
4 months ago - 46:59
Mosleh
What is vLLM & How do I Serve Llama 3.1 With It?
11 months ago - 7:23
TRYEXCEPT
Large Language Model Serving - ML Systems Design Interview
4 months ago - 12:59
AMD Developer Central
vLLM: Easy, Fast, and Cheap LLM Serving, Woosuk Kwon, UC Berkeley
7 months ago - 22:30
LMCache Team
@LMCacheTeam subscribers
Arxiv Papers
[short] Infinite-LLM: Efficient LLM Service for Long Context with Attention and Distributed KVCache
1 year ago - 2:59
Fahd Mirza
How LLM Use Large Context Windows
1 year ago - 3:33
ACMMobiSys
@ACMMobiSys subscribers
MIT HAN Lab
MIT HAN Lab: Hardware, AI and Neural-nets Accelerate Deep Learning Computing Group website: hanlab.mit.edu TinyML ...
@MITHANLab subscribers
kexin.chu2017
@kexin.chu2017 subscribers
Arxiv Papers
[QA] Autellix: An Efficient Serving Engine for LLM Agents as General Programs
5 months ago - 8:20
Predibase
The highest quality models with the fastest throughput tailored to your use case—served in your cloud or ours. As the first platform ...
@Predibase subscribers
Predibase
The highest quality models with the fastest throughput tailored to your use case—served in your cloud or ours. As the first platform ...
@Predibase subscribers
Jianchang Su
@JianchangSu-cc2fi subscribers
Fuhai Gao
@fuhaigao1117 subscribers
S.P.I.T. Media
Task Scheduling for Decentralized LLM Serving | Dr. Sanjaya Kumar Panda | GenLang 5.0
Streamed 2 weeks ago - 4:06:11
LuxaK
FLASHINFER EFFICIENT AND CUSTOMIZABLE ATTENTION ENGINE FOR LLM INFERENCE SERVING
6 months ago - 13:55
PyTorch
Welcome to the official PyTorch YouTube Channel. Learn about the latest PyTorch tutorials, new, and more. PyTorch is an open ...
@PyTorch subscribers
S.P.I.T. Media
@SPITMedia-tu5rk subscribers
Red Hat
Optimize LLM inference with vLLM
3 days ago - 6:13