How can you speed up your LLM inference time?
In this video, we'll optimize the token generation time for our fine-tuned Falcon 7b model with QLoRA. We'll explore various model loading techniques and look into batch inference for faster predictions.
Discord: discord.gg/UaNPxVD6tv
Prepare for the Machine Learning interview: mlexpert.io/
Subscribe: bit.ly/venelin-subscribe
Lit-Parrot: github.com/Lightning-AI/lit-parrot
00:00 - Introduction
01:05 - Text Tutorial on MLExpert.io
01:26 - Google Colab Setup
03:58 - Training Config Baseline
07:06 - Loading in 4 Bit
08:26 - Loading in 8 Bit
09:40 - torch.compile()
10:25 - Batch Inference
12:00 - Lit-Parrot
16:57 - Conclusion
Turtle image by stockgiu
#chatgpt #gpt4 #llms #artificialintelligence #promptengineering #chatbot #transformers #python #pytorch
コメント