Faster LLM Inference: Speeding up Falcon 7b (with QLoRA adapter) Prediction Time

音が流れない場合、再生を一時停止してもう一度再生してみて下さい。

ツール　

10100回再生

How can you speed up your LLM inference time?
In this video, we'll optimize the token generation time for our fine-tuned Falcon 7b model with QLoRA. We'll explore various model loading techniques and look into batch inference for faster predictions.

Discord: discord.gg/UaNPxVD6tv
Prepare for the Machine Learning interview: mlexpert.io/
Subscribe: bit.ly/venelin-subscribe

Lit-Parrot: github.com/Lightning-AI/lit-parrot

00:00 - Introduction
01:05 - Text Tutorial on MLExpert.io
01:26 - Google Colab Setup
03:58 - Training Config Baseline
07:06 - Loading in 4 Bit
08:26 - Loading in 8 Bit
09:40 - torch.compile()
10:25 - Batch Inference
12:00 - Lit-Parrot
16:57 - Conclusion

Turtle image by stockgiu

#chatgpt #gpt4 #llms #artificialintelligence #promptengineering #chatbot #transformers #python #pytorch

コメント