Faster LLM Inference: Speeding up Falcon 7b (with QLoRA adapter) Prediction Time

「ツール」は右上に移動しました。

利用したサーバー: natural-voltaic-titanium

254いいね 10095回再生

How can you speed up your LLM inference time?
In this video, we'll optimize the token generation time for our fine-tuned Falcon 7b model with QLoRA. We'll explore various model loading techniques and look into batch inference for faster predictions.

Discord: discord.gg/UaNPxVD6tv
Prepare for the Machine Learning interview: mlexpert.io/
Subscribe: bit.ly/venelin-subscribe

Lit-Parrot: github.com/Lightning-AI/lit-parrot

00:00 - Introduction
01:05 - Text Tutorial on MLExpert.io
01:26 - Google Colab Setup
03:58 - Training Config Baseline
07:06 - Loading in 4 Bit
08:26 - Loading in 8 Bit
09:40 - torch.compile()
10:25 - Batch Inference
12:00 - Lit-Parrot
16:57 - Conclusion

Turtle image by stockgiu

#chatgpt #gpt4 #llms #artificialintelligence #promptengineering #chatbot #transformers #python #pytorch

コメント