Loading...
「ツール」は右上に移動しました。
利用したサーバー: natural-voltaic-titanium
616いいね 21251回再生

LLM Tokenizers Explained: BPE Encoding, WordPiece and SentencePiece

In this video we talk about three tokenizers that are commonly used when training large language models: (1) the byte-pair encoding tokenizer, (2) the wordpiece tokenizer and (3) the sentencepiece tokenizer.

References
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
BPE tokenizer paper: arxiv.org/abs/1508.07909
WordPiece tokenizer paper:
Wordpiece tokenizer paper: static.googleusercontent.com/media/research.google…
Sentencepiece tokenizer paper: arxiv.org/abs/1808.06226

Related Videos
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Why Language Models Hallucinate:    • Why LLMs Hallucinate  
Grounding DINO, Open-Set Object Detection:    • Object Detection Part 8: Grounding DI...  
Detection Transformers (DETR), Object Queries:    • Object Detection Part 7: Detection Tr...  
Wav2vec2 A Framework for Self-Supervised Learning of Speech Representations - Paper Explained:    • Wav2vec2 A Framework for Self-Supervi...  
Transformer Self-Attention Mechanism Explained:    • Transformer Self-Attention Mechanism ...  
How to Fine-tune Large Language Models Like ChatGPT with Low-Rank Adaptation (LoRA):    • Low-Rank Adaptation (LoRA) Explained  
Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped Query Attention (GQA) Explained:    • Multi-Head Attention (MHA), Multi-Que...  
LLM Prompt Engineering with Random Sampling: Temperature, Top-k, Top-p:    • LLM Prompt Engineering with Random Sa...  

Contents
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
00:00 - Intro
00:32 - BPE Encoding
02:16 - Wordpiece
03:45 - Sentencepiece
04:52 - Outro

Follow Me
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
🐦 Twitter: @datamlistic twitter.com/datamlistic
📸 Instagram: @datamlistic www.instagram.com/datamlistic
📱 TikTok: @datamlistic www.tiktok.com/@datamlistic

Channel Support
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
The best way to support the channel is to share the content. ;)

If you'd like to also support the channel financially, donating the price of a coffee is always warmly welcomed! (completely optional and voluntary)
► Patreon: www.patreon.com/datamlistic
► Bitcoin (BTC): 3C6Pkzyb5CjAUYrJxmpCaaNPVRgRVxxyTq
► Ethereum (ETH): 0x9Ac4eB94386C3e02b96599C05B7a8C71773c9281
► Cardano (ADA): addr1v95rfxlslfzkvd8sr3exkh7st4qmgj4ywf5zcaxgqgdyunsj5juw5
► Tether (USDT): 0xeC261d9b2EE4B6997a6a424067af165BAA4afE1a

#tokenization #llm #wordpiece #sentencepiece

コメント