Tokenization in LLMs
When people start learning about LLMs, this seem to be the most underestimated part. Hence, I thought of this topic for this blog. This is definitely a fundamental topic that should be understood well by any ML enthusiast.
Where does tokenization come into play ?
Tokenization is a crucial preprocessing step in NLP tasks, it converts human-readable text into numerical input that can be understood and processed by models.
1. The raw text input is received, which can be a sentence, a paragraph, or a document.
2. The text is split into tokens according to the tokenizer’s rules. This process may involve splitting the text into words, subwords, or characters, depending on the tokenizer’s design.
3. Each token is mapped to a unique integer ID from the model’s vocabulary.
4. Each integer ID is mapped to a vector/embedding which is learnt while training.
One interesting website is Tiktokenizer which shows the token split for all the famous models
Now let’s dive deep/expand into tokenization step (step 2) in the above process i.e. how we reach from a sentence to tokens —
Before splitting a text into subtokens (according to its model), the tokenizer performs two steps: normalization and pre-tokenization.
1) The normalization step involves some general cleanup, such as removing needless whitespace, lowercasing, and/or removing accents. If you’re familiar with Unicode normalization (such as NFC or NFKC), this is also something the tokenizer may apply.
2) Pre-tokenization split the texts into small entities, like words. Different tokenizers/models does this differently.
from transformers import AutoTokenizer
# splits on whitespace and punctuations
tokenizer1 = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer1.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are you?")
# [('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]
# not reversible
tokenizer2 = AutoTokenizer.from_pretrained("gpt2")
tokenizer2.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are you?")
# [('Hello', (0, 5)), (',', (5, 6)), ('Ġhow', (6, 10)), ('Ġare', (10, 14)), ('Ġ', (14, 15)), ('Ġyou', (15, 19)),
# ('?', (19, 20))]
# keep the spaces and replace them with a Ġ symbol, enabling it to recover the original spaces if we decode the tokens
# only splits on whitespaces - uses SentencePiece
tokenizer3 = AutoTokenizer.from_pretrained("t5-small")
tokenizer3.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are you?")
# [('▁Hello,', (0, 6)), ('▁how', (7, 10)), ('▁are', (11, 14)), ('▁you?', (16, 20))]
What are the common tokenizers ?
1. Byte Pair Encoding starts with initializing the vocabulary with individual characters or bytes and then iteratively merges the most frequent pair until a predefined vocabulary size is reached. It is used in models like GPT, GPT-2, RoBERTa, BART, and DeBERTa.
2. WordPiece also starts with initializing the vocab with individual characters and then iteratively merges pairs. The selection criterion for merging is based on which combination most increases the likelihood of the training data when added to the model. (P(pair prob)/Product of P(token prob)). It is used in models like BERT, DistilBERT.
3. Unigram works in the other direction: it starts from a big vocabulary and removes tokens from it until it reaches the desired vocabulary size. There are several options to build that base vocabulary: we can take the most common substrings in pre-tokenized words. (Won’t go into much details as BPE and WordPiece are the most common ones)
Most modern LLM implementation like Llama2, Gemma, Mistral use BPE + SentencePiece tokenizers (discussed later). Hence, I would like to focus on BPE for a bit.
Example — We looked at the overview of BPE above. Let’s look at an example to better understand it.
Now there might be a question Why Byte Pair and not word pair ?
Strings are immutable sequences of Unicode code points. (Code points are the numbers assigned by the Unicode Consortium to every character in every writing system even emojis and all languages which is like 150k as of now). Unicode Text is processed as binary data/bytes using encodings like UTF-8 where each unicode is represented as a 1–4 bytes combinations where each byte can be 0–255.
There has been efforts if the LLMs can directly use the bytes sequences instead of the Tokenization which is another step on top of bytes — one of which is MEGABYTE by Facebook.
Now the iterative merging is done on these bytes rather than on plain text characters. Hence Byte Pair Encoding :)
The more combinations we perform the vocab size will increase. In a decoder, Vocab size impacts 2 things — 1. The size of the Embedding space where each token is mapped to a vector which will increase with vocab size and 2. The final linear layer which produces the logits which are softmaxed to get the probability for each of the token in the vocab. The size of the layer/dot products also increases with vocab size.
Another reason of not having a infinite vocab size is that each token will occur less frequently and will not be trained efficiently.
However having a high enough vocab size helps in capturing the semantics better. It will impact in having more common subwords as a single token. For example — FileNotFoundException will represent the semantic better than File, Not, Found, Exception separately.
Training a tokenizer is different from training a model! It is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and the exact rules used to pick them depend on the tokenization algorithm. It’s deterministic, not probabilistic like the LLM model meaning once the rules are learnt on a corpus, the same subwords are picked.
Tiktoken
Tiktoken is a fast BPE tokeniser for use with OpenAI’s models (only inference code shared).
import tiktoken
# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4o")
OpenAI also has some details shared on Github on this topic —
1. The Encoder code loads 2 files which are saved after training — a) encoder.json which is just a map of byte to index and b) vocab.bpe which is just the merges like (token1, token2) -> merged_tokens.
2. Apart from the naive implementations we discussed above, GPT2 has implemented a regex patterns based rules such that some merges don’t happen like merging alphabets and punctuations (like dog.) . This also enforces merges only happens within the splits which is where the pre-tokenization we learnt above come into place.
3. There are 50, 257 tokens in encoder.json map 256 raw byte tokens, 50k merges and 1 special token <|endoftext|>
In GPT-4 they also changed the regex pattern a bit.
Llama3 used Tiktoken library instead of SentencePiece (which was used in Llama2) as this helped them in improving compression ratio (number of tokens — after combining/ number of bytes before combining). They also increased the vocab size from 32k to 128k (4X) which also lead to a higher memory because of which Llama3 is 8B and not 7B (like Llama 2) and they moved to Grouped Query Attention to keep the inference speed same.
SentencePiece
SentencePiece is a tokenization algorithm for the preprocessing of text that we can use with any of the models discussed below (like BPE, Unigram). It considers the text as a sequence of Unicode characters (instead of encoding to UTF8), and replaces spaces with a special character, ▁
. Used in conjunction with the Unigram algorithm (discussed above), it doesn’t even require a pre-tokenization step, which is very useful for languages where the space character is not used (like Chinese or Japanese).
SentencePiece is reversible tokenization — Decoding the tokens is done simply by concatenating them and replacing the _
s with spaces. While the BERT tokenizer removes repeating spaces, so its tokenization is not reversible.
I hope this was informational to at-least a few people. Do clap and follow if you found this article helpful.