Finetuning LLMs using LoRA
Before getting to the meat of the blog i.e. Finetuning an LLM, it will be good to have a brief summary of Why do we even need it ?
Background
Until Transformers came into picture, RNNs use to rule the world of Langauge. I had written a dedicated blog on the same some time back. We also looked at Transformer architecture (which is a encoder-decoder architecture) and an encoder-only BERT model in a dedicated blog. LLMs vary from Encoder-Decoder network (like T5 networks) and some are decoder only network (like ChatGPT) which we will discuss in this blog. How ChatGPT was trained will be discussed in some other blog.
We have interacted with OpenAIs ChatGPT model in the langchain blogs. The LLM we are going to interact with in this blog are models from the Dolly family, a set of instruction-following LLMs commercially open-sourced by Databricks. There are three sizes of models: 3 billions (3b
) [which we will be using], 7 billions (7b
) and 12 billions (12b
). Below is the code on how we can use the model for inference
# importing libraries
import torch
from transformers import pipeline
from IPython.display import Markdown
# pipeline from transformers (by huggingface) to load dolly
instruct_pipeline = pipeline(model="databricks/dolly-v2-3b",
device_map="auto",
torch_dtype=torch.float16,
trust_remote_code=True,
)
prompt = "What are the pricing options of Medium blogging website?"
# pass a list of prompts/questions [,,] if batch prediction is required
output1 = instruct_pipeline(prompt)
Markdown(output1[0]['generated_text'])
> The pricing options of Medium blogging website include SaaS, on-premise, and freemium.
Why LLMs ?
LLMs which are a type of Foundational Models are different from Traditional models as in earlier one the Foundational model is pre-trained on a lot of data and then we can either use them directly for some usecases or finetuning it to adapt to certain usecases. In Traditional ML models, we had to train different models for each case from scratch which is expensive and time consuming.
Large Language Models (LLMs) have a wide variety of applications across Customer Service, Marketing, Law, Finance, Healthcare, Education etc. This is where finetuning LLMs come into place.
Alternatives to Finetuning
One alternative to finetuning is In-Context Learning. In-context learning (ICL) is a specific method of prompt engineering where demonstrations of the task are provided to the model as part of the prompt (in natural language). With ICL, you can use off-the-shelf large language models (LLMs) to solve novel tasks without the need for fine-tuning (or model weight update).
However, the result depends heavily on the input prompt. This necessitates an empirical art of composing and formatting the prompt to maximize a model’s performance on a desired task. Fine-tuning retrains a model pre-trained on general domains to a specific task
Efficient Finetuning
LLM Finetuning requires labelled data which includes the Instruction, Input/Context and Response(label). The problems with training LLMs and fine tuning them is —
1. We need more compute to train. As the models are getting larger and larger, we are finding that we need much bigger GPUs multiple GPUs just to be able to fine tune some of these models.
2. The file sizes become huge. The T5 XXL check point is around about 40 GBs in size. Not to mention, the sort of 20 billion parameter models that we’ve got coming out now.
Few best practices for finetuning are — Using a strong regularization, using small learning rate and few epochs. In general, a NN (like a CNN for image classification) is not fully finetuned which is expensive and might lead to catastrophic forgetting. We just finetune the last layer or the last few layers.
For LLM, we use a similar approach called Parameter Efficient Fine-Tuning (PEFT). One of the popular ways of doing PEFT is Low-Rank Adaption (LoRA) primarily developed by Microsoft. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than finetuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and no additional inference latency. (which is amazing right!)
LoRA is an improved finetuning method where instead of finetuning all the weights that constitute the weight matrix (W) of the pre-trained large language model, two smaller matrices (A and B) that approximate the update to the matrix are fine-tuned.
W0 + ΔW = W0 + BA, where W0 (d*k), A(d*r) and B (r*k) and r << d, k
These matrices constitute the LoRA adapter. Here ‘r’ is a hyperparameter (The paper suggests values — 1, 2, 4, 8 or 64 with 4 or 8 working best most of the times). During training, W0 is frozen and doesn't receive gradient updates, while A and B contain trainable parameters. Both W0 and ΔW = BA are multiplied with the same input, and their respective output vectors are summed coordinate-wise. A random Gaussian initialization for A and zero for B, so ΔW = BA is zero at the beginning of training.
Now many of us might also have heard abour QLoRA. QLoRA is an even more memory efficient version of LoRA where the pretrained model is loaded to GPU memory as recoverable quantized 4-bit weights (compared to 8-bits in the case of LoRA). Essentially the assumption is that all weights are taken from a normal distribution. So it stores the position in the normal distribution in 4-bits and when it has to recover back to 8-bits it takes out the actual weight using the position.
Now let’s get into the code part 👨🏼💻
For this blog, we will use LaMini-instruction dataset as a sample dataset. If we have some custom enterprise QnA dataset, we can use the same to finetune the model the dataset. We will do it a step by step manner —
Step 1 — Loading LaMini-instruction dataset using load_dataset from huggingface
Step 2 — Loading Dolly Tokenizer and Model using huggingface (again!)
Step 3 — Data Preparation — Tokenize, split dataset and prepare for batch processing
Step 4 — Configuring LoRA and getting the PEFT model
Step 5 — Training the model and saving
Step 6 — Prediction with the finetuned model
Before that let’s import the necessary packages
# mentioning datatypes for better documentation
from typing import Dict, List
from datasets import Dataset, load_dataset, disable_caching
disable_caching() ## disable huggingface cache
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch
from torch.utils.data import Dataset
from IPython.display import Markdown
1. Data Loading
# Dataset Preparation
dataset = load_dataset("MBZUAI/LaMini-instruction" , split = 'train')
small_dataset = dataset.select([i for i in range(200)])
print(small_dataset)
print(small_dataset[0])
# creating templates
prompt_template = """Below is an instruction that describes a task. Write a response that appropriately completes the request. Instruction: {instruction}\n Response:"""
answer_template = """{response}"""
# creating function to add keys in the dictionary for prompt, answer and whole text
def _add_text(rec):
instruction = rec["instruction"]
response = rec["response"]
# check if both exists, else raise error
if not instruction:
raise ValueError(f"Expected an instruction in: {rec}")
if not response:
raise ValueError(f"Expected a response in: {rec}")
rec["prompt"] = prompt_template.format(instruction=instruction)
rec["answer"] = answer_template.format(response=response)
rec["text"] = rec["prompt"] + rec["answer"]
return rec
# running through all samples
small_dataset = small_dataset.map(_add_text)
print(small_dataset[0])
To finetune our LLM, we need to decorate our instruction dataset with a prompt — Instruction: {instruction} Response:{response}
2. Tokenizer and Model Load
# loading the tokenizer for dolly model. The tokenizer converts raw text into tokens
model_id = "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
#loading the model using AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
model_id,
# use_cache=False,
device_map="auto", #"balanced",
load_in_8bit=True,
torch_dtype=torch.float16
)
# resizes input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
model.resize_token_embeddings(len(tokenizer))
> Embedding(50280, 2560)
3. Data Preparation
from functools import partial
import copy
from transformers import DataCollatorForSeq2Seq
MAX_LENGTH = 256
# Function to generate token embeddings from text part of batch
def _preprocess_batch(batch: Dict[str, List]):
model_inputs = tokenizer(batch["text"], max_length=MAX_LENGTH, truncation=True, padding='max_length')
model_inputs["labels"] = copy.deepcopy(model_inputs['input_ids'])
return model_inputs
_preprocessing_function = partial(_preprocess_batch)
# apply the preprocessing function to each batch in the dataset
encoded_small_dataset = small_dataset.map(
_preprocessing_function,
batched=True,
remove_columns=["instruction", "response", "prompt", "answer"],
)
processed_dataset = encoded_small_dataset.filter(lambda rec: len(rec["input_ids"]) <= MAX_LENGTH)
# splitting dataset
split_dataset = processed_dataset.train_test_split(test_size=14, seed=0)
print(split_dataset)
# takes a list of samples from a Dataset and collate them into a batch, as a dictionary of PyTorch tensors.
data_collator = DataCollatorForSeq2Seq(
model = model, tokenizer=tokenizer, max_length=MAX_LENGTH, pad_to_multiple_of=8, padding='max_length')
4. Coniguring LoRA
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training
LORA_R = 256 # 512
LORA_ALPHA = 512 # 1024
LORA_DROPOUT = 0.05
# Define LoRA Config
lora_config = LoraConfig(
r = LORA_R, # the dimension of the low-rank matrices
lora_alpha = LORA_ALPHA, # scaling factor for the weight matrices
lora_dropout = LORA_DROPOUT, # dropout probability of the LoRA layers
bias="none",
task_type="CAUSAL_LM",
target_modules=["query_key_value"],
)
# Prepare int-8 model for training - utility function that prepares a PyTorch model for int8 quantization training. <https://huggingface.co/docs/peft/task_guides/int8-asr>
model = prepare_model_for_int8_training(model)
# initialize the model with the LoRA framework
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
> trainable params: 83886080 || all params: 2858972160 || trainable%: 2.9341342029717423
alpha determines the multiplier applied to the weight changes when added to the original weights (scale_multiplier = alpha/rank).
dropout is a percentage that randomly leaves out some weight changes each time to deter overfitting.
5. Model training and saving
from transformers import TrainingArguments, Trainer
import bitsandbytes
# define the training arguments first.
EPOCHS = 3
LEARNING_RATE = 1e-4
MODEL_SAVE_FOLDER_NAME = "dolly-3b-lora"
training_args = TrainingArguments(
output_dir=MODEL_SAVE_FOLDER_NAME,
overwrite_output_dir=True,
fp16=True, #converts to float precision 16 using bitsandbytes
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
learning_rate=LEARNING_RATE,
num_train_epochs=EPOCHS,
logging_strategy="epoch",
evaluation_strategy="epoch",
save_strategy="epoch",
)
# training the model
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=split_dataset['train'],
eval_dataset=split_dataset["test"],
data_collator=data_collator,
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
# only saves the incremental 🤗 PEFT weights (adapter_model.bin) that were trained, meaning it is super efficient to store, transfer, and load.
trainer.model.save_pretrained(MODEL_SAVE_FOLDER_NAME)
# save the full model and the training arguments
trainer.save_model(MODEL_SAVE_FOLDER_NAME)
trainer.model.config.save_pretrained(MODEL_SAVE_FOLDER_NAME)
The model seems to overfit on training data which might be because of difference in train-test dataset, training parameters but we get the crux of finetuning an LLM.
6. Prediction with the Finetuned Model
# Function to format the response and filter out the instruction from the response.
def postprocess(response):
messages = response.split("Response:")
if not messages:
raise ValueError("Invalid template for prompt. The template should include the term 'Response:'")
return "".join(messages[1:])
# Prompt for prediction
inference_prompt = "List 5 reasons why someone should learn to cook"
# Inference pipeline with the fine-tuned model
inf_pipeline = pipeline('text-generation', model=trainer.model, tokenizer=tokenizer, max_length=256, trust_remote_code=True)
# Format the prompt using the `prompt_template` and generate response
response = inf_pipeline(prompt_template.format(instruction=inference_prompt))[0]['generated_text']
# postprocess the response
formatted_response = postprocess(response)
formatted_response
We couldn’t discuss about evaluation of LLMs which we can discuss in the next blog.
Usecases —
1. https://blog.gofynd.com/fine-tuning-metas-llama-2-to-power-jio-copilot-part-1-afa527744d36
Hope you could learn something new from this blog. Please do provide your feedback in form of responses and claps :)
References
1. Fine-tuning LLMs with PEFT and LoRA — https://www.youtube.com/watch?v=Us5ZFp16PaU