Beginner’s guide to one of the best Vision model — CLIP (Contrastive Language-Image Pre-training)

Anirban Sen
6 min readAug 19, 2023
Photo by Dan Cristian Pădureț on Unsplash

All of us have seen the amazing capabilities of StableDiffusion (and even Dall-E) in Image Generation. There is another model which works in tandem with the models and has relatively stabilised its position in Computer Vision — CLIP (Contrastive Language-Image Pretraining). In Stable Diffusion, the text prompt is encoded using CLIP. In DALL-E, CLIP is used to evaluate the generated images.

What is CLIP?
Contrastive Language-Image Pre-training (CLIP for short) is a open-source, multi-modal (text and images), zero-shot state-of-the-art model introduced by OpenAI in February 2021.

This model performs simple pre-training of predicting which caption goes with which image and vice versa (hence contrastive) on a huge dataset of 400 million (image, text) [ImageNet has 1.2 million training dataset] pairs collected from the internet in an efficient and scalable way is able to perform really well even Zero-shot on various tasks like text classification.

Multi-Modal: Multi-Modal architectures leverage more than one domain to learn a specific task. CLIP combines Natural Language Processing and Computer Vision.

Zero-shot: Zero-shot learning is a way to generalize on unseen labels, without having specifically trained to classify them. For example, all ImageNet models are trained to recognize 1000 specific classes. CLIP is not bound by this limitation.

These image representations are so valuable that they can be used in various tasks like Image classification, Image retrieval, Object detection, Face recognition, Image compression, Image generation, OCR and many more.

How is it better than other models ?

  1. Zero Shot classification — It matches the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. Across a 27 dataset eval suite, a zero-shot CLIP classifier outperforms a fully supervised linear classifier fitted on ResNet-50 features on 16 datasets, including ImageNet.
Zero-shot CLIP is competitive with a fully supervised baseline

2. Robustness to Natural Distribution Shift

Zero-shot CLIP is much more robust to distribution shift than standard ImageNet models

Across these collected datasets, the accuracy of ImageNet models drop well below the expectation set by the ImageNet validation set.

How does it work?
1. Contrastive Pre training —

Given a batch of N (image, text) pairs, CLIP is trained to predict which of the N * N possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N^2 — N incorrect pairings. They optimize a symmetric cross entropy loss over these similarity scores.

# Step 1 - Get a batch of n sample images and text
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts

# Step 2 - Get backbone models for images and texts
# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer

# Step 3 - extract feature representations of each modality with dif dimensions
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]

# Step 4 - Learn joint multimodal embedding [n, d_e]
# W_i[d_i, d_e] - learned proj of image to embed which is learnt during training
# W_t[d_t, d_e] - learned proj of text to embed which is learnt during training
I_e = l2_normalize(, W_i), axis=1)
T_e = l2_normalize(, W_t), axis=1)

# Step 5 - Get scaled pairwise cosine similarities/Logits [n, n]
logits =, T_e.T) * np.exp(t)

# Step 6 - Get symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

What are the data augmentations done?
They also simplify the image transformation function. A random square
crop from resized images is the only data augmentation used during training.

What are the different models trained and how are they trained?
They train a series of 5 ResNets and 3 Vision Transformers. For the ResNets they trainResNet-50, a ResNet-101, and then 3 more which follow EfficientNet-style model scaling and use approximately 4x, 16x, and 64x the compute of a ResNet-50. They are denoted as RN50x4, RN50x16, and
RN50x64 respectively. For the Vision Transformers we train a ViT-B/32, a ViT-B/16, and a ViT-L/14.

The text encoder is a Transformer model with some architecture modifications. As a base size a 63M-parameter 12-layer model with 8 attention heads is used. The trans former operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49k vocab size

They train all models for 32 epochs. They use the Adam optimizer with decoupled weight decay regularization applied to all weights that are
not gains or biases, and decay the learning rate using a cosine schedule. Initial hyper parameters were set using a combination of grid searches,
random search, and manual tuning on the baseline ResNet-50 model when trained for 1 epoch.

How do we use CLIP?
We can use CLIP directly as either as a zero-shot classifier or to get embeddings for a given image or text. Following is an example which gets the embeddings using encode_image. We are going to feed 8 example images and their textual descriptions to the model, and compare the similarity between the corresponding features.
import torch
import clip
from PIL import Image

# ['RN50', 'RN101', 'RN50x4', 'RN50x16', 'RN50x64', 'ViT-B/32', 'ViT-B/16', 'ViT-L/14', 'ViT-L/14@336px']

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image_input = torch.tensor(np.stack(images)).cuda()
text_tokens = clip.tokenize(["This is " + desc for desc in texts]).cuda()

with torch.no_grad():
image_features = model.encode_image(image_input).float()
text_features = model.encode_text(text_tokens).float()

image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T

The notebook for the full coding example is [here] where we also see a zero-shot classification example

Some Interesting Usecases —

Finetuning CLIP

A lot of active research is still going on to robustly finetune CLIP to avoid catastrophic forgetting of the pretraining.