Approximate Nearest Neighbour

Anirban Sen
7 min readMar 29, 2024
Photo by Alev Takil on Unsplash

Finding the nearest neighbour is a widespread problem is today’s world. The usecases are Visual Search, recommendations, finding similar documents and even LLMs. Essentially we represent objects as embeddings/vector representations. We have discussed on word embedding earlier where we had created word embeddings using word2vec. Now how do we find the most similar words or in general similar objects? The general solution is k-NN using methods like cosine similarity across all embeddings. Now this takes a significant time when there are billions of embeddings.
CosineSimilarity(u, v) = u . v / ||u|| ||v|| where u and v are vectors if u and v are unit vectors then its the same as the dot product of vectors
Imagine Youtube running this algorithm for their 2.6 billion active users, and then wait for a few years until you get 1 recommendation that you may or may not like 😂.

https://www.linkedin.com/pulse/approximate-nearest-neighbors-ann-mina-ashraf-gamil/

For search to be useful, it needs to be accurate and fast. This is where Approximate Nearest Neighbour come into place. This compromises on the accuracy a bit but increases the speed quite much. For the same youtube example, as you can already think, it will be okay for Youtube to recommend few off videos but provide it a great speed. There are various tools to perform ANNs — Facebook’s FAISS, Spotify’s ANNOY and many more. We will be discussing the algorithms under FAISS which are more or less used in all packages.

1. Flat Index

FAISS has two flat indexes — IndexFlatL2 if using Euclidean/L2 distance, or IndexFlatIP if using inner product distance (which are more or less same). This is the same as KNN. Let’s look at how to use FlatIndex in FAISS.

##Installing
!pip install faiss-cpu --no-cache

##Importing Libraries
import faiss
import shutil
import urllib.request as request
from contextlib import closing
import tarfile
import numpy as np

# first we download the Sift1M dataset
with closing(request.urlopen('ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz')) as r:
with open('sift.tar.gz', 'wb') as f:
shutil.copyfileobj(r, f)
# the download leaves us with a tar.gz file, we unzip it
tar = tarfile.open('sift.tar.gz', "r:gz")
tar.extractall()
# now define a function to read the fvecs file format of Sift1M dataset
def read_fvecs(fp):
a = np.fromfile(fp, dtype='int32')
d = a[0]
return a.reshape(-1, d + 1)[:, 1:].copy().view('float32')

# data we will search through
xb = read_fvecs('./sift/sift_base.fvecs') # 1M samples
# also get some query vectors to search with
xq = read_fvecs('./sift/sift_query.fvecs')
# take just one query (there are many in sift_learn.fvecs)
xq = xq[0].reshape(1, xq.shape[1])

d = 128 # dimensionality of Sift1M data
k = 10 # number of nearest neighbors to return

index = faiss.IndexFlatIP(d)
index.add(xb)
D, I = index.search(xq, k)

2. Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) works by grouping vectors into buckets by processing each vector through a hash function that maximizes hashing collisions — rather than minimizing as is usual with hashing functions.

What does that mean? Imagine we have a Python dictionary. When we create a new key-value pair in our dictionary, we use a hashing function to hash the key. This hash value of the key determines the ‘bucket’ where we store its respective value. A Python dictionary is an example of a hash table using a typical hashing function that minimizes hashing collisions, a hashing collision where two different objects (keys) produce the same hash. In our dictionary, we want to avoid these collisions as it means that we would have multiple objects mapped to a single key — but for LSH, we want to maximize hashing collisions. Why would we want to maximize collisions? Well, for search, we use LSH to group similar objects together. When we introduce a new query object (or vector), our LSH algorithm can be used to find the closest matching groups.

## Locality Sensitive Hashing
nbits = d*16 # resolution of bucketed vectors
# initialize index and add vectors
index = faiss.IndexLSH(d, nbits)
index.add(xb)
D, I = index.search(xq, k)

3. Hierarchical Navigable Small World Graphs

Hierarchical Navigable Small World (HNSW) graphs are another, more recent development in search. HNSW-based ANNS consistently top out as the highest performing indexes. HNSW is a further adaption of navigable small world (NSW) graphs — where an NSW graph is a graph structure containing vertices connected by edges to their nearest neighbors. The ‘NSW’ part is due to vertices within these graphs all having a very short average path length to all other vertices within the graph — despite not being directly connected.

At a high level, HNSW graphs are built by taking NSW graphs and breaking them apart into multiple layers. With each incremental layer eliminating intermediate connections between vertices.

## Hierarchical Navigable Small World (HNSW) Index
# set HNSW index parameters
M = 64 # number of connections each vertex will have
ef_search = 32 # depth of layers explored during search
ef_construction = 64 # depth of layers explored during index construction
# initialize index (d == 128)
index = faiss.IndexHNSWFlat(d, M)
# set efConstruction and efSearch parameters
index.hnsw.efConstruction = ef_construction
index.hnsw.efSearch = ef_search
# add data to index
index.add(xb)

4. Inverted File Index

The Inverted File Index (IVF) index consists of search scope reduction through clustering. It’s a very popular index as it’s easy to use, with high search-quality and reasonable search-speed. It works on the concept of Voronoi diagrams. To understand Voronoi diagrams, we need to imagine our highly-dimensional vectors placed into a 2D space. We then place a few additional points in our 2D space, which will become our ‘cluster’ (Voronoi cells in our case) centroids. We then extend an equal radius out from each of our centroids. At some point, the circumferences of each cell circle will collide with another — creating our cell edges. Now, every datapoint will be contained within a cell — and be assigned to that respective centroid.Just as with our other indexes, we introduce a query vector xq — this query vector must land within one of our cells, at which point we restrict our search scope to that single cell. But there is a problem if our query vector lands near the edge of a cell — there’s a good chance that its closest other datapoint is contained within a neighboring cell. We call this the edge problem. Now, what we can do to mitigate this issue and increase search-quality is increase an index parameter known as the nprobe value. With nprobe we can set the number of cells to search.

## Inverted File Index(IVF) Index
nlist = 128 # number of cells/clusters to partition data into
quantizer = faiss.IndexFlatIP(d) # how the vectors will be stored/compared
index = faiss.IndexIVFFlat(quantizer, d, nlist)
index.train(xb) # we must train the index to cluster into cells
index.add(xb)
index.nprobe = 8 # set how many of nearest cells to search
D, I = index.search(xq, k)

These were the basic algorithms. But if we look at all the indexes present in FAISS, there are modifications of these algorithms with scalar and product quantized modifications.

Quantization makes no assumption about the distribution of the data — it looks at each dimension (or group of dimensions) separately and attempts to “bin” each value into one of many discrete buckets. Instead of performing a flat search over all of the original vectors, we can instead perform flat search over the quantized vectors — this can result in reduced memory consumption as well as significant speedup.

Scalar quantization is an important data compression technique that turns floating point values into low-dimensional integers. For each vector dimension, scalar quantization takes the maximum and minimum value of that particular dimension as seen across the entire database, and uniformly splits that dimension into bins across its entire range.

Product quantization (PQ) is another data compression technique that is much more powerful and flexible to quantize vectors when compared with scalar quantization. Given a dataset of N total vectors, we'll first divide each vector into M subvectors (also known as a subspace). We’ll then use k-means (or some other clustering algorithm) for all subvectors in the dataset. This will give us a collection of K centroids for each subspace, each of which will be assigned its own unique ID. With all centroids computed, we’ll replace all subvectors in the original dataset with the ID of its closest centroid.

This was just a high level view of vector search algorithms (which is a vast research topic). I hope this was informational to atleast few people. Do clap and follow if you found this article helpful.

You can also show your appreciation by buying me a coffee

Interesting Reads —
1. Approximate Nearest Neighbour Phrase Mining for Contextual Speech Recognition by Apple

References —
[1] https://www.pinecone.io/learn/series/faiss/vector-indexes/
[2] https://zilliz.com/learn/scalar-quantization-and-product-quantization

--

--