ML Systems in the industry

8 min readDec 11, 2023

Recommendations

In this blog, I intend to summarize the Recommendations ML systems from the industry. I will keep adding to this as I find interesting reads. —

1. Scaling the Instagram Explore recommendations system

Explore is one of the largest recommendation systems on Instagram. This is powered by multi-stage ranking approach with several well-defined stages, each focusing on different objectives and algorithms — 1. Retrieval 2. First-stage ranking 3. Second-stage ranking and 4. Final reranking
Retrieval — The retrieval stage consists of multiple candidates’ retrieval sources. These select hundreds of relevant items from a media pool of billions of items. These candidates are combined and passed to ranking models.
a. Candidates’ sources can be based on heuristics as well as ML based. Additionally, they can be real-time as well as and pre-generated. They utilize all these source types together and mix them with tunable weights. for eg heuristics real-time can be recent media from followed author and ML real-time can be candidates generated from Two-tower network
b. Two-tower networks are trained with user and item features and objective is to predict engagement events (like liking a post) which is basically dot-product of the respective embeddings learnt. For real-time inference, a. Freshest user-side features is used to generate user-embedding and a ANN search is done for most similar items. or b. For interacted items, an ANN search is done for most similar items. This helps to tradeoff between different engagement types.

Ranking — Ranking gradually reduces the number of candidates from a few thousand to few hundred in multiple stages.
a. A first-stage ranker (i.e., lightweight model), which is less precise and less computationally intensive. Here, the same Two-tower network is used but the objective is to predict the output of the second stage ranker with label (similar to knowledge distillation).
b. Second stage ranker predicts the probability of different engagement events (click, like, etc.) using the multi-task multi label (MTML) neural network model. Recommendations are precomputed for some users during off-peak hours. Final score that is used for ordering of ranked items is a weighted sum — W_click * P(click) + W_like * P(like) — W_see_less * P(see less) + etc.
Final reranking — Applying certain rules allows us to have a much better control over the final recommendations like e.g — a. Do not show items from the same authors in a sequence to have diversity b. Filter-out/downrank some items based on integrity-related scores
Hyperparameter tuning like W_click etc. are done using offline tuning (learn these parameters from data) and online Bayesian Optimization.

2. Twitter’s Recommendation Algorithm

Source — https://blog.x.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm

Twitter distills roughly 500 million Tweets posted daily down to a handful of top Tweets that show up on your device’s For You timeline.
The recommendation pipeline is made up of three main stages that consume these features:
1. Fetch the best Tweets from different recommendation sources in a process called candidate sourcing.
2. Rank each Tweet using a machine learning model.
3. Apply heuristics and filters, such as filtering out Tweets from users you’ve blocked, NSFW content, and Tweets you’ve already seen.

Candidate Sourcing — It pulls candidates from people you follow (In-Network) and from people you don’t follow (Out-of-Network).
In-Network Source — Real Graph model which predicts the likelihood of engagement between two users is used for this source.
Out-of-Network — Twitter takes 2 approaches for this — 1) Social Graph — Estimate what you would find relevant by analyzing the engagements of people you follow or those with similar interests based on GraphJet.
2) Embedding Spaces — Twitter uses something called SimClusters. SimClusters discover communities anchored by a cluster of influential users using a custom matrix factorization algorithm.

2. Ranking — At this point in the pipeline, we have ~1500 candidates that may be relevant. Ranking is achieved with a ~48M parameter neural network that is continuously trained on Tweet interactions to optimize for positive engagement (e.g. Likes, Retweets, and Replies). This ranking mechanism takes into account thousands of features and outputs ten labels to give each Tweet a score, where each label represents the probability of an engagement.

3. Heuristics, Filters, and Product Features — Visibility Filtering, Author Diversity, Feedback-based Fatigue and few more filters. These features work together to create a balanced and diverse feed.

Fraud Detection

In this blog, I intend to summarize the Fraud detection ML systems from the industry —

1. How we built it: Stripe Radar

Radar is Stripe’s fraud prevention solution which assesses more than 1,000 characteristics of a potential transaction in order to determine the likelihood that it’s fraudulent. It does it accurately, in < 100 ms with a False positive rate (incorrectly blocks) of just 0.1%. Why this is challenging is only 1 out of every 1,000 payments is fraudulent.
They had started with logistic regression and now a ResNeXt based deep neural network (DNN) model runs.
a. Before this an ensemble “Wide & Deep model,” composed of an XGBoost model (the wide part — for memorization) and a DNN (the deep part — for generalization) was being used.
b. XGBoost was incompatible at scale as it is not very parallelizable. However, increasing depth of the DNN too much to replace the XGBoost and have the same performance they ran the risk of overfitting, causing the model to memorize random noise in the features.
c. ResNeXt’s architecture adopts a “Network-in-Neuron” strategy. It splits a computation into distinct branches. The outputs from the branches are then summed to produce an output. Aggregating branches expands a new dimension of feature representation which is more effective than increasing depth or width.
d. Removing XGBoost component of the architecture, they reduced the time to train their model by over 85% (to less than two hours).
As fraud is a ever-changing domain, every week, the Radar team also meets to discuss new fraud trends that emerge from research into activity on the dark web. They gather all of this information and ideate features that target the specific contours of each attack. They ran some experiments using more synthetically generated transaction data using LLMs and got encouraging results.
They also invest a lot on model explainability as they have to explain why a transaction was marked fraudent (if it was not). They havenot clearly mentioned what techniques they use to understand the most important feature per case but most probably they use something like LIME/SHAP.

Search Autocomplete

In this blog, I intend to summarize the Search and Autocomplete ML systems from the industry —
The goal of autosuggestion is to predict relevant query completions for incomplete prefix inputs so that users reach their search intent with fewer keystrokes, saving their time and effort.

1. Autosuggestion Services in Web Search

Blog by Microsoft where they summarise fundamentals and key aspects of modern QAC services

Tries and inverted indexes are some favorites for choosing efficient data structures for the storage and lookup of suggestions.
Improving the relevance of autosuggestion services using —
a. Personalization — Language-specific (for the prefix “a” it might be useful to show “aadhar card” as a suggestion when Hindi is selected as the language, but not when Korean), Time-sensitive: Several events (sports, political, festivals, and others) occur at more or less fixed intervals of time and Based on user behavior and search history (for a user whose search history contains “avengers” and “iron man”, it may be better to display “mark ruffalo” as the topmost suggestion for the prefix “mark ” instead of “mark zuckerberg”)
b. Diversity — Greedy methods of deduplication while some use algorithms based on dynamic programming and A* search as well (instead of showing all about aadhar cards for prefix aa, also show aaj tak, aakash digital etc)
c. Freshness — News, forecasts, sentiments becoming viral, and so on
Non-prefix matching (due to misspellings or malformed prefixes) — Language-specific spell correction mechanisms (gogle -> google) or some inverted index based (popular near me -> restaurants popular near me) non-prefix matching
Interesting features in modern QAC services — Rich entity information (One often finds thumbnails and some auxiliary information about entity-type suggestions personalities etc) , Factual answers in suggestion box (small calculation or lookup — prefix 1usd converted in inr as a result), Ghosting (Autosuggest algorithms calculate the likelihood of the first suggestion being clicked by the user. If this likelihood is greater than some threshold, the search box automatically gets pre-populated with this likely suggestion.)
Zero-input scenario — (a) Yahoo shows history search queries, (b) Bing shows region-wise trending queries, and © Google shows queries related to the previous search result.
Evaluation of QAC Techniques — Mean reciprocal rank (MRR), Success rate at top K (SR@K), α-nDC, GMinimum keystroke length (MKS) and e-Saved

Chat

In this blog, I intend to summarize the Chat ML systems from the industry —

1. Jio Copilot

Medium blog by Fynd where they walkthrough the ML model experiments of creating Jio Copilot (which probably powers all chatbots of JioMart, TiraBeauty etc) —

Following is just a summary -
1. Objective is to build something that can tell you when your JioMart order is out for delivery, provide a smart summary of the last episode of your favourite show you watched on JioCinema or help you troubleshoot your JioFiber broadband connection.
2. They started with closed-source LLMs, but faced challenges such as high latency, limited API usage quota, and constraints with data transfer policies.
3. They chose Llama 2, beacuse of User privacy, Time & cost savings, State-of-the-Art research and Faster and better community support
4. Llama 2 base models occasionally ‘hallucinated’ with product names, and price and size details, was inconsistent in format and a few other issues on the two tasks they were targeting — Intent and Entity Classification and Chat completion
5. For data required for finetuning the model, they started with generating synthetic data across all product categories from Jio Storefronts like JioMart, TiraBeauty, Netmeds, MilkBasket, JioCinema and JioFiber covering a diverse set of scenarios
6. They fine-tuned Llama2 7B & 13B models using PEFT and QLORA-based parameter and memory efficient techniques (We had done something similar over here — https://lnkd.in/dgky-a9X)
7. To evaluate the readiness they checked for Bias Detection and Content Filtering for profanity, hate speech, violence, and other harmful content. and Model API response time to be less than 1 second for the Intent-Entity model and less than 3 seconds for the chat model response.
8. Prompts they used to train the models -