Blogs

Machine Learning System Design Guide

By Irpan Abdurahman

This guide is intended for people preparing for ML system design interviews. It covers key concepts across common domains like regression, classification, NLP, computer vision, and recommendation/ranking/search systems. It also provides a step-by-step framework one can follow to come up with a end-to-end solution.

Table of Contents


🤖 ML Concepts


📈 Regression

Goal: Predict a continuous scalar value.

🧪 Offline Metrics

🛠️ Typical Models


🏷️ Classification

Goal: Predict a discrete label/category.

🧪 Offline Metrics

🛠️ Typical Models


🗣️ Natural Language Processing

Goal: Analyze or generate human language data.

📝 Typical Tasks

📊 Typical Data

🧪 Offline Metrics

🛠️ Typical Models


🖼️ Computer Vision

Goal: Extract insights from image or video data.

🧪 Offline Metrics

📝 Typical Tasks

📊 Typical Data

🛠️ Typical Models


🎯 Recommendation & Ranking

Goal: Suggest relevant items to users by ranking items based on similarity or a scoring function.

📊 Typical Data

🧪 Offline Metrics for Recommendation

🧪 Offline Metrics for Ranking

🛠️ Typical Recommandation Models

🛠️ Typical Ranking Models


🔍 Retrieval

Goal: Find relevant candidates from a large collection.

🧪 Offline Metrics

🛠️ Typical Models


🔄 ML System Lifecycle


🧱 Feature Engineering

Step 1: Feature Selection

Step 2: Create New Features

Step 3: Handling Missing Data

Step 4: Transformation

Step 5: Handling Outliers

Step 6: Handling Imbalanced Data

Step 7: Privacy & Compliance


🏋️ Training

Data Splits

Train/Dev/Test (make sure no leakage and ensure generalization!).

Training Modes

Loss Function

Techniques

Model Store


⚡ Inference


🚀 Deployment

🔵 Blue-Green Deployment

Safely release a new version (code, model, service) with zero downtime.

🧪 A/B Experiments

Split users into two groups and give them different versions (control vs. test) to measure which one performs better.

🎰 Bandits

Improve A/B testing by dynamically allocating more traffic to the better-performing version.

🐤 Canary Release

Gradually roll out a new model to a small subset of users before releasing it to the entire user base.

🕵️ Shadow Deployment

Test a new model or feature in a live environment without affecting the user experience.

❄️ Cold Start

Launching a new model or feature with minimal or no historical data.

Model Cold Start: Deploying a new model without sufficient training data.

Feature Cold Start: Introducing a new feature without enough historical data.


📈 Scaling

General System Scaling

ML System Scaling


🧪 Online Metrics


🔍 Monitoring & Updates

Monitoring

Updates


🧠 ML System Design Steps Framework

⏱️ Total Duration: ~40 minutes
Each section includes a time estimate to help pace your response.

🔹 Step 1: Define the Problem (5-7 min)

🎯 Goal:

Understand the business context, define the ML task, and identify success metrics.

✅ Checklist:

🗣️ Sample Questions:

🔹 Step 2: Data Strategy & Labeling (5-7 min)

🎯 Goal:

Identify data sources, define labels, and ensure data quality.

✅ Checklist:

🧠 Tips:

🔹 Step 3: Feature Engineering & Data Processing (5-7 min)

🎯 Goal:

Design effective features based on entities and user interactions.

✅ Checklist:

🔹 Step 4: Modeling & Training Strategy (5-7 min)

🎯 Goal:

Choose appropriate model and training setup.

✅ Checklist:

🔹 Step 5: Evaluation (5 min)

🎯 Goal:

Measure model performance both offline and online.

✅ Offline:

✅ Online:

🔹 Step 6: Deployment & Monitoring (5 min)

🎯 Goal:

Design reliable and scalable deployment.

✅ Checklist:

🔹 Step 7: Wrap-Up & Trade-offs (3–5 min)

🎯 Goal:

Summarize and showcase holistic thinking.

✅ Checklist:

🧩 Bonus Topics (If Time Permits)


🦾 Example Design

Below is a social media feed recommender system designed following this step-by-step framework above: recSystem


💬 Example Questions & Practice Resources