Machine Learning System Design Guide

By Irpan Abdurahman

April 04, 2025

This guide is intended for people preparing for ML system design interviews. It covers key concepts across common domains like regression, classification, NLP, computer vision, and recommendation/ranking/search systems. It also provides a step-by-step framework one can follow to come up with a end-to-end solution.

ML Concepts
ML System Lifecycle
ML System Design Steps Framework
- Example Design
- Example Questions & Practice Resources

🤖 ML Concepts

📈 Regression

Goal: Predict a continuous scalar value.

🧪 Offline Metrics

Mean Squared Error (MSE): Penalizes large errors more.
Mean Absolute Error (MAE): All errors contribute equally.
R-squared (R²): Measures how well the model explains the variance.

🛠️ Typical Models

Linear Regression
- Use when: Input and output has linear relationships.
- Tradeoffs: Simple and interpretable, but underperforms with non-linear patterns.
Decision Trees
- Use when: You need to model complex, non-linear interactions.
- Tradeoffs: Prone to overfitting, so it’s better to use pruning or ensemble methods (like Random Forest); interpretable.
Random Forest
- Use when: Accuracy is critical and data has complex patterns.
- Tradeoffs: Reduces overfitting compared to DT. Less interpretable, more computationally intensive.
Gradient Boosting (e.g., XGBoost, LightGBM)
- Use when: You want state-of-the-art accuracy on tabular data.
- Tradeoffs: Can overfit, sensitive to hyperparameters.

🏷️ Classification

Goal: Predict a discrete label/category.

🧪 Offline Metrics

Accuracy: Overall correct predictions.
Precision: Correct positive predictions / Total predicted positives. (How much positive predicted are actually positive?)
Recall: Correct positive predictions / Total actual positives. (How much actual positive did we predict?)
F1 Score: Harmonic mean of precision and recall.
AUC: Area under the curve. The model’s ability to distinguish between classes at different thresholds.
- Used to set a threshold for decision. Ex: To determine best K for if Prediction value > K then model predict Positive.
- AUC = 1.0 -> perfect model. AUC = 0.5 -> random guess. AUC < 0.5 -> something is wrong.
- ROC AUC vs. Precision-Recall AUC: use ROC when classes are balanced, P-R when imbalanced.

🛠️ Typical Models

Logistic Regression
- Use when: Simple baseline for binary classification.
- Tradeoffs: Fast and interpretable, may underperform on complex data. Can’t solve non-linear problem
Decision Trees: Splits the data based on feature values to classify data.
- Use when: When the data has non-linear patterns.
- Tradeoffs: Prone to overfitting, but easily interpretable.
Random Forest / Gradient Boosted Trees
- Use when: Tabular data with non-linear relationships.
- Tradeoffs: May need tuning, less interpretable, but better genelization.
K-Nearest Neighbors (KNN):
- Use when: Simple, works well when decision boundaries are complex.
- Tradeoffs: Computationally expensive, especially with large datasets.
Support Vector Machines (SVM)
- Use when: High-dimensional data and small datasets.
- Tradeoffs: Can be slow on large datasets, choosing the right kernel can be tricky.
Neural Networks
- Use when: Large datasets, image/text/audio features.
- Tradeoffs: Requires more data, less interpretable.

🗣️ Natural Language Processing

Goal: Analyze or generate human language data.

📝 Typical Tasks

Text Classification: Spam detection, sentiment analysis.
Named Entity Recognition (NER): Detect names, orgs, places.
Question Answering: Extractive or generative.
Summarization: Abstractive or extractive.
Translation, Language Modeling, Text Generation.
Retrieval, Text Similarity.
Speech Tasks: ASR, TTS.

📊 Typical Data

Raw Text: News articles, reviews, transcripts, web crawl.
Structured Datasets: CSVs with text + labels (e.g., IMDB, SST-2).
Paired Data: Source-target pairs (e.g., translation, Q&A).
Multilingual Data: For translation, cross-lingual models.
Documents: PDFs, HTML, JSON, XML.
Audio (optional): Paired speech-text for ASR.

🧪 Offline Metrics

BLEU / ROUGE / METEOR: Text similarity for translation/summarization.
Accuracy / Precision / Recall/ F1: For classification tasks (e.g., sentiment analysis).
Perplexity: For evaluating language models.

🛠️ Typical Models

TF-IDF + Logistic Regression / SVM
- Use when: Simple text classification or baseline.
- Tradeoffs: Fast, but lacks semantic understanding.
RNN / LSTM / GRU
- Use when: Sequential tasks like sentiment analysis or translation.
- Tradeoffs: Struggle with long-term dependencies.
Transformer-based models (e.g., BERT, GPT)
- Use when: Most modern NLP tasks.
- Tradeoffs: Require significant compute and fine-tuning.

🖼️ Computer Vision

Goal: Extract insights from image or video data.

🧪 Offline Metrics

Accuracy, Precision/Recall/F1: For classification.
Intersection over Union (IoU): For object detection.
PSNR / SSIM: For image reconstruction or enhancement.

📝 Typical Tasks

Classification, Object Detection, Segmentation (Semantic/Instance).
Image Generation, Captioning, VQA.
OCR, Pose Estimation, Action Recognition.

📊 Typical Data

Image formats (JPG, PNG), annotated with labels, boxes, or masks.
Multimodal pairs: image + caption, VQA.
Video frames.

🛠️ Typical Models

CNN (Convolutional Neural Networks)
- Use when: Image classification or object detection.
- Tradeoffs: Efficient and accurate; limited context awareness.
ResNet, EfficientNet
- Use when: Deeper or more efficient image models.
- Tradeoffs: Better accuracy with optimized performance.
YOLO / Faster R-CNN
- Use when: Real-time object detection.
- Tradeoffs: Speed vs. accuracy tradeoff depending on model.
ViT (Vision Transformers)
- Use when: High-end image recognition with sufficient data.
- Tradeoffs: Require large datasets and compute.

🎯 Recommendation & Ranking

Goal: Suggest relevant items to users by ranking items based on similarity or a scoring function.

📊 Typical Data

User Attributes: Metadata about users (e.g., Demographics, preferences, browsing history).
Item Attributes: Metadata about items (e.g., product descriptions, genres, categories). Labels indicating relevance or ranking order, or relevance scores.
User-Item Interactions: (User, Item, Score) triplets or implicit feedback such as clicks, views, purchases.

🧪 Offline Metrics for Recommendation

Precision@K: Fraction of recommended items in the top-K list that are relevant.
Example: If the system recommends 5 items and 3 are relevant, Precision@5 = 60%.
Recall@K: Fraction of relevant items that are successfully recommended in the top-K list.
Example: If the user has 10 favorite items, and 3 appear in the top-5 recommendations, Recall@5 = 30%.
Coverage: Measures the percentage of items recommended at least once.
To ensure every product has a chance to be recommended.
Diversity: Ensures that recommended items are varied.
To prevent repetitive recommendations and improve user experience.

🧪 Offline Metrics for Ranking

MAP (Mean Average Precision): Averaged precision across queries.
Best when multiple relevant items exist, and their position matters.
NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality by assigning higher importance to relevant items ranked higher.
Best when higher rank matters more.
MRR (Mean Reciprocal Rank): Measures how high up the first correct/relevant result appears in a ranked list. Best when #1 ranked item matters more.

🛠️ Typical Recommandation Models

Collaborative Filtering (Matrix Factorization)
- Use when: You have rich user-item interaction data. Recommends items based on past user behaviors.
- Tradeoffs: Cold-start problem (new user no data), ignores metadata.
Content-based Filtering
- Use when: You lack collaborative data or want interpretability. Recommends items based on the attributes of the items or users.
- Tradeoffs: Can’t capture user-item interactions well.
Hybrid Models
- Use when: You want to combine strengths of collaborative and content methods. Ex: Use CF to see what similar users like and use CBF to re-rank based on your previous taste (attributes of previous likes).
- Tradeoffs: Increased complexity.
Deep Learning-based (e.g., Two-tower models)
- Use when: Rich features and large-scale data.
- Tradeoffs: More complex and resource-intensive.

🛠️ Typical Ranking Models

Regression-Based Ranking (Pointwise)
- Use when: You have continuous data and need to rank items based on predicted scores.
- Tradeoffs: May not capture relationships between items as well as more complex models.
Pairwise Ranking
- Use when: You want to compare pairs of items to determine the better one.
- Tradeoffs: Computationally expensive.

🔍 Retrieval

Goal: Find relevant candidates from a large collection.

🧪 Offline Metrics

Hit Rate (HR): Measures how often relevant items appear in the top-K retrieval.
Example: If a user’s preferred item appears in the top-10 search results, it counts as a hit.

🛠️ Typical Models

TF-IDF (Term Frequency-Inverse Document Frequency)
- Use when: Document retrieval tasks based on keyword relevance.
- Tradeoffs: Simple but doesn’t capture semantic meaning.
BM25 (Best Match 25)
- Use when: Information retrieval tasks, especially in large text corpora.
- Tradeoffs: More advanced than TF-IDF, but still simple.
Neural Networks (Deep Embeddings)
- Use when: Semantic understanding of the data is important.
- Tradeoffs: Requires large datasets and computational resources.
Approximate Nearest Neighbors (ANN)
- Use when: Fast, scalable retrieval in large datasets.
- Tradeoffs: May not always find exact neighbors, but balances speed and accuracy well.

🔄 ML System Lifecycle

🧱 Feature Engineering

Step 1: Feature Selection

Define key actors: users, items, queries, etc.
Extract actor-specific features based on tasks.

Step 2: Create New Features

Cross Features: Combine entities like User-Item (user-video watch history), Query-Doc.
Statistical / ML Features:
- Polynomial: capture non-linear relationships.
- Binning: convert numerical data into categorical groups.
- Feature Interactions: combine existing features (add, multiply, divide).
- Clustering: (e.g., k-mean, mean-shift) to assign entities to clusters.
- PCA: Dimensionality reduction, visualization.
- SVD: Dimensionality reduction, recommendation, latent semantic analysis (LSA).

Step 3: Handling Missing Data

Drop: If rare (<5%).
Imputation:
- Mean/median (numerical), mode (categorical)
- Forward/backward fill (time-series)
- Predictive models (KNN, regression)
- Indicator variable (flag missing values)

Step 4: Transformation

Numerical:
- Scaling: StandardScaler(Z-score), MinMax, Robust
- Log transformation
Categorical:
- One-Hot Encoding (nominal/unordered)
- Ordinal Encoding (ordered)
- Target Encoding (mean target value, watch out for leakage!)
- Embeddings (high-cardinality categories)
Text:
- Tokenization: Word, subword (BPE, WordPiece).
- Normalization: Lowercasing, stemming, lemmatization, stop-word removal
- Vectorization/Embeddings: TF-IDF (scoring), Word2Vec (similarity), BERT (context understanding), GPT (generative).
- Padding & Truncation: To fixed input size.
Images:
- Resize, normalize (scale to [0,1]).
- Data augmentation (flips, rotations, cropping).
- Normalization: Pixel value scaling (e.g., [0,1], mean-std).
- Conversion: RGB <-> grayscale if needed.
- Batching: Consider padding for varying sizes (segmentation, detection).
- Feature extraction (ResNet, EfficientNet).
- Vectorization/Embeddings
Videos:
- Frame sampling, optical flow analysis.
- Convert video to embeddings

Step 5: Handling Outliers

Detection: Z-score (>3σ), IQR (>1.5×IQR).
Treatment: Winsorization (cap at 5th & 95th percentiles), log transform (mitigate extreme values).

Step 6: Handling Imbalanced Data

Resampling:
- Oversampling (SMOTE, ADASYN): add more underrepresented class
- Undersampling (reduce majority class): reduce the number of majority class (can lead to info loss)
Class Weights: Assign more weights to underrepresented class.
Threshold Tuning: Adjust using Precision-Recall AUC.

Step 7: Privacy & Compliance

Data minimization, anonymization & hashing.
Ensure user consent & GDPR compliance.
Differential privacy (add noise to prevent identification).
Federated learning (train models without centralizing data).

🏋️ Training

Data Splits

Train/Dev/Test (make sure no leakage and ensure generalization!).

Train set: used to train the model — i.e., learn the patterns, weights, or rules.
Dev set: for model selection and hyperparam tuning.
Test set: final evaluation before deployment.

Training Modes

Offline Training: Train on historical data.
Online Learning: Continuously update with new data.
Warm Start: Train with historical data and fine-tune with recent behavior.

Loss Function

Classification:
- Cross-Entropy (Log-Loss): Measures the difference between predicted probability and actual class.
Regression:
- Mean Squared Error (MSE): Penalizes large errors more than small ones.
- Mean Absolute Error (MAE): More robust to outliers than MSE.
Ranking:
- Pairwise Losses: e.g., hinge loss (SVM), triplet loss (embedding learning).

Techniques

Cross-Validation: For robustness.
Hyperparameter Tuning: Grid search, random search, Bayesian.
Transfer Learning: Start from pretrained models.
AutoML: Automate feature selection, model search (NAS), tuning.
Distributed Training: Parallelize with data/model parallelism.

Regularization: prevents overfitting by adding constraints to the model or altering the training data.

Weight Regularization:

L1 (Lasso): Adds absolute weight penalty (λ *

) → Encourages sparsity, good for feature selection.

L2 (Ridge): Adds squared weight penalty (λ * w²) → Shrinks weights to prevent overfitting.
ElasticNet: Combines L1 and L2 penalties.

Dropout: Randomly “drops” neurons during training to prevent co-adaptation and improve generalization.
Data Augmentation: Especially for image/text/audio. Random transformations (e.g., crop, rotate, synonym replacement) to increase diversity of training data.
Early Stopping: Monitor validation loss and stop training when it stops improving.
Batch Normalization: Stabilizes training by normalizing layer inputs. Acts as a mild regularizer.
Label Smoothing: Instead of hard 0/1 targets, use soft targets (e.g., 0.9 for correct class, 0.1 distributed among others). Helps prevent overconfidence.
Adversarial Training: Add small perturbations to inputs during training to improve robustness.

Model Store

Object Store (ex: S3, Blob): saving as trained model artifacts (.pt, .pkl, .onnx, etc.). Can store anywhere, manual version control and logs.
MLFlows: for full ML lifecycles. Experiment tracking, model registry (version control), monitoring, CI/CD.

⚡ Inference

Batch Inference: High throughput, pre-computed.
Real-Time: Low latency, live predictions.
Hybrid: e.g., Netflix recommendation – batch for candidates, real-time for ranking.

🚀 Deployment

🔵 Blue-Green Deployment

Safely release a new version (code, model, service) with zero downtime.

Have two identical environments: one live (Blue) and one staging (Green).
Test in staging (Green). Once ready, switch all users from Blue to Green.
If something breaks, rollback quickly by switching back to Blue.

🧪 A/B Experiments

Split users into two groups and give them different versions (control vs. test) to measure which one performs better.

Make sure each group is representative of the user base to avoid bias.
Measure impact using metrics like CTR, CVR.
Null Hypothesis (H₀): No significant difference between the control and test groups (e.g., “The new model has the same performance as the current model”).
- If p-value < 0.05, reject the null hypothesis (statistically significant difference).
- If p-value > 0.05, fail to reject the null hypothesis.

🎰 Bandits

Improve A/B testing by dynamically allocating more traffic to the better-performing version.

Multi-Armed Bandit (MAB): Allocates more traffic to the better-performing version while continuing to explore other options.
Exploration vs. Exploitation: Balances trying new options (exploration) and focusing on the best-performing option (exploitation).
Benefits: More efficient use of traffic and reduced risk compared to traditional A/B testing.

🐤 Canary Release

Gradually roll out a new model to a small subset of users before releasing it to the entire user base.

How? A small group of users (canary group) gets the new model. If it performs well, the rollout is expanded.
Benefits: Controlled testing and minimizes risk before full deployment.

🕵️ Shadow Deployment

Test a new model or feature in a live environment without affecting the user experience.

How? The new model is deployed but runs in the background. Its predictions are logged for analysis alongside the current model’s output.
Benefits: Risk-free testing in real-world conditions and allows for early identification of issues.

❄️ Cold Start

Launching a new model or feature with minimal or no historical data.

Model Cold Start: Deploying a new model without sufficient training data.

Data Augmentation: Use external or synthetic data to improve training.
Transfer Learning: Use a pre-trained model on similar data.
Hybrid Models: Combine new models with simpler systems until more data is collected.

Feature Cold Start: Introducing a new feature without enough historical data.

New User
- Ask onboarding questions (genres, interests)
- Use user metadata (age, location, device)
- Recommend popular/trending items
New Item
- Use content-based features (title, category, description)
- Embed item with similar existing items
- Boost it a little so the system can explore how users respond (exploration vs. exploitation)

📈 Scaling

General System Scaling

Distributed Servers: Spread servers across different locations to improve reliability and speed.
CDN (Content Delivery Network): Use nearby servers to deliver content faster to users.
Load Balancers: Distribute network traffic to multiple servers to keep everything running efficiently.
Sharding: Split data into smaller chunks to improve database performance.
Replication: Copy data to multiple places for backup and faster access.
Caching: Store frequently used data temporarily for faster access.

ML System Scaling

Data Parallelism: Distribute data across multiple nodes to speed up training (e.g., TensorFlow, PyTorch).
Model Parallelism: Split the model across multiple machines (for very large models).
Asynchronous SGD: Each worker updates parameters asynchronously.
Synchronous SGD: Workers update parameters simultaneously to ensure consistency.
Distributed Training: Training on multiple machines to speed up large-scale model training.

🧪 Online Metrics

Click-Through Rate (CTR) – Fraction of users who click on a recommended item.
Conversion Rate (CVR) – Fraction of users who take a desired action (e.g., purchase, sign-up).
Engagement Metrics – Time spent, interactions per session, number of likes, CPE (Cost Per Engagement).
Revenue Impact – Revenue per user, average order value.
Latency & System Performance – Response time, request throughput.
User Retention & Churn – How many users return vs. leave.

🔍 Monitoring & Updates

Monitoring

Logging: Features, predictions, metrics.
Metrics:
- SW System Metrics: Server load, response time, uptime, etc.
- ML Metrics: Accuracy, loss, predictions, feature distributions.
- Online & Offline Metric Dashboards: Visualize real-time vs historical metrics for insights.
Data Distribution Shifts:
- Types of Shifts:
  - Covariate Shift: Change in the distribution of input data.
  - Label Shift: Change in the distribution of output labels.
  - Concept Shift: Change in the relationship between input and output.
- Detection: Use statistical methods or hypothesis testing (e.g., Kolmogorov-Smirnov test).
- Correction: Adjust model or retrain with fresh data.
System Failures:
- SW System Failures: Dependency issues, deployment errors, hardware downtime.
- ML System Failures: Data distribution differences (test vs. real-time), feedback loops.
- Edge Cases: Handling invalid or junk inputs.
- Alarms: Set up alarms for failures in data pipelines, low metric performance, or system downtimes.

Updates

Continual Training: Update the model as new data becomes available.
- Model Updates: Fine-tune or retrain from scratch based on new data.
- Frequency: Decide on how often to retrain (daily, weekly, monthly, etc.).
- Auto Update Models: Automate model updates to keep the system up-to-date.
- Active Learning: Use human-in-the-loop systems to improve model performance by selecting uncertain samples for labeling.

🧠 ML System Design Steps Framework

⏱️ Total Duration: ~40 minutes
Each section includes a time estimate to help pace your response.

🔹 Step 1: Define the Problem (5-7 min)

🎯 Goal:

Understand the business context, define the ML task, and identify success metrics.

✅ Checklist:

Lay out assumeptions about the system.
Clarify user need and product impact
Identify business objective (e.g. engagement, retention, revenue)
Come up with functional (what the user needs) vs. non-functional (what the system needs) requirements.
- Ask about:
  - Input/output format
  - Latency/throughput constraints
  - Real-time vs batch
  - Expected scale (users, data size)
Reframe as ML task:
- Classification, regression, ranking, clustering, recommendation, etc.
- Input + Output
Success metrics
- Offline metric for training
- Online metric for monitoring, CI/CD

🗣️ Sample Questions:

“Is the goal to increase time spent, engagement, or retention?”
“Should the model respond in real-time, or is daily batch okay?”
“What feedback signals are available—explicit or implicit?”

🔹 Step 2: Data Strategy & Labeling (5-7 min)

🎯 Goal:

Identify data sources, define labels, and ensure data quality.

✅ Checklist:

Available data sources
Implicit vs explicit labels
Manual vs automated labeling
Cold start & feedback loop
Bias & sampling issues

🧠 Tips:

Mention position bias, selection bias, existing system bias
Consider label noise, label delay, skewed distributions

🔹 Step 3: Feature Engineering & Data Processing (5-7 min)

🎯 Goal:

Design effective features based on entities and user interactions.

✅ Checklist:

Key entities: users, items, sessions
Feature types:
- User features (demographics, behavior)
- Item features (metadata, embeddings)
- Interaction features (clicks, dwell time)
Preprocessing:
- Handle missing data, outliers
- One-hot vs embeddings
- Temporal features, normalization

🔹 Step 4: Modeling & Training Strategy (5-7 min)

🎯 Goal:

Choose appropriate model and training setup.

✅ Checklist:

Start simple: Baseline → heuristic → ML → DL
Justify model based on:
- Task complexity
- Feature richness
- Scale
Training details:
- Data splits (train/val/test)
- Regularization, early stopping
- Loss function choice
- Class imbalance solutions

🔹 Step 5: Evaluation (5 min)

🎯 Goal:

Measure model performance both offline and online.

✅ Offline:

Metrics: Accuracy, Precision@K, Recall@K, AUC, nDCG, MRR
Segment-based evaluation (new vs returning users)

✅ Online:

A/B testing setup
Metrics: CTR, conversion, dwell time, retention
Tradeoffs: precision vs recall, relevance vs diversity

🔹 Step 6: Deployment & Monitoring (5 min)

🎯 Goal:

Design reliable and scalable deployment.

✅ Checklist:

Serving: Real-time vs batch
Framework: TorchServe, TF Serving, ONNX
Caching, latency optimizations
Canary rollout, rollback strategy
Monitoring:
- Drift detection
- Prediction distribution
- Alerting/observability
Retraining schedule:
- Periodic, incremental
- Versioning and reproducibility

🔹 Step 7: Wrap-Up & Trade-offs (3–5 min)

🎯 Goal:

Summarize and showcase holistic thinking.

✅ Checklist:

30-second end-to-end summary
Key tradeoffs discussed
Next steps:
- Feedback loop
- Explainability
- Long-term maintenance
- Ethics and fairness considerations

🧩 Bonus Topics (If Time Permits)

Cold start solutions (heuristics, hybrid systems)
Multi-objective optimization (relevance, diversity)
Privacy (federated learning, DP)
Edge cases and failure handling

🦾 Example Design

Below is a social media feed recommender system designed following this step-by-step framework above: recSystem

Table of Contents

🧪 Offline Metrics

🛠️ Typical Models

🧪 Offline Metrics

🛠️ Typical Models

📝 Typical Tasks

📊 Typical Data

🧪 Offline Metrics

🛠️ Typical Models

🧪 Offline Metrics

📝 Typical Tasks

📊 Typical Data

🛠️ Typical Models

📊 Typical Data

🧪 Offline Metrics for Recommendation

🧪 Offline Metrics for Ranking

🛠️ Typical Recommandation Models

🛠️ Typical Ranking Models

🧪 Offline Metrics

🛠️ Typical Models

Step 1: Feature Selection

Step 2: Create New Features

Step 3: Handling Missing Data

Step 4: Transformation

Step 5: Handling Outliers

Step 6: Handling Imbalanced Data

Step 7: Privacy & Compliance

Data Splits

Training Modes

Loss Function

Techniques

Model Store

🔵 Blue-Green Deployment

🧪 A/B Experiments

🎰 Bandits

🐤 Canary Release

🕵️ Shadow Deployment

❄️ Cold Start

General System Scaling

ML System Scaling

Monitoring

Updates

🔹 Step 1: Define the Problem (5-7 min)

🎯 Goal:

✅ Checklist:

🗣️ Sample Questions:

🔹 Step 2: Data Strategy & Labeling (5-7 min)

🎯 Goal:

✅ Checklist:

🧠 Tips:

🔹 Step 3: Feature Engineering & Data Processing (5-7 min)

🎯 Goal:

✅ Checklist:

🔹 Step 4: Modeling & Training Strategy (5-7 min)

🎯 Goal:

✅ Checklist:

🔹 Step 5: Evaluation (5 min)

🎯 Goal:

✅ Offline:

✅ Online:

🔹 Step 6: Deployment & Monitoring (5 min)

🎯 Goal:

✅ Checklist:

🔹 Step 7: Wrap-Up & Trade-offs (3–5 min)

🎯 Goal:

✅ Checklist:

🧩 Bonus Topics (If Time Permits)