By Irpan Abdurahman
This guide is intended for people preparing for ML system design interviews. It covers key concepts across common domains like regression, classification, NLP, computer vision, and recommendation/ranking/search systems. It also provides a step-by-step framework one can follow to come up with a end-to-end solution.
Table of Contents
Goal: Predict a continuous scalar value.
🧪 Offline Metrics
- Mean Squared Error (MSE): Penalizes large errors more.
- Mean Absolute Error (MAE): All errors contribute equally.
- R-squared (R²): Measures how well the model explains the variance.
🛠️ Typical Models
- Linear Regression
- Use when: Input and output has linear relationships.
- Tradeoffs: Simple and interpretable, but underperforms with non-linear patterns.
- Decision Trees
- Use when: You need to model complex, non-linear interactions.
- Tradeoffs: Prone to overfitting, so it’s better to use pruning or ensemble methods (like Random Forest); interpretable.
- Random Forest
- Use when: Accuracy is critical and data has complex patterns.
- Tradeoffs: Reduces overfitting compared to DT. Less interpretable, more computationally intensive.
- Gradient Boosting (e.g., XGBoost, LightGBM)
- Use when: You want state-of-the-art accuracy on tabular data.
- Tradeoffs: Can overfit, sensitive to hyperparameters.
Goal: Predict a discrete label/category.
🧪 Offline Metrics
- Accuracy: Overall correct predictions.
- Precision: Correct positive predictions / Total predicted positives. (How much positive predicted are actually positive?)
- Recall: Correct positive predictions / Total actual positives. (How much actual positive did we predict?)
- F1 Score: Harmonic mean of precision and recall.
- AUC: Area under the curve. The model’s ability to distinguish between classes at different thresholds.
- Used to set a threshold for decision. Ex: To determine best K for if Prediction value > K then model predict Positive.
- AUC = 1.0 -> perfect model. AUC = 0.5 -> random guess. AUC < 0.5 -> something is wrong.
- ROC AUC vs. Precision-Recall AUC: use ROC when classes are balanced, P-R when imbalanced.
🛠️ Typical Models
- Logistic Regression
- Use when: Simple baseline for binary classification.
- Tradeoffs: Fast and interpretable, may underperform on complex data. Can’t solve non-linear problem
- Decision Trees: Splits the data based on feature values to classify data.
- Use when: When the data has non-linear patterns.
- Tradeoffs: Prone to overfitting, but easily interpretable.
- Random Forest / Gradient Boosted Trees
- Use when: Tabular data with non-linear relationships.
- Tradeoffs: May need tuning, less interpretable, but better genelization.
- K-Nearest Neighbors (KNN):
- Use when: Simple, works well when decision boundaries are complex.
- Tradeoffs: Computationally expensive, especially with large datasets.
- Support Vector Machines (SVM)
- Use when: High-dimensional data and small datasets.
- Tradeoffs: Can be slow on large datasets, choosing the right kernel can be tricky.
- Neural Networks
- Use when: Large datasets, image/text/audio features.
- Tradeoffs: Requires more data, less interpretable.
Goal: Analyze or generate human language data.
📝 Typical Tasks
- Text Classification: Spam detection, sentiment analysis.
- Named Entity Recognition (NER): Detect names, orgs, places.
- Question Answering: Extractive or generative.
- Summarization: Abstractive or extractive.
- Translation, Language Modeling, Text Generation.
- Retrieval, Text Similarity.
- Speech Tasks: ASR, TTS.
📊 Typical Data
- Raw Text: News articles, reviews, transcripts, web crawl.
- Structured Datasets: CSVs with text + labels (e.g., IMDB, SST-2).
- Paired Data: Source-target pairs (e.g., translation, Q&A).
- Multilingual Data: For translation, cross-lingual models.
- Documents: PDFs, HTML, JSON, XML.
- Audio (optional): Paired speech-text for ASR.
🧪 Offline Metrics
- BLEU / ROUGE / METEOR: Text similarity for translation/summarization.
- Accuracy / Precision / Recall/ F1: For classification tasks (e.g., sentiment analysis).
- Perplexity: For evaluating language models.
🛠️ Typical Models
- TF-IDF + Logistic Regression / SVM
- Use when: Simple text classification or baseline.
- Tradeoffs: Fast, but lacks semantic understanding.
- RNN / LSTM / GRU
- Use when: Sequential tasks like sentiment analysis or translation.
- Tradeoffs: Struggle with long-term dependencies.
- Transformer-based models (e.g., BERT, GPT)
- Use when: Most modern NLP tasks.
- Tradeoffs: Require significant compute and fine-tuning.
Goal: Extract insights from image or video data.
🧪 Offline Metrics
- Accuracy, Precision/Recall/F1: For classification.
- Intersection over Union (IoU): For object detection.
- PSNR / SSIM: For image reconstruction or enhancement.
📝 Typical Tasks
- Classification, Object Detection, Segmentation (Semantic/Instance).
- Image Generation, Captioning, VQA.
- OCR, Pose Estimation, Action Recognition.
📊 Typical Data
- Image formats (JPG, PNG), annotated with labels, boxes, or masks.
- Multimodal pairs: image + caption, VQA.
- Video frames.
🛠️ Typical Models
- CNN (Convolutional Neural Networks)
- Use when: Image classification or object detection.
- Tradeoffs: Efficient and accurate; limited context awareness.
- ResNet, EfficientNet
- Use when: Deeper or more efficient image models.
- Tradeoffs: Better accuracy with optimized performance.
- YOLO / Faster R-CNN
- Use when: Real-time object detection.
- Tradeoffs: Speed vs. accuracy tradeoff depending on model.
- ViT (Vision Transformers)
- Use when: High-end image recognition with sufficient data.
- Tradeoffs: Require large datasets and compute.
Goal: Suggest relevant items to users by ranking items based on similarity or a scoring function.
📊 Typical Data
- User Attributes: Metadata about users (e.g., Demographics, preferences, browsing history).
- Item Attributes: Metadata about items (e.g., product descriptions, genres, categories). Labels indicating relevance or ranking order, or relevance scores.
- User-Item Interactions: (User, Item, Score) triplets or implicit feedback such as clicks, views, purchases.
🧪 Offline Metrics for Recommendation
-
Precision@K: Fraction of recommended items in the top-K list that are relevant.
Example: If the system recommends 5 items and 3 are relevant, Precision@5 = 60%.
-
Recall@K: Fraction of relevant items that are successfully recommended in the top-K list.
Example: If the user has 10 favorite items, and 3 appear in the top-5 recommendations, Recall@5 = 30%.
-
Coverage: Measures the percentage of items recommended at least once.
To ensure every product has a chance to be recommended.
-
Diversity: Ensures that recommended items are varied.
To prevent repetitive recommendations and improve user experience.
🧪 Offline Metrics for Ranking
-
MAP (Mean Average Precision): Averaged precision across queries.
Best when multiple relevant items exist, and their position matters.
-
NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality by assigning higher importance to relevant items ranked higher.
Best when higher rank matters more.
-
MRR (Mean Reciprocal Rank): Measures how high up the first correct/relevant result appears in a ranked list.
Best when #1 ranked item matters more.
🛠️ Typical Recommandation Models
- Collaborative Filtering (Matrix Factorization)
- Use when: You have rich user-item interaction data. Recommends items based on past user behaviors.
- Tradeoffs: Cold-start problem (new user no data), ignores metadata.
- Content-based Filtering
- Use when: You lack collaborative data or want interpretability. Recommends items based on the attributes of the items or users.
- Tradeoffs: Can’t capture user-item interactions well.
- Hybrid Models
- Use when: You want to combine strengths of collaborative and content methods. Ex: Use CF to see what similar users like and use CBF to re-rank based on your previous taste (attributes of previous likes).
- Tradeoffs: Increased complexity.
- Deep Learning-based (e.g., Two-tower models)
- Use when: Rich features and large-scale data.
- Tradeoffs: More complex and resource-intensive.
🛠️ Typical Ranking Models
- Regression-Based Ranking (Pointwise)
- Use when: You have continuous data and need to rank items based on predicted scores.
- Tradeoffs: May not capture relationships between items as well as more complex models.
- Pairwise Ranking
- Use when: You want to compare pairs of items to determine the better one.
- Tradeoffs: Computationally expensive.
Goal: Find relevant candidates from a large collection.
🧪 Offline Metrics
- Hit Rate (HR): Measures how often relevant items appear in the top-K retrieval.
Example: If a user’s preferred item appears in the top-10 search results, it counts as a hit.
🛠️ Typical Models
- TF-IDF (Term Frequency-Inverse Document Frequency)
- Use when: Document retrieval tasks based on keyword relevance.
- Tradeoffs: Simple but doesn’t capture semantic meaning.
- BM25 (Best Match 25)
- Use when: Information retrieval tasks, especially in large text corpora.
- Tradeoffs: More advanced than TF-IDF, but still simple.
- Neural Networks (Deep Embeddings)
- Use when: Semantic understanding of the data is important.
- Tradeoffs: Requires large datasets and computational resources.
- Approximate Nearest Neighbors (ANN)
- Use when: Fast, scalable retrieval in large datasets.
- Tradeoffs: May not always find exact neighbors, but balances speed and accuracy well.
Step 1: Feature Selection
- Define key actors: users, items, queries, etc.
- Extract actor-specific features based on tasks.
Step 2: Create New Features
- Cross Features: Combine entities like User-Item (user-video watch history), Query-Doc.
- Statistical / ML Features:
- Polynomial: capture non-linear relationships.
- Binning: convert numerical data into categorical groups.
- Feature Interactions: combine existing features (add, multiply, divide).
- Clustering: (e.g., k-mean, mean-shift) to assign entities to clusters.
- PCA: Dimensionality reduction, visualization.
- SVD: Dimensionality reduction, recommendation, latent semantic analysis (LSA).
Step 3: Handling Missing Data
- Drop: If rare (<5%).
- Imputation:
- Mean/median (numerical), mode (categorical)
- Forward/backward fill (time-series)
- Predictive models (KNN, regression)
- Indicator variable (flag missing values)
- Numerical:
- Scaling: StandardScaler(Z-score), MinMax, Robust
- Log transformation
- Categorical:
- One-Hot Encoding (nominal/unordered)
- Ordinal Encoding (ordered)
- Target Encoding (mean target value, watch out for leakage!)
- Embeddings (high-cardinality categories)
- Text:
- Tokenization: Word, subword (BPE, WordPiece).
- Normalization: Lowercasing, stemming, lemmatization, stop-word removal
- Vectorization/Embeddings: TF-IDF (scoring), Word2Vec (similarity), BERT (context understanding), GPT (generative).
- Padding & Truncation: To fixed input size.
- Images:
- Resize, normalize (scale to [0,1]).
- Data augmentation (flips, rotations, cropping).
- Normalization: Pixel value scaling (e.g., [0,1], mean-std).
- Conversion: RGB <-> grayscale if needed.
- Batching: Consider padding for varying sizes (segmentation, detection).
- Feature extraction (ResNet, EfficientNet).
- Vectorization/Embeddings
- Videos:
- Frame sampling, optical flow analysis.
- Convert video to embeddings
Step 5: Handling Outliers
- Detection: Z-score (>3σ), IQR (>1.5×IQR).
- Treatment: Winsorization (cap at 5th & 95th percentiles), log transform (mitigate extreme values).
Step 6: Handling Imbalanced Data
- Resampling:
- Oversampling (SMOTE, ADASYN): add more underrepresented class
- Undersampling (reduce majority class): reduce the number of majority class (can lead to info loss)
- Class Weights: Assign more weights to underrepresented class.
- Threshold Tuning: Adjust using Precision-Recall AUC.
Step 7: Privacy & Compliance
- Data minimization, anonymization & hashing.
- Ensure user consent & GDPR compliance.
- Differential privacy (add noise to prevent identification).
- Federated learning (train models without centralizing data).
Data Splits
Train/Dev/Test (make sure no leakage and ensure generalization!).
- Train set: used to train the model — i.e., learn the patterns, weights, or rules.
- Dev set: for model selection and hyperparam tuning.
- Test set: final evaluation before deployment.
Training Modes
- Offline Training: Train on historical data.
- Online Learning: Continuously update with new data.
- Warm Start: Train with historical data and fine-tune with recent behavior.
Loss Function
- Classification:
- Cross-Entropy (Log-Loss): Measures the difference between predicted probability and actual class.
- Regression:
- Mean Squared Error (MSE): Penalizes large errors more than small ones.
- Mean Absolute Error (MAE): More robust to outliers than MSE.
- Ranking:
- Pairwise Losses: e.g., hinge loss (SVM), triplet loss (embedding learning).
Techniques
- Cross-Validation: For robustness.
- Hyperparameter Tuning: Grid search, random search, Bayesian.
- Transfer Learning: Start from pretrained models.
- AutoML: Automate feature selection, model search (NAS), tuning.
- Distributed Training: Parallelize with data/model parallelism.
- Regularization: prevents overfitting by adding constraints to the model or altering the training data.
- Weight Regularization:
-
L1 (Lasso): Adds absolute weight penalty (λ * |
w |
) → Encourages sparsity, good for feature selection. |
- L2 (Ridge): Adds squared weight penalty (λ * w²) → Shrinks weights to prevent overfitting.
- ElasticNet: Combines L1 and L2 penalties.
- Dropout: Randomly “drops” neurons during training to prevent co-adaptation and improve generalization.
- Data Augmentation: Especially for image/text/audio. Random transformations (e.g., crop, rotate, synonym replacement) to increase diversity of training data.
- Early Stopping: Monitor validation loss and stop training when it stops improving.
- Batch Normalization: Stabilizes training by normalizing layer inputs. Acts as a mild regularizer.
- Label Smoothing: Instead of hard 0/1 targets, use soft targets (e.g., 0.9 for correct class, 0.1 distributed among others). Helps prevent overconfidence.
- Adversarial Training: Add small perturbations to inputs during training to improve robustness.
Model Store
- Object Store (ex: S3, Blob): saving as trained model artifacts (.pt, .pkl, .onnx, etc.). Can store anywhere, manual version control and logs.
- MLFlows: for full ML lifecycles. Experiment tracking, model registry (version control), monitoring, CI/CD.
- Batch Inference: High throughput, pre-computed.
- Real-Time: Low latency, live predictions.
- Hybrid: e.g., Netflix recommendation – batch for candidates, real-time for ranking.
🔵 Blue-Green Deployment
Safely release a new version (code, model, service) with zero downtime.
- Have two identical environments: one live (Blue) and one staging (Green).
- Test in staging (Green). Once ready, switch all users from Blue to Green.
- If something breaks, rollback quickly by switching back to Blue.
🧪 A/B Experiments
Split users into two groups and give them different versions (control vs. test) to measure which one performs better.
- Make sure each group is representative of the user base to avoid bias.
- Measure impact using metrics like CTR, CVR.
- Null Hypothesis (H₀): No significant difference between the control and test groups (e.g., “The new model has the same performance as the current model”).
- If p-value < 0.05, reject the null hypothesis (statistically significant difference).
- If p-value > 0.05, fail to reject the null hypothesis.
🎰 Bandits
Improve A/B testing by dynamically allocating more traffic to the better-performing version.
- Multi-Armed Bandit (MAB): Allocates more traffic to the better-performing version while continuing to explore other options.
- Exploration vs. Exploitation: Balances trying new options (exploration) and focusing on the best-performing option (exploitation).
- Benefits: More efficient use of traffic and reduced risk compared to traditional A/B testing.
🐤 Canary Release
Gradually roll out a new model to a small subset of users before releasing it to the entire user base.
- How? A small group of users (canary group) gets the new model. If it performs well, the rollout is expanded.
- Benefits: Controlled testing and minimizes risk before full deployment.
🕵️ Shadow Deployment
Test a new model or feature in a live environment without affecting the user experience.
- How? The new model is deployed but runs in the background. Its predictions are logged for analysis alongside the current model’s output.
- Benefits: Risk-free testing in real-world conditions and allows for early identification of issues.
❄️ Cold Start
Launching a new model or feature with minimal or no historical data.
Model Cold Start: Deploying a new model without sufficient training data.
- Data Augmentation: Use external or synthetic data to improve training.
- Transfer Learning: Use a pre-trained model on similar data.
- Hybrid Models: Combine new models with simpler systems until more data is collected.
Feature Cold Start: Introducing a new feature without enough historical data.
- New User
- Ask onboarding questions (genres, interests)
- Use user metadata (age, location, device)
- Recommend popular/trending items
- New Item
- Use content-based features (title, category, description)
- Embed item with similar existing items
- Boost it a little so the system can explore how users respond (exploration vs. exploitation)
General System Scaling
- Distributed Servers: Spread servers across different locations to improve reliability and speed.
- CDN (Content Delivery Network): Use nearby servers to deliver content faster to users.
- Load Balancers: Distribute network traffic to multiple servers to keep everything running efficiently.
- Sharding: Split data into smaller chunks to improve database performance.
- Replication: Copy data to multiple places for backup and faster access.
- Caching: Store frequently used data temporarily for faster access.
ML System Scaling
- Data Parallelism: Distribute data across multiple nodes to speed up training (e.g., TensorFlow, PyTorch).
- Model Parallelism: Split the model across multiple machines (for very large models).
- Asynchronous SGD: Each worker updates parameters asynchronously.
- Synchronous SGD: Workers update parameters simultaneously to ensure consistency.
- Distributed Training: Training on multiple machines to speed up large-scale model training.
- Click-Through Rate (CTR) – Fraction of users who click on a recommended item.
- Conversion Rate (CVR) – Fraction of users who take a desired action (e.g., purchase, sign-up).
- Engagement Metrics – Time spent, interactions per session, number of likes, CPE (Cost Per Engagement).
- Revenue Impact – Revenue per user, average order value.
- Latency & System Performance – Response time, request throughput.
- User Retention & Churn – How many users return vs. leave.
Monitoring
- Logging: Features, predictions, metrics.
- Metrics:
- SW System Metrics: Server load, response time, uptime, etc.
- ML Metrics: Accuracy, loss, predictions, feature distributions.
- Online & Offline Metric Dashboards: Visualize real-time vs historical metrics for insights.
- Data Distribution Shifts:
- Types of Shifts:
- Covariate Shift: Change in the distribution of input data.
- Label Shift: Change in the distribution of output labels.
- Concept Shift: Change in the relationship between input and output.
- Detection: Use statistical methods or hypothesis testing (e.g., Kolmogorov-Smirnov test).
- Correction: Adjust model or retrain with fresh data.
- System Failures:
- SW System Failures: Dependency issues, deployment errors, hardware downtime.
- ML System Failures: Data distribution differences (test vs. real-time), feedback loops.
- Edge Cases: Handling invalid or junk inputs.
- Alarms: Set up alarms for failures in data pipelines, low metric performance, or system downtimes.
Updates
- Continual Training: Update the model as new data becomes available.
- Model Updates: Fine-tune or retrain from scratch based on new data.
- Frequency: Decide on how often to retrain (daily, weekly, monthly, etc.).
- Auto Update Models: Automate model updates to keep the system up-to-date.
- Active Learning: Use human-in-the-loop systems to improve model performance by selecting uncertain samples for labeling.
⏱️ Total Duration: ~40 minutes
Each section includes a time estimate to help pace your response.
🔹 Step 1: Define the Problem (5-7 min)
🎯 Goal:
Understand the business context, define the ML task, and identify success metrics.
✅ Checklist:
- Lay out assumeptions about the system.
- Clarify user need and product impact
- Identify business objective (e.g. engagement, retention, revenue)
- Come up with functional (what the user needs) vs. non-functional (what the system needs) requirements.
- Ask about:
- Input/output format
- Latency/throughput constraints
- Real-time vs batch
- Expected scale (users, data size)
- Reframe as ML task:
- Classification, regression, ranking, clustering, recommendation, etc.
- Input + Output
- Success metrics
- Offline metric for training
- Online metric for monitoring, CI/CD
🗣️ Sample Questions:
- “Is the goal to increase time spent, engagement, or retention?”
- “Should the model respond in real-time, or is daily batch okay?”
- “What feedback signals are available—explicit or implicit?”
🔹 Step 2: Data Strategy & Labeling (5-7 min)
🎯 Goal:
Identify data sources, define labels, and ensure data quality.
✅ Checklist:
- Available data sources
- Implicit vs explicit labels
- Manual vs automated labeling
- Cold start & feedback loop
- Bias & sampling issues
🧠 Tips:
- Mention position bias, selection bias, existing system bias
- Consider label noise, label delay, skewed distributions
🔹 Step 3: Feature Engineering & Data Processing (5-7 min)
🎯 Goal:
Design effective features based on entities and user interactions.
✅ Checklist:
- Key entities: users, items, sessions
- Feature types:
- User features (demographics, behavior)
- Item features (metadata, embeddings)
- Interaction features (clicks, dwell time)
- Preprocessing:
- Handle missing data, outliers
- One-hot vs embeddings
- Temporal features, normalization
🔹 Step 4: Modeling & Training Strategy (5-7 min)
🎯 Goal:
Choose appropriate model and training setup.
✅ Checklist:
- Start simple: Baseline → heuristic → ML → DL
- Justify model based on:
- Task complexity
- Feature richness
- Scale
- Training details:
- Data splits (train/val/test)
- Regularization, early stopping
- Loss function choice
- Class imbalance solutions
🔹 Step 5: Evaluation (5 min)
🎯 Goal:
Measure model performance both offline and online.
✅ Offline:
- Metrics: Accuracy, Precision@K, Recall@K, AUC, nDCG, MRR
- Segment-based evaluation (new vs returning users)
✅ Online:
- A/B testing setup
- Metrics: CTR, conversion, dwell time, retention
- Tradeoffs: precision vs recall, relevance vs diversity
🔹 Step 6: Deployment & Monitoring (5 min)
🎯 Goal:
Design reliable and scalable deployment.
✅ Checklist:
- Serving: Real-time vs batch
- Framework: TorchServe, TF Serving, ONNX
- Caching, latency optimizations
- Canary rollout, rollback strategy
- Monitoring:
- Drift detection
- Prediction distribution
- Alerting/observability
- Retraining schedule:
- Periodic, incremental
- Versioning and reproducibility
🔹 Step 7: Wrap-Up & Trade-offs (3–5 min)
🎯 Goal:
Summarize and showcase holistic thinking.
✅ Checklist:
- 30-second end-to-end summary
- Key tradeoffs discussed
- Next steps:
- Feedback loop
- Explainability
- Long-term maintenance
- Ethics and fairness considerations
🧩 Bonus Topics (If Time Permits)
- Cold start solutions (heuristics, hybrid systems)
- Multi-objective optimization (relevance, diversity)
- Privacy (federated learning, DP)
- Edge cases and failure handling
Below is a social media feed recommender system designed following this step-by-step framework above:
