Designing ML Systems

1. Chapter 1: Machine Learning Systems in Production
2. Chapter 2: Data Engineering Fundamentals
3. Chapter 3: Training Data
4. Chapter 4: Feature Engineering
5. Chapter 5: Model Development
6. Chapter 6: Model Deployment
7. Chapter 7: Why Systems Fail

ETL : Extract Transform Load
BERT: Bidirectional Encoder Representation from Transformers

Input Features for ML models

Static features or Batch features: Features that change slowly
Dynamic features or Streaming features: Features that change quickly

Both of these require different systems to process i.e. feed into ML system.

For streaming features which can require complex queries with join and aggregation, a efficient stream processing engine is required. E.g. Apache Flink, KSQL.

1. Chapter 1: Machine Learning Systems in Production

What is ML? An approach to learn complex patterns from existing data to make predictions on unseen data.
When to use ML:
- The task is repetitive.
- The cost of a wrong prediction is cheap.
- The problem is at scale (millions of predictions/data points).
- Patterns are constantly changing (e.g., spam detection).
ML in Research vs. Production:
- Research: Focuses on \(SOTA\) performance on static, benchmark datasets; prioritizes fast training.
- Production: Involves multiple stakeholders (Sales, Product, ML); prioritizes fast inference and low latency; deals with constantly shifting data.
Key Requirements: Reliability (performing correctly under adversity), Scalability, Maintainability, and Adaptability.
Iterative Process: Scoping -> Data Engineering -> Model Development -> Deployment -> Monitoring/Maintenance -> Business Analysis.

2. Chapter 2: Data Engineering Fundamentals

Data Formats:
- Row-major (CSV): Better for heavy writes.
- Column-major (Parquet): Better for heavy reads of specific features; more compact and efficient for storage/S3.
Data Models:
- Relational: Organized into tables; usually requires strict schema.
- NoSQL: Includes Document models (schemaless, high locality) and Graph models (priority on relationships).
Storage & Processing:
- OLTP (Transactional): Low latency, high availability for user actions (e.g., tweeting).
- OLAP (Analytical): Optimized for aggregating data across many rows.
- ETL vs. ELT: Traditionally data is transformed before loading; ELT loads raw data into a "Data Lake" for later processing.
Dataflow Modes: Databases, Request-driven (REST/RPC), or Event-driven (Kafka/Kinesis real-time transports).

3. Chapter 3: Training Data

Sampling Techniques:
- Stratified Sampling: Sampling from each "strata" (group) to ensure rare classes are represented.
- Reservoir Sampling: Useful for streaming data; ensures every item has equal probability of being selected without knowing the total count.
Handling Lack of Labels:
- Weak Supervision: Uses heuristics/labeling functions to programmatically label data (e.g., Snorkel).
- Transfer Learning: Reusing a model developed for one task (base task) as a starting point for another (downstream task).
- Active Learning: Model chooses which unlabeled samples are most useful for a human to label.
Class Imbalance:
- Challenges: Insufficient signal for minority classes; models may learn simple (wrong) heuristics.
- Handling: Use better metrics (F1, Precision-Recall Curve instead of Accuracy); Resampling (Oversampling minority/Undersampling majority); Algorithm-level fixes (Cost-sensitive learning, Focal Loss).

4. Chapter 4: Feature Engineering

Operations:
- Scaling: Normalizing features to similar ranges (crucial for gradient-based models).
- Discretization: Turning continuous features into buckets/categories.
Data Leakage: Occurs when info from the future or target labels "leaks" into training features.
- Detection: Look for unusually high correlations or do ablation studies.
Feature Importance: Using techniques like SHAP or Ablation to understand which features drive predictions.

5. Chapter 5: Model Development

Framing: Deciding whether a problem is classification (binary, multiclass, multilabel) or regression.
Baselines: Always start with simple heuristics (e.g., most common class), human baselines, or zero-rule baselines to justify ML complexity.
Ensembling: Combining multiple models (Bagging, Boosting, Stacking) to improve performance.
Distributed Training: Essential for large models; involves Data Parallelism or Model Parallelism.

6. Chapter 6: Model Deployment

Online vs. Batch Prediction:
- Online: Generated on-demand via REST/RPC; low latency.
- Batch: Pre-computed and stored for later retrieval; high throughput.
Inference Optimization:
- Quantization: Reducing precision of weights (e.g., Float32 to Int8).
- Pruning: Removing unhelpful neurons/connections.
- Knowledge Distillation: Training a small "student" model to mimic a large "teacher" model.

7. Chapter 7: Why Systems Fail

Data Distribution Shifts: Tests for drift can be Kolmogorov-Smirnov (KS) test - useful only for 1D data, Maximum Mean Discrepancy (MMD) test
- Covariate Shift: \(P(X)\) changes but \(P(Y|X)\) remains the same (input distribution changes).
- Label Shift: \(P(Y)\) changes but \(P(X|Y)\) remains the same.
- Concept Drift: \(P(Y|X)\) changes (the relationship between input and output changes).

Degenerate Feedback Loops: When a model's predictions influence the data used for its future training (common in recommendation systems).
Edge Cases: Examples where the model performs significantly worse than average.

Backlinks

System Design