Designing ML Systems
Table of Contents
- ETL : Extract Transform Load
- BERT: Bidirectional Encoder Representation from Transformers
Input Features for ML models
- Static features or Batch features: Features that change slowly
- Dynamic features or Streaming features: Features that change quickly
Both of these require different systems to process i.e. feed into ML system.
For streaming features which can require complex queries with join and aggregation, a efficient stream processing engine is required. E.g. Apache Flink, KSQL.
1. Chapter 1: Machine Learning Systems in Production
- What is ML? An approach to learn complex patterns from existing data to make predictions on unseen data.
- When to use ML:
- The task is repetitive.
- The cost of a wrong prediction is cheap.
- The problem is at scale (millions of predictions/data points).
- Patterns are constantly changing (e.g., spam detection).
- ML in Research vs. Production:
- Research: Focuses on \(SOTA\) performance on static, benchmark datasets; prioritizes fast training.
- Production: Involves multiple stakeholders (Sales, Product, ML); prioritizes fast inference and low latency; deals with constantly shifting data.
- Key Requirements: Reliability (performing correctly under adversity), Scalability, Maintainability, and Adaptability.
- Iterative Process: Scoping -> Data Engineering -> Model Development -> Deployment -> Monitoring/Maintenance -> Business Analysis.
2. Chapter 2: Data Engineering Fundamentals
- Data Formats:
- Row-major (CSV): Better for heavy writes.
- Column-major (Parquet): Better for heavy reads of specific features; more compact and efficient for storage/S3.
- Data Models:
- Relational: Organized into tables; usually requires strict schema.
- NoSQL: Includes Document models (schemaless, high locality) and Graph models (priority on relationships).
- Storage & Processing:
- OLTP (Transactional): Low latency, high availability for user actions (e.g., tweeting).
- OLAP (Analytical): Optimized for aggregating data across many rows.
- ETL vs. ELT: Traditionally data is transformed before loading; ELT loads raw data into a "Data Lake" for later processing.
- Dataflow Modes: Databases, Request-driven (REST/RPC), or Event-driven (Kafka/Kinesis real-time transports).
3. Chapter 3: Training Data
- Sampling Techniques:
- Stratified Sampling: Sampling from each "strata" (group) to ensure rare classes are represented.
- Reservoir Sampling: Useful for streaming data; ensures every item has equal probability of being selected without knowing the total count.
- Handling Lack of Labels:
- Weak Supervision: Uses heuristics/labeling functions to programmatically label data (e.g., Snorkel).
- Transfer Learning: Reusing a model developed for one task (base task) as a starting point for another (downstream task).
- Active Learning: Model chooses which unlabeled samples are most useful for a human to label.
- Class Imbalance:
- Challenges: Insufficient signal for minority classes; models may learn simple (wrong) heuristics.
- Handling: Use better metrics (F1, Precision-Recall Curve instead of Accuracy); Resampling (Oversampling minority/Undersampling majority); Algorithm-level fixes (Cost-sensitive learning, Focal Loss).
4. Chapter 4: Feature Engineering
- Operations:
- Scaling: Normalizing features to similar ranges (crucial for gradient-based models).
- Discretization: Turning continuous features into buckets/categories.
- Data Leakage: Occurs when info from the future or target labels "leaks" into training features.
- Detection: Look for unusually high correlations or do ablation studies.
- Feature Importance: Using techniques like SHAP or Ablation to understand which features drive predictions.
5. Chapter 5: Model Development
- Framing: Deciding whether a problem is classification (binary, multiclass, multilabel) or regression.
- Baselines: Always start with simple heuristics (e.g., most common class), human baselines, or zero-rule baselines to justify ML complexity.
- Ensembling: Combining multiple models (Bagging, Boosting, Stacking) to improve performance.
- Distributed Training: Essential for large models; involves Data Parallelism or Model Parallelism.
6. Chapter 6: Model Deployment
- Online vs. Batch Prediction:
- Online: Generated on-demand via REST/RPC; low latency.
- Batch: Pre-computed and stored for later retrieval; high throughput.
- Inference Optimization:
- Quantization: Reducing precision of weights (e.g., Float32 to Int8).
- Pruning: Removing unhelpful neurons/connections.
- Knowledge Distillation: Training a small "student" model to mimic a large "teacher" model.
7. Chapter 7: Why Systems Fail
- Data Distribution Shifts:
Tests for drift can be Kolmogorov-Smirnov (KS) test - useful only for 1D data, Maximum Mean Discrepancy (MMD) test
- Covariate Shift: \(P(X)\) changes but \(P(Y|X)\) remains the same (input distribution changes).
- Label Shift: \(P(Y)\) changes but \(P(X|Y)\) remains the same.
- Concept Drift: \(P(Y|X)\) changes (the relationship between input and output changes).
- Degenerate Feedback Loops: When a model's predictions influence the data used for its future training (common in recommendation systems).
- Edge Cases: Examples where the model performs significantly worse than average.