2026-01-27

Designing ML Systems

Table of Contents

[github.com]

Input Features for ML models

Both of these require different systems to process i.e. feed into ML system.

For streaming features which can require complex queries with join and aggregation, a efficient stream processing engine is required. E.g. Apache Flink, KSQL.

1. Chapter 1: Machine Learning Systems in Production

  • What is ML? An approach to learn complex patterns from existing data to make predictions on unseen data.
  • When to use ML:
    • The task is repetitive.
    • The cost of a wrong prediction is cheap.
    • The problem is at scale (millions of predictions/data points).
    • Patterns are constantly changing (e.g., spam detection).
  • ML in Research vs. Production:
    • Research: Focuses on \(SOTA\) performance on static, benchmark datasets; prioritizes fast training.
    • Production: Involves multiple stakeholders (Sales, Product, ML); prioritizes fast inference and low latency; deals with constantly shifting data.
  • Key Requirements: Reliability (performing correctly under adversity), Scalability, Maintainability, and Adaptability.
  • Iterative Process: Scoping -> Data Engineering -> Model Development -> Deployment -> Monitoring/Maintenance -> Business Analysis.

2. Chapter 2: Data Engineering Fundamentals

  • Data Formats:
    • Row-major (CSV): Better for heavy writes.
    • Column-major (Parquet): Better for heavy reads of specific features; more compact and efficient for storage/S3.
  • Data Models:
    • Relational: Organized into tables; usually requires strict schema.
    • NoSQL: Includes Document models (schemaless, high locality) and Graph models (priority on relationships).
  • Storage & Processing:
    • OLTP (Transactional): Low latency, high availability for user actions (e.g., tweeting).
    • OLAP (Analytical): Optimized for aggregating data across many rows.
    • ETL vs. ELT: Traditionally data is transformed before loading; ELT loads raw data into a "Data Lake" for later processing.
  • Dataflow Modes: Databases, Request-driven (REST/RPC), or Event-driven (Kafka/Kinesis real-time transports).

3. Chapter 3: Training Data

  • Sampling Techniques:
    • Stratified Sampling: Sampling from each "strata" (group) to ensure rare classes are represented.
    • Reservoir Sampling: Useful for streaming data; ensures every item has equal probability of being selected without knowing the total count.
  • Handling Lack of Labels:
    • Weak Supervision: Uses heuristics/labeling functions to programmatically label data (e.g., Snorkel).
    • Transfer Learning: Reusing a model developed for one task (base task) as a starting point for another (downstream task).
    • Active Learning: Model chooses which unlabeled samples are most useful for a human to label.
  • Class Imbalance:
    • Challenges: Insufficient signal for minority classes; models may learn simple (wrong) heuristics.
    • Handling: Use better metrics (F1, Precision-Recall Curve instead of Accuracy); Resampling (Oversampling minority/Undersampling majority); Algorithm-level fixes (Cost-sensitive learning, Focal Loss).

4. Chapter 4: Feature Engineering

  • Operations:
    • Scaling: Normalizing features to similar ranges (crucial for gradient-based models).
    • Discretization: Turning continuous features into buckets/categories.
  • Data Leakage: Occurs when info from the future or target labels "leaks" into training features.
    • Detection: Look for unusually high correlations or do ablation studies.
  • Feature Importance: Using techniques like SHAP or Ablation to understand which features drive predictions.

5. Chapter 5: Model Development

  • Framing: Deciding whether a problem is classification (binary, multiclass, multilabel) or regression.
  • Baselines: Always start with simple heuristics (e.g., most common class), human baselines, or zero-rule baselines to justify ML complexity.
  • Ensembling: Combining multiple models (Bagging, Boosting, Stacking) to improve performance.
  • Distributed Training: Essential for large models; involves Data Parallelism or Model Parallelism.

6. Chapter 6: Model Deployment

  • Online vs. Batch Prediction:
    • Online: Generated on-demand via REST/RPC; low latency.
    • Batch: Pre-computed and stored for later retrieval; high throughput.
  • Inference Optimization:
    • Quantization: Reducing precision of weights (e.g., Float32 to Int8).
    • Pruning: Removing unhelpful neurons/connections.
    • Knowledge Distillation: Training a small "student" model to mimic a large "teacher" model.

7. Chapter 7: Why Systems Fail

  • Data Distribution Shifts: Tests for drift can be Kolmogorov-Smirnov (KS) test - useful only for 1D data, Maximum Mean Discrepancy (MMD) test
    • Covariate Shift: \(P(X)\) changes but \(P(Y|X)\) remains the same (input distribution changes).
    • Label Shift: \(P(Y)\) changes but \(P(X|Y)\) remains the same.
    • Concept Drift: \(P(Y|X)\) changes (the relationship between input and output changes).
  • Degenerate Feedback Loops: When a model's predictions influence the data used for its future training (common in recommendation systems).
  • Edge Cases: Examples where the model performs significantly worse than average.

Backlinks


You can send your feedback, queries here