LLM Engineer Handbook

1. RAG
2. SFT - Supervised Fine-tuning
3. Inference Optimization

[pdf][github.com]

Transformers:
- LayerNorm brings back the activation to normal level after being blown up by the weights matrices
- GPT-2
- BERT - encoder only, no attention masking
- GPT-3
- RETRO (DeepMind) 2021
  - encode sentences using BERT and store in database
  - at inference time, fetch matching sentences and attend to them
- Chinchilla 2022
Semantic Search
- Libraries FAISS, ScanNN, Haystack, Pinecone, …
Traning-seving skew: Happens when feature passed during training and inference time don't match up

Things that come with scale [#39]:

PySpark or Ray
Inference in difference faster language (C++, Java, Rust)
Dividing & Sharing work between teams
Streaming for real-time training

FTI pipeline

Feature pipeline

Tools: Pandas, Polars, Spark, DBT, Flink, Bytewax
Training

Tools: PyTorch, TensorFlow, Scikit-learn, XGBoost, JAX
Inference

Tools: Same as for training

And there's an fourth pipeline: Data pipeline. See diagram at #47

1. RAG

Retrieval Augmented Generation

To create a embedding we can use various models:

Word2Vec, GloVe were early methods. Still in use.
Encoder only trasformers (e.g. BERT, RoBERTa)
CLIP (for image and text)

Or find something from MTEB benchmark (Massive Text Embedding Benchmark)

Vector DBs

E.g. Qdrant

Use approximate nearest neighbour (ANN) algorithms
Indexing techniques:
1. Hierarchical navigable small world (HNSW): create multi-layer graph with node = set of similar vectors
2. random projection
3. product quantization (PQ): divide to sub-vectors
4. locality-sensitive hashing (LSH): hash vectors to buckets
Similarity measure: Cosine similarity, Euclidean distance, dot product

Retrieval:

Pre-retrieval

Query rewriting
Query expansion
data indexing optimization (cleaning, metadata, chunking)

Retrieval:

Instructor Embedding model gives embeddings tune for the task at hand without needing to do finetune training of the model.
Hybrid search: vector & keyword based search
Filtered vector search

Post-retrieval:

Objective is to prevent to large context or context with irrelevant information
Prompt compression: eliminates useless details
Re-ranking: Select top N results of vector search by using a different model (called cross-encoder model [#154]) that gives matching scores.

Terminology:
- Bi-encoder: The vector query step uses Bi-encoder. i.e. text are econded separately (e.g. word2vec) and then similarity between them is measured.
- Cross-encoder: The reranking step uses cross-encoder. It takes a pair of text at a time, and gives a similarity score. Slower but more accurate.

Streaming

Distributed event streaming platform: Apache Kafka

Or for simplicity queues: RabbitMQ
Streaming engine: Apach Flink

Change data capture:

Pull: Last update Timestamp based
Push: DB triggers
Logs: Transaction logs

This is the best. The only cons is that implementation is DB specific because the log files don't have standardized formats.

2. SFT - Supervised Fine-tuning

Create instruction dataset
- Collection of Instruction and Output pairs. Or system-instruction-output triplet. System is part of instruction anyways.
In the project followed in the book, we had only blog/articles. These are the output (answers). The instruction then is something like "Write an article about RAG". To create the instruction dataset, we can use another LLM and ask it to give say 5 instructions corresponding to the following chunk of article. Additionally, we can ask it to generate 5 instruction-output pairs.
Deduplicate data
- Exact
- Fuzzy Deduplication (e..g MinHash) [#213]
- Semantic similarity
Data decontamination
Data generation, augmentation

Make sure traning dataset doesn't contain test dataset samples.

For larget models the samples can be low (e.g. 70B -> around 1000 samples)

3. Inference Optimization

3.1. Model Optimization

We need to reduce VRAM (Video Random Access Memory) usage.

KV Cache:

For 7B model it's around 2GB for 2048 tokens.
Continuous batching
Speculative Decoding
Optimized attention mechanisms
- PagedAttention
- FlashAttention

3.2. Model Parallelism

Data
Pipeline
Tensor

This is efficient in context of LLM because we can easily parallelize across attention heads.

3.3. Model Quantization

Backlinks

System Design