LLM Engineer Handbook
Table of Contents
[pdf][github.com]
- Transformers:
- LayerNorm brings back the activation to normal level after being blown up by the weights matrices
- GPT-2
- BERT - encoder only, no attention masking
- GPT-3
- RETRO (DeepMind) 2021
- encode sentences using BERT and store in database
- at inference time, fetch matching sentences and attend to them
- Chinchilla 2022
- Semantic Search
- Libraries FAISS, ScanNN, Haystack, Pinecone, …
- Traning-seving skew: Happens when feature passed during training and inference time don't match up
Things that come with scale [#39]:
- PySpark or Ray
- Inference in difference faster language (C++, Java, Rust)
- Dividing & Sharing work between teams
- Streaming for real-time training
FTI pipeline
Feature pipeline
Tools: Pandas, Polars, Spark, DBT, Flink, Bytewax
Training
Tools: PyTorch, TensorFlow, Scikit-learn, XGBoost, JAX
Inference
Tools: Same as for training
And there's an fourth pipeline: Data pipeline. See diagram at #47
1. RAG
Retrieval Augmented Generation
To create a embedding we can use various models:
- Word2Vec, GloVe were early methods. Still in use.
- Encoder only trasformers (e.g. BERT, RoBERTa)
- CLIP (for image and text)
Or find something from MTEB benchmark (Massive Text Embedding Benchmark)
Vector DBs
E.g. Qdrant
- Use approximate nearest neighbour (ANN) algorithms
- Indexing techniques:
- Hierarchical navigable small world (HNSW): create multi-layer graph with node = set of similar vectors
- random projection
- product quantization (PQ): divide to sub-vectors
- locality-sensitive hashing (LSH): hash vectors to buckets
- Similarity measure: Cosine similarity, Euclidean distance, dot product
Retrieval:
Pre-retrieval
- Query rewriting
- Query expansion
- data indexing optimization (cleaning, metadata, chunking)
Retrieval:
- Instructor Embedding model gives embeddings tune for the task at hand without needing to do finetune training of the model.
- Hybrid search: vector & keyword based search
- Filtered vector search
Post-retrieval:
- Objective is to prevent to large context or context with irrelevant information
- Prompt compression: eliminates useless details
Re-ranking: Select top N results of vector search by using a different model (called cross-encoder model [#154]) that gives matching scores.
Terminology:
- Bi-encoder: The vector query step uses Bi-encoder. i.e. text are econded separately (e.g. word2vec) and then similarity between them is measured.
- Cross-encoder: The reranking step uses cross-encoder. It takes a pair of text at a time, and gives a similarity score. Slower but more accurate.
Streaming
Distributed event streaming platform: Apache Kafka
Or for simplicity queues: RabbitMQ
- Streaming engine: Apach Flink
Change data capture:
- Pull: Last update Timestamp based
- Push: DB triggers
Logs: Transaction logs
This is the best. The only cons is that implementation is DB specific because the log files don't have standardized formats.
2. SFT - Supervised Fine-tuning
Create instruction dataset
- Collection of Instruction and Output pairs. Or system-instruction-output triplet. System is part of instruction anyways.
In the project followed in the book, we had only blog/articles. These are the output (answers). The instruction then is something like "Write an article about RAG". To create the instruction dataset, we can use another LLM and ask it to give say 5 instructions corresponding to the following chunk of article. Additionally, we can ask it to generate 5 instruction-output pairs.
- Deduplicate data
- Exact
- Fuzzy Deduplication (e..g MinHash) [#213]
- Semantic similarity
- Data decontamination
- Data generation, augmentation
Make sure traning dataset doesn't contain test dataset samples.
For larget models the samples can be low (e.g. 70B -> around 1000 samples)
3. Inference Optimization
3.1. Model Optimization
We need to reduce VRAM (Video Random Access Memory) usage.
KV Cache:
For 7B model it's around 2GB for 2048 tokens.
- Continuous batching
- Speculative Decoding
- Optimized attention mechanisms
- PagedAttention
- FlashAttention
3.2. Model Parallelism
- Data
- Pipeline
Tensor
This is efficient in context of LLM because we can easily parallelize across attention heads.