2026-01-27

LLM Engineer Handbook

Table of Contents

[pdf][github.com]

Things that come with scale [#39]:

FTI pipeline

And there's an fourth pipeline: Data pipeline. See diagram at #47

1. RAG

Retrieval Augmented Generation

To create a embedding we can use various models:

  1. Word2Vec, GloVe were early methods. Still in use.
  2. Encoder only trasformers (e.g. BERT, RoBERTa)
  3. CLIP (for image and text)

Or find something from MTEB benchmark (Massive Text Embedding Benchmark)

Vector DBs

E.g. Qdrant

  • Use approximate nearest neighbour (ANN) algorithms
  • Indexing techniques:
    1. Hierarchical navigable small world (HNSW): create multi-layer graph with node = set of similar vectors
    2. random projection
    3. product quantization (PQ): divide to sub-vectors
    4. locality-sensitive hashing (LSH): hash vectors to buckets
  • Similarity measure: Cosine similarity, Euclidean distance, dot product

Retrieval:

Pre-retrieval

  • Query rewriting
  • Query expansion
  • data indexing optimization (cleaning, metadata, chunking)

Retrieval:

  • Instructor Embedding model gives embeddings tune for the task at hand without needing to do finetune training of the model.
  • Hybrid search: vector & keyword based search
  • Filtered vector search

Post-retrieval:

  • Objective is to prevent to large context or context with irrelevant information
  • Prompt compression: eliminates useless details
  • Re-ranking: Select top N results of vector search by using a different model (called cross-encoder model [#154]) that gives matching scores.

    Terminology:

    • Bi-encoder: The vector query step uses Bi-encoder. i.e. text are econded separately (e.g. word2vec) and then similarity between them is measured.
    • Cross-encoder: The reranking step uses cross-encoder. It takes a pair of text at a time, and gives a similarity score. Slower but more accurate.

Streaming

  1. Distributed event streaming platform: Apache Kafka

    Or for simplicity queues: RabbitMQ

  2. Streaming engine: Apach Flink

Change data capture:

  1. Pull: Last update Timestamp based
  2. Push: DB triggers
  3. Logs: Transaction logs

    This is the best. The only cons is that implementation is DB specific because the log files don't have standardized formats.

2. SFT - Supervised Fine-tuning

  • Create instruction dataset

    • Collection of Instruction and Output pairs. Or system-instruction-output triplet. System is part of instruction anyways.

    In the project followed in the book, we had only blog/articles. These are the output (answers). The instruction then is something like "Write an article about RAG". To create the instruction dataset, we can use another LLM and ask it to give say 5 instructions corresponding to the following chunk of article. Additionally, we can ask it to generate 5 instruction-output pairs.

  • Deduplicate data
    • Exact
    • Fuzzy Deduplication (e..g MinHash) [#213]
    • Semantic similarity
  • Data decontamination
  • Data generation, augmentation

Make sure traning dataset doesn't contain test dataset samples.

For larget models the samples can be low (e.g. 70B -> around 1000 samples)

3. Inference Optimization

3.1. Model Optimization

We need to reduce VRAM (Video Random Access Memory) usage.

  • KV Cache:

    For 7B model it's around 2GB for 2048 tokens.

  • Continuous batching
  • Speculative Decoding
  • Optimized attention mechanisms
    • PagedAttention
    • FlashAttention

3.2. Model Parallelism

  • Data
  • Pipeline
  • Tensor

    This is efficient in context of LLM because we can easily parallelize across attention heads.

3.3. Model Quantization


Backlinks


You can send your feedback, queries here