LLM Post-training

1. Mid-Training
2. Supervised Fine Tuning
- 2.1. Synthetic Data Generation
3. Reinforcement Learning from Human Feedback
4. Reinforcement Learning with Verifiable Rewards (RLVR)
5. Case study - DeepSeek-R1

LLM post training takes a base model and improves it for conversation, reasoning and domain tasks using imitation/supervised learning and RL.

1. Mid-Training

To improve capability of a model within a domain, we can train the model in domain specific dataset. But if we train completely on domain specific data then the general reasoning capabilities start to degrade. So, a mixture of domain specific data (15-30% or even 50%) and rest of general data is used.

The initial layers of LLM are responsible for general knowledge and reasoning. And so sometimes, for mid training only the later layers are updated.

Mid training (also called continued pre-training) is also supervised fine tuning but SFT is thought as for conversation and formatting while mid training is done before SFT and is thought as for improving domain knowledge. This two-stage approach mirrors the pattern in unsupervised reinforcement learning.

2. Supervised Fine Tuning

The model trained so far has raw text completion ability. To make it fit for conversation, instruction following and other downstream tasks, supervised finetuning is done with a curated instruction-response dataset.

A consistent format should be used during SFT and same format should be used during inference. Initially plain text markup were used but now special tokens are used to mark the start and end of roles like system, user, and response prompts.

During SFT to prevent memorization and catastrophic forgetting, the learning rate is kept 10 to 100 times lower than in pre-training and training is done only for 1 to 3 epochs.

2.1. Synthetic Data Generation

For SFT collecting a high quality instruction-response dataset can be a bottleneck. So, a bigger "teacher" model can be used to generate synthetic dataset of tasks specific instruction-response pairs based on a few human-written "seed" tasks. This involves:

Task Generation: Sample tasks from the pool and generate new diverse instructions/tasks
Input/Output Generation: Take a task and generate the output
Filtering: To remove similar or low quality pairs. Alternatively, instead of removing similar pairs and wasting generated pairs, loosening the similarity threshold and focusing of local diversity within a training batch gives better & cost effective results.
Poll update: Add synthetic tasks back to the pool for iterative improvement

3. Reinforcement Learning from Human Feedback

3.1. SFT vs RLHF

RLHF is done to align the model output to human preference. It is an important and distinct phase than SFT because:

SFT is a form of imitation learning, and the model can't do better than the examples that humans provide.
SFT also promotes copying the structure of the response instead of understanding the rules behind them. This is because SFT loss is based on Maximum Likelihood Estimation.
It is difficult to specify high quality answer to nuanced objectives like helpfullness, and harmlessness. Instead it is easier to judge whether one answer is better than the other.

3.2. PPO

RLHF solves the issues with SFT by optimizing for the human preference. There are three stages:

SFT: To establish a baseline
Reward Model training: Based on human ranking of multiple completions for same prompt, a reward model is trained to predict human choice.

Based on ranking data, reward model \(r_{\phi}\) can be trained to assing reward value to each answer by optmizing for following objective (called the Bradley-Terry preference model):
\begin{align*} \mathcal L(\phi) = - \mathbb E_{(x, y_a, y_b) \sim D} \left[ \log P(y_{a} \succ y_{b}| x) \right] \\ \textrm{where, } P(y_{a} \succ y_{b} | x) = \sigma(r_{\phi}(x, y_{a}) - r_{\phi}(x, y_{b})) \ . \end{align*}
Policy Optimization: SFT model is now fine-tuned using PPO with reward from Reward Model.

Using PPO requires multiple models: a) policy model being trained, b) a reference model to compute the clipped reward, c) reward model and d) a critic model. This is computationally expensive.

3.3. Direct Preference Optimization (DPO)

Instead of training Reward Model, we can do direct policy optimization. The KL constrained, RLHF objective can be mathematically manipulated to arrive at an alternative but equivalent loss function that just uses the reference model and the policy model:

\begin{align*} \mathcal L_{DPO}(\theta) = - \mathbb E_{(x, y_a, y_b) \sim D} \left[ \log \sigma \left( \beta \log {\frac{\pi_{\theta}(y_{a}|x)} {\pi_{ref}(y_{a}|x)}} - \beta \log {\frac{\pi_{\theta}(y_{b}|x)} {\pi_{ref}(y_{b}|x)}} \right) \right]\ . \end{align*}

But the lack of Reward model means this is "offline" in nature and can only learn from a pre-collected dataset of peferences, i.e. it cannot explore new responses that might be better than those in the training set.

Note that, DPO learns the preferences but it is not RLHF because DPO update is not Reinforcement Learning.

3.4. Group Relative Policy Optimization (GRPO)

GRPO removes the memory bottleneck of PPO by getting rid of the critic model. For policy optimization, critic gives the advantage value, which is the amount by which a candidate response is better than the average response. In GRPO, it is computed not by a critic model, but by using the average reward for the group of responses generated for the same prompt. Specifically, the advantage is the given by normalizing with respect to the mean & standard deviation of group's rewards:

\begin{align*} A_{i} = \frac {r_{i} - \mu_{r}} {\sigma_{r}} \ . \end{align*}

In short what GRPO does is, generate a set of responses for the same prompt, evaluate the responses using a reward model and tune the model to generate responses that get better than average reward within the generated set.

4. Reinforcement Learning with Verifiable Rewards (RLVR)

In math and code, "peference" is poor proxy for correctness, and instead the model should be rewarded for correctness. There are two main approach for RLVR:

Outcome Reward Model: Look at final outcome and reward it even if the process is incorrect
Process Reward Model: Look at the process and reward each correct step of the process. This method evaluates each token or line and provides dense reward which is much better for credit assignment.

To use RLVR we first sample some candidate solutions for a task, and then the verifier assigns reward for the solution. The model is then updated to maximize the total reward.

5. Case study - DeepSeek-R1

To see all of the above in action, let see what DeepSeek R1 did. DeepSeek R1's training consisted of four stages [Source: pdf @ github.com/deepseek-ai]:

Cold Start SFT: Model was trained on few thousand high quality, long form Chain of Thought examples to establish basic readability and formatting.
Reasoning Oriented RL: GRPO was used to improve logical reasoning.
Rejection Sampling SFT: Reasoning model was used to generated 800,000 synthetic samples. The samples were filtered for correctness and mixed with general conversational data for another SFT phase.
Diverse RL: Final RL phase was focused on human perference as well as reasoning.