You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: chapters/01-introduction.md
+18Lines changed: 18 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,6 +21,9 @@ This book details key decisions and basic implementation examples for each step
21
21
22
22
RLHF has been applied to many domains successfully, with complexity increasing as the techniques have matured.
23
23
Early breakthrough experiments with RLHF were applied to deep reinforcement learning [@christiano2017deep], summarization [@stiennon2020learning], following instructions [@ouyang2022training], parsing web information for question answering [@nakano2021webgpt], and "alignment" [@bai2022training].
24
+
A summary of the early RLHF recipes is shown below in @fig:rlhf-basic.
25
+
26
+
{#fig:rlhf-basic}
24
27
25
28
In modern language model training, RLHF is one component of post-training.
26
29
Post-training is a more complete set of techniques and best-practices to make language models more useful for downstream tasks [@lambert2024t].
@@ -42,10 +45,25 @@ The core role of this book, beyond teaching the techniques for doing RLHF, is to
42
45
In recent years, language models have shifted from academic experiments studied in the purview of benchmarks to general purpose technology.
43
46
RLHF is at the core of this transition.
44
47
48
+
To start, let us consider how a language model at the end of autoregressive pretraining answers a query.
49
+
A strong model, Llama 3.1 405B, will continue from "The president of the united states in 2006 was":
50
+
51
+
> George w bush, the governor of Florida in 2006 was Jeb bush, and John McCain was an arizona senator in 2006 - who later lost to obama.
52
+
> September 1 – U.S. President Bush signs an executive order to provide more options for faith-based organizations when delivering social services that are paid for with federal funds.
53
+
> January 1 – The current deadline set by United States Congress in the Unlawful Internet Gambling Enforcement Act, aka UIGEA.
54
+
> search: Amendments to the 1961 International Convention for the Protection of New Varieties of Plants require plant breeders' rights include farmer's privilege.
55
+
> 2009 was a common year starting on Thursday of the Gregorian calendar, the 2009th year of the Common Era (ce) and Anno Domini (ad) designations, the 9th year of the 3rd millennium, the 9th year of the 21st century, and the 10th and last year of the 2000s decade.
56
+
57
+
What is clear is that this model is completing the sentence and adding other common internet metadata.
58
+
45
59
The most compelling view of how RLHF works is to think of how *style* applies to interactions you have with language models.
46
60
The style, or format, of information presented is crucial to how it is learned.
47
61
This has always been the case for examples such as coursework, but is normally applied in the background and not considered directly.
48
62
63
+
Continuing the above example, a model trained with RLHF (and other post-training tools), would answer the question very differently. Asking Tülu 3 405B the same question "The president of the united states in 2006 was," is says concisely:
64
+
65
+
> George W. Bush was the president of the United States in 2006. He served two terms in office, from January 20, 2001, to January 20, 2009.
66
+
49
67
Modern research has established RLHF as a general method to integrate subtle stylistic and related behavioral features into the models.
50
68
Compared to other techniques for post-training, such as instruction finetuning, RLHF generalizes far better across domains [@kirk2023understanding][@chu2025sft] -- helping create effective general purpose models.
Copy file name to clipboardExpand all lines: chapters/03-setup.md
+12-1Lines changed: 12 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -88,4 +88,15 @@ where $\tau \sim \pi_\theta$ denotes trajectories sampled by following policy $\
88
88
89
89
## RLHF Only Definitions
90
90
91
-
-**Reference Model ($\pi_\text{ref}$)**: This is a saved set of parameters used in RLHF where outputs of it are used to regularize the optimization.
91
+
-**Reference Model ($\pi_\text{ref}$)**: This is a saved set of parameters used in RLHF where outputs of it are used to regularize the optimization.
92
+
93
+
## Extended Glossary
94
+
95
+
-**Synthetic Data**: This is any training data for an AI model that is the output from another AI system. This could be anything from text generated from a open-ended prompt of a model to a model re-writing existing content.
96
+
-**Distillation**: Distillation is a general set of practices in training AI models where a model is trained on the outputs of a stronger model. This is a type of synthetic data known to make strong, smaller models. Most models make the rules around distillation clear through either the license, for open weight models, or the terms of service, for models accessible only via API. The term distillation is now overloaded with a specific technical definition from the ML literature.
97
+
-**(Teacher-student) Knowledge Distillation**: Knowledge distillation from a specific teacher to student model is a specific type of distillation above and where the term originated. It is a specific deep learning method where a neural network loss is modified to learn from the log-probabilites of the teacher model over multiple potential tokens/logits, instead of learning directly from a chosen output [@hinton2015distilling]. An example of a modern series of models trained with Knowledge Distillation is Gemma 2 [@team2024gemma] or Gemma 3. For a language modeling setup, the next-token loss function can be modified as follows [@agarwal2024policy], where the student model $P_\theta$ learns from the teacher distribution $P_\phi$:
-**In-context Learning (ICL)**: In-context here refers to any information within the context window of the language model. Usually, this is information added to the prompt. The simplest form of in-context learning is adding examples of a similar form before the prompt. Advanced versions can learn which information to include for a specific use-case.
102
+
-**Chain of Thought (CoT)**: Chain of thought is a specific behavior of language models where they are steered towards a behavior that breaks down a problem in a step by step form. The original version of this was through the prompt "Let's think step by step" [@wei2022chain].
Copy file name to clipboardExpand all lines: chapters/04-optimization.md
+8-1Lines changed: 8 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -55,14 +55,21 @@ Modern RLHF-trained models always utilize instruction finetuning followed by a m
55
55
## RLHF Recipe Example
56
56
57
57
The canonical RLHF recipe circa the release of ChatGPT followed a standard three step post-training recipe where RLHF was the center piece [@lambert2022illustrating][@ouyang2022training][@bai2022training].
58
-
The three steps taken on top of a "base" language model (the next-token prediction model trained on large-scale web text) was:
58
+
The three steps taken on top of a "base" language model (the next-token prediction model trained on large-scale web text) was, summarized below in @fig:rlhf-basic-repeat:
59
59
60
60
1.**Instruction tuning on ~10K examples**: This teaches the model to follow the question-answer format and teaches some basic skills from primarily human-written data.
61
61
2.**Training a reward model on ~100K pairwise prompts**: This model is trained from the instruction-tuned checkpoint and captures the diverse values one wishes to model in their final training. The reward model is the optimization target for RLHF.
62
62
3.**Training the instruction-tuned model with RLHF on another ~100K prompts**: The model is optimized against the reward model with a set of prompts that the model generates over before receiving ratings.
63
63
64
64
Once RLHF was done, the model was ready to be deployed to users. This recipe is the foundation of modern RLHF, but recipes have evolved substantially to include more stages and more data.
65
65
66
+
{#fig:rlhf-basic-repeat}
67
+
68
+
Modern versions of post-training involve many, many more model versions.
69
+
An example is shown below in fig:complex-rlhf where the model undergoes numerous training iterations before convergence.
70
+
71
+
{#fig:rlhf-complex}
72
+
66
73
## Finetuning and Regularization
67
74
68
75
RLHF is implemented from a strong base model, which induces a need to control the optimization from straying too far from the initial policy.
Copy file name to clipboardExpand all lines: chapters/07-reward-models.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -206,6 +206,8 @@ Given the efficacy of LLM-as-a-judge for evaluation, spawning many other evaluat
206
206
An entire field of study has emerged to study how to use so called "Generative Reward Models" [@mahan2024generative]
207
207
[@zhang2024generative][@ankner2024critique] (including models trained *specifically* to be effective judges [@kim2023prometheus]), but on RM evaluations they tend to be behind existing reward models, showing that reward modeling is an important technique for current RLHF.
208
208
209
+
A common trick to improve the robustness of LLM-as-a-judge workflows is to use a sampling temperature of 0 to reduce variance of ratings.
210
+
209
211
## Further Reading
210
212
211
213
The academic literature for reward modeling established itself in 2024.
Intuitively, it could seem that averaging over the sequence is best, as we are trying to reward the model for *outcomes* and the specific tokens are not as important.
391
+
In practice, the setup that is best likely is the one that is suited to the individual, online learning setup.
392
+
Often in RLHF methods the method with the best numerical stability and or the least variance in loss could be preferred.
393
+
394
+
Putting it together, using the first loss accumulation, the psuedocode can be written as below.
377
395
378
396
```python
379
397
# B: Batch Size, L: Sequence Length, G: Number of Generations
Copy file name to clipboardExpand all lines: chapters/14-reasoning.md
+35-1Lines changed: 35 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,4 +6,38 @@ next-chapter: "Synthetic Data & Distillation"
6
6
next-url: "16-synthetic.html"
7
7
---
8
8
9
-
# [Incomplete] Reasoning Training & Models
9
+
# [Incomplete] Reasoning Training & Models
10
+
11
+
At the 2016 edition of the Neural Information Processing Systems (NeurIPS) conference, Yann LeCun first introduced his now-famous cake metaphor for where learning happens in modern machine learning systems:
12
+
13
+
> If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning (RL).
14
+
15
+
This analogy is now largely complete with modern language models. Self-supervised learning on vast swaths of internet data makes up the majority of the cake (especially when viewed in compute spent in FLOPs), the beginning of post-training in supervised finetuning (SFT) for instructions tunes the model to a narrower distribution, and finally “pure” reinforcement learning (RL) is the cherry on top.
16
+
We learn just “a few bits” of information with RL in just a few training samples.
17
+
18
+
This little bit of RL takes many forms. It can be ... TODO
19
+
20
+
21
+
Despite many, many takes that “RL doesn’t work yet” or “RL scaling isn’t ready yet” (and implicit versions of this saying to focus on “RL that Matters”), Yann’s view seems to have been right.
22
+
23
+
OpenAI’s new Reinforcement Finetuning (RFT) API (just a research program for now), announced on day 2 of the 12 days of OpenAI, is the bridge that brings RL to the masses. This is a very surprising development even for those most faithful to RL. With RFT, one can likely finetune any of OpenAI’s models, while they highlighted o1 mini, it is of obvious value to both standard autoregressive models and reasoning-heavy models. To use RFT, you need three things — 1) training data for your application, 2) validation data for your application to test overfitting, and 3) a definition via OpenAI’s “grader” configuration (more on this later).
24
+
25
+
Reinforcement Finetuning has been met with excitement and trepidation. The best practices for using existing finetuning APIs, built on instruction tuning infrastructure, are still far from established. The general public of AI builders knows very little about how RL training can change model behavior to improve performance on tasks with minimal overall changes to the model.
26
+
27
+
In many domains, Reinforcement Finetuning is much more aligned with the goals of developers by being focused on performance rather than behavior. Standard finetuning APIs generally use a parameter-efficient finetuning method such as LoRA with supervised finetuning on instructions. Developers pass in prompts and completions and the model is tuned to match that by updating model parameters to match the completions. OpenAI describes this as increasing the prevalence of “features” in the text of interest.
28
+
29
+
Reinforcement finetuning is focused on matching answers. Given queries and correct answers, RFT helps the model learn to get the correct answers. While standard instruction tuning is done with 1 or 2 epochs of loss updates over the data, reinforcement finetuning gets its name by doing hundreds or thousands of epochs over the same few data points to give the model time to learn new behaviors. This can be viewed as reinforcing positive behaviors that would work sparingly in the base model version into robust behaviors after RFT.
30
+
31
+
The impact of reinforcement finetuning’s existence
32
+
Reinforcement finetuning signals many changes to the fate of RL and language models, both at OpenAI and elsewhere:
33
+
34
+
* Stability of RL can be solved: For its entire existence, the limiting factor on RL’s adoption has been stability. This manifests in two ways. First, the learning itself can be fickle and not always work. Second, the training itself is known to be more brittle than standard language model training and more prone to loss spikes, crashes, etc. Releasing an API where any user can train on their own data signals unprecedented stability improvements. This program is still a beta ahead of a full launch, but this signals that OpenAI is more confident about it working for the public rather than not. For example, last year when I heard about large-scale RL runs at frontier AI laboratories, it would be with stories like “they launch multiple seeds at once and only keep running the ones that didn’t crash.” Now, they can be confident in their RL running and accomplishing the task. The final output model is likely automatically detected by running evaluations on the checkpoint to make sure behavior did not dip and or measuring the KL distance from the initial policy. Both of these are signals that researchers rely on heavily in post-training experimentation, so automating decision-making based on it is extremely impactful.
35
+
36
+
* Open-source versions already “exist”: Our recent work at Ai2 on reinforcement learning with verifiable rewards (RLVR) is extremely similar. The major components, i.e. data format and optimizer type are identical, we just need increased open-source investment to understand many discussion items like which model to start on, which types of data to use, etc. Check out the code we’ve been using at Open Instruct.
37
+
38
+
* A potential data flywheel for advanced reasoning models: The best speculation is that OpenAI’s o1 is trained mostly with large-scale RL on data with verifiable outputs — much like this API. If this API works as intended, OpenAI could accumulate an extreme dataset for training future versions of their o1 models. The main limitation of these models is the lack of diversity in available domains and by experimenting with training on the targeted domains of many users of OpenAI’s models they can start turning a fruitful flywheel.
39
+
40
+
The scope of RL training for language models continues to grow: The biggest takeaway from o1 on a fundamental scientific level was that we have even more ways to train language models to potentially valuable behaviors. The more open doors that are available to researchers and engineers, the more optimism we should have about AI’s general trajectory. The RL finetuning API expands the window of permissivity for RL training.
41
+
42
+
I recall listening to a talk by a prominent OpenAI researcher a year ago (I couldn’t find it), where they said they were excited about RLHF and related methods just because the loss function is more general than autoregressive prediction. We are now living in this world of RL’s impact growing rapidly, and as many people expected, the human feedback piece is not always necessary.
0 commit comments