You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: chapters/01-introduction.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ RLHF became most known through the release of ChatGPT and the subsequent rapid d
16
16
The basic pipeline for RLHF involves three steps.
17
17
First, a language model that can follow user questions must be trained (see Chapter 9).
18
18
Second, human preference data must be collected for the training of a reward model of human preferences (see Chapter 7).
19
-
Finally, the language model can be optimized with a RL optimizer of choice, by sampling generations and rating them with respect to the reward model (see Chapter 3 and 11).
19
+
Finally, the language model can be optimized with an RL optimizer of choice, by sampling generations and rating them with respect to the reward model (see Chapter 3 and 11).
20
20
This book details key decisions and basic implementation examples for each step in this process.
21
21
22
22
RLHF has been applied to many domains successfully, with complexity increasing as the techniques have matured.
@@ -27,7 +27,7 @@ Post-training is a more complete set of techniques and best-practices to make la
27
27
Post-training can be summarized as using three optimization methods:
28
28
29
29
1. Instruction / Supervised Finetuning (IFT/SFT), where we teach formatting and for base of instruction following abilities. This is largely about learning *features* in language.
30
-
2. Preference Finetuning (PreFT), where we align to human preferences (and get smaller bump in capabilities at the same time). This is largely about *style* of language and subtle human preferences that are hard to quantify.
30
+
2. Preference Finetuning (PreFT), where we align to human preferences (and get smaller bump in capabilities at the same time). This is largely about *style* of language and subtle human preferences that are hard to quantify.
31
31
3. Reinforcement Finetuning (RFT). The newest type of post-training that boosts performance on verifiable domains.
32
32
33
33
This book focuses on the second area, **preference finetuning**, which has more complexity than instruction tuning and is far more established than Reinforcement Finetuning.
@@ -39,7 +39,7 @@ The foundations of RLHF involve far more than preferences alone and this book pr
39
39
40
40
The biggest question around RLHF, yet one that is still hard to answer, is "What does RLHF training offer models?"
41
41
The core role of this book, beyond teaching the techniques for doing RLHF, is to distill intuition as to *why* RLHF is crucial to modern AI models.
42
-
In recent years, language models shifted from academic experiments studied in the purview of benchmarks to general purpose technology.
42
+
In recent years, language models have shifted from academic experiments studied in the purview of benchmarks to general purpose technology.
43
43
RLHF is at the core of this transition.
44
44
45
45
The most compelling view of how RLHF works is to think of how *style* applies to interactions you have with language models.
Copy file name to clipboardExpand all lines: chapters/03-setup.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ In practice, one uses a cross-entropy loss with respect to each next-token predi
27
27
28
28
Implementing a language model can take many forms.
29
29
Modern LMs, including ChatGPT, Claude, Gemini, etc., most often use **decoder-only Transformers**[@Vaswani2017AttentionIA].
30
-
The core innovation of the Transform was heavily utilizing the **self-attention**[@Bahdanau2014NeuralMT] mechanism to allow the model to directly attend to concepts in context and learn complex mappings.
30
+
The core innovation of the Transformer was heavily utilizing the **self-attention**[@Bahdanau2014NeuralMT] mechanism to allow the model to directly attend to concepts in context and learn complex mappings.
31
31
Throughout this book, particularly when covering reward models in Chapter 7, we will discuss adding new heads or modifying a language modeling (LM) head of the transformer.
32
32
The LM head is a final linear projection layer that maps from the models internal embedding space to the tokenizer space (a.k.a. vocabulary).
33
33
Different heads can be used to re-use the internals of the model and fine-tune it to output differently shaped quantities.
Copy file name to clipboardExpand all lines: chapters/04-optimization.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,7 +30,7 @@ There are multiple core changes from the standard RL setup to that of RLHF:
30
30
31
31
1. Switching from a reward function to a reward model. In RLHF, a learned model of human preferences, $r_\theta(s_t, a_t)$ (or any other classification model) is used instead of an environmental reward function. This gives the designer a substantial increase in the flexibility of the approach and control over the final results.
32
32
2. No state transitions exist. In RLHF, the initial states for the domain are prompts sampled from a training dataset and the "action" is the completion to said prompt. During standard practices, this action does not impact the next state and is only scored by the reward model.
33
-
3. Response level rewards. Often referred to as a Bandits Problem, RLHF attribution of reward is done for an entire sequence of actions, composed of multiple generated tokens, rather than in a fine-grained manner.
33
+
3. Response level rewards. Often referred to as a bandit problem, RLHF attribution of reward is done for an entire sequence of actions, composed of multiple generated tokens, rather than in a fine-grained manner.
34
34
35
35
Given the single-turn nature of the problem, the optimization can be re-written without the time horizon and discount factor (and the reward models):
The core of reinforcement learning from human feedback, also referred to as reinforcement learning from human preferences in early literature, is designed to optimize machine learning models in domains where specifically designing a reward function is hard.
12
-
Consider an example, how do you decide which of these two poems is better (Context: *On February 26th, 2025, I asked both Claude 3.7 Sonnet and ChatGPT with GPT-4o to "Write me a short poem about a optimistic goldfish."*):
12
+
Consider an example, how do you decide which of these two poems is better (Context: *On February 26th, 2025, I asked both Claude 3.7 Sonnet and ChatGPT with GPT-4o to "Write me a short poem about an optimistic goldfish."*):
13
13
14
14
Example 1:
15
15
> The Optimistic Goldfish
@@ -64,8 +64,8 @@ Together, each of these areas brings specific assumptions at what a preference i
64
64
In practice, RLHF methods are motivated and studied from the perspective of empirical alignment -- maximizing model performance on specific skills instead of measuring the calibration to specific values.
65
65
Still, the origins of value alignment for RLHF methods continue to be studied through research on methods to solve for ``pluralistic alignment'' across populations, such as position papers [@conitzer2024social], [@mishra2023ai], new datasets [@kirk2024prism], and personalization methods [@poddar2024personalizing].
66
66
67
-
The goal of this chapter is to illustrate how complex motivations result in presumptions about the nature of tools used in RLHF that do often not apply in practice.
68
-
The specifics of obtaining data for RLHF is discussed further in Chapter 6 and using it for reward modeling in Chapter 7.
67
+
The goal of this chapter is to illustrate how complex motivations result in presumptions about the nature of tools used in RLHF that often do not apply in practice.
68
+
The specifics of obtaining data for RLHF are discussed further in Chapter 6 and using it for reward modeling in Chapter 7.
69
69
For an extended version of this chapter, see [@lambert2023entangled].
Early language models were only trained to predict the next tokens in a sequence and were not adapted to any specific tasks.
12
-
Around the release of GPT-3 [@brown2020language], language models were still primarily used via in-context learning where examples where shown to the model and then it was asked to complete a similar task.
12
+
Around the release of GPT-3 [@brown2020language], language models were still primarily used via in-context learning where examples were shown to the model and then it was asked to complete a similar task.
13
13
14
14
This was the combination of two trends -- historically in the natural language processing (NLP) literature, models were trained for a specific task.
15
15
Here, as seen with one example where bigger models generalize better, multiple results showed how standardizing the approach of task data can enable dramatically different downstream performance.
Copy file name to clipboardExpand all lines: chapters/10-rejection-sampling.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -194,8 +194,8 @@ The core hyperparameters for performing this training are very intuitive:
194
194
-**Sampling parameters**: Rejection sampling is directly dependent on the completions received from the model. Common settings for RS include temperatures above zero, e.g. between 0.7 and 1.0, with other modifications to parameters such as top-p or top-k sampling.
195
195
-**Completions per prompt**: Successful implementations of rejection sampling have included 10 to 30 or more completions for each prompt. Using too few completions will make training biased and or noisy.
196
196
-**Instruction tuning details**: No clear training details for the instruction tuning during RS have been released. It is likely that they use slightly different settings than the initial instruction tuning phase of the model.
197
-
-**Heterogenous model generations**: Some implementations of rejection sampling include generations from multiple models rather than just the current model that is going to be trained. Best practices on how to do this are not established.
198
-
-**Reward model training**: The reward model used will heavily impact the final result. For more resources on reward model training, see the [relevant chapter](https://rhlfbook.com/reward-models.html).
197
+
-**Heterogeneous model generations**: Some implementations of rejection sampling include generations from multiple models rather than just the current model that is going to be trained. Best practices on how to do this are not established.
198
+
-**Reward model training**: The reward model used will heavily impact the final result. For more resources on reward model training, see the [relevant chapter](https://rlhfbook.com/c/07-reward-models.html).
199
199
200
200
#### Implementation Tricks
201
201
@@ -205,7 +205,7 @@ The core hyperparameters for performing this training are very intuitive:
205
205
206
206
Best-of-N (BoN) sampling is often included as a baseline relative to RLHF methods.
207
207
It is important to remember that BoN *does not* modify the underlying model, but is a sampling technique.
208
-
For this matter, comparisons for BoN sampling to online training methods, such as PPO, is still valid in some contexts.
208
+
For this matter, comparisons for BoN sampling to online training methods, such as PPO, are still valid in some contexts.
209
209
For example, you can still measure the KL distance when running BoN sampling relative to any other policy.
210
210
211
211
Here, we will show that when using simple BoN sampling over one prompt, both selection criteria shown above are equivalent.
Copy file name to clipboardExpand all lines: chapters/13-cai.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ next-url: "14-reasoning.html"
11
11
RL from AI Feedback (RLAIF) is a larger set of techniques for using AI to augment or generate feedback data, including pairwise preferences [@lee2023rlaif][@sharma2024critical][@castricato2024suppressing].
12
12
There are many motivations to using RLAIF to either entirely replace human feedback or augment it.
13
13
AI models are far cheaper than humans, with a single piece of human preference data costing on the order of $1 or higher (or even above $10 per prompt), AI feedback with a frontier AI model, such as GPT-4o costs less than $0.01.
14
-
This cost differences opens the market of experimentation with RLHF methods to an entire population of people previously priced out.
14
+
This cost difference opens the market of experimentation with RLHF methods to an entire population of people previously priced out.
15
15
Other than price, AI feedback introduces different *tradeoffs* on performance than human feedback, which are still being investigated.
16
16
The peak performance for AI feedback is at least in the same ballpark of human data on skill-based evaluations, but it is not studied if human data allows finer control of the models in real-world product settings or for newer training methods such as character training.
17
17
@@ -28,10 +28,10 @@ Results in many academic results showing how one can substitute AI preference da
28
28
29
29
## Constitutional AI
30
30
31
-
The method of Constitutional AI (CAI), which Anthropic uses extensively in their Claude models, is earliest, large-scale use of synthetic data for RLHF training.
31
+
The method of Constitutional AI (CAI), which Anthropic uses extensively in their Claude models, is the earliest, large-scale use of synthetic data for RLHF training.
32
32
Constitutional AI has two uses of synthetic data:
33
33
34
-
1. Critiques of instruction-tune data to follow a set of principles like “Is the answer encouraging violence” or “Is the answer truthful.” When the model generates answers to questions, it checks the answer against the list of principles in the constitution, refining the answer over time. Then, they fine-tune the model on this resulting dataset.
34
+
1. Critiques of instruction-tuned data to follow a set of principles like “Is the answer encouraging violence” or “Is the answer truthful.” When the model generates answers to questions, it checks the answer against the list of principles in the constitution, refining the answer over time. Then, they fine-tune the model on this resulting dataset.
35
35
2. Generates pairwise preference data by using a language model to answer which completion was better, given the context of a random principle from the constitution (similar to this paper for principle-guided reward models). Then, RLHF proceeds as normal with synthetic data, hence the RLAIF name.
36
36
37
37
Largely, CAI is known for the second half above, the preference data, but the methods introduced for instruction data are used in general data filtering and synthetic data generation methods across post-training.
0 commit comments