Skip to content

Commit 252c6a2

Browse files
natolambertclaude
andauthored
Fix typos across book chapters -- claude bigger (#71)
Co-authored-by: Claude <noreply@anthropic.com>
1 parent ae6d4a2 commit 252c6a2

File tree

9 files changed

+18
-18
lines changed

9 files changed

+18
-18
lines changed

chapters/01-introduction.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ RLHF became most known through the release of ChatGPT and the subsequent rapid d
1616
The basic pipeline for RLHF involves three steps.
1717
First, a language model that can follow user questions must be trained (see Chapter 9).
1818
Second, human preference data must be collected for the training of a reward model of human preferences (see Chapter 7).
19-
Finally, the language model can be optimized with a RL optimizer of choice, by sampling generations and rating them with respect to the reward model (see Chapter 3 and 11).
19+
Finally, the language model can be optimized with an RL optimizer of choice, by sampling generations and rating them with respect to the reward model (see Chapter 3 and 11).
2020
This book details key decisions and basic implementation examples for each step in this process.
2121

2222
RLHF has been applied to many domains successfully, with complexity increasing as the techniques have matured.
@@ -27,7 +27,7 @@ Post-training is a more complete set of techniques and best-practices to make la
2727
Post-training can be summarized as using three optimization methods:
2828

2929
1. Instruction / Supervised Finetuning (IFT/SFT), where we teach formatting and for base of instruction following abilities. This is largely about learning *features* in language.
30-
2. Preference Finetuning (PreFT), where we align to human preferences (and get smaller bump in capabilities at the same time). This is largely about *style* of language and subtle human preferences that are hard to quantify.
30+
2. Preference Finetuning (PreFT), where we align to human preferences (and get smaller bump in capabilities at the same time). This is largely about *style* of language and subtle human preferences that are hard to quantify.
3131
3. Reinforcement Finetuning (RFT). The newest type of post-training that boosts performance on verifiable domains.
3232

3333
This book focuses on the second area, **preference finetuning**, which has more complexity than instruction tuning and is far more established than Reinforcement Finetuning.
@@ -39,7 +39,7 @@ The foundations of RLHF involve far more than preferences alone and this book pr
3939

4040
The biggest question around RLHF, yet one that is still hard to answer, is "What does RLHF training offer models?"
4141
The core role of this book, beyond teaching the techniques for doing RLHF, is to distill intuition as to *why* RLHF is crucial to modern AI models.
42-
In recent years, language models shifted from academic experiments studied in the purview of benchmarks to general purpose technology.
42+
In recent years, language models have shifted from academic experiments studied in the purview of benchmarks to general purpose technology.
4343
RLHF is at the core of this transition.
4444

4545
The most compelling view of how RLHF works is to think of how *style* applies to interactions you have with language models.

chapters/03-setup.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ In practice, one uses a cross-entropy loss with respect to each next-token predi
2727

2828
Implementing a language model can take many forms.
2929
Modern LMs, including ChatGPT, Claude, Gemini, etc., most often use **decoder-only Transformers** [@Vaswani2017AttentionIA].
30-
The core innovation of the Transform was heavily utilizing the **self-attention** [@Bahdanau2014NeuralMT] mechanism to allow the model to directly attend to concepts in context and learn complex mappings.
30+
The core innovation of the Transformer was heavily utilizing the **self-attention** [@Bahdanau2014NeuralMT] mechanism to allow the model to directly attend to concepts in context and learn complex mappings.
3131
Throughout this book, particularly when covering reward models in Chapter 7, we will discuss adding new heads or modifying a language modeling (LM) head of the transformer.
3232
The LM head is a final linear projection layer that maps from the models internal embedding space to the tokenizer space (a.k.a. vocabulary).
3333
Different heads can be used to re-use the internals of the model and fine-tune it to output differently shaped quantities.

chapters/04-optimization.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ There are multiple core changes from the standard RL setup to that of RLHF:
3030

3131
1. Switching from a reward function to a reward model. In RLHF, a learned model of human preferences, $r_\theta(s_t, a_t)$ (or any other classification model) is used instead of an environmental reward function. This gives the designer a substantial increase in the flexibility of the approach and control over the final results.
3232
2. No state transitions exist. In RLHF, the initial states for the domain are prompts sampled from a training dataset and the "action" is the completion to said prompt. During standard practices, this action does not impact the next state and is only scored by the reward model.
33-
3. Response level rewards. Often referred to as a Bandits Problem, RLHF attribution of reward is done for an entire sequence of actions, composed of multiple generated tokens, rather than in a fine-grained manner.
33+
3. Response level rewards. Often referred to as a bandit problem, RLHF attribution of reward is done for an entire sequence of actions, composed of multiple generated tokens, rather than in a fine-grained manner.
3434

3535
Given the single-turn nature of the problem, the optimization can be re-written without the time horizon and discount factor (and the reward models):
3636
$$J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[r_\theta(s_t, a_t) \right].$$ {#eq:rl_opt_int}

chapters/05-preferences.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ next-url: "06-preference-data.html"
99
# The Nature of Preferences
1010

1111
The core of reinforcement learning from human feedback, also referred to as reinforcement learning from human preferences in early literature, is designed to optimize machine learning models in domains where specifically designing a reward function is hard.
12-
Consider an example, how do you decide which of these two poems is better (Context: *On February 26th, 2025, I asked both Claude 3.7 Sonnet and ChatGPT with GPT-4o to "Write me a short poem about a optimistic goldfish."*):
12+
Consider an example, how do you decide which of these two poems is better (Context: *On February 26th, 2025, I asked both Claude 3.7 Sonnet and ChatGPT with GPT-4o to "Write me a short poem about an optimistic goldfish."*):
1313

1414
Example 1:
1515
> The Optimistic Goldfish
@@ -64,8 +64,8 @@ Together, each of these areas brings specific assumptions at what a preference i
6464
In practice, RLHF methods are motivated and studied from the perspective of empirical alignment -- maximizing model performance on specific skills instead of measuring the calibration to specific values.
6565
Still, the origins of value alignment for RLHF methods continue to be studied through research on methods to solve for ``pluralistic alignment'' across populations, such as position papers [@conitzer2024social], [@mishra2023ai], new datasets [@kirk2024prism], and personalization methods [@poddar2024personalizing].
6666

67-
The goal of this chapter is to illustrate how complex motivations result in presumptions about the nature of tools used in RLHF that do often not apply in practice.
68-
The specifics of obtaining data for RLHF is discussed further in Chapter 6 and using it for reward modeling in Chapter 7.
67+
The goal of this chapter is to illustrate how complex motivations result in presumptions about the nature of tools used in RLHF that often do not apply in practice.
68+
The specifics of obtaining data for RLHF are discussed further in Chapter 6 and using it for reward modeling in Chapter 7.
6969
For an extended version of this chapter, see [@lambert2023entangled].
7070

7171
## The path to optimizing preferences

chapters/06-preference-data.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ An example interaction of this form is shown below for an earlier version of Cha
4242
![Example preference data collection interface.](images/chatgpt-ab-test.jpeg){#fig:preference-chatgpt .center}
4343

4444
This style of interface is used extensively across the industry, such as for *evaluation* of models given the same format.
45-
A popular public option to see engage with models in this way is ChatBotArena [@chiang2024chatbot]:
45+
A popular public option to engage with models in this way is ChatBotArena [@chiang2024chatbot]:
4646

4747
![Example preference data collection interface.](images/chatbotarena.png){#fig:chatbotarena .center}
4848

chapters/09-instruction-tuning.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ next-url: "10-rejection-sampling.html"
99
# Instruction Finetuning
1010

1111
Early language models were only trained to predict the next tokens in a sequence and were not adapted to any specific tasks.
12-
Around the release of GPT-3 [@brown2020language], language models were still primarily used via in-context learning where examples where shown to the model and then it was asked to complete a similar task.
12+
Around the release of GPT-3 [@brown2020language], language models were still primarily used via in-context learning where examples were shown to the model and then it was asked to complete a similar task.
1313

1414
This was the combination of two trends -- historically in the natural language processing (NLP) literature, models were trained for a specific task.
1515
Here, as seen with one example where bigger models generalize better, multiple results showed how standardizing the approach of task data can enable dramatically different downstream performance.

chapters/10-rejection-sampling.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -194,8 +194,8 @@ The core hyperparameters for performing this training are very intuitive:
194194
- **Sampling parameters**: Rejection sampling is directly dependent on the completions received from the model. Common settings for RS include temperatures above zero, e.g. between 0.7 and 1.0, with other modifications to parameters such as top-p or top-k sampling.
195195
- **Completions per prompt**: Successful implementations of rejection sampling have included 10 to 30 or more completions for each prompt. Using too few completions will make training biased and or noisy.
196196
- **Instruction tuning details**: No clear training details for the instruction tuning during RS have been released. It is likely that they use slightly different settings than the initial instruction tuning phase of the model.
197-
- **Heterogenous model generations**: Some implementations of rejection sampling include generations from multiple models rather than just the current model that is going to be trained. Best practices on how to do this are not established.
198-
- **Reward model training**: The reward model used will heavily impact the final result. For more resources on reward model training, see the [relevant chapter](https://rhlfbook.com/reward-models.html).
197+
- **Heterogeneous model generations**: Some implementations of rejection sampling include generations from multiple models rather than just the current model that is going to be trained. Best practices on how to do this are not established.
198+
- **Reward model training**: The reward model used will heavily impact the final result. For more resources on reward model training, see the [relevant chapter](https://rlhfbook.com/c/07-reward-models.html).
199199

200200
#### Implementation Tricks
201201

@@ -205,7 +205,7 @@ The core hyperparameters for performing this training are very intuitive:
205205

206206
Best-of-N (BoN) sampling is often included as a baseline relative to RLHF methods.
207207
It is important to remember that BoN *does not* modify the underlying model, but is a sampling technique.
208-
For this matter, comparisons for BoN sampling to online training methods, such as PPO, is still valid in some contexts.
208+
For this matter, comparisons for BoN sampling to online training methods, such as PPO, are still valid in some contexts.
209209
For example, you can still measure the KL distance when running BoN sampling relative to any other policy.
210210

211211
Here, we will show that when using simple BoN sampling over one prompt, both selection criteria shown above are equivalent.

chapters/12-direct-alignment.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ $$ \pi^*(y|x) = \pi(y|x) = \frac{1}{Z(x)}\pi_{\text{ref}}(y|x)\exp\left(\frac{1}
136136

137137
To start, recall from Chapter 7 on Reward Modeling and Chapter 6 on Preference Data that a Bradley-Terry model of human preferences is formed as:
138138

139-
$$p^*(y_1 \succ y_2 \mid x) = \frac{\exp\left(r^*(y_1 \mid x)\right)}{\exp\left(r^*(x,y_1)\right) + \exp\left(r^*(x, y_2)\right)} $$ {#eq:bradley_terry_dpo}
139+
$$p^*(y_1 \succ y_2 \mid x) = \frac{\exp\left(r^*(x,y_1)\right)}{\exp\left(r^*(x,y_1)\right) + \exp\left(r^*(x, y_2)\right)} $$ {#eq:bradley_terry_dpo}
140140

141141
By manipulating @eq:dpo_opt_policy by taking the logarithm of both sides and performing some algebra, one can obtain the DPO reward as follows:
142142

@@ -152,7 +152,7 @@ By decomposing the exponential expressions from $e^{a+b}$ to $e^a e^b$ and then
152152
$$p^*(y_1 \succ y_2 \mid x) = \frac{\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right)}
153153
{\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right) + \exp\left(\beta \log \frac{\pi^*(y_2 \mid x)}{\pi_{\text{ref}}(y_2 \mid x)}\right)} $$ {#eq:dpo_loss_deriv1}
154154
155-
Then, multiple the numerator and denominator by $\exp\left(-\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right)$ to obtain:
155+
Then, multiply the numerator and denominator by $\exp\left(-\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right)$ to obtain:
156156
157157
$$p^*(y_1 \succ y_2 \mid x) = \frac{1}{1 + \exp\left(\beta \log \frac{\pi^*(y_2 \mid x)}{\pi_{\text{ref}}(y_2 \mid x)} - \beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right)} $$ {#eq:dpo_loss_deriv2}
158158

chapters/13-cai.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ next-url: "14-reasoning.html"
1111
RL from AI Feedback (RLAIF) is a larger set of techniques for using AI to augment or generate feedback data, including pairwise preferences [@lee2023rlaif] [@sharma2024critical] [@castricato2024suppressing].
1212
There are many motivations to using RLAIF to either entirely replace human feedback or augment it.
1313
AI models are far cheaper than humans, with a single piece of human preference data costing on the order of $1 or higher (or even above $10 per prompt), AI feedback with a frontier AI model, such as GPT-4o costs less than $0.01.
14-
This cost differences opens the market of experimentation with RLHF methods to an entire population of people previously priced out.
14+
This cost difference opens the market of experimentation with RLHF methods to an entire population of people previously priced out.
1515
Other than price, AI feedback introduces different *tradeoffs* on performance than human feedback, which are still being investigated.
1616
The peak performance for AI feedback is at least in the same ballpark of human data on skill-based evaluations, but it is not studied if human data allows finer control of the models in real-world product settings or for newer training methods such as character training.
1717

@@ -28,10 +28,10 @@ Results in many academic results showing how one can substitute AI preference da
2828

2929
## Constitutional AI
3030

31-
The method of Constitutional AI (CAI), which Anthropic uses extensively in their Claude models, is earliest, large-scale use of synthetic data for RLHF training.
31+
The method of Constitutional AI (CAI), which Anthropic uses extensively in their Claude models, is the earliest, large-scale use of synthetic data for RLHF training.
3232
Constitutional AI has two uses of synthetic data:
3333

34-
1. Critiques of instruction-tune data to follow a set of principles like “Is the answer encouraging violence” or “Is the answer truthful.” When the model generates answers to questions, it checks the answer against the list of principles in the constitution, refining the answer over time. Then, they fine-tune the model on this resulting dataset.
34+
1. Critiques of instruction-tuned data to follow a set of principles like “Is the answer encouraging violence” or “Is the answer truthful.” When the model generates answers to questions, it checks the answer against the list of principles in the constitution, refining the answer over time. Then, they fine-tune the model on this resulting dataset.
3535
2. Generates pairwise preference data by using a language model to answer which completion was better, given the context of a random principle from the constitution (similar to this paper for principle-guided reward models). Then, RLHF proceeds as normal with synthetic data, hence the RLAIF name.
3636

3737
Largely, CAI is known for the second half above, the preference data, but the methods introduced for instruction data are used in general data filtering and synthetic data generation methods across post-training.

0 commit comments

Comments
 (0)