Fix typos across book chapters -- claude bigger (#71)

natolambert · claude · web-flow · commit 252c6a2c9864 · 2025-03-15T09:19:34.000-07:00
Co-authored-by: Claude &lt;noreply@anthropic.com&gt;
diff --git a/chapters/01-introduction.md b/chapters/01-introduction.md
@@ -16,7 +16,7 @@ RLHF became most known through the release of ChatGPT and the subsequent rapid d
 The basic pipeline for RLHF involves three steps.
 First, a language model that can follow user questions must be trained (see Chapter 9).
 Second, human preference data must be collected for the training of a reward model of human preferences (see Chapter 7).
-Finally, the language model can be optimized with a RL optimizer of choice, by sampling generations and rating them with respect to the reward model (see Chapter 3 and 11).
+Finally, the language model can be optimized with an RL optimizer of choice, by sampling generations and rating them with respect to the reward model (see Chapter 3 and 11).
 This book details key decisions and basic implementation examples for each step in this process.
 
 RLHF has been applied to many domains successfully, with complexity increasing as the techniques have matured.
@@ -27,7 +27,7 @@ Post-training is a more complete set of techniques and best-practices to make la
 Post-training can be summarized as using three optimization methods:
 
 1. Instruction / Supervised Finetuning (IFT/SFT), where we teach formatting and for base of instruction following abilities. This is largely about learning *features* in language.
-2. Preference Finetuning (PreFT), where we align to human preferences (and get smaller bump in capabilities at the same time). This is largely about *style* of language and subtle human preferences that are hard to quantify. 
+2. Preference Finetuning (PreFT), where we align to human preferences (and get smaller bump in capabilities at the same time). This is largely about *style* of language and subtle human preferences that are hard to quantify.
 3. Reinforcement Finetuning (RFT). The newest type of post-training that boosts performance on verifiable domains.
 
 This book focuses on the second area, **preference finetuning**, which has more complexity than instruction tuning and is far more established than Reinforcement Finetuning.
@@ -39,7 +39,7 @@ The foundations of RLHF involve far more than preferences alone and this book pr
 
 The biggest question around RLHF, yet one that is still hard to answer, is "What does RLHF training offer models?"
 The core role of this book, beyond teaching the techniques for doing RLHF, is to distill intuition as to *why* RLHF is crucial to modern AI models.
-In recent years, language models shifted from academic experiments studied in the purview of benchmarks to general purpose technology.
+In recent years, language models have shifted from academic experiments studied in the purview of benchmarks to general purpose technology.
 RLHF is at the core of this transition.
 
 The most compelling view of how RLHF works is to think of how *style* applies to interactions you have with language models.
diff --git a/chapters/03-setup.md b/chapters/03-setup.md
@@ -27,7 +27,7 @@ In practice, one uses a cross-entropy loss with respect to each next-token predi
 
 Implementing a language model can take many forms.
 Modern LMs, including ChatGPT, Claude, Gemini, etc., most often use **decoder-only Transformers** [@Vaswani2017AttentionIA].
-The core innovation of the Transform was heavily utilizing the **self-attention** [@Bahdanau2014NeuralMT] mechanism to allow the model to directly attend to concepts in context and learn complex mappings.
+The core innovation of the Transformer was heavily utilizing the **self-attention** [@Bahdanau2014NeuralMT] mechanism to allow the model to directly attend to concepts in context and learn complex mappings.
 Throughout this book, particularly when covering reward models in Chapter 7, we will discuss adding new heads or modifying a language modeling (LM) head of the transformer.
 The LM head is a final linear projection layer that maps from the models internal embedding space to the tokenizer space (a.k.a. vocabulary).
 Different heads can be used to re-use the internals of the model and fine-tune it to output differently shaped quantities.
diff --git a/chapters/04-optimization.md b/chapters/04-optimization.md
@@ -30,7 +30,7 @@ There are multiple core changes from the standard RL setup to that of RLHF:
 
 1. Switching from a reward function to a reward model. In RLHF, a learned model of human preferences, $r_\theta(s_t, a_t)$ (or any other classification model) is used instead of an environmental reward function. This gives the designer a substantial increase in the flexibility of the approach and control over the final results.
 2. No state transitions exist. In RLHF, the initial states for the domain are prompts sampled from a training dataset and the "action" is the completion to said prompt. During standard practices, this action does not impact the next state and is only scored by the reward model.
-3. Response level rewards. Often referred to as a Bandits Problem, RLHF attribution of reward is done for an entire sequence of actions, composed of multiple generated tokens, rather than in a fine-grained manner. 
+3. Response level rewards. Often referred to as a bandit problem, RLHF attribution of reward is done for an entire sequence of actions, composed of multiple generated tokens, rather than in a fine-grained manner. 
 
 Given the single-turn nature of the problem, the optimization can be re-written without the time horizon and discount factor (and the reward models):
 $$J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[r_\theta(s_t, a_t) \right].$$ {#eq:rl_opt_int}
diff --git a/chapters/05-preferences.md b/chapters/05-preferences.md
@@ -9,7 +9,7 @@ next-url: "06-preference-data.html"
 # The Nature of Preferences
 
 The core of reinforcement learning from human feedback, also referred to as reinforcement learning from human preferences in early literature, is designed to optimize machine learning models in domains where specifically designing a reward function is hard.
-Consider an example, how do you decide which of these two poems is better (Context: *On February 26th, 2025, I asked both Claude 3.7 Sonnet and ChatGPT with GPT-4o to "Write me a short poem about a optimistic goldfish."*):
+Consider an example, how do you decide which of these two poems is better (Context: *On February 26th, 2025, I asked both Claude 3.7 Sonnet and ChatGPT with GPT-4o to "Write me a short poem about an optimistic goldfish."*):
 
 Example 1:
 > The Optimistic Goldfish
@@ -64,8 +64,8 @@ Together, each of these areas brings specific assumptions at what a preference i
 In practice, RLHF methods are motivated and studied from the perspective of empirical alignment -- maximizing model performance on specific skills instead of measuring the calibration to specific values.
 Still, the origins of value alignment for RLHF methods continue to be studied through research on methods to solve for ``pluralistic alignment'' across populations, such as position papers [@conitzer2024social], [@mishra2023ai], new datasets [@kirk2024prism], and personalization methods [@poddar2024personalizing].
 
-The goal of this chapter is to illustrate how complex motivations result in presumptions about the nature of tools used in RLHF that do often not apply in practice.
-The specifics of obtaining data for RLHF is discussed further in Chapter 6 and using it for reward modeling in Chapter 7.
+The goal of this chapter is to illustrate how complex motivations result in presumptions about the nature of tools used in RLHF that often do not apply in practice.
+The specifics of obtaining data for RLHF are discussed further in Chapter 6 and using it for reward modeling in Chapter 7.
 For an extended version of this chapter, see [@lambert2023entangled].
 
 ## The path to optimizing preferences
diff --git a/chapters/06-preference-data.md b/chapters/06-preference-data.md
@@ -42,7 +42,7 @@ An example interaction of this form is shown below for an earlier version of Cha
 ![Example preference data collection interface.](images/chatgpt-ab-test.jpeg){#fig:preference-chatgpt .center}
 
 This style of interface is used extensively across the industry, such as for *evaluation* of models given the same format.
-A popular public option to see engage with models in this way is ChatBotArena [@chiang2024chatbot]:
+A popular public option to engage with models in this way is ChatBotArena [@chiang2024chatbot]:
 
 ![Example preference data collection interface.](images/chatbotarena.png){#fig:chatbotarena .center}
 
diff --git a/chapters/09-instruction-tuning.md b/chapters/09-instruction-tuning.md
@@ -9,7 +9,7 @@ next-url: "10-rejection-sampling.html"
 # Instruction Finetuning
 
 Early language models were only trained to predict the next tokens in a sequence and were not adapted to any specific tasks.
-Around the release of GPT-3 [@brown2020language], language models were still primarily used via in-context learning where examples where shown to the model and then it was asked to complete a similar task.
+Around the release of GPT-3 [@brown2020language], language models were still primarily used via in-context learning where examples were shown to the model and then it was asked to complete a similar task.
 
 This was the combination of two trends -- historically in the natural language processing (NLP) literature, models were trained for a specific task.
 Here, as seen with one example where bigger models generalize better, multiple results showed how standardizing the approach of task data can enable dramatically different downstream performance.
diff --git a/chapters/10-rejection-sampling.md b/chapters/10-rejection-sampling.md
@@ -194,8 +194,8 @@ The core hyperparameters for performing this training are very intuitive:
 - **Sampling parameters**: Rejection sampling is directly dependent on the completions received from the model. Common settings for RS include temperatures above zero, e.g. between 0.7 and 1.0, with other modifications to parameters such as top-p or top-k sampling.
 - **Completions per prompt**: Successful implementations of rejection sampling have included 10 to 30 or more completions for each prompt. Using too few completions will make training biased and or noisy.
 - **Instruction tuning details**: No clear training details for the instruction tuning during RS have been released. It is likely that they use slightly different settings than the initial instruction tuning phase of the model.
-- **Heterogenous model generations**: Some implementations of rejection sampling include generations from multiple models rather than just the current model that is going to be trained. Best practices on how to do this are not established.
-- **Reward model training**: The reward model used will heavily impact the final result. For more resources on reward model training, see the [relevant chapter](https://rhlfbook.com/reward-models.html).
+- **Heterogeneous model generations**: Some implementations of rejection sampling include generations from multiple models rather than just the current model that is going to be trained. Best practices on how to do this are not established.
+- **Reward model training**: The reward model used will heavily impact the final result. For more resources on reward model training, see the [relevant chapter](https://rlhfbook.com/c/07-reward-models.html).
 
 #### Implementation Tricks
 
@@ -205,7 +205,7 @@ The core hyperparameters for performing this training are very intuitive:
 
 Best-of-N (BoN) sampling is often included as a baseline relative to RLHF methods.
 It is important to remember that BoN *does not* modify the underlying model, but is a sampling technique. 
-For this matter, comparisons for BoN sampling to online training methods, such as PPO, is still valid in some contexts.
+For this matter, comparisons for BoN sampling to online training methods, such as PPO, are still valid in some contexts.
 For example, you can still measure the KL distance when running BoN sampling relative to any other policy.
 
 Here, we will show that when using simple BoN sampling over one prompt, both selection criteria shown above are equivalent.
diff --git a/chapters/12-direct-alignment.md b/chapters/12-direct-alignment.md
@@ -136,7 +136,7 @@ $$ \pi^*(y|x) = \pi(y|x) = \frac{1}{Z(x)}\pi_{\text{ref}}(y|x)\exp\left(\frac{1}
 
 To start, recall from Chapter 7 on Reward Modeling and Chapter 6 on Preference Data that a Bradley-Terry model of human preferences is formed as:
 
-$$p^*(y_1 \succ y_2 \mid x) = \frac{\exp\left(r^*(y_1 \mid x)\right)}{\exp\left(r^*(x,y_1)\right) + \exp\left(r^*(x, y_2)\right)} $$ {#eq:bradley_terry_dpo}
+$$p^*(y_1 \succ y_2 \mid x) = \frac{\exp\left(r^*(x,y_1)\right)}{\exp\left(r^*(x,y_1)\right) + \exp\left(r^*(x, y_2)\right)} $$ {#eq:bradley_terry_dpo}
 
 By manipulating @eq:dpo_opt_policy by taking the logarithm of both sides and performing some algebra, one can obtain the DPO reward as follows:
 
@@ -152,7 +152,7 @@ By decomposing the exponential expressions from $e^{a+b}$ to $e^a e^b$ and then
 $$p^*(y_1 \succ y_2 \mid x) = \frac{\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right)}
 {\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right) + \exp\left(\beta \log \frac{\pi^*(y_2 \mid x)}{\pi_{\text{ref}}(y_2 \mid x)}\right)} $$ {#eq:dpo_loss_deriv1}
 
-Then, multiple the numerator and denominator by $\exp\left(-\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right)$ to obtain:
+Then, multiply the numerator and denominator by $\exp\left(-\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right)$ to obtain:
 
 $$p^*(y_1 \succ y_2 \mid x) = \frac{1}{1 + \exp\left(\beta \log \frac{\pi^*(y_2 \mid x)}{\pi_{\text{ref}}(y_2 \mid x)} - \beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right)} $$ {#eq:dpo_loss_deriv2}
 
diff --git a/chapters/13-cai.md b/chapters/13-cai.md
@@ -11,7 +11,7 @@ next-url: "14-reasoning.html"
 RL from AI Feedback (RLAIF) is a larger set of techniques for using AI to augment or generate feedback data, including pairwise preferences [@lee2023rlaif]  [@sharma2024critical] [@castricato2024suppressing].
 There are many motivations to using RLAIF to either entirely replace human feedback or augment it. 
 AI models are far cheaper than humans, with a single piece of human preference data costing on the order of $1 or higher (or even above $10 per prompt), AI feedback with a frontier AI model, such as GPT-4o costs less than $0.01. 
-This cost differences opens the market of experimentation with RLHF methods to an entire population of people previously priced out.
+This cost difference opens the market of experimentation with RLHF methods to an entire population of people previously priced out.
 Other than price, AI feedback introduces different *tradeoffs* on performance than human feedback, which are still being investigated.
 The peak performance for AI feedback is at least in the same ballpark of human data on skill-based evaluations, but it is not studied if human data allows finer control of the models in real-world product settings or for newer training methods such as character training.
 
@@ -28,10 +28,10 @@ Results in many academic results showing how one can substitute AI preference da
 
 ## Constitutional AI
 
-The method of Constitutional AI (CAI), which Anthropic uses extensively in their Claude models, is earliest, large-scale use of synthetic data for RLHF training. 
+The method of Constitutional AI (CAI), which Anthropic uses extensively in their Claude models, is the earliest, large-scale use of synthetic data for RLHF training. 
 Constitutional AI has two uses of synthetic data:
 
-1. Critiques of instruction-tune data to follow a set of principles like “Is the answer encouraging violence” or “Is the answer truthful.” When the model generates answers to questions, it checks the answer against the list of principles in the constitution, refining the answer over time. Then, they fine-tune the model on this resulting dataset.
+1. Critiques of instruction-tuned data to follow a set of principles like “Is the answer encouraging violence” or “Is the answer truthful.” When the model generates answers to questions, it checks the answer against the list of principles in the constitution, refining the answer over time. Then, they fine-tune the model on this resulting dataset.
 2. Generates pairwise preference data by using a language model to answer which completion was better, given the context of a random principle from the constitution (similar to this paper for principle-guided reward models). Then, RLHF proceeds as normal with synthetic data, hence the RLAIF name.
 
 Largely, CAI is known for the second half above, the preference data, but the methods introduced for instruction data are used in general data filtering and synthetic data generation methods across post-training.