Fix typos across repository (#70)

natolambert · claude · web-flow · commit d90f63977992 · 2025-03-15T08:54:07.000-07:00
Co-authored-by: Claude &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@ Built on [**Pandoc book template**](https://github.com/wikiti/pandoc-book-templa
 [![Content License](https://img.shields.io/badge/license-CC--BY--NC--SA--4.0-lightgrey)](https://github.com/natolambert/rlhf-book/blob/main/LICENSE-Content.md)
 
 This is a work-in-progress textbook covering the fundamentals of Reinforcement Learning from Human Feedback (RLHF).
-The code is licensed with the MIT license, but the content for the book found in `chapters/` is licensed under the [Creative Commons Non-Commerical Attribution License](https://creativecommons.org/licenses/by-nc/4.0/deed.en), CC BY-NC 4.0.
+The code is licensed with the MIT license, but the content for the book found in `chapters/` is licensed under the [Creative Commons Non-Commercial Attribution License](https://creativecommons.org/licenses/by-nc/4.0/deed.en), CC BY-NC 4.0.
 This is meant for people with a basic ML and/or software background.
 
 ### Citation
@@ -197,7 +197,7 @@ For more information, check the [Second] section.
 ...
 ```
 
-Or, with al alternative name:
+Or, with an alternative name:
 
 ```md
 For more information, check [this](#second) section.
@@ -397,7 +397,7 @@ custom styles, etc, and modify the Makefile file accordingly.
 
 Output files are generated using [pandoc templates](https://pandoc.org/MANUAL.html#templates). All
 templates are located under the `templates/` folder, and may be modified as you will. Some basic
-format templates are already included on this repository, ion case you need something to start
+format templates are already included on this repository, in case you need something to start
 with.
 
 ## References
diff --git a/chapters/01-introduction.md b/chapters/01-introduction.md
@@ -27,7 +27,7 @@ Post-training is a more complete set of techniques and best-practices to make la
 Post-training can be summarized as using three optimization methods:
 
 1. Instruction / Supervised Finetuning (IFT/SFT), where we teach formatting and for base of instruction following abilities. This is largely about learning *features* in language.
-2. Preference Finetuning (PreFT),where we align to human preferences (and get smaller bump in capabilities at the same time). This is largely about *style* of language and subtle human preferences that are hard to quantify. 
+2. Preference Finetuning (PreFT), where we align to human preferences (and get smaller bump in capabilities at the same time). This is largely about *style* of language and subtle human preferences that are hard to quantify. 
 3. Reinforcement Finetuning (RFT). The newest type of post-training that boosts performance on verifiable domains.
 
 This book focuses on the second area, **preference finetuning**, which has more complexity than instruction tuning and is far more established than Reinforcement Finetuning.
diff --git a/chapters/06-preference-data.md b/chapters/06-preference-data.md
@@ -75,7 +75,6 @@ Table: An example 5-wise Likert scale between two responses, A and B. {#tbl:like
 Some early RLHF for language modeling works uses an 8-step Likert scale with levels of preference between the two responses [@bai2022training]. 
 An even scale removes the possibility of ties:
 
-Here's a markdown table formatted as an 8-point Likert scale:
 
 | A$>>>$B |     |     | A$>$B | B$>$A  |     |     | B$>>>$A |
 |:-------:|:-----:|:-----:|:-----:|:------:|:-----:|:-----:|:-------:|
diff --git a/chapters/12-direct-alignment.md b/chapters/12-direct-alignment.md
@@ -11,7 +11,7 @@ next-url: "13-cai.html"
 Direct Alignment Algorithms (DAAs) allow one to update models to solve the  same RLHF objective without ever training an intermediate reward model or using reinforcement learning optimizers.
 The most prominent DAA and one that catalyzed an entire academic movement of aligning language models is Direct Preference Optimization (DPO) [@rafailov2024direct].
 At its core, DPO is using gradient ascent to solve the same constrained RLHF objective.
-Since its release in May of 2023, after a brief delay where the community figured out the right data and hyperparameters to use DPO with (specifically, surprisingly low learning rates), many popular models have used DPO or its variants, from Zephyr-$\beta$ kickstarting it in October of 2024 [@tunstall2023zephyr], Llama 3 Instruct [@dubey2024llama], Tülu 2 [@ivison2023camels] and 3 [@lambert2024t], Nemotron 4 340B [@adler2024nemotron], and others.
+Since its release in May of 2023, after a brief delay where the community figured out the right data and hyperparameters to use DPO with (specifically, surprisingly low learning rates), many popular models have used DPO or its variants, from Zephyr-$\beta$ kickstarting it in October of 2023 [@tunstall2023zephyr], Llama 3 Instruct [@dubey2024llama], Tülu 2 [@ivison2023camels] and 3 [@lambert2024t], Nemotron 4 340B [@adler2024nemotron], and others.
 Technically, Sequence Likelihood Calibration (SLiC-HF) was released first [@zhao2023slic], but it did not catch on due to a combination of luck and effectiveness.
 
 The most impactful part of DPO and DAAs is lowering the barrier of entry to experimenting with language model post-training.
@@ -32,7 +32,7 @@ This relies on the implicit reward for DPO training that replaces using an exter
 
 $$r(x, y) = \beta  \log \frac{\pi_r(y \mid x)}{\pi_{\text{ref}}(y \mid x)}$$ {#eq:dpo_reward}
 
-This comes from deriving the Bradley-Terry reward with respect to an optimal policy (shown in @eq:dpo_opt_policy), as shown in TODO BT model. 
+This comes from deriving the Bradley-Terry reward with respect to an optimal policy (shown in @eq:dpo_opt_policy), as shown in the Bradley-Terry model section. 
 Essentially, the implicit reward model shows "the probability of human preference data in terms of the optimal policy rather than the reward model."
 
 Let us consider the loss shown in @eq:dpo_core. 
@@ -145,7 +145,7 @@ $$r^*(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \b
 We then can substitute the reward into the Bradley-Terry equation shown in @eq:bradley_terry_dpo to obtain:
 
 $$p^*(y_1 \succ y_2 \mid x) = \frac{\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)} + \beta \log Z(x)\right)}
-{\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)} + \beta \log Z(x)\right) + \exp\left(\beta \log \frac{\pi^*(y_2 \mid x)}{\pi_{\text{ref}}(y_2 \mid x)} + \beta \log Z(x)\right)} $$ {#eq:eq:dpo_loss_deriv0}
+{\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)} + \beta \log Z(x)\right) + \exp\left(\beta \log \frac{\pi^*(y_2 \mid x)}{\pi_{\text{ref}}(y_2 \mid x)} + \beta \log Z(x)\right)} $$ {#eq:dpo_loss_deriv0}
 
 By decomposing the exponential expressions from $e^{a+b}$ to $e^a e^b$ and then cancelling out the terms $e^{\log(Z(x))}$, this simplifies to:
 
@@ -201,7 +201,7 @@ Some variants to DPO attempt to either improve the learning signal by making sma
 ![Sketch of preference displacement in DPO.](images/dpo_displacement.png){#fig:dpo_issue .center}
 
 One of the core issues *apparent* in DPO is that the optimization drives only to increase the margin between the probability of the chosen and rejected responses.
-Numerically, the model reduces the probabiltiy of both the chosen and rejected responses, but the *rejected response is reduced by a greater extent* as shown in @fig:dpo_issue.
+Numerically, the model reduces the probability of both the chosen and rejected responses, but the *rejected response is reduced by a greater extent* as shown in @fig:dpo_issue.
 Intuitively, it is not clear how this generalizes, but work has posited that it increases the probability of unaddressed for behaviors [@razin2024unintentional] [@ren2024learning]. 
 Simple methods, such as Cal-DPO [@xiao2024cal], adjust the optimization so that this **preference displacement** does not occur.
 In practice, the exact impact of this is not well known, but points are a potential reason why online methods can outperform vanilla DPO.