Skip to content

Commit d90f639

Browse files
natolambertclaude
andauthored
Fix typos across repository (#70)
Co-authored-by: Claude <noreply@anthropic.com>
1 parent d244030 commit d90f639

File tree

4 files changed

+8
-9
lines changed

4 files changed

+8
-9
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Built on [**Pandoc book template**](https://github.com/wikiti/pandoc-book-templa
55
[![Content License](https://img.shields.io/badge/license-CC--BY--NC--SA--4.0-lightgrey)](https://github.com/natolambert/rlhf-book/blob/main/LICENSE-Content.md)
66

77
This is a work-in-progress textbook covering the fundamentals of Reinforcement Learning from Human Feedback (RLHF).
8-
The code is licensed with the MIT license, but the content for the book found in `chapters/` is licensed under the [Creative Commons Non-Commerical Attribution License](https://creativecommons.org/licenses/by-nc/4.0/deed.en), CC BY-NC 4.0.
8+
The code is licensed with the MIT license, but the content for the book found in `chapters/` is licensed under the [Creative Commons Non-Commercial Attribution License](https://creativecommons.org/licenses/by-nc/4.0/deed.en), CC BY-NC 4.0.
99
This is meant for people with a basic ML and/or software background.
1010

1111
### Citation
@@ -197,7 +197,7 @@ For more information, check the [Second] section.
197197
...
198198
```
199199

200-
Or, with al alternative name:
200+
Or, with an alternative name:
201201

202202
```md
203203
For more information, check [this](#second) section.
@@ -397,7 +397,7 @@ custom styles, etc, and modify the Makefile file accordingly.
397397

398398
Output files are generated using [pandoc templates](https://pandoc.org/MANUAL.html#templates). All
399399
templates are located under the `templates/` folder, and may be modified as you will. Some basic
400-
format templates are already included on this repository, ion case you need something to start
400+
format templates are already included on this repository, in case you need something to start
401401
with.
402402

403403
## References

chapters/01-introduction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Post-training is a more complete set of techniques and best-practices to make la
2727
Post-training can be summarized as using three optimization methods:
2828

2929
1. Instruction / Supervised Finetuning (IFT/SFT), where we teach formatting and for base of instruction following abilities. This is largely about learning *features* in language.
30-
2. Preference Finetuning (PreFT),where we align to human preferences (and get smaller bump in capabilities at the same time). This is largely about *style* of language and subtle human preferences that are hard to quantify.
30+
2. Preference Finetuning (PreFT), where we align to human preferences (and get smaller bump in capabilities at the same time). This is largely about *style* of language and subtle human preferences that are hard to quantify.
3131
3. Reinforcement Finetuning (RFT). The newest type of post-training that boosts performance on verifiable domains.
3232

3333
This book focuses on the second area, **preference finetuning**, which has more complexity than instruction tuning and is far more established than Reinforcement Finetuning.

chapters/06-preference-data.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,6 @@ Table: An example 5-wise Likert scale between two responses, A and B. {#tbl:like
7575
Some early RLHF for language modeling works uses an 8-step Likert scale with levels of preference between the two responses [@bai2022training].
7676
An even scale removes the possibility of ties:
7777

78-
Here's a markdown table formatted as an 8-point Likert scale:
7978

8079
| A$>>>$B | | | A$>$B | B$>$A | | | B$>>>$A |
8180
|:-------:|:-----:|:-----:|:-----:|:------:|:-----:|:-----:|:-------:|

chapters/12-direct-alignment.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ next-url: "13-cai.html"
1111
Direct Alignment Algorithms (DAAs) allow one to update models to solve the same RLHF objective without ever training an intermediate reward model or using reinforcement learning optimizers.
1212
The most prominent DAA and one that catalyzed an entire academic movement of aligning language models is Direct Preference Optimization (DPO) [@rafailov2024direct].
1313
At its core, DPO is using gradient ascent to solve the same constrained RLHF objective.
14-
Since its release in May of 2023, after a brief delay where the community figured out the right data and hyperparameters to use DPO with (specifically, surprisingly low learning rates), many popular models have used DPO or its variants, from Zephyr-$\beta$ kickstarting it in October of 2024 [@tunstall2023zephyr], Llama 3 Instruct [@dubey2024llama], Tülu 2 [@ivison2023camels] and 3 [@lambert2024t], Nemotron 4 340B [@adler2024nemotron], and others.
14+
Since its release in May of 2023, after a brief delay where the community figured out the right data and hyperparameters to use DPO with (specifically, surprisingly low learning rates), many popular models have used DPO or its variants, from Zephyr-$\beta$ kickstarting it in October of 2023 [@tunstall2023zephyr], Llama 3 Instruct [@dubey2024llama], Tülu 2 [@ivison2023camels] and 3 [@lambert2024t], Nemotron 4 340B [@adler2024nemotron], and others.
1515
Technically, Sequence Likelihood Calibration (SLiC-HF) was released first [@zhao2023slic], but it did not catch on due to a combination of luck and effectiveness.
1616

1717
The most impactful part of DPO and DAAs is lowering the barrier of entry to experimenting with language model post-training.
@@ -32,7 +32,7 @@ This relies on the implicit reward for DPO training that replaces using an exter
3232

3333
$$r(x, y) = \beta \log \frac{\pi_r(y \mid x)}{\pi_{\text{ref}}(y \mid x)}$$ {#eq:dpo_reward}
3434

35-
This comes from deriving the Bradley-Terry reward with respect to an optimal policy (shown in @eq:dpo_opt_policy), as shown in TODO BT model.
35+
This comes from deriving the Bradley-Terry reward with respect to an optimal policy (shown in @eq:dpo_opt_policy), as shown in the Bradley-Terry model section.
3636
Essentially, the implicit reward model shows "the probability of human preference data in terms of the optimal policy rather than the reward model."
3737

3838
Let us consider the loss shown in @eq:dpo_core.
@@ -145,7 +145,7 @@ $$r^*(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \b
145145
We then can substitute the reward into the Bradley-Terry equation shown in @eq:bradley_terry_dpo to obtain:
146146

147147
$$p^*(y_1 \succ y_2 \mid x) = \frac{\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)} + \beta \log Z(x)\right)}
148-
{\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)} + \beta \log Z(x)\right) + \exp\left(\beta \log \frac{\pi^*(y_2 \mid x)}{\pi_{\text{ref}}(y_2 \mid x)} + \beta \log Z(x)\right)} $$ {#eq:eq:dpo_loss_deriv0}
148+
{\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)} + \beta \log Z(x)\right) + \exp\left(\beta \log \frac{\pi^*(y_2 \mid x)}{\pi_{\text{ref}}(y_2 \mid x)} + \beta \log Z(x)\right)} $$ {#eq:dpo_loss_deriv0}
149149
150150
By decomposing the exponential expressions from $e^{a+b}$ to $e^a e^b$ and then cancelling out the terms $e^{\log(Z(x))}$, this simplifies to:
151151
@@ -201,7 +201,7 @@ Some variants to DPO attempt to either improve the learning signal by making sma
201201
![Sketch of preference displacement in DPO.](images/dpo_displacement.png){#fig:dpo_issue .center}
202202
203203
One of the core issues *apparent* in DPO is that the optimization drives only to increase the margin between the probability of the chosen and rejected responses.
204-
Numerically, the model reduces the probabiltiy of both the chosen and rejected responses, but the *rejected response is reduced by a greater extent* as shown in @fig:dpo_issue.
204+
Numerically, the model reduces the probability of both the chosen and rejected responses, but the *rejected response is reduced by a greater extent* as shown in @fig:dpo_issue.
205205
Intuitively, it is not clear how this generalizes, but work has posited that it increases the probability of unaddressed for behaviors [@razin2024unintentional] [@ren2024learning].
206206
Simple methods, such as Cal-DPO [@xiao2024cal], adjust the optimization so that this **preference displacement** does not occur.
207207
In practice, the exact impact of this is not well known, but points are a potential reason why online methods can outperform vanilla DPO.

0 commit comments

Comments
 (0)