You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a work-in-progress textbook covering the fundamentals of Reinforcement Learning from Human Feedback (RLHF).
8
-
The code is licensed with the MIT license, but the content for the book found in `chapters/` is licensed under the [Creative Commons Non-Commerical Attribution License](https://creativecommons.org/licenses/by-nc/4.0/deed.en), CC BY-NC 4.0.
8
+
The code is licensed with the MIT license, but the content for the book found in `chapters/` is licensed under the [Creative Commons Non-Commercial Attribution License](https://creativecommons.org/licenses/by-nc/4.0/deed.en), CC BY-NC 4.0.
9
9
This is meant for people with a basic ML and/or software background.
10
10
11
11
### Citation
@@ -197,7 +197,7 @@ For more information, check the [Second] section.
197
197
...
198
198
```
199
199
200
-
Or, with al alternative name:
200
+
Or, with an alternative name:
201
201
202
202
```md
203
203
For more information, check [this](#second) section.
@@ -397,7 +397,7 @@ custom styles, etc, and modify the Makefile file accordingly.
397
397
398
398
Output files are generated using [pandoc templates](https://pandoc.org/MANUAL.html#templates). All
399
399
templates are located under the `templates/` folder, and may be modified as you will. Some basic
400
-
format templates are already included on this repository, ion case you need something to start
400
+
format templates are already included on this repository, in case you need something to start
Copy file name to clipboardExpand all lines: chapters/01-introduction.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ Post-training is a more complete set of techniques and best-practices to make la
27
27
Post-training can be summarized as using three optimization methods:
28
28
29
29
1. Instruction / Supervised Finetuning (IFT/SFT), where we teach formatting and for base of instruction following abilities. This is largely about learning *features* in language.
30
-
2. Preference Finetuning (PreFT),where we align to human preferences (and get smaller bump in capabilities at the same time). This is largely about *style* of language and subtle human preferences that are hard to quantify.
30
+
2. Preference Finetuning (PreFT),where we align to human preferences (and get smaller bump in capabilities at the same time). This is largely about *style* of language and subtle human preferences that are hard to quantify.
31
31
3. Reinforcement Finetuning (RFT). The newest type of post-training that boosts performance on verifiable domains.
32
32
33
33
This book focuses on the second area, **preference finetuning**, which has more complexity than instruction tuning and is far more established than Reinforcement Finetuning.
Copy file name to clipboardExpand all lines: chapters/12-direct-alignment.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ next-url: "13-cai.html"
11
11
Direct Alignment Algorithms (DAAs) allow one to update models to solve the same RLHF objective without ever training an intermediate reward model or using reinforcement learning optimizers.
12
12
The most prominent DAA and one that catalyzed an entire academic movement of aligning language models is Direct Preference Optimization (DPO) [@rafailov2024direct].
13
13
At its core, DPO is using gradient ascent to solve the same constrained RLHF objective.
14
-
Since its release in May of 2023, after a brief delay where the community figured out the right data and hyperparameters to use DPO with (specifically, surprisingly low learning rates), many popular models have used DPO or its variants, from Zephyr-$\beta$ kickstarting it in October of 2024[@tunstall2023zephyr], Llama 3 Instruct [@dubey2024llama], Tülu 2 [@ivison2023camels] and 3 [@lambert2024t], Nemotron 4 340B [@adler2024nemotron], and others.
14
+
Since its release in May of 2023, after a brief delay where the community figured out the right data and hyperparameters to use DPO with (specifically, surprisingly low learning rates), many popular models have used DPO or its variants, from Zephyr-$\beta$ kickstarting it in October of 2023[@tunstall2023zephyr], Llama 3 Instruct [@dubey2024llama], Tülu 2 [@ivison2023camels] and 3 [@lambert2024t], Nemotron 4 340B [@adler2024nemotron], and others.
15
15
Technically, Sequence Likelihood Calibration (SLiC-HF) was released first [@zhao2023slic], but it did not catch on due to a combination of luck and effectiveness.
16
16
17
17
The most impactful part of DPO and DAAs is lowering the barrier of entry to experimenting with language model post-training.
@@ -32,7 +32,7 @@ This relies on the implicit reward for DPO training that replaces using an exter
This comes from deriving the Bradley-Terry reward with respect to an optimal policy (shown in @eq:dpo_opt_policy), as shown in TODO BT model.
35
+
This comes from deriving the Bradley-Terry reward with respect to an optimal policy (shown in @eq:dpo_opt_policy), as shown in the Bradley-Terry model section.
36
36
Essentially, the implicit reward model shows "the probability of human preference data in terms of the optimal policy rather than the reward model."
By decomposing the exponential expressions from $e^{a+b}$ to $e^a e^b$ and then cancelling out the terms $e^{\log(Z(x))}$, this simplifies to:
151
151
@@ -201,7 +201,7 @@ Some variants to DPO attempt to either improve the learning signal by making sma
201
201
{#fig:dpo_issue .center}
202
202
203
203
One of the core issues *apparent* in DPO is that the optimization drives only to increase the margin between the probability of the chosen and rejected responses.
204
-
Numerically, the model reduces the probabiltiy of both the chosen and rejected responses, but the *rejected response is reduced by a greater extent* as shown in @fig:dpo_issue.
204
+
Numerically, the model reduces the probability of both the chosen and rejected responses, but the *rejected response is reduced by a greater extent* as shown in @fig:dpo_issue.
205
205
Intuitively, it is not clear how this generalizes, but work has posited that it increases the probability of unaddressed for behaviors [@razin2024unintentional] [@ren2024learning].
206
206
Simple methods, such as Cal-DPO [@xiao2024cal], adjust the optimization so that this **preference displacement** does not occur.
207
207
In practice, the exact impact of this is not well known, but points are a potential reason why online methods can outperform vanilla DPO.
0 commit comments