natolambert
diff --git a/‎.gitignore‎
Lines changed: 6 additions & 1 deletion b/‎.gitignore‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎CLAUDE.md‎
Lines changed: 103 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 103 additions & 0 deletions
diff --git a/‎chapters/07-reward-models.md‎
Lines changed: 58 additions & 0 deletions b/‎chapters/07-reward-models.md‎
Lines changed: 58 additions & 0 deletions
diff --git a/‎chapters/11-policy-gradients.md‎
Lines changed: 2 additions & 0 deletions b/‎chapters/11-policy-gradients.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎diagrams/Makefile‎
Lines changed: 47 additions & 0 deletions b/‎diagrams/Makefile‎
Lines changed: 47 additions & 0 deletions
diff --git a/‎diagrams/README.md‎
Lines changed: 97 additions & 0 deletions b/‎diagrams/README.md‎
Lines changed: 97 additions & 0 deletions
@@ -18,4 +18,9 @@ arxiv_check_results.json
 .claude
 
 # Python/uv
-uv.lock
+uv.lock
+
+# Generated diagrams (regenerate with: cd diagrams && make all)
+diagrams/generated/
+diagrams/feedback*/
+images/*_tokens.png
@@ -0,0 +1,103 @@
+# RLHF Book - Claude Code Context
+
+## Project Overview
+
+This is the source repository for "RLHF Book" by Nathan Lambert - a comprehensive guide to Reinforcement Learning from Human Feedback.
+
+**Live site:** https://rlhfbook.com
+
+## Build System
+
+- **Pandoc + Make** for multi-format output (HTML, PDF, EPUB, DOCX)
+- Run `make` to build all formats
+- Run `make html` for just the HTML site
+- Dependencies: pandoc, pandoc-crossref, basictex (for PDF)
+
+## Python Commands
+
+**Always use `uv run python` instead of bare `python`** to ensure the correct virtual environment and dependencies:
+
+```bash
+# Correct
+uv run python scripts/some_script.py
+uv run python -c "import matplotlib"
+
+# Incorrect
+python scripts/some_script.py
+```
+
+## Directory Structure
+
+```
+chapters/     # Markdown source files (01-introduction.md, etc.)
+images/       # Image assets referenced in chapters
+assets/       # Brand assets (covers, logos)
+templates/    # Pandoc templates for each output format
+scripts/      # Build utilities
+diagrams/     # Diagram sources (D2, Python scripts, specs)
+build/        # Generated output (not tracked in git)
+```
+
+## Image Conventions
+
+- Place images in `images/` directory
+- Reference: `![Description](images/filename.png){#fig:label}`
+- Optional sizing: `{width=450px}`
+- Cross-reference with `@fig:label`
+
+## Diagram Workflow
+
+The `diagrams/` directory contains source files for generating figures:
+
+1. **specs/** - YAML specifications defining diagram content
+2. **d2/** - D2 language sources for pipeline diagrams
+3. **scripts/** - Python scripts for token strip visualizations
+4. **generated/** - Intermediate outputs
+
+Generate diagrams with:
+```bash
+cd diagrams && make all
+```
+
+Then copy final versions to `images/` for use in chapters.
+
+## Future: Multimodal Feedback Loop
+
+Plan to integrate Gemini API for diagram feedback:
+- Pass math content + generated diagrams to Gemini 2.5 Pro
+- Get feedback on visual clarity, correctness, consistency
+- Iterate on mockups before artist handoff
+
+Example workflow:
+```python
+# Pseudocode for diagram feedback
+import google.generativeai as genai
+
+model = genai.GenerativeModel('gemini-2.5-pro')
+response = model.generate_content([
+    "Review this reward model diagram for accuracy:",
+    diagram_image,
+    "The math should show: " + latex_formula,
+    "Is this correct and clear?"
+])
+```
+
+## Key Chapters for Diagrams
+
+- **Chapter 7 (Reward Models)**: Bradley-Terry, ORM, PRM, Generative RM
+- **Chapter 11 (Policy Gradients)**: PPO visualizations, async vs sync training
+- **Chapter 12 (DPO)**: Direct alignment visualizations
+
+## Style Notes
+
+- Keep diagrams simple and artist-friendly
+- Use consistent visual grammar across related figures
+- Prefer SVG for scalability, PNG for final book assets
+- Mockups are iterative - not pixel-perfect
+
+## Next Steps (Diagrams PR)
+
+1. **Finalize diagrams** - Review and polish the multilane diagrams (ORM, Value Function)
+2. **Add diagrams to chapter text** - Insert figure references in `chapters/07-reward-models.md`
+3. **Add RLHF overview diagram** - Add the same RLHF diagram to the start of the RM chapter to highlight where RMs fit in the pipeline
+4. **Review PR** - Check over the full PR before merge
@@ -26,6 +26,8 @@ Later in this section we also compare these to Outcome Reward Models (ORMs), Pro
 
 *Throughout this chapter, we use $x$ to denote prompts and $y$ to denote completions. This notation is common in the language model literature, where methods operate on full prompt-completion pairs rather than individual tokens.*
 
+![The reward model in RLHF plays the role of the environment component that returns rewards in standard RL. The key difference is that in RLHF, we get to control and learn this reward function from human preferences, rather than having it fixed by the environment.](images/rlhf-overview.png){#fig:rm-role-in-rlhf}
+
 ## Training Reward Models
 
 The canonical implementation of a reward model is derived from the Bradley-Terry model of preference [@BradleyTerry].
@@ -74,6 +76,8 @@ $$\mathcal{L}(\theta) = \log \left( 1 + e^{r_{\theta}(y_r \mid x) - r_{\theta}(y
 These are equivalent by letting $\Delta = r_{\theta}(y_c \mid x) - r_{\theta}(y_r \mid x)$ and using $\sigma(\Delta) = \frac{1}{1 + e^{-\Delta}}$, which implies $-\log\sigma(\Delta) = \log(1 + e^{-\Delta}) = \log\left(1 + e^{r_{\theta}(y_r \mid x) - r_{\theta}(y_c \mid x)}\right)$.
 They both appear in the RLHF literature.
 
+![Training a preference reward model requires pairs of chosen and rejected completions. The model computes a scalar score at the end-of-sequence (EOS) token for each, and the contrastive loss depends only on the score difference between the two.](images/pref_rm_training.png){#fig:pref_rm_training}
+
 ## Architecture
 
 The most common way reward models are implemented is through an abstraction similar to Transformer's `AutoModelForSequenceClassification`, which appends a small linear head to the language model that performs classification between two outcomes -- chosen and rejected.
@@ -287,6 +291,10 @@ The important intuition here is that an ORM will output a probability of correct
 This can be a noisy process, as the updates and loss propagates per token depending on outcomes and attention mappings.
 <!-- On the other hand, this process is more computationally intensive. [@cobbe2021gsm8k] posits a few potential benefits to these models, such as (1) implementation of ORMs often being done with both the standard next-token language modelling loss and the reward modelling loss above in @eq:orm_loss and (2) the ORM design as a token-level loss outperforms completion-level loss calculation used in standard RMs. -->
 
+![At inference time, an outcome reward model outputs per-token correctness probabilities. Prompt tokens are masked (e.g., label=-100), while completion tokens each receive a probability indicating whether the model believes the response leads to a correct answer.](images/orm_inference.png){#fig:orm_inference}
+
+![Training an outcome reward model uses offline labels from a verifier or dataset (e.g., all 1s for correct completions). Each completion token is trained with binary cross-entropy against the outcome label, and per-token probabilities are aggregated into a final score for verification, filtering, or reranking.](images/orm_training.png){#fig:orm_training}
+
 These models have continued in use, but are less supported in open-source RLHF tools. 
 For example, the same type of ORM was used in the seminal work *Let's Verify Step by Step* [@lightman2023let], but without the language modeling prediction piece of the loss.
 Then, the final loss is a cross-entropy loss on every token predicting if the final answer is correct.
@@ -323,6 +331,8 @@ Traditionally PRMs are trained with a language modeling head that outputs a toke
 These predictions tend to be -1 for incorrect, 0 for neutral, and 1 for correct.
 These labels do not necessarily tie with whether or not the model is on the right path, but if the step is correct.
 
+![Process reward models provide supervision only at step boundaries (e.g., newline tokens). Each step receives a 3-class label: correct (+1), neutral (0), or incorrect (-1). All other tokens are masked during training.](images/prm_training_inference.png){#fig:prm_training_inference}
+
 An example construction of a PRM is shown below.
 
 ```python
@@ -394,6 +404,54 @@ Some notes, given the above table has a lot of edge cases.
 - Both in preference tuning and reasoning training, the value functions often have a discount factor of 1, which makes a value function even closer to an outcome reward model, but with a different training loss.
 - A process reward model can be supervised by doing rollouts from an intermediate state and collecting outcome data. This blends multiple ideas, but if the *loss* is per reasoning step labels, it is best referred to as a PRM.
 
+**ORM vs. Value Function: The key distinction.**
+ORMs and value functions can appear similar since both produce per-token outputs with the same head architecture, but they differ in *what they predict* and *where targets come from*:
+
+- **ORMs** predict an immediate, token-local quantity: $p(\text{correct}_t)$ or $r_t$. Targets come from *offline labels* (a verifier or dataset marking tokens/sequences as correct or incorrect).
+- **Value functions** predict the expected *remaining* return: $V(s_t) = \mathbb{E}[\sum_{k \geq t} \gamma^{k-t} r_k \mid s_t]$. Targets are typically *computed from on-policy rollouts* under the current policy $\pi_\theta$, and change as the policy changes (technically, value functions can also be off-policy, but this is not established for work in language modeling).
+
+If you define a dense token reward $r_t = \mathbb{1}[\text{token is correct}]$ and use $\gamma = 1$, then an ORM is learning $r_t$ (or $p(r_t = 1)$) while the value head is learning the remaining-sum $\sum_{k \geq t} r_k$.
+They can share the same base model and head dimensions, but the *semantics and supervision pipeline* differ: ORMs are trained offline from fixed labels, while value functions are trained on-policy and used to compute advantages $A_t = \hat{R}_t - V_t$ for policy gradients.
+
+### Inference Differences
+
+The models handled data differently at inference-time, i.e. once they've been trained, in order to handle a suite of tasks that RMs are used for.
+
+**Bradley-Terry RM (Preference Model):**
+
+- *Input:* prompt $x$ + candidate completion $y$
+- *Output:* single scalar $r_\theta(x, y)$ from EOS hidden state
+- *Usage:* rerank $k$ completions, pick top-1 (best-of-N sampling); or provide terminal reward for RLHF
+- *Aggregation:* Not needed with scalar outputs
+
+**Outcome RM:**
+
+- *Input:* prompt $x$ + completion $y$
+- *Output:* per-token probabilities $p_t \approx P(\text{correct at token } t)$ over completion tokens
+- *Usage:* score finished candidates; aggregate via mean, min (tail risk), or product $\sum_t \log p_t$
+- *Aggregation choices:* mean correctness, minimum $p_t$, average over last $m$ tokens, or threshold flagging if any $p_t < \tau$
+
+**Process RM:**
+
+- *Input:* prompt $x$ + reasoning trace with step boundaries
+- *Output:* scores at step boundaries (e.g., class logits for correct/neutral/incorrect)
+- *Usage:* score completed chain-of-thought; or guide search/decoding by pruning low-scoring branches
+- *Aggregation:* over steps (not tokens) — mean step score, minimum (fail-fast), or weighted sum favoring later steps
+
+**Value Function:**
+
+- *Input:* prompt $x$ + current prefix $y_{\leq t}$ (a state)
+- Output: $V_t$ at each token position in the completion (expected remaining return from state $t$)
+- Usage: compute per-token advantages $A_t = \hat{R}_t - V_t$ during RL training; the values at each step serve as baselines
+- *Aggregation:* typically take $V$ at the last generated token; interpretation differs from "probability of correctness"
+
+In summary, the way to understand the different models is:
+
+- **RM:** "How good is this whole answer?" → scalar value
+- **ORM:** "Which parts look correct?" → per-token correctness
+- **PRM:** "Are the reasoning steps sound?" → per-step scores
+- **Value:** "How much reward remains from here?" → baseline for RL advantages
+
 ## Generative Reward Modeling
 
 With the cost of preference data, a large research area emerged to use existing language models as a judge of human preferences or in other evaluation settings [@zheng2023judging].
 
@@ -418,6 +418,8 @@ Generalized Advantage Estimation (GAE) is considered the state-of-the-art and ca
 A value function can also be learned with Monte Carlo estimates from the rollouts used to update the policy. 
 PPO has two losses -- one to learn the value function and another to use that value function to update the policy.
 
+![Value function training uses on-policy rollouts to compute targets. The model predicts $V_t$ at each token, which is trained via MSE against the target return $\hat{V}_t$. The advantage $A_t = \hat{V}_t - V_t$ then weights the policy gradient update.](images/value_fn_training.png){#fig:value_fn_training}
+
 A simple example implementation of a value network loss is shown below.
 
 ```python
 
@@ -0,0 +1,47 @@
+# Makefile for generating RLHF Book diagrams
+#
+# Usage:
+#   make all        - Generate all diagrams
+#   make tokens     - Generate token strip diagrams
+#   make clean      - Remove generated files
+
+GENERATED_DIR := generated
+
+.PHONY: all tokens clean help
+
+all: tokens
+	@echo "All diagrams generated in $(GENERATED_DIR)/"
+
+# Token strip diagrams - requires matplotlib
+# Uses two generators:
+#   - generate_token_strips.py for Pref RM and PRM (simple strips)
+#   - generate_multilane_strips.py for ORM and Value (multi-lane with targets/usage)
+# Generates both PNG (digital) and SVG (print) in separate folders
+tokens: | $(GENERATED_DIR)
+	@mkdir -p $(GENERATED_DIR)/png $(GENERATED_DIR)/svg
+	uv run python scripts/generate_token_strips.py --output-dir $(GENERATED_DIR)/png --format png
+	uv run python scripts/generate_token_strips.py --output-dir $(GENERATED_DIR)/svg --format svg
+	uv run python scripts/generate_multilane_strips.py --output-dir $(GENERATED_DIR)/png --format png
+	uv run python scripts/generate_multilane_strips.py --output-dir $(GENERATED_DIR)/svg --format svg
+	@echo "Token strip diagrams generated (PNG in png/, SVG in svg/)"
+
+# Create output directory
+$(GENERATED_DIR):
+	mkdir -p $(GENERATED_DIR)
+
+# Clean generated files
+clean:
+	rm -rf $(GENERATED_DIR)/*
+	@echo "Cleaned $(GENERATED_DIR)/"
+
+# Help
+help:
+	@echo "RLHF Book Diagram Generator"
+	@echo ""
+	@echo "Targets:"
+	@echo "  all      - Generate all diagrams"
+	@echo "  tokens   - Generate token strip diagrams"
+	@echo "  clean    - Remove generated files"
+	@echo ""
+	@echo "Requirements:"
+	@echo "  - Python + matplotlib: uv add matplotlib"
@@ -0,0 +1,97 @@
+# Diagram Sources for RLHF Book
+
+This directory contains source files for generating diagrams. These are **mockups** intended for iteration with coding tools, to be refined by a professional artist.
+
+## Directory Structure
+
+```
+diagrams/
+├── specs/          # YAML specifications for each diagram type
+├── d2/             # D2 diagram source files (box-and-arrow flows)
+├── scripts/        # Python scripts for generating token strips and other visuals
+├── generated/      # Intermediate outputs (SVG, PNG before final placement)
+└── README.md       # This file
+```
+
+## Workflow
+
+1. **Edit specs** in `specs/` to define the conceptual content
+2. **Generate diagrams** using the scripts or D2 CLI
+3. **Review outputs** in `generated/`
+4. **Copy final versions** to `images/` for use in the book
+5. **Commit both sources and outputs** for reproducibility
+
+## Tooling Requirements
+
+### D2 (for pipeline diagrams)
+
+Install D2: https://d2lang.com/tour/install
+
+```bash
+# macOS
+brew install d2
+
+# or via script
+curl -fsSL https://d2lang.com/install.sh | sh -s --
+```
+
+Generate SVG/PNG:
+```bash
+d2 d2/pref_rm_pipeline.d2 generated/pref_rm_pipeline.svg
+d2 d2/pref_rm_pipeline.d2 generated/pref_rm_pipeline.png
+```
+
+### Python (for token strip visuals)
+
+Dependencies (matplotlib) are managed via uv:
+```bash
+uv add matplotlib  # if not already installed
+```
+
+Generate token strips:
+```bash
+uv run python scripts/generate_token_strips.py
+```
+
+## Generating All Diagrams
+
+```bash
+# From repo root
+cd diagrams && make all
+
+# Or just token strips (doesn't require D2)
+cd diagrams && make tokens
+
+# Copy generated diagrams to images/
+cd diagrams && make install
+```
+
+## Diagram Types
+
+### 1. Pipeline Diagrams (D2)
+Box-and-arrow flows showing: Data → Model → Output → Loss
+
+- `pref_rm_pipeline.d2` - Bradley-Terry Preference RM
+- `orm_pipeline.d2` - Outcome RM
+- `prm_pipeline.d2` - Process RM
+- `gen_rm_pipeline.d2` - Generative RM / LLM-as-Judge
+
+### 2. Token Strip Visualizations (Python)
+Horizontal token sequences showing where supervision attaches:
+
+- Preference RM: highlight EOS/last token only
+- ORM: highlight all completion tokens (prompt masked)
+- PRM: highlight step boundary tokens only
+- Value function: highlight all tokens (state values)
+
+### 3. Inference Usage Diagrams (D2)
+Simple flows showing how each RM type is used at inference time.
+
+## Handoff to Artist
+
+When ready for professional refinement:
+
+1. Export all diagrams as SVG
+2. Provide the YAML specs as semantic documentation
+3. Include a style guide (fonts, colors, stroke widths)
+4. Use consistent naming: `fig_rm_{type}_{variant}.svg`