Skip to content

Commit 0075120

Browse files
natolambertclaude
andauthored
Add equation labels to chapter 11 for consistent numbering (#217)
Co-authored-by: Claude Opus 4.5 <[email protected]>
1 parent 5a1f43b commit 0075120

File tree

1 file changed

+10
-10
lines changed

1 file changed

+10
-10
lines changed

chapters/11-policy-gradients.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -61,13 +61,13 @@ In RLHF this typically means sampling prompts $x_i$ from a dataset and generatin
6161
6262
$$
6363
\hat{J}(\theta) = \frac{1}{B}\sum_{i=1}^{B} R(x_i, y_i),
64-
$$
64+
$$ {#eq:empirical_batch_estimate}
6565
6666
or, in an MDP view with per-step rewards,
6767
6868
$$
6969
\hat{J}(\theta) = \frac{1}{B}\sum_{i=1}^{B} \sum_{t=0}^{T_i} \gamma^t r_{i,t}.
70-
$$
70+
$$ {#eq:empirical_mdp_estimate}
7171
7272
The core of policy gradient algorithms is computing the gradient with respect to the finite-time expected return over the current policy.
7373
With this expected return, $J$, the parameter update can be computed as follows, where $\alpha$ is the learning rate:
@@ -120,7 +120,7 @@ Back to the derivation, expanding the log probability of the trajectory:
120120
121121
$$
122122
\log p_\theta (\tau) = \log p(s_0) + \sum_{t=0}^\infty \log \pi_\theta(a_t|s_t) + \sum_{t=0}^\infty \log p(s_{t+1}|s_t, a_t)
123-
$$
123+
$$ {#eq:trajectory_log_prob}
124124
125125
Now, if we take the gradient of the above, we get:
126126
@@ -131,12 +131,12 @@ Now, if we take the gradient of the above, we get:
131131
Therefore, the gradient of the log probability of the trajectory simplifies to:
132132
$$
133133
\nabla_\theta \log p_\theta (\tau) = \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t)
134-
$$
134+
$$ {#eq:trajectory_log_grad}
135135
136136
Substituting this back in @eq:policy_gradient_expectation, we get:
137137
$$
138138
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau) \right]
139-
$$
139+
$$ {#eq:policy_gradient_returns}
140140
141141
Quite often, people use a more general formulation of the policy gradient:
142142
$$
@@ -278,7 +278,7 @@ Here, $\pi_\theta(a|s)$ is the current policy being optimized and $\pi_{\theta_{
278278
The ratio between these two policies emerges from *importance sampling*, which allows us to reuse data collected under an old policy to estimate gradients for a new policy.
279279
280280
Recall from the advantage formulation of the policy gradient (@eq:advantage_policy_gradient) that we have:
281-
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) A^{\pi_\theta}(s_t, a_t) \right].$$
281+
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) A^{\pi_\theta}(s_t, a_t) \right].$$ {#eq:advantage_policy_gradient_recall}
282282
283283
This expectation is taken over trajectories sampled from $\pi_\theta$, but in practice we want to take multiple gradient steps on a batch of data that was collected from a fixed policy $\pi_{\theta_{\text{old}}}$.
284284
To correct for this distribution mismatch, we multiply by the importance weight $\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}$, which reweights samples to account for how much more or less likely they are under the current policy versus the data-collection policy.
@@ -673,7 +673,7 @@ Given per-token losses $\ell_{i,t}$ for sample $i$ at token $t$, with completion
673673
674674
**Strategy 1: Per-sequence normalization** (standard GRPO; also used in some PPO implementations)
675675
676-
$$L = \frac{1}{B} \sum_{i=1}^{B} \frac{1}{|a_i|} \sum_{t=1}^{|a_i|} \ell_{i,t}$$
676+
$$L = \frac{1}{B} \sum_{i=1}^{B} \frac{1}{|a_i|} \sum_{t=1}^{|a_i|} \ell_{i,t}$$ {#eq:loss_per_sequence}
677677
678678
Each sequence contributes equally to the batch loss, regardless of length. In code:
679679
@@ -685,7 +685,7 @@ sequence_loss = ((per_token_loss * completion_mask).sum(dim=1) / \
685685
686686
**Strategy 2: Per-token normalization** (DAPO [@yu2025dapo])
687687
688-
$$L = \frac{\sum_{i=1}^{B} \sum_{t=1}^{|a_i|} \ell_{i,t}}{\sum_{i=1}^{B} |a_i|}$$
688+
$$L = \frac{\sum_{i=1}^{B} \sum_{t=1}^{|a_i|} \ell_{i,t}}{\sum_{i=1}^{B} |a_i|}$$ {#eq:loss_per_token}
689689
690690
Each token contributes equally; longer sequences have proportionally more influence on the gradient. In code:
691691
@@ -697,7 +697,7 @@ token_loss = ((per_token_loss * completion_mask).sum() / \
697697
698698
**Strategy 3: Fixed-length normalization** (Dr. GRPO [@liu2025understanding])
699699
700-
$$L = \frac{1}{B} \sum_{i=1}^{B} \frac{1}{L_{\max}} \sum_{t=1}^{|a_i|} \ell_{i,t}$$
700+
$$L = \frac{1}{B} \sum_{i=1}^{B} \frac{1}{L_{\max}} \sum_{t=1}^{|a_i|} \ell_{i,t}$$ {#eq:loss_fixed_length}
701701
702702
Normalizes by max sequence length $L_{\max}$, equalizing the per-token scale across sequences while still letting longer sequences contribute more total gradient because they contain more active tokens.
703703
@@ -964,7 +964,7 @@ The DeepSeekMath paper describes some implementation details of GRPO that differ
964964
For example, the KL penalty within the RLHF optimization (recall the KL penalty is also used when training reasoning models on verifiable rewards without a reward model) is applied directly in the loss update rather than to the reward function.
965965
Where the standard KL penalty application for RLHF is applied as $r=r_\theta - \beta \mathcal{D}_{\text{KL}}$, the GRPO implementation is along the lines of:
966966
967-
$$ L = L_{\text{policy gradient}} + \beta * \mathcal{D}_{\text{KL}} $$
967+
$$ L = L_{\text{policy gradient}} + \beta * \mathcal{D}_{\text{KL}} $$ {#eq:grpo_loss_kl}
968968
969969
Though, there are multiple ways to implement this.
970970
Traditionally, the KL distance is computed with respect to each token in the completion to a prompt $s$.

0 commit comments

Comments
 (0)