You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Quite often, people use a more general formulation of the policy gradient:
142
142
$$
@@ -278,7 +278,7 @@ Here, $\pi_\theta(a|s)$ is the current policy being optimized and $\pi_{\theta_{
278
278
The ratio between these two policies emerges from *importance sampling*, which allows us to reuse data collected under an old policy to estimate gradients for a new policy.
279
279
280
280
Recall from the advantage formulation of the policy gradient (@eq:advantage_policy_gradient) that we have:
This expectation is taken over trajectories sampled from $\pi_\theta$, but in practice we want to take multiple gradient steps on a batch of data that was collected from a fixed policy $\pi_{\theta_{\text{old}}}$.
284
284
To correct for this distribution mismatch, we multiply by the importance weight $\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}$, which reweights samples to account for how much more or less likely they are under the current policy versus the data-collection policy.
@@ -673,7 +673,7 @@ Given per-token losses $\ell_{i,t}$ for sample $i$ at token $t$, with completion
673
673
674
674
**Strategy 1: Per-sequence normalization** (standard GRPO; also used in some PPO implementations)
Normalizes by max sequence length $L_{\max}$, equalizing the per-token scale across sequences while still letting longer sequences contribute more total gradient because they contain more active tokens.
703
703
@@ -964,7 +964,7 @@ The DeepSeekMath paper describes some implementation details of GRPO that differ
964
964
For example, the KL penalty within the RLHF optimization (recall the KL penalty is also used when training reasoning models on verifiable rewards without a reward model) is applied directly in the loss update rather than to the reward function.
965
965
Where the standard KL penalty application for RLHF is applied as $r=r_\theta - \beta \mathcal{D}_{\text{KL}}$, the GRPO implementation is along the lines of:
966
966
967
-
$$ L = L_{\text{policy gradient}} + \beta * \mathcal{D}_{\text{KL}} $$
0 commit comments