-
Notifications
You must be signed in to change notification settings - Fork 67
Open
Description
as 'aprx_imm_reg' here is computed for every action and put to buffer without being summed up, I have no idea why
'aprx_imm_reg *= legal_action_mask / n_actions_to_smpl '
I think it is because I could not understand the formula here(v~(I) = * p(a) * |A(I)), and I failed find corresponding part in your paper,
"""
Last state values are the average, not the sum of all samples of that state since we add
v~(I) = * p(a) * |A(I)|. Since we sample multiple actions on each traverser node, we have to average over
their returns like: v~(I) * Sum_a=0_N (v~(I|a) * p(a) * ||A(I)|| / N).
"""
is there any reference for it?
thanks a lot
Metadata
Metadata
Assignees
Labels
No labels