Merged
Conversation
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
ananthsub
reviewed
Mar 18, 2025
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
ananthsub
reviewed
Mar 19, 2025
| from nemo.tron.state import GlobalState | ||
|
|
||
|
|
||
| def get_batch(data_iterator, cfg: ConfigContainer): |
Collaborator
There was a problem hiding this comment.
would be good to add typehint + docs for the return value
| return batch.values() | ||
|
|
||
|
|
||
| def forward_step(state: GlobalState, data_iterator: Iterable, model: GPTModel): |
Collaborator
There was a problem hiding this comment.
same here, the return type will be helpful
nemo/tron/utils/train_utils.py
Outdated
| torch.distributed.all_reduce(values, group=tracker[name]["avg_group"], op=torch.distributed.ReduceOp.AVG) | ||
|
|
||
|
|
||
| def maybe_inject_state(forward_step_func: Callable, state: GlobalState) -> Callable: |
Collaborator
There was a problem hiding this comment.
in some types.py file please define the typehint for forward_step_func since this will serve as additional docs
ericharper
reviewed
Mar 19, 2025
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
ananthsub
approved these changes
Mar 20, 2025
hemildesai
reviewed
Mar 20, 2025
hemildesai
pushed a commit
that referenced
this pull request
Mar 20, 2025
* pretrain loss func Signed-off-by: Maanu Grover <maanug@nvidia.com> * get batch and forward Signed-off-by: Maanu Grover <maanug@nvidia.com> * add rerun functionality to loss Signed-off-by: Maanu Grover <maanug@nvidia.com> * formatting Signed-off-by: Maanu Grover <maanug@nvidia.com> * injection of state Signed-off-by: Maanu Grover <maanug@nvidia.com> * remove globalstate singleton functionality Signed-off-by: Maanu Grover <maanug@nvidia.com> * update example Signed-off-by: Maanu Grover <maanug@nvidia.com> * missing copyright Signed-off-by: Maanu Grover <maanug@nvidia.com> * fix for latest mcore Signed-off-by: Maanu Grover <maanug@nvidia.com> * syntax Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com> * move assertion Signed-off-by: Maanu Grover <maanug@nvidia.com> * refactor for eval Signed-off-by: Maanu Grover <maanug@nvidia.com> * move to avoid circular import Signed-off-by: Maanu Grover <maanug@nvidia.com> * fix Signed-off-by: Maanu Grover <maanug@nvidia.com> * unused Signed-off-by: Maanu Grover <maanug@nvidia.com> * cache num fw args in train and eval Signed-off-by: Maanu Grover <maanug@nvidia.com> * docstring fix Signed-off-by: Maanu Grover <maanug@nvidia.com> * remove duplicate Signed-off-by: Maanu Grover <maanug@nvidia.com> --------- Signed-off-by: Maanu Grover <maanug@nvidia.com> Signed-off-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
hemildesai
pushed a commit
that referenced
this pull request
Apr 15, 2025
* pretrain loss func Signed-off-by: Maanu Grover <maanug@nvidia.com> * get batch and forward Signed-off-by: Maanu Grover <maanug@nvidia.com> * add rerun functionality to loss Signed-off-by: Maanu Grover <maanug@nvidia.com> * formatting Signed-off-by: Maanu Grover <maanug@nvidia.com> * injection of state Signed-off-by: Maanu Grover <maanug@nvidia.com> * remove globalstate singleton functionality Signed-off-by: Maanu Grover <maanug@nvidia.com> * update example Signed-off-by: Maanu Grover <maanug@nvidia.com> * missing copyright Signed-off-by: Maanu Grover <maanug@nvidia.com> * fix for latest mcore Signed-off-by: Maanu Grover <maanug@nvidia.com> * syntax Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com> * move assertion Signed-off-by: Maanu Grover <maanug@nvidia.com> * refactor for eval Signed-off-by: Maanu Grover <maanug@nvidia.com> * move to avoid circular import Signed-off-by: Maanu Grover <maanug@nvidia.com> * fix Signed-off-by: Maanu Grover <maanug@nvidia.com> * unused Signed-off-by: Maanu Grover <maanug@nvidia.com> * cache num fw args in train and eval Signed-off-by: Maanu Grover <maanug@nvidia.com> * docstring fix Signed-off-by: Maanu Grover <maanug@nvidia.com> * remove duplicate Signed-off-by: Maanu Grover <maanug@nvidia.com> --------- Signed-off-by: Maanu Grover <maanug@nvidia.com> Signed-off-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
jiemingz
pushed a commit
that referenced
this pull request
Jul 10, 2025
* pretrain loss func Signed-off-by: Maanu Grover <maanug@nvidia.com> * get batch and forward Signed-off-by: Maanu Grover <maanug@nvidia.com> * add rerun functionality to loss Signed-off-by: Maanu Grover <maanug@nvidia.com> * formatting Signed-off-by: Maanu Grover <maanug@nvidia.com> * injection of state Signed-off-by: Maanu Grover <maanug@nvidia.com> * remove globalstate singleton functionality Signed-off-by: Maanu Grover <maanug@nvidia.com> * update example Signed-off-by: Maanu Grover <maanug@nvidia.com> * missing copyright Signed-off-by: Maanu Grover <maanug@nvidia.com> * fix for latest mcore Signed-off-by: Maanu Grover <maanug@nvidia.com> * syntax Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com> * move assertion Signed-off-by: Maanu Grover <maanug@nvidia.com> * refactor for eval Signed-off-by: Maanu Grover <maanug@nvidia.com> * move to avoid circular import Signed-off-by: Maanu Grover <maanug@nvidia.com> * fix Signed-off-by: Maanu Grover <maanug@nvidia.com> * unused Signed-off-by: Maanu Grover <maanug@nvidia.com> * cache num fw args in train and eval Signed-off-by: Maanu Grover <maanug@nvidia.com> * docstring fix Signed-off-by: Maanu Grover <maanug@nvidia.com> * remove duplicate Signed-off-by: Maanu Grover <maanug@nvidia.com> --------- Signed-off-by: Maanu Grover <maanug@nvidia.com> Signed-off-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use thisGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information