LLM Forward Step by maanug-nv · Pull Request #12673 · NVIDIA-NeMo/NeMo

maanug-nv · 2025-03-18T21:33:29Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Maanu Grover <maanug@nvidia.com>

nemo/tron/train.py

Signed-off-by: Maanu Grover <maanug@nvidia.com>

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com>

Signed-off-by: Maanu Grover <maanug@nvidia.com>

nemo/tron/llm/gpt.py

ananthsub · 2025-03-19T00:46:31Z

nemo/tron/llm/gpt.py

+from nemo.tron.state import GlobalState
+
+
+def get_batch(data_iterator, cfg: ConfigContainer):


would be good to add typehint + docs for the return value

ananthsub · 2025-03-19T00:47:57Z

nemo/tron/llm/gpt.py

+    return batch.values()
+
+
+def forward_step(state: GlobalState, data_iterator: Iterable, model: GPTModel):


same here, the return type will be helpful

nemo/tron/api.py

ananthsub · 2025-03-19T00:57:59Z

nemo/tron/utils/train_utils.py

            torch.distributed.all_reduce(values, group=tracker[name]["avg_group"], op=torch.distributed.ReduceOp.AVG)
+
+
+def maybe_inject_state(forward_step_func: Callable, state: GlobalState) -> Callable:


in some types.py file please define the typehint for forward_step_func since this will serve as additional docs

nemo/tron/config.py

Signed-off-by: Maanu Grover <maanug@nvidia.com>

nemo/tron/utils/train_utils.py

* pretrain loss func Signed-off-by: Maanu Grover <maanug@nvidia.com> * get batch and forward Signed-off-by: Maanu Grover <maanug@nvidia.com> * add rerun functionality to loss Signed-off-by: Maanu Grover <maanug@nvidia.com> * formatting Signed-off-by: Maanu Grover <maanug@nvidia.com> * injection of state Signed-off-by: Maanu Grover <maanug@nvidia.com> * remove globalstate singleton functionality Signed-off-by: Maanu Grover <maanug@nvidia.com> * update example Signed-off-by: Maanu Grover <maanug@nvidia.com> * missing copyright Signed-off-by: Maanu Grover <maanug@nvidia.com> * fix for latest mcore Signed-off-by: Maanu Grover <maanug@nvidia.com> * syntax Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com> * move assertion Signed-off-by: Maanu Grover <maanug@nvidia.com> * refactor for eval Signed-off-by: Maanu Grover <maanug@nvidia.com> * move to avoid circular import Signed-off-by: Maanu Grover <maanug@nvidia.com> * fix Signed-off-by: Maanu Grover <maanug@nvidia.com> * unused Signed-off-by: Maanu Grover <maanug@nvidia.com> * cache num fw args in train and eval Signed-off-by: Maanu Grover <maanug@nvidia.com> * docstring fix Signed-off-by: Maanu Grover <maanug@nvidia.com> * remove duplicate Signed-off-by: Maanu Grover <maanug@nvidia.com> --------- Signed-off-by: Maanu Grover <maanug@nvidia.com> Signed-off-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>

maanug-nv added 8 commits March 12, 2025 16:54

pretrain loss func

b32914f

Signed-off-by: Maanu Grover <maanug@nvidia.com>

get batch and forward

f3c03dd

Signed-off-by: Maanu Grover <maanug@nvidia.com>

add rerun functionality to loss

d397d34

Signed-off-by: Maanu Grover <maanug@nvidia.com>

formatting

7d21d7e

Signed-off-by: Maanu Grover <maanug@nvidia.com>

injection of state

27515de

Signed-off-by: Maanu Grover <maanug@nvidia.com>

remove globalstate singleton functionality

2181140

Signed-off-by: Maanu Grover <maanug@nvidia.com>

update example

46ef694

Signed-off-by: Maanu Grover <maanug@nvidia.com>

missing copyright

82bf9f6

Signed-off-by: Maanu Grover <maanug@nvidia.com>

ananthsub reviewed Mar 18, 2025

View reviewed changes

nemo/tron/train.py Outdated Show resolved Hide resolved

nemo/tron/train.py Outdated Show resolved Hide resolved

maanug-nv and others added 6 commits March 18, 2025 15:52

fix for latest mcore

75c5fe3

Signed-off-by: Maanu Grover <maanug@nvidia.com>

syntax

080901c

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com>

move assertion

6f085c9

Signed-off-by: Maanu Grover <maanug@nvidia.com>

refactor for eval

686d6f9

Signed-off-by: Maanu Grover <maanug@nvidia.com>

move to avoid circular import

b7ac969

Signed-off-by: Maanu Grover <maanug@nvidia.com>

fix

71894a1

Signed-off-by: Maanu Grover <maanug@nvidia.com>

maanug-nv marked this pull request as ready for review March 19, 2025 00:46

ananthsub reviewed Mar 19, 2025

View reviewed changes

maanug-nv requested review from ananthsub and hemildesai March 19, 2025 02:27

ericharper reviewed Mar 19, 2025

View reviewed changes

nemo/tron/config.py Outdated Show resolved Hide resolved

maanug-nv added 4 commits March 19, 2025 19:49

unused

fb13862

Signed-off-by: Maanu Grover <maanug@nvidia.com>

cache num fw args in train and eval

354b5a3

Signed-off-by: Maanu Grover <maanug@nvidia.com>

docstring fix

b31d7f9

Signed-off-by: Maanu Grover <maanug@nvidia.com>

remove duplicate

430741f

Signed-off-by: Maanu Grover <maanug@nvidia.com>

ananthsub approved these changes Mar 20, 2025

View reviewed changes

hemildesai reviewed Mar 20, 2025

View reviewed changes

nemo/tron/utils/train_utils.py Show resolved Hide resolved

maanug-nv merged commit d1d1f7c into mlm-pretrain-loop Mar 20, 2025
11 checks passed

maanug-nv deleted the maanug/loss-and-fwd branch March 20, 2025 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Forward Step#12673

LLM Forward Step#12673
maanug-nv merged 18 commits intomlm-pretrain-loopfrom
maanug/loss-and-fwd

maanug-nv commented Mar 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ananthsub Mar 19, 2025

Uh oh!

ananthsub Mar 19, 2025

Uh oh!

Uh oh!

ananthsub Mar 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		from nemo.tron.state import GlobalState


		def get_batch(data_iterator, cfg: ConfigContainer):

		return batch.values()


		def forward_step(state: GlobalState, data_iterator: Iterable, model: GPTModel):

		torch.distributed.all_reduce(values, group=tracker[name]["avg_group"], op=torch.distributed.ReduceOp.AVG)


		def maybe_inject_state(forward_step_func: Callable, state: GlobalState) -> Callable:

Conversation

maanug-nv commented Mar 18, 2025

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ananthsub Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

ananthsub Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ananthsub Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants