Replies: 1 comment
-
|
Great setup! A few optimization paths to explore: Why higher GBS slows down: Suggested changes:
Quick benchmark: We run large-scale NeMo training at Revolution AI — B200 FP8 path is key for 20B+ models. Happy to help debug further! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
I am working on a pre-training job for GPT-OSS 20B (25B tokens, trained in 3 phases). I need your feedback to optimize the pre-training recipe to reduce our current 2-week timeline on 8x B200 GPUs. We are currently running at a GBS of 16, which results in a 2-week timeline. Any setting higher than GBS 256 triggers an immediate OOM error. Also, if we increase the batch size to our target of 256, the projected training time actually augments to 3 weeks.
Current Recipe Hyperparameters:
Any idea what would be the best to do to increase GBS to accelerate the training process? I think that is the bottleneck, but not sure, maybe increasing TP?
Beta Was this translation helpful? Give feedback.
All reactions