Multinode projection with different parallelization strategies when single node is benchmarked #492

araina-amd · 2026-01-14T23:49:10Z

Multinode scaling projection from baseline to target node count
Automatic config reduction for single-node benchmarking (PP and EP rescaling)
Integration with pipeline simulation for accurate baseline calculation
Per-layer communication estimation (TP AllReduce, MoE All-to-All)
Detailed communication breakdown with message sizes
Support for overlapped gradient all-reduce (default enabled)

- Multinode scaling projection from baseline to target node count - Automatic config reduction for single-node benchmarking (PP and EP rescaling) - Integration with pipeline simulation for accurate baseline calculation - Per-layer communication estimation (TP AllReduce, MoE All-to-All) - Detailed communication breakdown with message sizes - Support for overlapped gradient all-reduce (default enabled)

primus/core/projection/performance_projection/projection.py

primus/core/projection/module_profilers/collective_model.py

primus/core/projection/module_profilers/collective_args.py

primus/core/projection/module_profilers/collective_model.py

primus/core/projection/multinode_projection/projection.py

primus/core/projection/performance_projection/projection.py

primus/modules/trainer/megatron/trainer.py

primus/core/projection/performance_projection/projection.py

primus/backends/transformer_engine/pytorch/module/base.py

primus/core/projection/module_profilers/collective_model.py

primus/core/projection/module_profilers/collective_args.py

primus/core/projection/module_profilers/collective_model.py

yuankaichen-amd · 2026-01-15T01:16:22Z

Let's separate the style/formatting changes from the actual changes and make it into two PRs (if formatting is actually necessary).

yuankaichen-amd · 2026-01-15T01:47:42Z

To the actual changes, there are several key things missing from the code. Let's discuss it offline.

…nce_projection - Delete primus/core/projection/multinode_projection/ directory - All multinode projection functionality is now in performance_projection/projection.py - Communication calculation, hardware config loading, and projection logic consolidated

primus/core/projection/performance_projection/projection.py

primus/core/projection/module_profilers/collective_args.py

primus/core/projection/module_profilers/collective_model.py

+    if protocol == "simple":
+        # Simple protocol: one packet, add header
+        node_lat = args.write_latency + args.write_resp + args.write_latency
+        num_packets = 1


primus/core/projection/module_profilers/collective_model.py

+    """
+    if protocol == "simple":
+        pod_lat = args.pod_lat * 3
+        num_packets = 1


primus/core/projection/module_profilers/collective_model.py

+    intra_node_fanout, inter_node_fanout = get_max_fanout(args)
+    msg_size_per_peer = ceil(msg_size / gpus)
+    gpus_per_node = min(gpus, args.node_size)
+    num_nodes = ceil(gpus / gpus_per_node)


primus/core/projection/performance_projection/projection.py

+    # Model parameters
+    hidden_size = model_config.hidden_size
+    num_layers = model_config.num_layers
+    num_experts = model_config.num_experts


primus/core/projection/performance_projection/projection.py

…he benchmarked time.

…y accounted in the pipeline simulation model.

primus/core/projection/performance_projection/projection.py

yuankaichen-amd · 2026-01-20T01:22:07Z

primus/core/projection/module_profilers/language_model.py

+            a2a_combine = cm.alltoall(coll_args, dispatch_size, ep, groups=['ep'])
+
+            # Forward: dispatch + combine, Backward: same
+            fwd_time = (a2a_dispatch + a2a_combine) / 1000  # Convert to ms


these two numbers are identical. why don't we have a unified variable?

yuankaichen-amd · 2026-01-20T01:22:25Z

primus/core/projection/module_profilers/language_model.py

+
+            comm_ops.append({
+                'type': 'MoE All-to-All',
+                'time_fwd_ms': fwd_time,


yuankaichen-amd · 2026-01-20T01:23:31Z

primus/core/projection/module_profilers/language_model.py


-            print(f"  Forward time:  {forward_time:.2f} ms")
-            print(f"  Backward time: {backward_time:.2f} ms")
+            print(f"  Forward time (compute only):  {forward_time:.2f} ms")


it's not "compute only" -- the benchmarked layer contains down-scaled all-to-alls.

yuankaichen-amd · 2026-01-20T01:31:35Z

LGTM in general, I left some comments in the code as well as below:

Baseline (time, nodes) in the CLI input and its related code is not very useful. Since it is only used in printing results, I suggest we should remove those.
Please make PROJECTION_NNODES=4 as a CLI flag, if not specified, default to the baseline_nodes which is to be calculated based on pp/tp/ep/... in the config
Document an example of hardware config in the CLI level and what should be included. If user doesn't provide one, what are we using? Is the collective model able to select config numbers based on GPUs/Nics it detects on the node?

…a_parallel_size to use PROJECTION_NNODES, fixed wgrad double-counting (set to 0.0), removed wgrad additions for IO layers, and added zero-bubble scheduler support with 50/50 B/W split when enable_zero_bubble=True.

primus/core/projection/performance_projection/projection.py

…d _run_pipeline_simulation_megatron_zb() to use actual Megatron zero-bubble scheduler (ILP-based) instead of simple heuristic scheduler. Add custom_hardware_example.yaml for hardware configuration. Plus fixing some prints. Usage: bash runner/primus-cli direct --script primus/cli/main.py -- projection performance --config examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml --target-nodes 6 Projection accuracy for DeepSeek V2 Lite: - PP=3, EP=8 (3 nodes): Projected 6628ms vs Measured 6468ms = +2.5% error - PP=1, EP=16 (2 nodes): Projected 5337ms vs Measured 5276ms = +1.2% error

primus/core/projection/module_profilers/collective_args.py

+    total_gpus = num_nodes * gpus_per_node
+
+    if dp == -1:
+        dp = total_gpus // (tp * pp * ep * cp)


primus/core/projection/performance_projection/projection.py

+    # Only print from rank 0 to avoid duplicate output
+    is_rank_0 = not dist.is_initialized() or dist.get_rank() == 0
+
+    runtime_config = training_config.runtime_config


primus/core/projection/performance_projection/projection.py

+            target_grad_ar = target_breakdown.get('gradient_allreduce', 0)
+            grad_ar_msg = f"{target_grad_ar:.3f} ms (overlapped - not in critical path)"
+        else:
+            target_grad_ar = 0


primus/core/projection/performance_projection/projection.py

+    )
+
+    # Calculate speedup
+    speedup = benchmarked_time_ms / projected_time_ms if projected_time_ms > 0 else 0


primus/core/projection/performance_projection/projection.py

+
+    # Calculate speedup
+    speedup = benchmarked_time_ms / projected_time_ms if projected_time_ms > 0 else 0
+    ideal_speedup = dp_target / min_dp if min_dp > 0 else dp_target


primus/core/projection/module_profilers/collective_model.py

@@ -0,0 +1,659 @@
+import numpy as np
+from math import ceil
+from typing import Tuple


araina-amd requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners January 14, 2026 23:49

github-code-quality bot found potential problems Jan 14, 2026

View reviewed changes

github-code-quality bot found potential problems Jan 15, 2026

View reviewed changes

araina-amd marked this pull request as draft January 15, 2026 19:20

araina-amd changed the title ~~Multinode projection with different parallelization strategies when single node is benchmarked~~ [WIP] Multinode projection with different parallelization strategies when single node is benchmarked Jan 15, 2026

araina-amd force-pushed the dev/araina/multinode_performance_model branch from 53259c7 to 23ea7c6 Compare January 16, 2026 00:46

github-code-quality bot found potential problems Jan 16, 2026

View reviewed changes

Remove the addition of all to all time as it is already included in t…

cb0dba3

…he benchmarked time.

araina-amd requested a review from yuankaichen-amd January 16, 2026 01:16

Remove the p2p prints coming from the collective model. p2p is alread…

8a6c3d7

…y accounted in the pipeline simulation model.

github-code-quality bot found potential problems Jan 20, 2026

View reviewed changes

yuankaichen-amd reviewed Jan 20, 2026

View reviewed changes

github-code-quality bot found potential problems Jan 21, 2026

View reviewed changes

github-code-quality bot found potential problems Jan 22, 2026

View reviewed changes

Correct microbatch calculation for MOE and improve output consistency.

f35950d

araina-amd changed the title ~~[WIP] Multinode projection with different parallelization strategies when single node is benchmarked~~ Multinode projection with different parallelization strategies when single node is benchmarked Jan 24, 2026

araina-amd marked this pull request as ready for review January 24, 2026 01:47

github-code-quality bot found potential problems Jan 24, 2026

View reviewed changes

Multinode projection with different parallelization strategies when single node is benchmarked #492

Are you sure you want to change the base?

Multinode projection with different parallelization strategies when single node is benchmarked #492

Uh oh!

Conversation

araina-amd commented Jan 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuankaichen-amd commented Jan 15, 2026

Uh oh!

yuankaichen-amd commented Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuankaichen-amd Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

yuankaichen-amd Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

yuankaichen-amd Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

yuankaichen-amd commented Jan 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants