-
Notifications
You must be signed in to change notification settings - Fork 25
Multinode projection with different parallelization strategies when single node is benchmarked #492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
araina-amd
commented
Jan 14, 2026
- Multinode scaling projection from baseline to target node count
- Automatic config reduction for single-node benchmarking (PP and EP rescaling)
- Integration with pipeline simulation for accurate baseline calculation
- Per-layer communication estimation (TP AllReduce, MoE All-to-All)
- Detailed communication breakdown with message sizes
- Support for overlapped gradient all-reduce (default enabled)
- Multinode scaling projection from baseline to target node count - Automatic config reduction for single-node benchmarking (PP and EP rescaling) - Integration with pipeline simulation for accurate baseline calculation - Per-layer communication estimation (TP AllReduce, MoE All-to-All) - Detailed communication breakdown with message sizes - Support for overlapped gradient all-reduce (default enabled)
|
Let's separate the style/formatting changes from the actual changes and make it into two PRs (if formatting is actually necessary). |
|
To the actual changes, there are several key things missing from the code. Let's discuss it offline. |
…nce_projection - Delete primus/core/projection/multinode_projection/ directory - All multinode projection functionality is now in performance_projection/projection.py - Communication calculation, hardware config loading, and projection logic consolidated
53259c7 to
23ea7c6
Compare
| if protocol == "simple": | ||
| # Simple protocol: one packet, add header | ||
| node_lat = args.write_latency + args.write_resp + args.write_latency | ||
| num_packets = 1 |
| """ | ||
| if protocol == "simple": | ||
| pod_lat = args.pod_lat * 3 | ||
| num_packets = 1 |
| intra_node_fanout, inter_node_fanout = get_max_fanout(args) | ||
| msg_size_per_peer = ceil(msg_size / gpus) | ||
| gpus_per_node = min(gpus, args.node_size) | ||
| num_nodes = ceil(gpus / gpus_per_node) |
| # Model parameters | ||
| hidden_size = model_config.hidden_size | ||
| num_layers = model_config.num_layers | ||
| num_experts = model_config.num_experts |
…he benchmarked time.
…y accounted in the pipeline simulation model.
| a2a_combine = cm.alltoall(coll_args, dispatch_size, ep, groups=['ep']) | ||
|
|
||
| # Forward: dispatch + combine, Backward: same | ||
| fwd_time = (a2a_dispatch + a2a_combine) / 1000 # Convert to ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two numbers are identical. why don't we have a unified variable?
|
|
||
| comm_ops.append({ | ||
| 'type': 'MoE All-to-All', | ||
| 'time_fwd_ms': fwd_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
|
|
||
| print(f" Forward time: {forward_time:.2f} ms") | ||
| print(f" Backward time: {backward_time:.2f} ms") | ||
| print(f" Forward time (compute only): {forward_time:.2f} ms") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not "compute only" -- the benchmarked layer contains down-scaled all-to-alls.
|
LGTM in general, I left some comments in the code as well as below:
|
…a_parallel_size to use PROJECTION_NNODES, fixed wgrad double-counting (set to 0.0), removed wgrad additions for IO layers, and added zero-bubble scheduler support with 50/50 B/W split when enable_zero_bubble=True.
…d _run_pipeline_simulation_megatron_zb() to use actual Megatron zero-bubble scheduler (ILP-based) instead of simple heuristic scheduler. Add custom_hardware_example.yaml for hardware configuration. Plus fixing some prints. Usage: bash runner/primus-cli direct --script primus/cli/main.py -- projection performance --config examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml --target-nodes 6 Projection accuracy for DeepSeek V2 Lite: - PP=3, EP=8 (3 nodes): Projected 6628ms vs Measured 6468ms = +2.5% error - PP=1, EP=16 (2 nodes): Projected 5337ms vs Measured 5276ms = +1.2% error
| target_grad_ar = target_breakdown.get('gradient_allreduce', 0) | ||
| grad_ar_msg = f"{target_grad_ar:.3f} ms (overlapped - not in critical path)" | ||
| else: | ||
| target_grad_ar = 0 |
| ) | ||
|
|
||
| # Calculate speedup | ||
| speedup = benchmarked_time_ms / projected_time_ms if projected_time_ms > 0 else 0 |
|
|
||
| # Calculate speedup | ||
| speedup = benchmarked_time_ms / projected_time_ms if projected_time_ms > 0 else 0 | ||
| ideal_speedup = dp_target / min_dp if min_dp > 0 else dp_target |
| @@ -0,0 +1,659 @@ | |||
| import numpy as np | |||
| from math import ceil | |||
| from typing import Tuple | |||