Skip to content

Commit a79ed64

Browse files
authored
fix cli distributed launch without rank info (#3714)
1 parent 2dd639d commit a79ed64

File tree

3 files changed

+6
-5
lines changed

3 files changed

+6
-5
lines changed

docs/en/cli_usage.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -107,14 +107,14 @@ paddleformers-cli export examples/config/run_export.yaml
107107
#### 6.1. Method 1
108108

109109
```bash
110-
NNODES={num_nodes} MASTER_ADDR={your_master_addr} MASTER_PORT={your_master_port} CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 paddleformers-cli train examples/config/sft_full.yaml
110+
NNODES={num_nodes} MASTER_ADDR={your_master_addr} MASTER_PORT={your_master_port} RANK={rank} CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 paddleformers-cli train examples/config/sft_full.yaml
111111
```
112112

113113
#### 6.2. Method 2 (mpirun)
114114

115115
First, write a script, such as `scripts/train_96_gpus.sh`, with the following content:
116116
```bash
117-
NNODES={num_nodes} MASTER_ADDR={your_master_addr} MASTER_PORT={your_master_port} CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 paddleformers-cli train examples/config/sft_full.yaml
117+
NNODES={num_nodes} MASTER_ADDR={your_master_addr} MASTER_PORT={your_master_port} RANK={rank} CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 paddleformers-cli train examples/config/sft_full.yaml
118118
```
119119

120120
Then:

docs/zh/cli_usage.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -92,15 +92,15 @@ paddleformers-cli export examples/config/run_export.yaml
9292
#### 方式一
9393

9494
```shell
95-
NNODES={num_nodes} MASTER_ADDR={your_master_addr} MASTER_PORT={your_master_port} CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 paddleformers-cli train examples/config/sft_full.yaml
95+
NNODES={num_nodes} MASTER_ADDR={your_master_addr} MASTER_PORT={your_master_port} RANK={rank} CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 paddleformers-cli train examples/config/sft_full.yaml
9696
```
9797

9898
#### 方式二 (mpirun)
9999

100100
先写一个脚本,例如`scripts/train_96_gpus.sh`,内容为:
101101

102102
```shell
103-
NNODES={num_nodes} MASTER_ADDR={your_master_addr} MASTER_PORT={your_master_port} CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 paddleformers-cli train examples/config/sft_full.yaml
103+
NNODES={num_nodes} MASTER_ADDR={your_master_addr} MASTER_PORT={your_master_port} RANK={rank} CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 paddleformers-cli train examples/config/sft_full.yaml
104104
```
105105

106106
然后:

paddleformers/cli/cli.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ def main():
7171
distributed_funcs = ["train", "export"]
7272
paddleformers_dist_log = os.getenv("PADDLEFORMERS_DIST_LOG", "paddleformers_dist_log")
7373
nnodes = os.getenv("NNODES", "1")
74+
rank = os.getenv("RANK", "0")
7475
master_ip = os.getenv("MASTER_ADDR", "127.0.0.1")
7576
master_port = os.getenv("MASTER_PORT", "8080")
7677
current_device = detect_device()
@@ -154,7 +155,7 @@ def main():
154155
command = (
155156
f"python -m paddle.distributed.launch --log_dir {paddleformers_dist_log} "
156157
f"--{current_device}s {visible_cards} --master {master_ip}:{master_port} "
157-
f"--nnodes {nnodes} {launcher.__file__} {args_to_pass}"
158+
f"--nnodes {nnodes} --rank {rank} --run_mode=collective {launcher.__file__} {args_to_pass}"
158159
)
159160
command = shlex.split(command)
160161
process = subprocess.Popen(

0 commit comments

Comments
 (0)