Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
c1d30ba
fleet args update addition.
Feiye0979 Jan 14, 2026
48c60d6
fleet args update addition.
Feiye0979 Jan 14, 2026
a89b73e
Merge remote-tracking branch 'origin/develop' into develop_unify_args…
Feiye0979 Jan 14, 2026
450c60d
fleet args update addition. pre-commit check
Feiye0979 Jan 14, 2026
713abf2
fleet args update addition. pre-commit check
Feiye0979 Jan 14, 2026
39280cb
fleet args update addition.
Feiye0979 Jan 14, 2026
7da9049
fleet args update addition. change accuracy
Feiye0979 Jan 14, 2026
8021b13
Merge remote-tracking branch 'origin/develop' into develop_unify_args…
Feiye0979 Jan 14, 2026
53ffc2e
Merge remote-tracking branch 'origin/develop' into develop_unify_args…
Feiye0979 Jan 15, 2026
cbfff76
fleet args update addition. update & fix cases
Feiye0979 Jan 15, 2026
06989ac
Merge branch 'PaddlePaddle:develop' into develop_unify_args_add
Feiye0979 Jan 15, 2026
1637bad
fleet args update addition. fix cases
Feiye0979 Jan 15, 2026
b5fda65
fleet args update addition. fix cases
Feiye0979 Jan 15, 2026
d174da2
Merge remote-tracking branch 'origin/develop' into develop_unify_args…
Feiye0979 Jan 15, 2026
8fa1ca3
Merge remote-tracking branch 'origin/develop' into develop_unify_args…
Feiye0979 Jan 15, 2026
e90c9ff
Merge remote-tracking branch 'origin/develop' into develop_unify_args…
Feiye0979 Jan 19, 2026
215cfb7
fleet args update addition.
Feiye0979 Jan 19, 2026
0e18c9d
Merge remote-tracking branch 'origin/develop' into develop_unify_args…
Feiye0979 Jan 20, 2026
3cc1405
Merge remote-tracking branch 'origin/develop' into develop_unify_args…
Feiye0979 Jan 20, 2026
eec342c
fleet args update addition. fix accuracy.
Feiye0979 Jan 20, 2026
1ba0633
fleet args update addition. fix accuracy.
Feiye0979 Jan 20, 2026
81e6e05
fix ci copy.
Feiye0979 Jan 20, 2026
7576b43
fleet args update addition. fix ci case.
Feiye0979 Jan 20, 2026
b05f55f
fleet args update addition. fix ci case.
Feiye0979 Jan 20, 2026
487662d
fleet args update addition. fix ci case.
Feiye0979 Jan 21, 2026
8332c7b
fleet args update addition. fix ci case.
Feiye0979 Jan 21, 2026
c5e0369
Merge remote-tracking branch 'origin/develop' into develop_unify_args…
Feiye0979 Jan 21, 2026
319d8db
fleet args update addition. fix ci case.
Feiye0979 Jan 21, 2026
180cf8b
fleet args update addition. fix ci case.
Feiye0979 Jan 21, 2026
88ab74f
fleet args update addition.
Feiye0979 Jan 21, 2026
730ad62
fleet args update addition. fix ci case.
Feiye0979 Jan 21, 2026
1cfdede
fleet args update addition. testing fuse_rms_norm.
Feiye0979 Jan 21, 2026
49e1944
fleet args update addition. testing revert accuracy.
Feiye0979 Jan 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/zh/dpo_and_lora_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ mix_strategy: concat

### model
model_name_or_path: baidu/ERNIE-4.5-0.3B-PT
attn_impl: flashmask
_attn_implementation: flashmask

### finetuning
# base
Expand Down Expand Up @@ -135,7 +135,7 @@ mix_strategy: concat

### model
model_name_or_path: baidu/ERNIE-4.5-0.3B-PT
attn_impl: flashmask
_attn_implementation: flashmask
lora: true
lora_rank: 8

Expand Down Expand Up @@ -187,7 +187,7 @@ load_checkpoint_format: flex_checkpoint

`model_name_or_path`:模型本地路径或 HuggingFace 仓库对应的名称,如`baidu/ERNIE-4.5-0.3B-PT`,推荐使用 SFT 后的模型

`attn_impl`:模型 Attention Mask 实现方式,推荐使用 `flashmask`,是一种针对 FlashAttention 的一种核心优化技术。
`_attn_implementation`:模型 Attention Mask 实现方式,推荐使用 `flashmask`,是一种针对 FlashAttention 的一种核心优化技术。

`lora`:Bool 类型,是否 lora 训练,默认`False`。

Expand Down
4 changes: 2 additions & 2 deletions docs/zh/pt_and_cpt_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ mix_strategy: concat

### model
model_name_or_path: baidu/ERNIE-4.5-0.3B-Base-PT
attn_impl: flashmask
_attn_implementation: flashmask

### finetuning
# base
Expand Down Expand Up @@ -108,7 +108,7 @@ load_checkpoint_format: flex_checkpoint

`model_name_or_path`:模型本地路径或 HuggingFace 仓库对应的名称,如`baidu/ERNIE-4.5-0.3B-Base-PT`

`attn_impl`:模型 Attention Mask 实现方式,推荐使用 `flashmask`,是一种针对 FlashAttention 的一种核心优化技术。
`_attn_implementation`:模型 Attention Mask 实现方式,推荐使用 `flashmask`,是一种针对 FlashAttention 的一种核心优化技术。

`stage`:与训练类型相关,预训练设置`PT`

Expand Down
6 changes: 3 additions & 3 deletions docs/zh/sft_and_lora_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ mix_strategy: concat

### model
model_name_or_path: baidu/ERNIE-4.5-0.3B-Base-PT
attn_impl: flashmask
_attn_implementation: flashmask

### finetuning
# base
Expand Down Expand Up @@ -124,7 +124,7 @@ mix_strategy: concat

### model
model_name_or_path: baidu/ERNIE-4.5-0.3B-Base-PT
attn_impl: flashmask
_attn_implementation: flashmask
lora: true
lora_rank: 8

Expand Down Expand Up @@ -175,7 +175,7 @@ load_checkpoint_format: flex_checkpoint

`model_name_or_path`:模型本地路径或 HuggingFace 仓库对应的名称,如`baidu/ERNIE-4.5-0.3B-Base-PT`

`attn_impl`:模型 Attention Mask 实现方式,推荐使用 `flashmask`,是一种针对 FlashAttention 的一种核心优化技术。
`_attn_implementation`:模型 Attention Mask 实现方式,推荐使用 `flashmask`,是一种针对 FlashAttention 的一种核心优化技术。

`lora`:Bool 类型,是否 lora 训练,默认`False`。

Expand Down
2 changes: 1 addition & 1 deletion docs/zh/training_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,7 +283,7 @@
--expert_model_parallel_size
专家并行的并行度。(`int`, 可选)

--aux_loss_alpha
--router_aux_loss_coef
MoE 模型的辅助损失(Auxiliary loss)权重系数。(`float`, 可选, 默认为 0.0001)

--expert_max_capacity
Expand Down
2 changes: 1 addition & 1 deletion examples/best_practices/DeepSeek-V3/SFT-Practice.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,4 +80,4 @@ mpirun bash run_dsv3_4k.sh
* 在 MoE 模型中,专家间负载不均衡也可能引发 OOM 错误。为此,合理引入 AuxLoss 及其无辅助损失机制至关重要。以下是实验过程中总结的关键注意事项:
* Gate 计算隔离:e_score_correction_bias 应仅用于门控权重计算,避免传递至后续 FFN 模块。
* AuxLoss 计算适配:在 SP 或 Subbatch 等并行策略下,需注意 seq_len 的实际取值,确保损失计算正确。
* 配置调整:Hugging Face 所提供的部分配置(如 aux_loss_alpha)需结合具体训练场景进行针对性调优。
* 配置调整:Hugging Face 所提供的部分配置(如 router_aux_loss_coef)需结合具体训练场景进行针对性调优。
4 changes: 1 addition & 3 deletions examples/best_practices/DeepSeek-V3/dsv3_128k_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,8 @@ sharding: stage1
bf16: true
amp_master_grad: true
fp16_opt_level: O2
use_flash_attention: true
use_attn_mask_startend_row_indices: true
using_fake_gate: false
moe_router_force_load_balancing: false
pre_alloc_memory: 60
tensorwise_offload_optimizer: true
fuse_rms_norm: true
moe_subbatch_token_num_before_dispatch: 1024
4 changes: 1 addition & 3 deletions examples/best_practices/DeepSeek-V3/dsv3_32k_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,8 @@ sharding: stage1
bf16: true
amp_master_grad: true
fp16_opt_level: O2
use_flash_attention: true
use_attn_mask_startend_row_indices: true
using_fake_gate: false
moe_router_force_load_balancing: false
pre_alloc_memory: 60
tensorwise_offload_optimizer: true
fuse_rms_norm: true
moe_subbatch_token_num_before_dispatch: 0
4 changes: 1 addition & 3 deletions examples/best_practices/DeepSeek-V3/dsv3_4k_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,8 @@ sharding: stage1
bf16: true
amp_master_grad: true
fp16_opt_level: O2
use_flash_attention: true
use_attn_mask_startend_row_indices: true
using_fake_gate: false
moe_router_force_load_balancing: false
pre_alloc_memory: 60
tensorwise_offload_optimizer: true
fuse_rms_norm: true
moe_subbatch_token_num_before_dispatch: 0
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
"AutoModel": "DeepseekV2ModelFast",
"AutoModelForCausalLM": "DeepseekV2ForCausalLM"
},
"aux_loss_alpha": 0.0001,
"aux_loss_free_gamma": 0.0,
"router_aux_loss_coef": 0.0001,
"moe_router_bias_update_rate": 0.0,
"bos_token_id": 0,
"eos_token_id": 1,
"ep_size": 1,
Expand Down Expand Up @@ -61,8 +61,6 @@
"v_head_dim": 128,
"vocab_size": 129280,
"using_flex_token": true,
"fuse_rms_norm": true,
"fuse_attention_ffn": true,
"apply_rope_fusion": true,
"token_drop_steps": 0,
"recompute_fwd_gate_up": true,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ expert_model_parallel_size: 2
sharding: "stage1"
virtual_pipeline_model_parallel_size: 1
sequence_parallel: 0
use_flash_attention: true
max_seq_len: 4097
learning_rate: 0.000022
min_lr: 0.00000073333
Expand All @@ -48,8 +47,6 @@ distributed_dataloader: 1
unified_checkpoint: true
save_total_limit: 2
skip_profile_timer: false
fuse_rms_norm: true
fuse_attention_ffn: true
apply_rope_fusion: true
save_sharded_model: false
load_sharded_model: false
Expand All @@ -58,7 +55,7 @@ unified_checkpoint_config: "ignore_merge_optimizer"
offload_optim: true
reorder_pipeline_priority: true
num_nextn_predict_layers: 1
using_fake_gate: false
moe_router_force_load_balancing: false
hidden_dropout_prob: 0.1
attention_probs_dropout_prob: 0.1
pre_alloc_memory: 61
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ random_shuffle: false

### model
model_name_or_path: baidu/ERNIE-4.5-VL-28B-A3B-Thinking
attn_impl: flashmask
_attn_implementation: flashmask

### finetuning
# base
Expand Down Expand Up @@ -55,7 +55,6 @@ recompute_num_layers: 1
recompute_modules: ["loss_fn"]
recompute_use_reentrant: true

use_flash_attention: true
sequence_parallel: true
pp_seg_method: layer:Ernie4_5_DecoderLayer|ErnieDecoderLayer|EmptyLayer
offload_queue: true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ random_shuffle: false

### model
model_name_or_path: baidu/ERNIE-4.5-VL-28B-A3B-Thinking
attn_impl: flashmask
_attn_implementation: flashmask

### finetuning
# base
Expand Down Expand Up @@ -55,7 +55,6 @@ recompute_num_layers: 1
recompute_modules: ["loss_fn"]
recompute_use_reentrant: true

use_flash_attention: true
sequence_parallel: true
pp_seg_method: layer:Ernie4_5_DecoderLayer|ErnieDecoderLayer|EmptyLayer
offload_queue: true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ random_shuffle: false

### model
model_name_or_path: baidu/ERNIE-4.5-VL-28B-A3B-Thinking
attn_impl: flashmask
_attn_implementation: flashmask
lora: true
lora_rank: 32

Expand Down Expand Up @@ -57,7 +57,6 @@ recompute_num_layers: 1
recompute_modules: ["loss_fn"]
recompute_use_reentrant: true

use_flash_attention: true
sequence_parallel: true
pp_seg_method: layer:Ernie4_5_DecoderLayer|ErnieDecoderLayer|EmptyLayer
offload_queue: true
Expand Down
6 changes: 3 additions & 3 deletions examples/best_practices/PaddleOCR-VL/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ template: paddleocr_vl

### model
model_name_or_path: PaddlePaddle/PaddleOCR-VL
attn_impl: flashmask
_attn_implementation: flashmask

### finetuning
# base
Expand Down Expand Up @@ -207,7 +207,7 @@ template: paddleocr_vl

### model
model_name_or_path: PaddlePaddle/PaddleOCR-VL
attn_impl: flashmask
_attn_implementation: flashmask
lora: true
lora_rank: 8

Expand Down Expand Up @@ -728,7 +728,7 @@ CUDA_VISIBLE_DEVICES=0 paddleformers-cli train examples/best_practices/PaddleOCR
per_device_train_batch_size=2 \
per_device_eval_batch_size=2 \
gradient_accumulation_steps=32 \
attn_impl=sdpa \
_attn_implementation=sdpa \
pre_alloc_memory=18 \
device=iluvatar_gpu
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ template: paddleocr_vl

### model
model_name_or_path: PaddlePaddle/PaddleOCR-VL
attn_impl: flashmask
_attn_implementation: flashmask

### finetuning
# base
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ template: paddleocr_vl

### model
model_name_or_path: PaddlePaddle/PaddleOCR-VL
attn_impl: flashmask
_attn_implementation: flashmask
lora: true
lora_rank: 8

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ template: qwen3

### model
model_name_or_path: Qwen/Qwen3-0.6B
attn_impl: flashmask
_attn_implementation: flashmask

### finetuning
# base
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ template: qwen3

### model
model_name_or_path: Qwen/Qwen3-0.6B
attn_impl: flashmask
_attn_implementation: flashmask

### finetuning
# base
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -444,7 +444,7 @@ template: qwen2_vl

### model
model_name_or_path: Qwen/Qwen2.5-VL-7B-Instruct
attn_impl: flashmask
_attn_implementation: flashmask
lora: true
lora_rank: 8
lora_alpha: 32
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,7 @@ mix_strategy: concat

### model
model_name_or_path: Qwen/Qwen3-0.6B
attn_impl: flashmask
_attn_implementation: flashmask

### finetuning
# base
Expand Down Expand Up @@ -408,7 +408,7 @@ mix_strategy: concat
### model
model_name_or_path: ./checkpoints/paddleformers_qwen3_0p6b_sft_ckpts_emoji/
attn_impl: flashmask
_attn_implementation: flashmask
### finetuning
# base
Expand Down
2 changes: 1 addition & 1 deletion examples/config/dpo/full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ template: qwen3

### model
model_name_or_path: Qwen/Qwen3-0.6B-Base
attn_impl: flashmask
_attn_implementation: flashmask

### finetuning
# base
Expand Down
2 changes: 1 addition & 1 deletion examples/config/dpo/full_function_call.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ split_multi_turn: False

### model
model_name_or_path: Qwen/Qwen3-0.6B-Base
attn_impl: flashmask
_attn_implementation: flashmask
use_fused_head_and_loss_fn: false
loss_subbatch_sequence_length: 8192

Expand Down
2 changes: 1 addition & 1 deletion examples/config/dpo/full_tp_pp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ template: qwen3

### model
model_name_or_path: Qwen/Qwen3-0.6B-Base
attn_impl: flashmask
_attn_implementation: flashmask

### finetuning
# base
Expand Down
2 changes: 1 addition & 1 deletion examples/config/dpo/full_tp_pp_ep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ template: qwen3

### model
model_name_or_path: Qwen/Qwen3-0.6B-Base
attn_impl: flashmask
_attn_implementation: flashmask

### finetuning
# base
Expand Down
2 changes: 1 addition & 1 deletion examples/config/dpo/lora.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ template: qwen3

### model
model_name_or_path: Qwen/Qwen3-0.6B-Base
attn_impl: flashmask
_attn_implementation: flashmask
lora: true
lora_rank: 8

Expand Down
2 changes: 1 addition & 1 deletion examples/config/dpo/lora_tp_pp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ template: qwen3

### model
model_name_or_path: Qwen/Qwen3-0.6B-Base
attn_impl: flashmask
_attn_implementation: flashmask
lora: true
lora_rank: 8

Expand Down
2 changes: 1 addition & 1 deletion examples/config/dpo/lora_tp_pp_ep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ template: qwen3

### model
model_name_or_path: Qwen/Qwen3-0.6B-Base
attn_impl: flashmask
_attn_implementation: flashmask
lora: true
lora_rank: 8

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ template: ernie_nothink

### model
model_name_or_path: baidu/ERNIE-4.5-0.3B-PT
attn_impl: eager
_attn_implementation: eager

### finetuning
# base
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ template: ernie_nothink

### model
model_name_or_path: baidu/ERNIE-4.5-0.3B-PT
attn_impl: eager
_attn_implementation: eager
lora: true
lora_rank: 8

Expand Down
Loading
Loading