-
Notifications
You must be signed in to change notification settings - Fork 707
Open
Labels
bugSomething isn't workingSomething isn't workingstaletriageThis issue needs review by the core team.This issue needs review by the core team.
Description
What happened?
[11/25/25 07:56:53] INFO [rank-0] Building PEFT model... train.py:462
[11/25/25 07:57:16] INFO [rank-0] torch_utils.py:288
Model Parameters Summary:
🔢 Total parameters: 31,362,594,816
🔗 Embedding parameters: 311,164,928
🎯 Trainable parameters: 830,472,192
🔒 Frozen parameters: 30,532,122,624 (97.35%)
[11/25/25 07:57:17] INFO [rank-0] PROF: Torch Profiler disabled! torch_profiler_utils.py:164
WARNING [rank-0] MFU logging is not supported for PEFT. Skipping MFU callbacks.
...
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens:
{'pad_token_id': 151643}.
0%| | 0/717 [00:00<?, ?it/s][rank7]:[E1125 08:25:19.475037007 ProcessGroupNCCL.cpp:685] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=37876, OpType=_ALLGATHER_BASE, NumelIn=4096, NumelOut=32768, Timeout(ms)=600000) ran for 600002 milliseconds before timing out.
[rank7]:[E1125 08:25:19.475360022 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 7] failure detected by watchdog at work sequence id: 37876 PG status: last enqueued work: 37877, last completed work: 37875
[rank7]:[E1125 08:25:19.475389244 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank7]:[E1125 08:25:19.475545658 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 7] First PG on this rank to signal dumping.
[rank5]:[E1125 08:25:19.489563588 ProcessGroupNCCL.cpp:685] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=37876, OpType=_ALLGATHER_BASE, NumelIn=4096, NumelOut=32768, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
[rank5]:[E1125 08:25:19.489847310 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 5] failure detected by watchdog at work sequence id: 37876 PG status: last enqueued work: 37877, last completed work: 37875
[rank5]:[E1125 08:25:19.489872137 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank5]:[E1125 08:25:19.490011797 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 5] First PG on this rank to signal dumping.
[rank6]:[E1125 08:25:19.537363073 ProcessGroupNCCL.cpp:685] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=37876, OpType=_ALLGATHER_BASE, NumelIn=77890080, NumelOut=623120640, Timeout(ms)=600000) ran for 600065 milliseconds before timing out.
[rank6]:[E1125 08:25:19.537665205 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 6] failure detected by watchdog at work sequence id: 37876 PG status: last enqueued work: 37876, last completed work: 37875
[rank3]:[E1125 08:25:19.537666978 ProcessGroupNCCL.cpp:685] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=37876, OpType=_ALLGATHER_BASE, NumelIn=77890080, NumelOut=623120640, Timeout(ms)=600000) ran for 600065 milliseconds before timing out.
[rank6]:[E1125 08:25:19.537705651 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank6]:[E1125 08:25:19.537834038 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 6] First PG on this rank to signal dumping.
Steps to reproduce the bug
oumi distributed torchrun -m oumi train -c configs/recipes/qwen3_coder/train/train.yaml
# LoRA config for Qwen3 30B A3B (MoE model with 3B activated params).
#
# Requirements:
# - Log into WandB (`wandb login`) or disable `enable_wandb`
#
# Usage:
# oumi distributed torchrun -m oumi train -c oumi://configs/recipes/qwen3/sft/30b_a3b_lora/train.yaml
#
# See Also:
# - Documentation: https://oumi.ai/docs/en/latest/user_guides/train/train.html
# - Config class: oumi.core.configs.TrainingConfig
# - Config source: https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/training_config.py
# - Other training configs: configs/**/*train.yaml
model:
model_name: "/data0/swh/models/Qwen3-Coder-30B-A3B-Instruct"
model_max_length: 8192
torch_dtype_str: "bfloat16"
attn_implementation: "sdpa"
trust_remote_code: True
data:
train:
datasets:
- dataset_name: /data0/swh/codes/grpo_exe/multiground_v8_251118.json # 51,760 examples
# target_col: "prompt"
training:
trainer_type: "TRL_SFT"
use_peft: True
save_steps: 100
num_train_epochs: 3
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
max_grad_norm: null
enable_gradient_checkpointing: True
gradient_checkpointing_kwargs:
use_reentrant: False
ddp_find_unused_parameters: False
optimizer: "adamw_torch_fused"
learning_rate: 3.0e-04
lr_scheduler_type: "cosine"
warmup_steps: 100
weight_decay: 0.01
compile: False
dataloader_num_workers: "auto"
dataloader_prefetch_factor: 32
logging_steps: 100
empty_device_cache_steps: 50
output_dir: "/data0/swh/models/oumi-instruct-lora2"
include_performance_metrics: True
enable_wandb: False
peft:
lora_r: 16
lora_alpha: 32
lora_target_modules:
- "gate_proj"
- "up_proj"
- "down_proj"
# - "gate"
# - "q_proj"
# - "k_proj"
# - "v_proj"
fsdp:
enable_fsdp: True
forward_prefetch: True
sharding_strategy: "FULL_SHARD"
auto_wrap_policy: "TRANSFORMER_BASED_WRAP"
transformer_layer_cls: "Qwen3MoeDecoderLayer"
System Info
noReactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaletriageThis issue needs review by the core team.This issue needs review by the core team.