Skip to content

[Bug] NCCL (NVIDIA Collective Communications Library) timeout error when train Qwen3-Coder-30B-A3B-Instruct #2055

@shiwanghua

Description

@shiwanghua

What happened?

[11/25/25 07:56:53] INFO     [rank-0] Building PEFT model...                                                                                                                                                               train.py:462
[11/25/25 07:57:16] INFO     [rank-0]                                                                                                                                                                                torch_utils.py:288
                             Model Parameters Summary:                                                                                                                                                                                 
                             🔢 Total     parameters: 31,362,594,816                                                                                                                                                                   
                             🔗 Embedding parameters: 311,164,928                                                                                                                                                                      
                             🎯 Trainable parameters: 830,472,192                                                                                                                                                                      
                             🔒 Frozen    parameters: 30,532,122,624 (97.35%)                                                                                                                                                          
                                                                                                                                                                                                                                       
[11/25/25 07:57:17] INFO     [rank-0] PROF: Torch Profiler disabled!                                                                                                                                        torch_profiler_utils.py:164
                    WARNING  [rank-0] MFU logging is not supported for PEFT. Skipping MFU callbacks.                       
...

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: 
{'pad_token_id': 151643}.
  0%|                                                                                                                                                                                                          | 0/717 [00:00<?, ?it/s][rank7]:[E1125 08:25:19.475037007 ProcessGroupNCCL.cpp:685] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=37876, OpType=_ALLGATHER_BASE, NumelIn=4096, NumelOut=32768, Timeout(ms)=600000) ran for 600002 milliseconds before timing out.
[rank7]:[E1125 08:25:19.475360022 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 7]  failure detected by watchdog at work sequence id: 37876 PG status: last enqueued work: 37877, last completed work: 37875
[rank7]:[E1125 08:25:19.475389244 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank7]:[E1125 08:25:19.475545658 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 7] First PG on this rank to signal dumping.
[rank5]:[E1125 08:25:19.489563588 ProcessGroupNCCL.cpp:685] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=37876, OpType=_ALLGATHER_BASE, NumelIn=4096, NumelOut=32768, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
[rank5]:[E1125 08:25:19.489847310 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 5]  failure detected by watchdog at work sequence id: 37876 PG status: last enqueued work: 37877, last completed work: 37875
[rank5]:[E1125 08:25:19.489872137 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank5]:[E1125 08:25:19.490011797 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 5] First PG on this rank to signal dumping.
[rank6]:[E1125 08:25:19.537363073 ProcessGroupNCCL.cpp:685] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=37876, OpType=_ALLGATHER_BASE, NumelIn=77890080, NumelOut=623120640, Timeout(ms)=600000) ran for 600065 milliseconds before timing out.
[rank6]:[E1125 08:25:19.537665205 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 6]  failure detected by watchdog at work sequence id: 37876 PG status: last enqueued work: 37876, last completed work: 37875
[rank3]:[E1125 08:25:19.537666978 ProcessGroupNCCL.cpp:685] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=37876, OpType=_ALLGATHER_BASE, NumelIn=77890080, NumelOut=623120640, Timeout(ms)=600000) ran for 600065 milliseconds before timing out.
[rank6]:[E1125 08:25:19.537705651 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank6]:[E1125 08:25:19.537834038 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 6] First PG on this rank to signal dumping.

Steps to reproduce the bug

oumi distributed torchrun -m oumi train -c configs/recipes/qwen3_coder/train/train.yaml

# LoRA config for Qwen3 30B A3B (MoE model with 3B activated params).
#
# Requirements:
#   - Log into WandB (`wandb login`) or disable `enable_wandb`
#
# Usage:
#   oumi distributed torchrun -m oumi train -c oumi://configs/recipes/qwen3/sft/30b_a3b_lora/train.yaml
#
# See Also:
#   - Documentation: https://oumi.ai/docs/en/latest/user_guides/train/train.html
#   - Config class: oumi.core.configs.TrainingConfig
#   - Config source: https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/training_config.py
#   - Other training configs: configs/**/*train.yaml

model:
  model_name: "/data0/swh/models/Qwen3-Coder-30B-A3B-Instruct"
  model_max_length: 8192
  torch_dtype_str: "bfloat16"
  attn_implementation: "sdpa"
  trust_remote_code: True

data:
  train:
    datasets:
      - dataset_name: /data0/swh/codes/grpo_exe/multiground_v8_251118.json # 51,760 examples
    # target_col: "prompt"

training:
  trainer_type: "TRL_SFT"
  use_peft: True
  save_steps: 100
  num_train_epochs: 3
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 8
  max_grad_norm: null

  enable_gradient_checkpointing: True
  gradient_checkpointing_kwargs:
    use_reentrant: False
  ddp_find_unused_parameters: False
  optimizer: "adamw_torch_fused"
  learning_rate: 3.0e-04
  lr_scheduler_type: "cosine"
  warmup_steps: 100
  weight_decay: 0.01
  compile: False

  dataloader_num_workers: "auto"
  dataloader_prefetch_factor: 32

  logging_steps: 100
  empty_device_cache_steps: 50
  output_dir: "/data0/swh/models/oumi-instruct-lora2"
  include_performance_metrics: True
  enable_wandb: False

peft:
  lora_r: 16
  lora_alpha: 32
  lora_target_modules:
    - "gate_proj"
    - "up_proj"
    - "down_proj"
    # - "gate"
    # - "q_proj"
    # - "k_proj"
    # - "v_proj"

fsdp:
  enable_fsdp: True
  forward_prefetch: True
  sharding_strategy: "FULL_SHARD"
  auto_wrap_policy: "TRANSFORMER_BASED_WRAP"
  transformer_layer_cls: "Qwen3MoeDecoderLayer"

System Info

no

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingstaletriageThis issue needs review by the core team.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions