[Bug] NCCL (NVIDIA Collective Communications Library) timeout error when train Qwen3-Coder-30B-A3B-Instruct

### What happened?


```
[11/25/25 07:56:53] INFO     [rank-0] Building PEFT model...                                                                                                                                                               train.py:462
[11/25/25 07:57:16] INFO     [rank-0]                                                                                                                                                                                torch_utils.py:288
                             Model Parameters Summary:                                                                                                                                                                                 
                             🔢 Total     parameters: 31,362,594,816                                                                                                                                                                   
                             🔗 Embedding parameters: 311,164,928                                                                                                                                                                      
                             🎯 Trainable parameters: 830,472,192                                                                                                                                                                      
                             🔒 Frozen    parameters: 30,532,122,624 (97.35%)                                                                                                                                                          
                                                                                                                                                                                                                                       
[11/25/25 07:57:17] INFO     [rank-0] PROF: Torch Profiler disabled!                                                                                                                                        torch_profiler_utils.py:164
                    WARNING  [rank-0] MFU logging is not supported for PEFT. Skipping MFU callbacks.                       
...

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: 
{'pad_token_id': 151643}.
  0%|                                                                                                                                                                                                          | 0/717 [00:00<?, ?it/s][rank7]:[E1125 08:25:19.475037007 ProcessGroupNCCL.cpp:685] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=37876, OpType=_ALLGATHER_BASE, NumelIn=4096, NumelOut=32768, Timeout(ms)=600000) ran for 600002 milliseconds before timing out.
[rank7]:[E1125 08:25:19.475360022 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 7]  failure detected by watchdog at work sequence id: 37876 PG status: last enqueued work: 37877, last completed work: 37875
[rank7]:[E1125 08:25:19.475389244 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank7]:[E1125 08:25:19.475545658 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 7] First PG on this rank to signal dumping.
[rank5]:[E1125 08:25:19.489563588 ProcessGroupNCCL.cpp:685] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=37876, OpType=_ALLGATHER_BASE, NumelIn=4096, NumelOut=32768, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
[rank5]:[E1125 08:25:19.489847310 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 5]  failure detected by watchdog at work sequence id: 37876 PG status: last enqueued work: 37877, last completed work: 37875
[rank5]:[E1125 08:25:19.489872137 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank5]:[E1125 08:25:19.490011797 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 5] First PG on this rank to signal dumping.
[rank6]:[E1125 08:25:19.537363073 ProcessGroupNCCL.cpp:685] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=37876, OpType=_ALLGATHER_BASE, NumelIn=77890080, NumelOut=623120640, Timeout(ms)=600000) ran for 600065 milliseconds before timing out.
[rank6]:[E1125 08:25:19.537665205 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 6]  failure detected by watchdog at work sequence id: 37876 PG status: last enqueued work: 37876, last completed work: 37875
[rank3]:[E1125 08:25:19.537666978 ProcessGroupNCCL.cpp:685] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=37876, OpType=_ALLGATHER_BASE, NumelIn=77890080, NumelOut=623120640, Timeout(ms)=600000) ran for 600065 milliseconds before timing out.
[rank6]:[E1125 08:25:19.537705651 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank6]:[E1125 08:25:19.537834038 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 6] First PG on this rank to signal dumping.
```

### Steps to reproduce the bug

oumi distributed torchrun -m oumi train -c configs/recipes/qwen3_coder/train/train.yaml 

```
# LoRA config for Qwen3 30B A3B (MoE model with 3B activated params).
#
# Requirements:
#   - Log into WandB (`wandb login`) or disable `enable_wandb`
#
# Usage:
#   oumi distributed torchrun -m oumi train -c oumi://configs/recipes/qwen3/sft/30b_a3b_lora/train.yaml
#
# See Also:
#   - Documentation: https://oumi.ai/docs/en/latest/user_guides/train/train.html
#   - Config class: oumi.core.configs.TrainingConfig
#   - Config source: https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/training_config.py
#   - Other training configs: configs/**/*train.yaml

model:
  model_name: "/data0/swh/models/Qwen3-Coder-30B-A3B-Instruct"
  model_max_length: 8192
  torch_dtype_str: "bfloat16"
  attn_implementation: "sdpa"
  trust_remote_code: True

data:
  train:
    datasets:
      - dataset_name: /data0/swh/codes/grpo_exe/multiground_v8_251118.json # 51,760 examples
    # target_col: "prompt"

training:
  trainer_type: "TRL_SFT"
  use_peft: True
  save_steps: 100
  num_train_epochs: 3
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 8
  max_grad_norm: null

  enable_gradient_checkpointing: True
  gradient_checkpointing_kwargs:
    use_reentrant: False
  ddp_find_unused_parameters: False
  optimizer: "adamw_torch_fused"
  learning_rate: 3.0e-04
  lr_scheduler_type: "cosine"
  warmup_steps: 100
  weight_decay: 0.01
  compile: False

  dataloader_num_workers: "auto"
  dataloader_prefetch_factor: 32

  logging_steps: 100
  empty_device_cache_steps: 50
  output_dir: "/data0/swh/models/oumi-instruct-lora2"
  include_performance_metrics: True
  enable_wandb: False

peft:
  lora_r: 16
  lora_alpha: 32
  lora_target_modules:
    - "gate_proj"
    - "up_proj"
    - "down_proj"
    # - "gate"
    # - "q_proj"
    # - "k_proj"
    # - "v_proj"

fsdp:
  enable_fsdp: True
  forward_prefetch: True
  sharding_strategy: "FULL_SHARD"
  auto_wrap_policy: "TRANSFORMER_BASED_WRAP"
  transformer_layer_cls: "Qwen3MoeDecoderLayer"

```

### System Info

```shell
no
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] NCCL (NVIDIA Collective Communications Library) timeout error when train Qwen3-Coder-30B-A3B-Instruct #2055

What happened?

Steps to reproduce the bug

System Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] NCCL (NVIDIA Collective Communications Library) timeout error when train Qwen3-Coder-30B-A3B-Instruct #2055

Description

What happened?

Steps to reproduce the bug

System Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions