Skip to content

Fix training step handling with gradient accumulation#1470

Open
firefighter-eric wants to merge 1 commit into
modelscope:mainfrom
firefighter-eric:fix/grad-accum-logger-step
Open

Fix training step handling with gradient accumulation#1470
firefighter-eric wants to merge 1 commit into
modelscope:mainfrom
firefighter-eric:fix/grad-accum-logger-step

Conversation

@firefighter-eric
Copy link
Copy Markdown
Contributor

@firefighter-eric firefighter-eric commented May 30, 2026

Summary

This PR fixes training logger and tqdm progress handling when gradient_accumulation_steps > 1.

The training loop runs once per micro-batch inside accelerator.accumulate(...), but a real optimizer step only happens when gradients are synchronized. Previously, both model_logger.on_step_end(...) and tqdm(dataloader) advanced on every micro-batch. As a result, ModelLogger.num_steps, save_steps, checkpoint names, W&B/TensorBoard/SwanLab steps, and the progress bar could all be based on micro-batch count instead of real optimizer update count.

This is especially visible when the effective total batch size is the same, but it is achieved with different gradient_accumulation_steps values. Those configurations should have consistent logger steps and tqdm progress because they perform the same number of optimizer updates. This PR makes both logger and tqdm progress depend on accelerator.sync_gradients, so they advance only at real optimizer-step boundaries instead of micro-batch boundaries.

For example, with gradient_accumulation_steps=4, save_steps=10 could previously trigger after 10 micro-batches even though fewer optimizer updates had actually happened.

This change updates the training loop so that:

  • model_logger.on_step_end(...) is called only when accelerator.sync_gradients is true
  • the tqdm total is based on optimizer steps, not micro-batches
  • the tqdm progress bar is updated only when accelerator.sync_gradients is true

As a result, model save steps, logged metric steps, checkpoint names, and tqdm progress now align with real optimizer steps under gradient accumulation.

Validation

  • Ran python3 -m py_compile diffsynth/diffusion/runner.py
  • Ran a minimal training-loop check with 10 micro-batches and gradient_accumulation_steps=4; logger step count and progress bar total were 3, matching actual optimizer steps
  • Ran FLUX.2 Klein 4B LoRA training via diff101/train/flux2_klein_4b_lora.sh with --gradient_accumulation_steps 4; training completed successfully, tqdm showed 2/2 for 5 micro-batches, and W&B history logged 2 loss points, matching optimizer-step boundaries

中文说明

这个 PR 修复了 gradient_accumulation_steps > 1 时 logger step 和 tqdm 进度按 micro-batch 递增的问题。

训练循环会在 accelerator.accumulate(...) 内按 micro-batch 执行,但真实 optimizer step 只会发生在梯度同步时。之前 model_logger.on_step_end(...)tqdm(dataloader) 都会在每个 micro-batch 后前进,导致 ModelLogger.num_stepssave_steps、checkpoint 文件名、W&B/TensorBoard/SwanLab step,以及进度条都可能按照 micro-batch 数计算,而不是真实 optimizer update 数。

这个问题在 effective total batch size 一样、但使用不同 gradient_accumulation_steps 组合时尤其明显。它们应该有一致的 logger step 和 tqdm 进度,因为真实 optimizer update 次数是一样的。这个 PR 让 logger 和 tqdm 都依赖 accelerator.sync_gradients,因此只会在真实 optimizer step 边界递增,而不是在 micro-batch 边界递增。

例如,当 gradient_accumulation_steps=4 时,save_steps=10 之前可能在 10 个 micro-batch 后触发保存,但此时实际 optimizer update 次数更少。

这个改动会让训练循环:

  • 只在 accelerator.sync_gradients 为 true 时调用 model_logger.on_step_end(...)
  • 让 tqdm 总数按 optimizer step 计算,而不是按 micro-batch 计算
  • 只在 accelerator.sync_gradients 为 true 时更新 tqdm 进度条

因此在梯度累积开启时,模型保存 step、日志 metric step、checkpoint 文件名,以及 tqdm 进度都会和真实 optimizer step 对齐。

验证

  • 已运行 python3 -m py_compile diffsynth/diffusion/runner.py
  • 已运行一个最小训练循环测试:10 个 micro-batch,gradient_accumulation_steps=4,logger step 数和进度条总数均为 3,符合真实 optimizer step 数
  • 已通过 diff101/train/flux2_klein_4b_lora.sh 运行 FLUX.2 Klein 4B LoRA 训练,并设置 --gradient_accumulation_steps 4;训练成功完成,5 个 micro-batch 下 tqdm 显示 2/2,W&B history 记录了 2 个 loss 点,符合真实 optimizer step 边界

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the training runner to only trigger step-end logging when gradients are synchronized. Feedback suggests accumulating and averaging the detached loss across all micro-batches during gradient accumulation to prevent noisy and inaccurate loss curves, rather than only logging the final micro-batch's loss.

Comment thread diffsynth/diffusion/runner.py Outdated
@firefighter-eric firefighter-eric force-pushed the fix/grad-accum-logger-step branch from 39908b6 to 71d471b Compare May 30, 2026 16:41
@firefighter-eric firefighter-eric changed the title Fix training step logging with gradient accumulation Fix training step handling with gradient accumulation May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant