Fix training step handling with gradient accumulation#1470
Open
firefighter-eric wants to merge 1 commit into
Open
Fix training step handling with gradient accumulation#1470firefighter-eric wants to merge 1 commit into
firefighter-eric wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request modifies the training runner to only trigger step-end logging when gradients are synchronized. Feedback suggests accumulating and averaging the detached loss across all micro-batches during gradient accumulation to prevent noisy and inaccurate loss curves, rather than only logging the final micro-batch's loss.
39908b6 to
71d471b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes training logger and tqdm progress handling when
gradient_accumulation_steps > 1.The training loop runs once per micro-batch inside
accelerator.accumulate(...), but a real optimizer step only happens when gradients are synchronized. Previously, bothmodel_logger.on_step_end(...)andtqdm(dataloader)advanced on every micro-batch. As a result,ModelLogger.num_steps,save_steps, checkpoint names, W&B/TensorBoard/SwanLab steps, and the progress bar could all be based on micro-batch count instead of real optimizer update count.This is especially visible when the effective total batch size is the same, but it is achieved with different
gradient_accumulation_stepsvalues. Those configurations should have consistent logger steps and tqdm progress because they perform the same number of optimizer updates. This PR makes both logger and tqdm progress depend onaccelerator.sync_gradients, so they advance only at real optimizer-step boundaries instead of micro-batch boundaries.For example, with
gradient_accumulation_steps=4,save_steps=10could previously trigger after 10 micro-batches even though fewer optimizer updates had actually happened.This change updates the training loop so that:
model_logger.on_step_end(...)is called only whenaccelerator.sync_gradientsis trueaccelerator.sync_gradientsis trueAs a result, model save steps, logged metric steps, checkpoint names, and tqdm progress now align with real optimizer steps under gradient accumulation.
Validation
python3 -m py_compile diffsynth/diffusion/runner.pygradient_accumulation_steps=4; logger step count and progress bar total were 3, matching actual optimizer stepsdiff101/train/flux2_klein_4b_lora.shwith--gradient_accumulation_steps 4; training completed successfully, tqdm showed2/2for 5 micro-batches, and W&B history logged 2 loss points, matching optimizer-step boundaries中文说明
这个 PR 修复了
gradient_accumulation_steps > 1时 logger step 和 tqdm 进度按 micro-batch 递增的问题。训练循环会在
accelerator.accumulate(...)内按 micro-batch 执行,但真实 optimizer step 只会发生在梯度同步时。之前model_logger.on_step_end(...)和tqdm(dataloader)都会在每个 micro-batch 后前进,导致ModelLogger.num_steps、save_steps、checkpoint 文件名、W&B/TensorBoard/SwanLab step,以及进度条都可能按照 micro-batch 数计算,而不是真实 optimizer update 数。这个问题在 effective total batch size 一样、但使用不同
gradient_accumulation_steps组合时尤其明显。它们应该有一致的 logger step 和 tqdm 进度,因为真实 optimizer update 次数是一样的。这个 PR 让 logger 和 tqdm 都依赖accelerator.sync_gradients,因此只会在真实 optimizer step 边界递增,而不是在 micro-batch 边界递增。例如,当
gradient_accumulation_steps=4时,save_steps=10之前可能在 10 个 micro-batch 后触发保存,但此时实际 optimizer update 次数更少。这个改动会让训练循环:
accelerator.sync_gradients为 true 时调用model_logger.on_step_end(...)accelerator.sync_gradients为 true 时更新 tqdm 进度条因此在梯度累积开启时,模型保存 step、日志 metric step、checkpoint 文件名,以及 tqdm 进度都会和真实 optimizer step 对齐。
验证
python3 -m py_compile diffsynth/diffusion/runner.pygradient_accumulation_steps=4,logger step 数和进度条总数均为 3,符合真实 optimizer step 数diff101/train/flux2_klein_4b_lora.sh运行 FLUX.2 Klein 4B LoRA 训练,并设置--gradient_accumulation_steps 4;训练成功完成,5 个 micro-batch 下 tqdm 显示2/2,W&B history 记录了 2 个 loss 点,符合真实 optimizer step 边界