Skip to content

device=auto is finalized after kpar reset, leading to inconsistent PW/GPU parallel setup and corrupted direct cubes #7514

Description

@Stardust0831

Describe the bug

Summary

In ABACUS v3.11.0-beta.4, a PW/GPU Gamma-only calculation with multiple MPI ranks behaves differently depending on whether device gpu is written explicitly or device is omitted.

The omitted-device case uses the default device=auto. It can resolve to GPU after the kpar reset has already run, leaving KPAR=1 with multiple ranks in one k-point pool. In the tested build this path may finish without an error but write an invalid charge-density cube.

This looks like an input parameter finalization order issue rather than a physical-model issue.

Environment where I observed it

  • ABACUS module/build: abacus/v3.11.0-beta4-sm70-avx512
  • Basis: basis_type pw
  • Device: GPU build, GPUs visible
  • MPI: reproduced with mpirun -np 2
  • K points: Gamma-only
  • Direct grid output: out_chg and out_pot

Minimal reproducer

I prepared a small H2O test case: one H2O molecule in an 18 Angstrom cubic box, Gamma-only, PW basis, with expected valence electron count of 8 e.
abacus_device_auto_kpar_h2o.tar.gz

The attached reproducer folder contains:

  • INPUT.gpu: explicit device gpu
  • INPUT.nodevice: no device line, therefore default device=auto
  • INPUT.cpu: explicit device cpu control
  • STRU
  • KPT
  • analyze_cube.py: no-dependency cube electron-count checker
  • run_h2o_device_auto_kpar_reproducer.slurm: runs the small matrix and writes a summary

Observed behavior

Case Observation
explicit device gpu, np=1 Runs normally; chg.cube integrates to about 7.99995 e.
explicit device gpu, np=2 Fails early with nks == 0, some processor have no k points!; no misleading cube is produced.
omitted device, np=1 Runs normally; chg.cube integrates to about 7.99995 e.
omitted device, np=2 Finishes without a clear fatal error, but writes a bad chg.cube; observed integral was about 6.43463 e, with an abnormal total energy.
explicit device cpu, np=2 Used as a control path; expected to conserve electron count.

This is dangerous because the omitted-device case can produce a file that looks like a normal direct cube but is numerically wrong.

Expected behavior

The omitted-device case should not silently enter a different unsupported PW/GPU parallel path.

At least one of the following should happen:

  • device=auto should be finalized before kpar and related parallel defaults are reset, so omitted device and explicit device gpu use the same finalized device state.
  • If PW/GPU with multiple MPI ranks inside one k-point pool is not supported, ABACUS should fail fast before writing direct cube outputs.
  • If KPAR > nkstot, ABACUS should fail fast with a clear diagnostic.

Likely cause

From reading the current source, the relevant flow appears to be:

  1. ABACUS reads all explicit INPUT items.
  2. ABACUS executes reset_value hooks in input item registration order.
  3. kpar is reset before device is finalized.
  4. The default device value is auto.
  5. device=auto is later resolved to gpu when GPU is available.

The kpar reset has PW/GPU-specific logic similar to:

if (device == "gpu" && basis_type == "pw") {
    kpar = NPROC / bndpar;
}

Therefore:

  • With explicit device gpu, kpar reset sees gpu, sets KPAR=NPROC/bndpar, and the Gamma-only np=2 case fails early because one pool has no k points.
  • With omitted device, kpar reset sees raw/default auto, so the PW/GPU branch is skipped and KPAR stays 1. Later device=auto resolves to GPU, leaving multiple ranks in one k-point pool.

The GPU PW FFT implementation also appears to assume poolnproc == 1. In release builds, an assertion may not protect this path, so unsupported pool-internal multi-rank GPU PW execution can continue and corrupt direct grid outputs.

Impact

This affects workflows that trust ABACUS direct cube outputs, especially:

  • electron density surfaces from out_chg
  • Gamma-only large cells where users naturally run multiple MPI ranks with visible GPUs

The most concerning part is the silent bad-output mode for omitted device.

Suggested fix

A robust fix would be to separate raw input reading from finalized primitive input state:

  1. Read all explicit user INPUT values.
  2. Finalize primitive inputs that other reset hooks depend on, at least:
    • canonical basis_type
    • final device, including auto -> cpu/gpu
  3. Run derived default/reset hooks such as kpar, ks_solver, and other parallel defaults against finalized values.
  4. Validate unsupported combinations before any direct cube is written.

This early device finalization should only resolve the string/state. It should not initialize the GPU context before the existing MPI broadcast/runtime initialization stage.

Suggested fail-fast checks:

  • KPAR > nkstot
  • basis_type=pw && device=gpu && poolnproc > 1, unless distributed GPU PW FFT is actually supported
  • direct cube output requested on an unsupported PW/GPU parallel layout

I would be happy to help implement the fix. If the maintainers agree on the preferred strategy for input-parameter priority/dependency handling, I can prepare a PR for the corresponding finalization order, dependency ordering, or fail-fast checks.

Regression tests that would cover it

For a Gamma-only H2O PW test:

  • omitted device, GPU visible, np=2
  • explicit device gpu, same input, np=2
  • explicit device cpu, same input, np=2
  • compare charge cube electron count with the expected 8 e
  • ensure unsupported cases fail before writing a credible-looking cube

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugsBugs that only solvable with sufficient knowledge of DFTGPU & DCU & HPCGPU and DCU and HPC related any issuesInput&OutputSuitable for coders without knowing too many DFT details

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions