Describe the bug
Summary
In ABACUS v3.11.0-beta.4, a PW/GPU Gamma-only calculation with multiple MPI ranks behaves differently depending on whether device gpu is written explicitly or device is omitted.
The omitted-device case uses the default device=auto. It can resolve to GPU after the kpar reset has already run, leaving KPAR=1 with multiple ranks in one k-point pool. In the tested build this path may finish without an error but write an invalid charge-density cube.
This looks like an input parameter finalization order issue rather than a physical-model issue.
Environment where I observed it
- ABACUS module/build:
abacus/v3.11.0-beta4-sm70-avx512
- Basis:
basis_type pw
- Device: GPU build, GPUs visible
- MPI: reproduced with
mpirun -np 2
- K points: Gamma-only
- Direct grid output:
out_chg and out_pot
Minimal reproducer
I prepared a small H2O test case: one H2O molecule in an 18 Angstrom cubic box, Gamma-only, PW basis, with expected valence electron count of 8 e.
abacus_device_auto_kpar_h2o.tar.gz
The attached reproducer folder contains:
INPUT.gpu: explicit device gpu
INPUT.nodevice: no device line, therefore default device=auto
INPUT.cpu: explicit device cpu control
STRU
KPT
analyze_cube.py: no-dependency cube electron-count checker
run_h2o_device_auto_kpar_reproducer.slurm: runs the small matrix and writes a summary
Observed behavior
| Case |
Observation |
explicit device gpu, np=1 |
Runs normally; chg.cube integrates to about 7.99995 e. |
explicit device gpu, np=2 |
Fails early with nks == 0, some processor have no k points!; no misleading cube is produced. |
omitted device, np=1 |
Runs normally; chg.cube integrates to about 7.99995 e. |
omitted device, np=2 |
Finishes without a clear fatal error, but writes a bad chg.cube; observed integral was about 6.43463 e, with an abnormal total energy. |
explicit device cpu, np=2 |
Used as a control path; expected to conserve electron count. |
This is dangerous because the omitted-device case can produce a file that looks like a normal direct cube but is numerically wrong.
Expected behavior
The omitted-device case should not silently enter a different unsupported PW/GPU parallel path.
At least one of the following should happen:
device=auto should be finalized before kpar and related parallel defaults are reset, so omitted device and explicit device gpu use the same finalized device state.
- If PW/GPU with multiple MPI ranks inside one k-point pool is not supported, ABACUS should fail fast before writing direct cube outputs.
- If
KPAR > nkstot, ABACUS should fail fast with a clear diagnostic.
Likely cause
From reading the current source, the relevant flow appears to be:
- ABACUS reads all explicit INPUT items.
- ABACUS executes
reset_value hooks in input item registration order.
kpar is reset before device is finalized.
- The default
device value is auto.
device=auto is later resolved to gpu when GPU is available.
The kpar reset has PW/GPU-specific logic similar to:
if (device == "gpu" && basis_type == "pw") {
kpar = NPROC / bndpar;
}
Therefore:
- With explicit
device gpu, kpar reset sees gpu, sets KPAR=NPROC/bndpar, and the Gamma-only np=2 case fails early because one pool has no k points.
- With omitted
device, kpar reset sees raw/default auto, so the PW/GPU branch is skipped and KPAR stays 1. Later device=auto resolves to GPU, leaving multiple ranks in one k-point pool.
The GPU PW FFT implementation also appears to assume poolnproc == 1. In release builds, an assertion may not protect this path, so unsupported pool-internal multi-rank GPU PW execution can continue and corrupt direct grid outputs.
Impact
This affects workflows that trust ABACUS direct cube outputs, especially:
- electron density surfaces from
out_chg
- Gamma-only large cells where users naturally run multiple MPI ranks with visible GPUs
The most concerning part is the silent bad-output mode for omitted device.
Suggested fix
A robust fix would be to separate raw input reading from finalized primitive input state:
- Read all explicit user INPUT values.
- Finalize primitive inputs that other reset hooks depend on, at least:
- canonical
basis_type
- final
device, including auto -> cpu/gpu
- Run derived default/reset hooks such as
kpar, ks_solver, and other parallel defaults against finalized values.
- Validate unsupported combinations before any direct cube is written.
This early device finalization should only resolve the string/state. It should not initialize the GPU context before the existing MPI broadcast/runtime initialization stage.
Suggested fail-fast checks:
KPAR > nkstot
basis_type=pw && device=gpu && poolnproc > 1, unless distributed GPU PW FFT is actually supported
- direct cube output requested on an unsupported PW/GPU parallel layout
I would be happy to help implement the fix. If the maintainers agree on the preferred strategy for input-parameter priority/dependency handling, I can prepare a PR for the corresponding finalization order, dependency ordering, or fail-fast checks.
Regression tests that would cover it
For a Gamma-only H2O PW test:
- omitted
device, GPU visible, np=2
- explicit
device gpu, same input, np=2
- explicit
device cpu, same input, np=2
- compare charge cube electron count with the expected
8 e
- ensure unsupported cases fail before writing a credible-looking cube
Task list for Issue attackers (only for developers)
Describe the bug
Summary
In ABACUS v3.11.0-beta.4, a PW/GPU Gamma-only calculation with multiple MPI ranks behaves differently depending on whether
device gpuis written explicitly ordeviceis omitted.The omitted-device case uses the default
device=auto. It can resolve to GPU after thekparreset has already run, leavingKPAR=1with multiple ranks in one k-point pool. In the tested build this path may finish without an error but write an invalid charge-density cube.This looks like an input parameter finalization order issue rather than a physical-model issue.
Environment where I observed it
abacus/v3.11.0-beta4-sm70-avx512basis_type pwmpirun -np 2out_chgandout_potMinimal reproducer
I prepared a small H2O test case: one H2O molecule in an 18 Angstrom cubic box, Gamma-only, PW basis, with expected valence electron count of
8 e.abacus_device_auto_kpar_h2o.tar.gz
The attached reproducer folder contains:
INPUT.gpu: explicitdevice gpuINPUT.nodevice: nodeviceline, therefore defaultdevice=autoINPUT.cpu: explicitdevice cpucontrolSTRUKPTanalyze_cube.py: no-dependency cube electron-count checkerrun_h2o_device_auto_kpar_reproducer.slurm: runs the small matrix and writes a summaryObserved behavior
device gpu,np=1chg.cubeintegrates to about7.99995 e.device gpu,np=2nks == 0, some processor have no k points!; no misleading cube is produced.device,np=1chg.cubeintegrates to about7.99995 e.device,np=2chg.cube; observed integral was about6.43463 e, with an abnormal total energy.device cpu,np=2This is dangerous because the omitted-device case can produce a file that looks like a normal direct cube but is numerically wrong.
Expected behavior
The omitted-device case should not silently enter a different unsupported PW/GPU parallel path.
At least one of the following should happen:
device=autoshould be finalized beforekparand related parallel defaults are reset, so omitteddeviceand explicitdevice gpuuse the same finalized device state.KPAR > nkstot, ABACUS should fail fast with a clear diagnostic.Likely cause
From reading the current source, the relevant flow appears to be:
reset_valuehooks in input item registration order.kparis reset beforedeviceis finalized.devicevalue isauto.device=autois later resolved togpuwhen GPU is available.The
kparreset has PW/GPU-specific logic similar to:Therefore:
device gpu,kparreset seesgpu, setsKPAR=NPROC/bndpar, and the Gamma-onlynp=2case fails early because one pool has no k points.device,kparreset sees raw/defaultauto, so the PW/GPU branch is skipped andKPARstays 1. Laterdevice=autoresolves to GPU, leaving multiple ranks in one k-point pool.The GPU PW FFT implementation also appears to assume
poolnproc == 1. In release builds, an assertion may not protect this path, so unsupported pool-internal multi-rank GPU PW execution can continue and corrupt direct grid outputs.Impact
This affects workflows that trust ABACUS direct cube outputs, especially:
out_chgThe most concerning part is the silent bad-output mode for omitted
device.Suggested fix
A robust fix would be to separate raw input reading from finalized primitive input state:
basis_typedevice, includingauto -> cpu/gpukpar,ks_solver, and other parallel defaults against finalized values.This early device finalization should only resolve the string/state. It should not initialize the GPU context before the existing MPI broadcast/runtime initialization stage.
Suggested fail-fast checks:
KPAR > nkstotbasis_type=pw && device=gpu && poolnproc > 1, unless distributed GPU PW FFT is actually supportedI would be happy to help implement the fix. If the maintainers agree on the preferred strategy for input-parameter priority/dependency handling, I can prepare a PR for the corresponding finalization order, dependency ordering, or fail-fast checks.
Regression tests that would cover it
For a Gamma-only H2O PW test:
device, GPU visible,np=2device gpu, same input,np=2device cpu, same input,np=28 eTask list for Issue attackers (only for developers)