`device=auto` is finalized after `kpar` reset, leading to inconsistent PW/GPU parallel setup and corrupted direct cubes

### Describe the bug

## Summary

In ABACUS v3.11.0-beta.4, a PW/GPU Gamma-only calculation with multiple MPI ranks behaves differently depending on whether `device gpu` is written explicitly or `device` is omitted.

The omitted-device case uses the default `device=auto`.  It can resolve to GPU after the `kpar` reset has already run, leaving `KPAR=1` with multiple ranks in one k-point pool.  In the tested build this path may finish without an error but write an invalid charge-density cube.

This looks like an input parameter finalization order issue rather than a physical-model issue.

## Environment where I observed it

- ABACUS module/build: `abacus/v3.11.0-beta4-sm70-avx512`
- Basis: `basis_type pw`
- Device: GPU build, GPUs visible
- MPI: reproduced with `mpirun -np 2`
- K points: Gamma-only
- Direct grid output: `out_chg` and `out_pot`

## Minimal reproducer

I prepared a small H2O test case: one H2O molecule in an 18 Angstrom cubic box, Gamma-only, PW basis, with expected valence electron count of `8 e`.
[abacus_device_auto_kpar_h2o.tar.gz](https://github.com/user-attachments/files/29279952/abacus_device_auto_kpar_h2o.tar.gz)

The attached reproducer folder contains:

- `INPUT.gpu`: explicit `device gpu`
- `INPUT.nodevice`: no `device` line, therefore default `device=auto`
- `INPUT.cpu`: explicit `device cpu` control
- `STRU`
- `KPT`
- `analyze_cube.py`: no-dependency cube electron-count checker
- `run_h2o_device_auto_kpar_reproducer.slurm`: runs the small matrix and writes a summary

## Observed behavior

| Case | Observation |
| --- | --- |
| explicit `device gpu`, `np=1` | Runs normally; `chg.cube` integrates to about `7.99995 e`. |
| explicit `device gpu`, `np=2` | Fails early with `nks == 0, some processor have no k points!`; no misleading cube is produced. |
| omitted `device`, `np=1` | Runs normally; `chg.cube` integrates to about `7.99995 e`. |
| omitted `device`, `np=2` | Finishes without a clear fatal error, but writes a bad `chg.cube`; observed integral was about `6.43463 e`, with an abnormal total energy. |
| explicit `device cpu`, `np=2` | Used as a control path; expected to conserve electron count. |

This is dangerous because the omitted-device case can produce a file that looks like a normal direct cube but is numerically wrong.

## Expected behavior

The omitted-device case should not silently enter a different unsupported PW/GPU parallel path.

At least one of the following should happen:

- `device=auto` should be finalized before `kpar` and related parallel defaults are reset, so omitted `device` and explicit `device gpu` use the same finalized device state.
- If PW/GPU with multiple MPI ranks inside one k-point pool is not supported, ABACUS should fail fast before writing direct cube outputs.
- If `KPAR > nkstot`, ABACUS should fail fast with a clear diagnostic.

## Likely cause

From reading the current source, the relevant flow appears to be:

1. ABACUS reads all explicit INPUT items.
2. ABACUS executes `reset_value` hooks in input item registration order.
3. `kpar` is reset before `device` is finalized.
4. The default `device` value is `auto`.
5. `device=auto` is later resolved to `gpu` when GPU is available.

The `kpar` reset has PW/GPU-specific logic similar to:

```cpp
if (device == "gpu" && basis_type == "pw") {
    kpar = NPROC / bndpar;
}
```

Therefore:

- With explicit `device gpu`, `kpar` reset sees `gpu`, sets `KPAR=NPROC/bndpar`, and the Gamma-only `np=2` case fails early because one pool has no k points.
- With omitted `device`, `kpar` reset sees raw/default `auto`, so the PW/GPU branch is skipped and `KPAR` stays 1.  Later `device=auto` resolves to GPU, leaving multiple ranks in one k-point pool.

The GPU PW FFT implementation also appears to assume `poolnproc == 1`.  In release builds, an assertion may not protect this path, so unsupported pool-internal multi-rank GPU PW execution can continue and corrupt direct grid outputs.

## Impact

This affects workflows that trust ABACUS direct cube outputs, especially:

- electron density surfaces from `out_chg`
- Gamma-only large cells where users naturally run multiple MPI ranks with visible GPUs


The most concerning part is the silent bad-output mode for omitted `device`.

## Suggested fix

A robust fix would be to separate raw input reading from finalized primitive input state:

1. Read all explicit user INPUT values.
2. Finalize primitive inputs that other reset hooks depend on, at least:
   - canonical `basis_type`
   - final `device`, including `auto -> cpu/gpu`
3. Run derived default/reset hooks such as `kpar`, `ks_solver`, and other parallel defaults against finalized values.
4. Validate unsupported combinations before any direct cube is written.

This early device finalization should only resolve the string/state.  It should not initialize the GPU context before the existing MPI broadcast/runtime initialization stage.

Suggested fail-fast checks:

- `KPAR > nkstot`
- `basis_type=pw && device=gpu && poolnproc > 1`, unless distributed GPU PW FFT is actually supported
- direct cube output requested on an unsupported PW/GPU parallel layout

I would be happy to help implement the fix.  If the maintainers agree on the preferred strategy for input-parameter priority/dependency handling, I can prepare a PR for the corresponding finalization order, dependency ordering, or fail-fast checks.

## Regression tests that would cover it

For a Gamma-only H2O PW test:

- omitted `device`, GPU visible, `np=2`
- explicit `device gpu`, same input, `np=2`
- explicit `device cpu`, same input, `np=2`
- compare charge cube electron count with the expected `8 e`
- ensure unsupported cases fail before writing a credible-looking cube

### Task list for Issue attackers (only for developers)

- [ ] Verify the issue is not a duplicate.
- [ ] Describe the bug.
- [ ] Steps to reproduce.
- [ ] Expected behavior.
- [ ] Error message.
- [ ] Environment details.
- [ ] Additional context.
- [ ] Assign a priority level (low, medium, high, urgent).
- [ ] Assign the issue to a team member.
- [ ] Label the issue with relevant tags.
- [ ] Identify possible related issues.
- [ ] Create a unit test or automated test to reproduce the bug (if applicable).
- [ ] Fix the bug.
- [ ] Test the fix.
- [ ] Update documentation (if necessary).
- [ ] Close the issue and inform the reporter (if applicable).

Case	Observation
explicit `device gpu`, `np=1`	Runs normally; `chg.cube` integrates to about `7.99995 e`.
explicit `device gpu`, `np=2`	Fails early with `nks == 0, some processor have no k points!`; no misleading cube is produced.
omitted `device`, `np=1`	Runs normally; `chg.cube` integrates to about `7.99995 e`.
omitted `device`, `np=2`	Finishes without a clear fatal error, but writes a bad `chg.cube`; observed integral was about `6.43463 e`, with an abnormal total energy.
explicit `device cpu`, `np=2`	Used as a control path; expected to conserve electron count.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`device=auto` is finalized after `kpar` reset, leading to inconsistent PW/GPU parallel setup and corrupted direct cubes #7514

Describe the bug

Summary

Environment where I observed it

Minimal reproducer

Observed behavior

Expected behavior

Likely cause

Impact

Suggested fix

Regression tests that would cover it

Task list for Issue attackers (only for developers)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

device=auto is finalized after kpar reset, leading to inconsistent PW/GPU parallel setup and corrupted direct cubes #7514

Description

Describe the bug

Summary

Environment where I observed it

Minimal reproducer

Observed behavior

Expected behavior

Likely cause

Impact

Suggested fix

Regression tests that would cover it

Task list for Issue attackers (only for developers)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`device=auto` is finalized after `kpar` reset, leading to inconsistent PW/GPU parallel setup and corrupted direct cubes #7514