Skip to content

feat(data): add dpdata format conversion#5565

Open
njzjz wants to merge 1 commit into
deepmodeling:masterfrom
njzjz:feat/dpdata-auto-conversion
Open

feat(data): add dpdata format conversion#5565
njzjz wants to merge 1 commit into
deepmodeling:masterfrom
njzjz:feat/dpdata-auto-conversion

Conversation

@njzjz

@njzjz njzjz commented Jun 20, 2026

Copy link
Copy Markdown
Member

Summary

  • add training/validation data format and out_format options for dpdata-backed conversion
  • default converted output to lmdb and cache converted datasets under /home/jzzeng/codes/deepmd-kit/.deepmd_dpdata_cache
  • route converted LMDB data through TF/JAX/PT/PT-expt-compatible data loaders; report a clear Paddle error for unsupported LMDB
  • add conversion/cache tests and make dpdata>=1.0.1 a runtime dependency

Closes #5237

Tests

  • ruff check .
  • ruff format --check .
  • pytest source/tests/common/test_data_system_conversion.py -q
  • pytest source/tests/common/dpmodel/test_lmdb_data.py::TestLmdbDataReader::test_is_lmdb -q
  • pytest source/tests/tf/test_dp_test.py::TestDPTestEner::test_1frame -q
  • srun --gres=gpu:1 dp train input.json with extxyz input, default LMDB conversion, no --skip-neighbor-stat
  • srun --gres=gpu:1 dp --pt train input.json with extxyz input, default LMDB conversion, no --skip-neighbor-stat
  • srun --gres=gpu:1 dp --jax train input.json with extxyz input, default LMDB conversion, no --skip-neighbor-stat (environment used CPU JAX fallback because CUDA jaxlib is unavailable)
  • srun --gres=gpu:1 dp --pt-expt train input.json verified conversion and neighbor statistics; this environment then hits the existing pt-expt tensor serialization error during model construction

Summary by CodeRabbit

  • New Features

    • Added automatic dataset format conversion, enabling users to specify input formats (e.g., .extxyz) with automatic conversion to LMDB or other formats
    • Introduced new configuration options format and out_format/output_format for training and validation datasets to enable flexible data preprocessing
    • Improved LMDB dataset support with intelligent routing and caching during model training
  • Tests

    • Added comprehensive test suite for dataset format conversion functionality

Comment on lines +710 to +714
from deepmd.dpmodel.utils.lmdb_data import (
LmdbDataReader,
SameNlocBatchSampler,
compute_block_targets,
)
finally:
try:
lock_path.unlink()
except FileNotFoundError:
Comment on lines +1327 to +1329
from deepmd.dpmodel.utils.lmdb_data import (
is_lmdb,
)
Comment on lines +1391 to +1393
from deepmd.dpmodel.utils.lmdb_data import (
is_lmdb,
)
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds automatic dpdata-based format conversion (e.g., extxyz → LMDB/HDF5) for DeePMD training and validation datasets. Introduces LmdbDataSystem as a DeepmdDataSystem-compatible adapter, adds format/out_format config schema fields, wires format-aware routing through all three training entrypoints (PT, PD, PT-expt), promotes dpdata to a core runtime dependency, and adds unit tests covering conversion, caching, and LmdbDataSystem batching.

Changes

Automatic dpdata Format Conversion for Training/Validation Datasets

Layer / File(s) Summary
Config schema: format and out_format fields
deepmd/utils/argcheck.py
Adds format and out_format (alias output_format, default "lmdb") arguments to training_data_args() and validation_data_args() with dpdata conversion doc strings.
dpdata promoted to core runtime dependency
pyproject.toml
Moves dpdata from optional test extras to top-level [project].dependencies at >=1.0.1, removing the older >=0.2.7 test-only entry.
LmdbDataSystem adapter
deepmd/utils/data_system.py (lines 3–43, 686–942)
Introduces LmdbDataSystem wrapping LmdbDataReader and SameNlocBatchSampler, adds module-level conversion cache constants, and new stdlib imports.
dpdata conversion helpers and execution
deepmd/utils/data_system.py (lines 1086–1277)
Adds format-detection helpers, rglob input expansion, deterministic cache-path computation, mtime-based freshness checks, and _convert_system_by_dpdata with temp-file, lock-file concurrency control, and in-memory cache update.
process_systems and get_data with fmt/out_fmt
deepmd/utils/data_system.py (lines 1279–1408)
Extends process_systems API with fmt/out_fmt parameters routing through _convert_system_by_dpdata; updates get_data to read format fields from config and return LmdbDataSystem on single-LMDB resolution.
PT entrypoint: format-aware LMDB routing and neighbor-stat refactor
deepmd/pt/entrypoints/main.py
_make_dp_loader_set() returns LmdbDataset when resolved system is LMDB with format=None; adds _get_neighbor_stat_data_from_params() helper unifying LMDB and standard paths for multi-task and single-task neighbor-stat computation.
PD entrypoint: format-aware routing with LMDB guards
deepmd/pd/entrypoints/main.py
Imports is_lmdb, passes format/out_format into process_systems, and raises NotImplementedError when training or validation resolves to a single LMDB path.
PT-expt entrypoint: conditional LMDB bypass
deepmd/pt_expt/entrypoints/main.py
_get_neighbor_stat_data and _build_data_system skip direct _detect_lmdb_path when format is set, defer to process_systems with fmt/out_fmt, and route post-conversion single-LMDB results to LmdbDataSystem/make_neighbor_stat_data.
Unit tests for conversion and LmdbDataSystem
source/tests/common/test_data_system_conversion.py
Adds TestDpdataFormatConversion with fake dpdata stubs, minimal HDF5/LMDB fixture writers, and four test methods covering LMDB default output, cwd-scoped cache isolation, explicit HDF5 conversion, and get_data with format="auto" producing LmdbDataSystem with correct batch shapes.

Sequence Diagram(s)

sequenceDiagram
  participant User as User Config
  participant Entrypoint as Training Entrypoint (pt/pd/pt_expt)
  participant ProcessSystems as process_systems()
  participant Converter as _convert_system_by_dpdata()
  participant Cache as _DPDATA_CONVERSION_CACHE / .deepmd_dpdata_cache
  participant DataSystem as LmdbDataSystem / DeepmdDataSystem

  User->>Entrypoint: training_data with format="extxyz", out_format="lmdb"
  Entrypoint->>ProcessSystems: process_systems(systems, fmt="extxyz", out_fmt="lmdb")
  ProcessSystems->>Converter: _convert_system_by_dpdata(path, fmt, out_fmt)
  Converter->>Cache: check freshness (mtime vs. cache)
  alt cache fresh
    Cache-->>Converter: return cached LMDB path
  else cache stale or missing
    Converter->>Converter: acquire .lock file
    Converter->>Converter: dpdata.MultiSystems → temp output
    Converter->>Cache: move to cache path, update _DPDATA_CONVERSION_CACHE
    Converter->>Converter: release lock
  end
  Converter-->>ProcessSystems: [lmdb_path]
  ProcessSystems-->>Entrypoint: [lmdb_path]
  Entrypoint->>DataSystem: LmdbDataSystem(lmdb_path, type_map, ...)
  DataSystem-->>Entrypoint: batches (type, natoms_vec, coord, box, ...)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • deepmodeling/deepmd-kit#5283: Introduced the LMDB reader utilities (is_lmdb, LmdbDataReader, SameNlocBatchSampler) that this PR's new LmdbDataSystem adapter and entrypoint routing directly depend on.
  • deepmodeling/deepmd-kit#5408: Added pt_expt LMDB scalar-path support that this PR extends by conditionally bypassing _detect_lmdb_path and routing through process_systems with format parameters.
  • deepmodeling/deepmd-kit#5419: Adds FullValidator LMDB evaluation support via LmdbTestDataNlocView, which consumes the LmdbDataset objects produced through the process_systems/format-driven routing introduced in this PR.

Suggested labels

new feature, Python

Suggested reviewers

  • wanghan-iapcm
  • iProzd
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.24% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(data): add dpdata format conversion' accurately and concisely summarizes the main change: adding automatic format conversion functionality using dpdata.
Linked Issues check ✅ Passed All coding objectives from issue #5237 are met: automatic format detection via dpdata integration [#5237], format conversion support in training/validation data configs [#5237], and seamless conversion to DeepMD-compatible formats [#5237].
Out of Scope Changes check ✅ Passed All changes are within scope: format conversion integration, updated data configs, entrypoint modifications, new tests, and dpdata dependency elevation all directly support the automatic format conversion objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
deepmd/utils/data_system.py (4)

1147-1162: ⚖️ Poor tradeoff

Recursive mtime scan may be slow for large source directories.

_source_mtime walks the entire source directory tree to find the latest modification time. For datasets with many files, this could add noticeable latency on every cache freshness check. Consider caching the computed mtime or using a faster heuristic (e.g., only checking top-level directory mtime plus a sample of files).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/utils/data_system.py` around lines 1147 - 1162, The `_source_mtime`
function performs a full recursive directory walk using source.rglob("*") to
find the latest modification time across all files, which becomes inefficient
for large source directories. To improve performance, implement a caching
mechanism to store previously computed mtimes so that repeated calls for the
same source directory do not re-scan the entire tree, or alternatively replace
the full recursive scan with a faster heuristic that only examines the top-level
directory mtime and a representative sample of files rather than traversing
every single file.

786-796: 💤 Low value

Mixed-type detection scans all frames at initialization.

_detect_mixed_type iterates through every frame in the LMDB dataset comparing atom types, which could be slow for very large datasets (thousands of frames). Consider caching this property in the LMDB metadata during conversion, or adding a sampling heuristic for large datasets.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/utils/data_system.py` around lines 786 - 796, The _detect_mixed_type
method iterates through all frames in the dataset to check for mixed atom types,
which is inefficient for large datasets. Implement caching by storing the
detection result as an instance variable after the first call, and consider
adding a sampling heuristic for datasets with many frames such that for very
large datasets (e.g., more than a configurable threshold), only a sample of
frames are checked instead of all frames. Update the method to return the cached
result on subsequent calls and use the sampling strategy to limit iterations
while still maintaining reasonable confidence in the mixed-type detection.

1395-1408: 💤 Low value

Single-LMDB fast-path only; consider documenting multi-LMDB limitation.

The LMDB routing only handles the case where systems resolves to exactly one LMDB path. If multiple LMDB paths are provided (or conversion produces multiple systems), they fall through to DeepmdDataSystem. Consider adding a log warning or updating docstring to clarify this behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/utils/data_system.py` around lines 1395 - 1408, The code currently
only provides optimized handling for a single LMDB system through
LmdbDataSystem, while multiple LMDB systems silently fall through to
DeepmdDataSystem. Add a log warning message when multiple LMDB paths are
detected (when len(systems) > 1 and all are LMDB) to alert users that they will
be handled through the standard DeepmdDataSystem path rather than the optimized
LmdbDataSystem, and update the function's docstring to document this single-LMDB
fast-path behavior and clarify what happens with multiple LMDB inputs.

1219-1277: ⚖️ Poor tradeoff

Stale lock files may persist after process crashes.

If a process crashes after creating the lock file (line 1242) but before the finally block runs (e.g., SIGKILL), the .lock file will remain. Subsequent processes will wait 5 minutes before timing out. Consider adding stale-lock detection using the PID written to the lock file, or a timestamp-based staleness check.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/utils/data_system.py` around lines 1219 - 1277, The
_convert_system_by_dpdata function creates lock files to coordinate between
processes, but if a process crashes after creating the lock file but before the
finally block executes, the lock file persists causing other processes to wait 5
minutes before timing out. Add stale lock detection logic in the except
FileExistsError block before calling _wait_for_conversion. Read the PID from the
existing lock file and check if that process is still running using
platform-appropriate methods (e.g., os.kill with signal 0 on Unix, or process
existence checks). If the process is not running or if the lock file is older
than a reasonable threshold (e.g., 10 minutes), remove the stale lock file and
retry the lock acquisition instead of waiting. This prevents indefinite hangs on
stale locks from crashed processes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deepmd/pd/entrypoints/main.py`:
- Around line 123-144: The current LMDB validation checks for both
training_systems and validation_systems only reject LMDB when the result is a
single system (len(...) == 1), but the error messages indicate that Paddle does
not support LMDB data in general. Remove the len(...) == 1 condition from both
the training_systems check (around line 123) and the validation_systems check
(around line 139) so that any LMDB dataset is rejected regardless of whether
it's a single system or multiple systems in the list. This ensures that any
LMDB-resolved dataset triggers the NotImplementedError with a clear message,
preventing less clear failures downstream.

In `@deepmd/pt_expt/entrypoints/main.py`:
- Around line 118-132: The current code in _get_neighbor_stat_data only
validates the single-LMDB case with `if len(systems) == 1 and
is_lmdb(systems[0])`, but when format-based conversion produces multiple LMDB
paths, this check is skipped and execution falls through to get_data() instead
of raising an appropriate error for list-form LMDB systems. Add validation
guards in both _get_neighbor_stat_data and _build_data_system functions to
ensure that after process_systems() is called, if any LMDB systems are returned,
they are validated to not be in list form (similar to what _detect_lmdb_path
does), and raise a clear error before reaching the fallback get_data() or
DeepmdDataSystem paths.

In `@deepmd/pt/entrypoints/main.py`:
- Around line 197-204: Add a validation guard before the existing condition that
checks `len(systems) == 1 and is_lmdb(systems[0])` to prevent multiple LMDB
paths from being passed to DpLoaderSet. The guard should use
`isinstance(systems, list)` combined with `any(isinstance(s, str) and is_lmdb(s)
for s in systems)` to detect when systems is a list containing LMDB paths and
raise a clear ValueError message explaining that LMDB datasets must be passed as
a scalar string rather than as a list.

In `@deepmd/utils/data_system.py`:
- Around line 848-852: Add a defensive check at the beginning of the
`_stack_frames` method to guard against empty frames lists. Before accessing
`frames[0]` at line 864, add validation to check if the frames list is empty and
handle this edge case appropriately, such as raising a more informative error or
returning early. This will prevent IndexError when the sampler yields an empty
batch due to malformed LMDB data, since both `_load_set` and `get_batch` call
this method with frames lists derived from sampler indices.

---

Nitpick comments:
In `@deepmd/utils/data_system.py`:
- Around line 1147-1162: The `_source_mtime` function performs a full recursive
directory walk using source.rglob("*") to find the latest modification time
across all files, which becomes inefficient for large source directories. To
improve performance, implement a caching mechanism to store previously computed
mtimes so that repeated calls for the same source directory do not re-scan the
entire tree, or alternatively replace the full recursive scan with a faster
heuristic that only examines the top-level directory mtime and a representative
sample of files rather than traversing every single file.
- Around line 786-796: The _detect_mixed_type method iterates through all frames
in the dataset to check for mixed atom types, which is inefficient for large
datasets. Implement caching by storing the detection result as an instance
variable after the first call, and consider adding a sampling heuristic for
datasets with many frames such that for very large datasets (e.g., more than a
configurable threshold), only a sample of frames are checked instead of all
frames. Update the method to return the cached result on subsequent calls and
use the sampling strategy to limit iterations while still maintaining reasonable
confidence in the mixed-type detection.
- Around line 1395-1408: The code currently only provides optimized handling for
a single LMDB system through LmdbDataSystem, while multiple LMDB systems
silently fall through to DeepmdDataSystem. Add a log warning message when
multiple LMDB paths are detected (when len(systems) > 1 and all are LMDB) to
alert users that they will be handled through the standard DeepmdDataSystem path
rather than the optimized LmdbDataSystem, and update the function's docstring to
document this single-LMDB fast-path behavior and clarify what happens with
multiple LMDB inputs.
- Around line 1219-1277: The _convert_system_by_dpdata function creates lock
files to coordinate between processes, but if a process crashes after creating
the lock file but before the finally block executes, the lock file persists
causing other processes to wait 5 minutes before timing out. Add stale lock
detection logic in the except FileExistsError block before calling
_wait_for_conversion. Read the PID from the existing lock file and check if that
process is still running using platform-appropriate methods (e.g., os.kill with
signal 0 on Unix, or process existence checks). If the process is not running or
if the lock file is older than a reasonable threshold (e.g., 10 minutes), remove
the stale lock file and retry the lock acquisition instead of waiting. This
prevents indefinite hangs on stale locks from crashed processes.
🪄 Autofix (Beta)

❌ Autofix failed (check again to retry)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 54161e1f-c01b-4956-b3a7-a6ab0d33cb04

📥 Commits

Reviewing files that changed from the base of the PR and between 4b6506d and 37b3b76.

📒 Files selected for processing (7)
  • deepmd/pd/entrypoints/main.py
  • deepmd/pt/entrypoints/main.py
  • deepmd/pt_expt/entrypoints/main.py
  • deepmd/utils/argcheck.py
  • deepmd/utils/data_system.py
  • pyproject.toml
  • source/tests/common/test_data_system_conversion.py

Comment on lines +123 to +144
if len(training_systems) == 1 and is_lmdb(training_systems[0]):
raise NotImplementedError(
"Paddle backend does not support LMDB training data yet. "
"Set training_data.out_format to 'deepmd/hdf5' when using "
"training_data.format for automatic conversion."
)
if validation_systems is not None:
val_patterns = validation_dataset_params.get("rglob_patterns", None)
validation_systems = process_systems(validation_systems, val_patterns)
validation_systems = process_systems(
validation_systems,
val_patterns,
fmt=validation_dataset_params.get("format", None),
out_fmt=validation_dataset_params.get(
"out_format", validation_dataset_params.get("output_format", None)
),
)
if len(validation_systems) == 1 and is_lmdb(validation_systems[0]):
raise NotImplementedError(
"Paddle backend does not support LMDB validation data yet. "
"Set validation_data.out_format to 'deepmd/hdf5' when using "
"validation_data.format for automatic conversion."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Block any LMDB-resolved dataset for Paddle, not only single-path LMDB.

Line 123 and Line 139 only reject len(...) == 1 LMDB results, but the message says Paddle does not support LMDB data yet. LMDB lists can bypass this guard and fail later in less clear ways.

Suggested fix
-        if len(training_systems) == 1 and is_lmdb(training_systems[0]):
+        if any(is_lmdb(path) for path in training_systems):
             raise NotImplementedError(
                 "Paddle backend does not support LMDB training data yet. "
                 "Set training_data.out_format to 'deepmd/hdf5' when using "
                 "training_data.format for automatic conversion."
             )
@@
-            if len(validation_systems) == 1 and is_lmdb(validation_systems[0]):
+            if any(is_lmdb(path) for path in validation_systems):
                 raise NotImplementedError(
                     "Paddle backend does not support LMDB validation data yet. "
                     "Set validation_data.out_format to 'deepmd/hdf5' when using "
                     "validation_data.format for automatic conversion."
                 )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pd/entrypoints/main.py` around lines 123 - 144, The current LMDB
validation checks for both training_systems and validation_systems only reject
LMDB when the result is a single system (len(...) == 1), but the error messages
indicate that Paddle does not support LMDB data in general. Remove the len(...)
== 1 condition from both the training_systems check (around line 123) and the
validation_systems check (around line 139) so that any LMDB dataset is rejected
regardless of whether it's a single system or multiple systems in the list. This
ensures that any LMDB-resolved dataset triggers the NotImplementedError with a
clear message, preventing less clear failures downstream.

Comment on lines +118 to 132
systems = process_systems(
dataset_params["systems"],
patterns=dataset_params.get("rglob_patterns", None),
fmt=dataset_params.get("format", None),
out_fmt=dataset_params.get(
"out_format", dataset_params.get("output_format", None)
),
)
if len(systems) == 1 and is_lmdb(systems[0]):
from deepmd.dpmodel.utils.lmdb_data import (
make_neighbor_stat_data,
)

return make_neighbor_stat_data(systems[0], type_map)
return get_data(dataset_params, 0, type_map, None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Re-apply scalar-LMDB validation after format-based conversion.

_detect_lmdb_path explicitly disallows LMDB in list-form systems, but when format is set the code skips that check and only handles the single-LMDB case. If process_systems(...) returns multiple LMDB paths, this now falls through to generic paths (get_data / DeepmdDataSystem) instead of raising the intended clear error.

Suggested fix pattern
     systems = process_systems(
         dataset_params["systems"],
         patterns=dataset_params.get("rglob_patterns", None),
         fmt=dataset_params.get("format", None),
         out_fmt=dataset_params.get(
             "out_format", dataset_params.get("output_format", None)
         ),
     )
+    lmdb_systems = [s for s in systems if is_lmdb(s)]
+    if lmdb_systems and not (len(systems) == 1 and len(lmdb_systems) == 1):
+        raise ValueError(
+            "LMDB datasets must be passed as a scalar 'systems' string "
+            "(e.g. 'systems': '/path/to/data.lmdb'); list-form systems "
+            "with LMDB paths are not supported."
+        )
     if len(systems) == 1 and is_lmdb(systems[0]):
         ...

Apply the same guard in both _get_neighbor_stat_data and _build_data_system.

Also applies to: 168-175

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pt_expt/entrypoints/main.py` around lines 118 - 132, The current code
in _get_neighbor_stat_data only validates the single-LMDB case with `if
len(systems) == 1 and is_lmdb(systems[0])`, but when format-based conversion
produces multiple LMDB paths, this check is skipped and execution falls through
to get_data() instead of raising an appropriate error for list-form LMDB
systems. Add validation guards in both _get_neighbor_stat_data and
_build_data_system functions to ensure that after process_systems() is called,
if any LMDB systems are returned, they are validated to not be in list form
(similar to what _detect_lmdb_path does), and raise a clear error before
reaching the fallback get_data() or DeepmdDataSystem paths.

Comment on lines +197 to 204
if len(systems) == 1 and is_lmdb(systems[0]):
return LmdbDataset(
systems[0],
model_params_single["type_map"],
dataset_params["batch_size"],
auto_prob_style=dataset_params.get("auto_prob", None),
)
return DpLoaderSet(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect PT data loader implementations/usages for LMDB awareness.
fd -t f --extension py | xargs rg -n -C3 '\bclass\s+DpLoaderSet\b|\bclass\s+LmdbDataset\b|\bis_lmdb\s*\('

# Check whether tests cover multi-system conversion to LMDB in PT paths.
rg -n -C3 'format|out_format|output_format|lmdb|DpLoaderSet|LmdbDataset' source/tests deepmd/pt

Repository: deepmodeling/deepmd-kit

Length of output: 50380


🏁 Script executed:

# Check DpLoaderSet implementation in PT
head -150 deepmd/pt/utils/dataloader.py | tail -100

# Check pt_expt's validation that rejects multi-LMDB
sed -n '80,95p' deepmd/pt_expt/entrypoints/main.py

# Verify if pt should have same check as pt_expt
diff -u <(sed -n '80,95p' deepmd/pt_expt/entrypoints/main.py) <(sed -n '190,205p' deepmd/pt/entrypoints/main.py) || true

Repository: deepmodeling/deepmd-kit

Length of output: 4467


🏁 Script executed:

# Check DeepmdDataSetForLoader to see if it handles LMDB
rg -A 20 "class DeepmdDataSetForLoader" deepmd/pt/

# Check if there's any LMDB handling in DpLoaderSet or construct_dataset
rg -B 5 -A 10 "def construct_dataset" deepmd/pt/utils/dataloader.py

# Check what process_systems actually returns for multi-LMDB cases
rg -B 3 -A 10 "def process_systems" deepmd/utils/data_system.py | head -60

Repository: deepmodeling/deepmd-kit

Length of output: 2444


🏁 Script executed:

# Check if DeepmdData can handle LMDB paths
rg -B 5 -A 25 "class DeepmdData" deepmd/pt/utils/dataset.py | head -50

# Check if DeepmdData/DeepmdDataSystem supports LMDB directly
rg "LMDB\|lmdb" deepmd/pt/utils/dataset.py

Repository: deepmodeling/deepmd-kit

Length of output: 1015


🏁 Script executed:

# Check DeepmdData implementation - is it from pt or common
rg -l "class DeepmdData" deepmd/

# Find where DeepmdData is defined
fd -t f -name "*.py" | xargs rg "^class DeepmdData" | head -5

# Check if DeepmdData supports LMDB
rg -B 2 -A 20 "def __init__" deepmd/utils/data.py | grep -A 20 "class DeepmdData" | head -40

Repository: deepmodeling/deepmd-kit

Length of output: 642


🏁 Script executed:

# Check DeepmdData __init__ to see what it accepts
sed -n '/^class DeepmdData/,/^class /p' deepmd/utils/data.py | head -80

# Check if there's any LMDB support in DeepmdData
rg "LMDB|lmdb" deepmd/utils/data.py

# Check if process_systems can return LMDB paths and how DpLoaderSet handles them
sed -n '1330,1360p' deepmd/utils/data_system.py

Repository: deepmodeling/deepmd-kit

Length of output: 4291


🏁 Script executed:

# Verify DeepmdData cannot handle LMDB by checking what happens if we pass LMDB path
rg -B 5 -A 30 "def __init__" deepmd/utils/data.py | grep -A 30 "class DeepmdData" | tail -40

# See what error DeepmdData would raise for LMDB
rg "is_dir\|set_prefix" deepmd/utils/data.py | head -10

# Confirm pt_expt validation function and compare intent
sed -n '75,100p' deepmd/pt_expt/entrypoints/main.py

Repository: deepmodeling/deepmd-kit

Length of output: 1077


🏁 Script executed:

# Check what error DeepmdData actually raises when given an LMDB path
rg -A 5 "is_dir" deepmd/utils/data.py | head -20

# Verify if there are tests for multi-LMDB conversion to see current behavior
rg -l "process_systems.*lmdb\|LMDB.*process_systems" source/tests/ | head -5

# Check if pt/entrypoints/main.py should have the _detect_lmdb_path guard like pt_expt
sed -n '190,210p' deepmd/pt/entrypoints/main.py

Repository: deepmodeling/deepmd-kit

Length of output: 1179


Add LMDB validation guard to prevent silent failures when multi-system conversion produces LMDB paths.

The check at line 197 only routes the single-LMDB case to LmdbDataset. If process_systems(...) returns multiple LMDB paths (e.g., from multi-input conversion with default LMDB output), they fall through to DpLoaderSet, which passes them to DeepmdData. Since DeepmdData expects set.* subdirectories and will fail with a confusing "No set.* found" error when given LMDB paths, add an explicit guard matching the one in deepmd/pt_expt/entrypoints/main.py lines 84-89:

if isinstance(systems, list) and any(isinstance(s, str) and is_lmdb(s) for s in systems):
    raise ValueError(
        "LMDB datasets must be passed as a scalar 'systems' string "
        "(e.g. 'systems': '/path/to/data.lmdb'); list-form systems "
        "with LMDB paths are not supported."
    )

Place this check before the len(systems) == 1 condition to catch and clarify the error early.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pt/entrypoints/main.py` around lines 197 - 204, Add a validation guard
before the existing condition that checks `len(systems) == 1 and
is_lmdb(systems[0])` to prevent multiple LMDB paths from being passed to
DpLoaderSet. The guard should use `isinstance(systems, list)` combined with
`any(isinstance(s, str) and is_lmdb(s) for s in systems)` to detect when systems
is a list containing LMDB paths and raise a clear ValueError message explaining
that LMDB datasets must be passed as a scalar string rather than as a list.

Comment on lines +848 to +852
def _load_set(self, set_name: str) -> dict[str, Any]:
"""Load one same-nloc LMDB group for legacy neighbor-stat code."""
indices = self._nloc_set_indices[str(set_name)]
frames = [self._reader[int(idx)] for idx in indices]
return self._stack_frames(frames)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add guard for empty frames list in _stack_frames.

_load_set and get_batch both call _stack_frames(frames) where frames comes from sampler indices. If the sampler ever yields an empty batch (edge case with malformed LMDB), frames[0] at line 864 would raise IndexError.

🛡️ Suggested defensive check
     def _stack_frames(self, frames: list[dict[str, Any]]) -> dict[str, Any]:
+        if not frames:
+            raise ValueError("Cannot stack an empty list of frames")
         out: dict[str, Any] = {}

Also applies to: 917-921

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/utils/data_system.py` around lines 848 - 852, Add a defensive check at
the beginning of the `_stack_frames` method to guard against empty frames lists.
Before accessing `frames[0]` at line 864, add validation to check if the frames
list is empty and handle this edge case appropriately, such as raising a more
informative error or returning early. This will prevent IndexError when the
sampler yields an empty batch due to malformed LMDB data, since both `_load_set`
and `get_batch` call this method with frames lists derived from sampler indices.

@codecov

codecov Bot commented Jun 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 67.30159% with 103 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.13%. Comparing base (4b6506d) to head (37b3b76).

Files with missing lines Patch % Lines
deepmd/utils/data_system.py 67.61% 91 Missing ⚠️
deepmd/pt_expt/entrypoints/main.py 25.00% 6 Missing ⚠️
deepmd/pt/entrypoints/main.py 73.33% 4 Missing ⚠️
deepmd/pd/entrypoints/main.py 71.42% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5565      +/-   ##
==========================================
- Coverage   82.17%   82.13%   -0.04%     
==========================================
  Files         898      898              
  Lines      103576   103869     +293     
  Branches     4432     4433       +1     
==========================================
+ Hits        85117    85317     +200     
- Misses      17063    17159      +96     
+ Partials     1396     1393       -3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

An unexpected error occurred while generating fixes: Not Found - https://docs.github.com/rest/git/refs#get-a-reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatic Format Conversion in dp using dpdata

2 participants