feat(data): add dpdata format conversion by njzjz · Pull Request #5565 · deepmodeling/deepmd-kit

njzjz · 2026-06-20T16:01:43Z

Summary

add training/validation data format and out_format options for dpdata-backed conversion
default converted output to lmdb and cache converted datasets under /home/jzzeng/codes/deepmd-kit/.deepmd_dpdata_cache
route converted LMDB data through TF/JAX/PT/PT-expt-compatible data loaders; report a clear Paddle error for unsupported LMDB
add conversion/cache tests and make dpdata>=1.0.1 a runtime dependency

Tests

ruff check .
ruff format --check .
pytest source/tests/common/test_data_system_conversion.py -q
pytest source/tests/common/dpmodel/test_lmdb_data.py::TestLmdbDataReader::test_is_lmdb -q
pytest source/tests/tf/test_dp_test.py::TestDPTestEner::test_1frame -q
srun --gres=gpu:1 dp train input.json with extxyz input, default LMDB conversion, no --skip-neighbor-stat
srun --gres=gpu:1 dp --pt train input.json with extxyz input, default LMDB conversion, no --skip-neighbor-stat
srun --gres=gpu:1 dp --jax train input.json with extxyz input, default LMDB conversion, no --skip-neighbor-stat (environment used CPU JAX fallback because CUDA jaxlib is unavailable)
srun --gres=gpu:1 dp --pt-expt train input.json verified conversion and neighbor statistics; this environment then hits the existing pt-expt tensor serialization error during model construction

Summary by CodeRabbit

New Features
- Added automatic dataset format conversion, enabling users to specify input formats (e.g., .extxyz) with automatic conversion to LMDB or other formats
- Introduced new configuration options format and out_format/output_format for training and validation datasets to enable flexible data preprocessing
- Improved LMDB dataset support with intelligent routing and caching during model training
Tests
- Added comprehensive test suite for dataset format conversion functionality

+        from deepmd.dpmodel.utils.lmdb_data import (
+            LmdbDataReader,
+            SameNlocBatchSampler,
+            compute_block_targets,
+        )


+                finally:
+                    try:
+                        lock_path.unlink()
+                    except FileNotFoundError:


+    from deepmd.dpmodel.utils.lmdb_data import (
+        is_lmdb,
+    )


+    from deepmd.dpmodel.utils.lmdb_data import (
+        is_lmdb,
+    )


coderabbitai · 2026-06-20T16:09:06Z

📝 Walkthrough

Walkthrough

Adds automatic dpdata-based format conversion (e.g., extxyz → LMDB/HDF5) for DeePMD training and validation datasets. Introduces LmdbDataSystem as a DeepmdDataSystem-compatible adapter, adds format/out_format config schema fields, wires format-aware routing through all three training entrypoints (PT, PD, PT-expt), promotes dpdata to a core runtime dependency, and adds unit tests covering conversion, caching, and LmdbDataSystem batching.

Changes

Automatic dpdata Format Conversion for Training/Validation Datasets

Layer / File(s)	Summary
Config schema: format and out_format fields `deepmd/utils/argcheck.py`	Adds `format` and `out_format` (alias `output_format`, default `"lmdb"`) arguments to `training_data_args()` and `validation_data_args()` with dpdata conversion doc strings.
dpdata promoted to core runtime dependency `pyproject.toml`	Moves `dpdata` from optional test extras to top-level `[project].dependencies` at `>=1.0.1`, removing the older `>=0.2.7` test-only entry.
LmdbDataSystem adapter `deepmd/utils/data_system.py` (lines 3–43, 686–942)	Introduces `LmdbDataSystem` wrapping `LmdbDataReader` and `SameNlocBatchSampler`, adds module-level conversion cache constants, and new stdlib imports.
dpdata conversion helpers and execution `deepmd/utils/data_system.py` (lines 1086–1277)	Adds format-detection helpers, rglob input expansion, deterministic cache-path computation, mtime-based freshness checks, and `_convert_system_by_dpdata` with temp-file, lock-file concurrency control, and in-memory cache update.
process_systems and get_data with fmt/out_fmt `deepmd/utils/data_system.py` (lines 1279–1408)	Extends `process_systems` API with `fmt`/`out_fmt` parameters routing through `_convert_system_by_dpdata`; updates `get_data` to read format fields from config and return `LmdbDataSystem` on single-LMDB resolution.
PT entrypoint: format-aware LMDB routing and neighbor-stat refactor `deepmd/pt/entrypoints/main.py`	`_make_dp_loader_set()` returns `LmdbDataset` when resolved system is LMDB with `format=None`; adds `_get_neighbor_stat_data_from_params()` helper unifying LMDB and standard paths for multi-task and single-task neighbor-stat computation.
PD entrypoint: format-aware routing with LMDB guards `deepmd/pd/entrypoints/main.py`	Imports `is_lmdb`, passes `format`/`out_format` into `process_systems`, and raises `NotImplementedError` when training or validation resolves to a single LMDB path.
PT-expt entrypoint: conditional LMDB bypass `deepmd/pt_expt/entrypoints/main.py`	`_get_neighbor_stat_data` and `_build_data_system` skip direct `_detect_lmdb_path` when `format` is set, defer to `process_systems` with `fmt`/`out_fmt`, and route post-conversion single-LMDB results to `LmdbDataSystem`/`make_neighbor_stat_data`.
Unit tests for conversion and LmdbDataSystem `source/tests/common/test_data_system_conversion.py`	Adds `TestDpdataFormatConversion` with fake dpdata stubs, minimal HDF5/LMDB fixture writers, and four test methods covering LMDB default output, cwd-scoped cache isolation, explicit HDF5 conversion, and `get_data` with `format="auto"` producing `LmdbDataSystem` with correct batch shapes.

Sequence Diagram(s)

sequenceDiagram
  participant User as User Config
  participant Entrypoint as Training Entrypoint (pt/pd/pt_expt)
  participant ProcessSystems as process_systems()
  participant Converter as _convert_system_by_dpdata()
  participant Cache as _DPDATA_CONVERSION_CACHE / .deepmd_dpdata_cache
  participant DataSystem as LmdbDataSystem / DeepmdDataSystem

  User->>Entrypoint: training_data with format="extxyz", out_format="lmdb"
  Entrypoint->>ProcessSystems: process_systems(systems, fmt="extxyz", out_fmt="lmdb")
  ProcessSystems->>Converter: _convert_system_by_dpdata(path, fmt, out_fmt)
  Converter->>Cache: check freshness (mtime vs. cache)
  alt cache fresh
    Cache-->>Converter: return cached LMDB path
  else cache stale or missing
    Converter->>Converter: acquire .lock file
    Converter->>Converter: dpdata.MultiSystems → temp output
    Converter->>Cache: move to cache path, update _DPDATA_CONVERSION_CACHE
    Converter->>Converter: release lock
  end
  Converter-->>ProcessSystems: [lmdb_path]
  ProcessSystems-->>Entrypoint: [lmdb_path]
  Entrypoint->>DataSystem: LmdbDataSystem(lmdb_path, type_map, ...)
  DataSystem-->>Entrypoint: batches (type, natoms_vec, coord, box, ...)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

deepmodeling/deepmd-kit#5283: Introduced the LMDB reader utilities (is_lmdb, LmdbDataReader, SameNlocBatchSampler) that this PR's new LmdbDataSystem adapter and entrypoint routing directly depend on.
deepmodeling/deepmd-kit#5408: Added pt_expt LMDB scalar-path support that this PR extends by conditionally bypassing _detect_lmdb_path and routing through process_systems with format parameters.
deepmodeling/deepmd-kit#5419: Adds FullValidator LMDB evaluation support via LmdbTestDataNlocView, which consumes the LmdbDataset objects produced through the process_systems/format-driven routing introduced in this PR.

Suggested labels

new feature, Python

Suggested reviewers

wanghan-iapcm
iProzd

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.24% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(data): add dpdata format conversion' accurately and concisely summarizes the main change: adding automatic format conversion functionality using dpdata.
Linked Issues check	✅ Passed	All coding objectives from issue `#5237` are met: automatic format detection via dpdata integration [`#5237`], format conversion support in training/validation data configs [`#5237`], and seamless conversion to DeepMD-compatible formats [`#5237`].
Out of Scope Changes check	✅ Passed	All changes are within scope: format conversion integration, updated data configs, entrypoint modifications, new tests, and dpdata dependency elevation all directly support the automatic format conversion objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

deepmd/utils/data_system.py (4)
1147-1162: ⚖️ Poor tradeoff

Recursive mtime scan may be slow for large source directories.

_source_mtime walks the entire source directory tree to find the latest modification time. For datasets with many files, this could add noticeable latency on every cache freshness check. Consider caching the computed mtime or using a faster heuristic (e.g., only checking top-level directory mtime plus a sample of files).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/utils/data_system.py` around lines 1147 - 1162, The `_source_mtime`
function performs a full recursive directory walk using source.rglob("*") to
find the latest modification time across all files, which becomes inefficient
for large source directories. To improve performance, implement a caching
mechanism to store previously computed mtimes so that repeated calls for the
same source directory do not re-scan the entire tree, or alternatively replace
the full recursive scan with a faster heuristic that only examines the top-level
directory mtime and a representative sample of files rather than traversing
every single file.
786-796: 💤 Low value

Mixed-type detection scans all frames at initialization.

_detect_mixed_type iterates through every frame in the LMDB dataset comparing atom types, which could be slow for very large datasets (thousands of frames). Consider caching this property in the LMDB metadata during conversion, or adding a sampling heuristic for large datasets.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/utils/data_system.py` around lines 786 - 796, The _detect_mixed_type
method iterates through all frames in the dataset to check for mixed atom types,
which is inefficient for large datasets. Implement caching by storing the
detection result as an instance variable after the first call, and consider
adding a sampling heuristic for datasets with many frames such that for very
large datasets (e.g., more than a configurable threshold), only a sample of
frames are checked instead of all frames. Update the method to return the cached
result on subsequent calls and use the sampling strategy to limit iterations
while still maintaining reasonable confidence in the mixed-type detection.
1395-1408: 💤 Low value

Single-LMDB fast-path only; consider documenting multi-LMDB limitation.

The LMDB routing only handles the case where systems resolves to exactly one LMDB path. If multiple LMDB paths are provided (or conversion produces multiple systems), they fall through to DeepmdDataSystem. Consider adding a log warning or updating docstring to clarify this behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/utils/data_system.py` around lines 1395 - 1408, The code currently
only provides optimized handling for a single LMDB system through
LmdbDataSystem, while multiple LMDB systems silently fall through to
DeepmdDataSystem. Add a log warning message when multiple LMDB paths are
detected (when len(systems) > 1 and all are LMDB) to alert users that they will
be handled through the standard DeepmdDataSystem path rather than the optimized
LmdbDataSystem, and update the function's docstring to document this single-LMDB
fast-path behavior and clarify what happens with multiple LMDB inputs.
1219-1277: ⚖️ Poor tradeoff

Stale lock files may persist after process crashes.

If a process crashes after creating the lock file (line 1242) but before the finally block runs (e.g., SIGKILL), the .lock file will remain. Subsequent processes will wait 5 minutes before timing out. Consider adding stale-lock detection using the PID written to the lock file, or a timestamp-based staleness check.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/utils/data_system.py` around lines 1219 - 1277, The
_convert_system_by_dpdata function creates lock files to coordinate between
processes, but if a process crashes after creating the lock file but before the
finally block executes, the lock file persists causing other processes to wait 5
minutes before timing out. Add stale lock detection logic in the except
FileExistsError block before calling _wait_for_conversion. Read the PID from the
existing lock file and check if that process is still running using
platform-appropriate methods (e.g., os.kill with signal 0 on Unix, or process
existence checks). If the process is not running or if the lock file is older
than a reasonable threshold (e.g., 10 minutes), remove the stale lock file and
retry the lock acquisition instead of waiting. This prevents indefinite hangs on
stale locks from crashed processes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deepmd/pd/entrypoints/main.py`:
- Around line 123-144: The current LMDB validation checks for both
training_systems and validation_systems only reject LMDB when the result is a
single system (len(...) == 1), but the error messages indicate that Paddle does
not support LMDB data in general. Remove the len(...) == 1 condition from both
the training_systems check (around line 123) and the validation_systems check
(around line 139) so that any LMDB dataset is rejected regardless of whether
it's a single system or multiple systems in the list. This ensures that any
LMDB-resolved dataset triggers the NotImplementedError with a clear message,
preventing less clear failures downstream.

In `@deepmd/pt_expt/entrypoints/main.py`:
- Around line 118-132: The current code in _get_neighbor_stat_data only
validates the single-LMDB case with `if len(systems) == 1 and
is_lmdb(systems[0])`, but when format-based conversion produces multiple LMDB
paths, this check is skipped and execution falls through to get_data() instead
of raising an appropriate error for list-form LMDB systems. Add validation
guards in both _get_neighbor_stat_data and _build_data_system functions to
ensure that after process_systems() is called, if any LMDB systems are returned,
they are validated to not be in list form (similar to what _detect_lmdb_path
does), and raise a clear error before reaching the fallback get_data() or
DeepmdDataSystem paths.

In `@deepmd/pt/entrypoints/main.py`:
- Around line 197-204: Add a validation guard before the existing condition that
checks `len(systems) == 1 and is_lmdb(systems[0])` to prevent multiple LMDB
paths from being passed to DpLoaderSet. The guard should use
`isinstance(systems, list)` combined with `any(isinstance(s, str) and is_lmdb(s)
for s in systems)` to detect when systems is a list containing LMDB paths and
raise a clear ValueError message explaining that LMDB datasets must be passed as
a scalar string rather than as a list.

In `@deepmd/utils/data_system.py`:
- Around line 848-852: Add a defensive check at the beginning of the
`_stack_frames` method to guard against empty frames lists. Before accessing
`frames[0]` at line 864, add validation to check if the frames list is empty and
handle this edge case appropriately, such as raising a more informative error or
returning early. This will prevent IndexError when the sampler yields an empty
batch due to malformed LMDB data, since both `_load_set` and `get_batch` call
this method with frames lists derived from sampler indices.

---

Nitpick comments:
In `@deepmd/utils/data_system.py`:
- Around line 1147-1162: The `_source_mtime` function performs a full recursive
directory walk using source.rglob("*") to find the latest modification time
across all files, which becomes inefficient for large source directories. To
improve performance, implement a caching mechanism to store previously computed
mtimes so that repeated calls for the same source directory do not re-scan the
entire tree, or alternatively replace the full recursive scan with a faster
heuristic that only examines the top-level directory mtime and a representative
sample of files rather than traversing every single file.
- Around line 786-796: The _detect_mixed_type method iterates through all frames
in the dataset to check for mixed atom types, which is inefficient for large
datasets. Implement caching by storing the detection result as an instance
variable after the first call, and consider adding a sampling heuristic for
datasets with many frames such that for very large datasets (e.g., more than a
configurable threshold), only a sample of frames are checked instead of all
frames. Update the method to return the cached result on subsequent calls and
use the sampling strategy to limit iterations while still maintaining reasonable
confidence in the mixed-type detection.
- Around line 1395-1408: The code currently only provides optimized handling for
a single LMDB system through LmdbDataSystem, while multiple LMDB systems
silently fall through to DeepmdDataSystem. Add a log warning message when
multiple LMDB paths are detected (when len(systems) > 1 and all are LMDB) to
alert users that they will be handled through the standard DeepmdDataSystem path
rather than the optimized LmdbDataSystem, and update the function's docstring to
document this single-LMDB fast-path behavior and clarify what happens with
multiple LMDB inputs.
- Around line 1219-1277: The _convert_system_by_dpdata function creates lock
files to coordinate between processes, but if a process crashes after creating
the lock file but before the finally block executes, the lock file persists
causing other processes to wait 5 minutes before timing out. Add stale lock
detection logic in the except FileExistsError block before calling
_wait_for_conversion. Read the PID from the existing lock file and check if that
process is still running using platform-appropriate methods (e.g., os.kill with
signal 0 on Unix, or process existence checks). If the process is not running or
if the lock file is older than a reasonable threshold (e.g., 10 minutes), remove
the stale lock file and retry the lock acquisition instead of waiting. This
prevents indefinite hangs on stale locks from crashed processes.

🪄 Autofix (Beta)

❌ Autofix failed (check again to retry)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 54161e1f-c01b-4956-b3a7-a6ab0d33cb04

📥 Commits

Reviewing files that changed from the base of the PR and between 4b6506d and 37b3b76.

📒 Files selected for processing (7)

deepmd/pd/entrypoints/main.py
deepmd/pt/entrypoints/main.py
deepmd/pt_expt/entrypoints/main.py
deepmd/utils/argcheck.py
deepmd/utils/data_system.py
pyproject.toml
source/tests/common/test_data_system_conversion.py

coderabbitai · 2026-06-20T16:09:09Z

+        if len(training_systems) == 1 and is_lmdb(training_systems[0]):
+            raise NotImplementedError(
+                "Paddle backend does not support LMDB training data yet. "
+                "Set training_data.out_format to 'deepmd/hdf5' when using "
+                "training_data.format for automatic conversion."
+            )
        if validation_systems is not None:
            val_patterns = validation_dataset_params.get("rglob_patterns", None)
-            validation_systems = process_systems(validation_systems, val_patterns)
+            validation_systems = process_systems(
+                validation_systems,
+                val_patterns,
+                fmt=validation_dataset_params.get("format", None),
+                out_fmt=validation_dataset_params.get(
+                    "out_format", validation_dataset_params.get("output_format", None)
+                ),
+            )
+            if len(validation_systems) == 1 and is_lmdb(validation_systems[0]):
+                raise NotImplementedError(
+                    "Paddle backend does not support LMDB validation data yet. "
+                    "Set validation_data.out_format to 'deepmd/hdf5' when using "
+                    "validation_data.format for automatic conversion."
+                )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Block any LMDB-resolved dataset for Paddle, not only single-path LMDB.

Line 123 and Line 139 only reject len(...) == 1 LMDB results, but the message says Paddle does not support LMDB data yet. LMDB lists can bypass this guard and fail later in less clear ways.

Suggested fix

- if len(training_systems) == 1 and is_lmdb(training_systems[0]): + if any(is_lmdb(path) for path in training_systems): raise NotImplementedError( "Paddle backend does not support LMDB training data yet. " "Set training_data.out_format to 'deepmd/hdf5' when using " "training_data.format for automatic conversion." ) @@ - if len(validation_systems) == 1 and is_lmdb(validation_systems[0]): + if any(is_lmdb(path) for path in validation_systems): raise NotImplementedError( "Paddle backend does not support LMDB validation data yet. " "Set validation_data.out_format to 'deepmd/hdf5' when using " "validation_data.format for automatic conversion." )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pd/entrypoints/main.py` around lines 123 - 144, The current LMDB validation checks for both training_systems and validation_systems only reject LMDB when the result is a single system (len(...) == 1), but the error messages indicate that Paddle does not support LMDB data in general. Remove the len(...) == 1 condition from both the training_systems check (around line 123) and the validation_systems check (around line 139) so that any LMDB dataset is rejected regardless of whether it's a single system or multiple systems in the list. This ensures that any LMDB-resolved dataset triggers the NotImplementedError with a clear message, preventing less clear failures downstream.

coderabbitai · 2026-06-20T16:09:09Z

+    systems = process_systems(
+        dataset_params["systems"],
+        patterns=dataset_params.get("rglob_patterns", None),
+        fmt=dataset_params.get("format", None),
+        out_fmt=dataset_params.get(
+            "out_format", dataset_params.get("output_format", None)
+        ),
+    )
+    if len(systems) == 1 and is_lmdb(systems[0]):
+        from deepmd.dpmodel.utils.lmdb_data import (
+            make_neighbor_stat_data,
+        )
+
+        return make_neighbor_stat_data(systems[0], type_map)
    return get_data(dataset_params, 0, type_map, None)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Re-apply scalar-LMDB validation after format-based conversion.

_detect_lmdb_path explicitly disallows LMDB in list-form systems, but when format is set the code skips that check and only handles the single-LMDB case. If process_systems(...) returns multiple LMDB paths, this now falls through to generic paths (get_data / DeepmdDataSystem) instead of raising the intended clear error.

Suggested fix pattern

systems = process_systems( dataset_params["systems"], patterns=dataset_params.get("rglob_patterns", None), fmt=dataset_params.get("format", None), out_fmt=dataset_params.get( "out_format", dataset_params.get("output_format", None) ), ) + lmdb_systems = [s for s in systems if is_lmdb(s)] + if lmdb_systems and not (len(systems) == 1 and len(lmdb_systems) == 1): + raise ValueError( + "LMDB datasets must be passed as a scalar 'systems' string " + "(e.g. 'systems': '/path/to/data.lmdb'); list-form systems " + "with LMDB paths are not supported." + ) if len(systems) == 1 and is_lmdb(systems[0]): ...

Apply the same guard in both _get_neighbor_stat_data and _build_data_system.

Also applies to: 168-175

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt_expt/entrypoints/main.py` around lines 118 - 132, The current code in _get_neighbor_stat_data only validates the single-LMDB case with `if len(systems) == 1 and is_lmdb(systems[0])`, but when format-based conversion produces multiple LMDB paths, this check is skipped and execution falls through to get_data() instead of raising an appropriate error for list-form LMDB systems. Add validation guards in both _get_neighbor_stat_data and _build_data_system functions to ensure that after process_systems() is called, if any LMDB systems are returned, they are validated to not be in list form (similar to what _detect_lmdb_path does), and raise a clear error before reaching the fallback get_data() or DeepmdDataSystem paths.

coderabbitai · 2026-06-20T16:09:09Z

+            if len(systems) == 1 and is_lmdb(systems[0]):
+                return LmdbDataset(
+                    systems[0],
+                    model_params_single["type_map"],
+                    dataset_params["batch_size"],
+                    auto_prob_style=dataset_params.get("auto_prob", None),
+                )
            return DpLoaderSet(


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect PT data loader implementations/usages for LMDB awareness. fd -t f --extension py | xargs rg -n -C3 '\bclass\s+DpLoaderSet\b|\bclass\s+LmdbDataset\b|\bis_lmdb\s*\(' # Check whether tests cover multi-system conversion to LMDB in PT paths. rg -n -C3 'format|out_format|output_format|lmdb|DpLoaderSet|LmdbDataset' source/tests deepmd/pt

Repository: deepmodeling/deepmd-kit

Length of output: 50380

🏁 Script executed:

# Check DpLoaderSet implementation in PT head -150 deepmd/pt/utils/dataloader.py | tail -100 # Check pt_expt's validation that rejects multi-LMDB sed -n '80,95p' deepmd/pt_expt/entrypoints/main.py # Verify if pt should have same check as pt_expt diff -u <(sed -n '80,95p' deepmd/pt_expt/entrypoints/main.py) <(sed -n '190,205p' deepmd/pt/entrypoints/main.py) || true

Repository: deepmodeling/deepmd-kit

Length of output: 4467

🏁 Script executed:

# Check DeepmdDataSetForLoader to see if it handles LMDB rg -A 20 "class DeepmdDataSetForLoader" deepmd/pt/ # Check if there's any LMDB handling in DpLoaderSet or construct_dataset rg -B 5 -A 10 "def construct_dataset" deepmd/pt/utils/dataloader.py # Check what process_systems actually returns for multi-LMDB cases rg -B 3 -A 10 "def process_systems" deepmd/utils/data_system.py | head -60

Repository: deepmodeling/deepmd-kit

Length of output: 2444

🏁 Script executed:

# Check if DeepmdData can handle LMDB paths rg -B 5 -A 25 "class DeepmdData" deepmd/pt/utils/dataset.py | head -50 # Check if DeepmdData/DeepmdDataSystem supports LMDB directly rg "LMDB\|lmdb" deepmd/pt/utils/dataset.py

Repository: deepmodeling/deepmd-kit

Length of output: 1015

🏁 Script executed:

# Check DeepmdData implementation - is it from pt or common rg -l "class DeepmdData" deepmd/ # Find where DeepmdData is defined fd -t f -name "*.py" | xargs rg "^class DeepmdData" | head -5 # Check if DeepmdData supports LMDB rg -B 2 -A 20 "def __init__" deepmd/utils/data.py | grep -A 20 "class DeepmdData" | head -40

Repository: deepmodeling/deepmd-kit

Length of output: 642

🏁 Script executed:

# Check DeepmdData __init__ to see what it accepts sed -n '/^class DeepmdData/,/^class /p' deepmd/utils/data.py | head -80 # Check if there's any LMDB support in DeepmdData rg "LMDB|lmdb" deepmd/utils/data.py # Check if process_systems can return LMDB paths and how DpLoaderSet handles them sed -n '1330,1360p' deepmd/utils/data_system.py

Repository: deepmodeling/deepmd-kit

Length of output: 4291

🏁 Script executed:

# Verify DeepmdData cannot handle LMDB by checking what happens if we pass LMDB path rg -B 5 -A 30 "def __init__" deepmd/utils/data.py | grep -A 30 "class DeepmdData" | tail -40 # See what error DeepmdData would raise for LMDB rg "is_dir\|set_prefix" deepmd/utils/data.py | head -10 # Confirm pt_expt validation function and compare intent sed -n '75,100p' deepmd/pt_expt/entrypoints/main.py

Repository: deepmodeling/deepmd-kit

Length of output: 1077

🏁 Script executed:

# Check what error DeepmdData actually raises when given an LMDB path rg -A 5 "is_dir" deepmd/utils/data.py | head -20 # Verify if there are tests for multi-LMDB conversion to see current behavior rg -l "process_systems.*lmdb\|LMDB.*process_systems" source/tests/ | head -5 # Check if pt/entrypoints/main.py should have the _detect_lmdb_path guard like pt_expt sed -n '190,210p' deepmd/pt/entrypoints/main.py

Repository: deepmodeling/deepmd-kit

Length of output: 1179

Add LMDB validation guard to prevent silent failures when multi-system conversion produces LMDB paths.

The check at line 197 only routes the single-LMDB case to LmdbDataset. If process_systems(...) returns multiple LMDB paths (e.g., from multi-input conversion with default LMDB output), they fall through to DpLoaderSet, which passes them to DeepmdData. Since DeepmdData expects set.* subdirectories and will fail with a confusing "No set.* found" error when given LMDB paths, add an explicit guard matching the one in deepmd/pt_expt/entrypoints/main.py lines 84-89:

if isinstance(systems, list) and any(isinstance(s, str) and is_lmdb(s) for s in systems): raise ValueError( "LMDB datasets must be passed as a scalar 'systems' string " "(e.g. 'systems': '/path/to/data.lmdb'); list-form systems " "with LMDB paths are not supported." )

Place this check before the len(systems) == 1 condition to catch and clarify the error early.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/entrypoints/main.py` around lines 197 - 204, Add a validation guard before the existing condition that checks `len(systems) == 1 and is_lmdb(systems[0])` to prevent multiple LMDB paths from being passed to DpLoaderSet. The guard should use `isinstance(systems, list)` combined with `any(isinstance(s, str) and is_lmdb(s) for s in systems)` to detect when systems is a list containing LMDB paths and raise a clear ValueError message explaining that LMDB datasets must be passed as a scalar string rather than as a list.

coderabbitai · 2026-06-20T16:09:09Z

+    def _load_set(self, set_name: str) -> dict[str, Any]:
+        """Load one same-nloc LMDB group for legacy neighbor-stat code."""
+        indices = self._nloc_set_indices[str(set_name)]
+        frames = [self._reader[int(idx)] for idx in indices]
+        return self._stack_frames(frames)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add guard for empty frames list in _stack_frames.

_load_set and get_batch both call _stack_frames(frames) where frames comes from sampler indices. If the sampler ever yields an empty batch (edge case with malformed LMDB), frames[0] at line 864 would raise IndexError.

🛡️ Suggested defensive check

def _stack_frames(self, frames: list[dict[str, Any]]) -> dict[str, Any]: + if not frames: + raise ValueError("Cannot stack an empty list of frames") out: dict[str, Any] = {}

Also applies to: 917-921

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/utils/data_system.py` around lines 848 - 852, Add a defensive check at the beginning of the `_stack_frames` method to guard against empty frames lists. Before accessing `frames[0]` at line 864, add validation to check if the frames list is empty and handle this edge case appropriately, such as raising a more informative error or returning early. This will prevent IndexError when the sampler yields an empty batch due to malformed LMDB data, since both `_load_set` and `get_batch` call this method with frames lists derived from sampler indices.

codecov · 2026-06-20T17:04:27Z

Codecov Report

❌ Patch coverage is 67.30159% with 103 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.13%. Comparing base (4b6506d) to head (37b3b76).

Files with missing lines	Patch %	Lines
deepmd/utils/data_system.py	67.61%	91 Missing ⚠️
deepmd/pt_expt/entrypoints/main.py	25.00%	6 Missing ⚠️
deepmd/pt/entrypoints/main.py	73.33%	4 Missing ⚠️
deepmd/pd/entrypoints/main.py	71.42%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5565      +/-   ##
==========================================
- Coverage   82.17%   82.13%   -0.04%     
==========================================
  Files         898      898              
  Lines      103576   103869     +293     
  Branches     4432     4433       +1     
==========================================
+ Hits        85117    85317     +200     
- Misses      17063    17159      +96     
+ Partials     1396     1393       -3

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai · 2026-06-21T05:59:30Z

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

An unexpected error occurred while generating fixes: Not Found - https://docs.github.com/rest/git/refs#get-a-reference

feat(data): add dpdata format conversion

37b3b76

dosubot Bot added enhancement new feature labels Jun 20, 2026

github-actions Bot added the Python label Jun 20, 2026

github-advanced-security AI found potential problems Jun 20, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data): add dpdata format conversion#5565

feat(data): add dpdata format conversion#5565
njzjz wants to merge 1 commit into
deepmodeling:masterfrom
njzjz:feat/dpdata-auto-conversion

njzjz commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 20, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment •

edited

Loading

Uh oh!

coderabbitai Bot Jun 20, 2026

Uh oh!

coderabbitai Bot Jun 20, 2026

Uh oh!

coderabbitai Bot Jun 20, 2026

Uh oh!

coderabbitai Bot Jun 20, 2026

Uh oh!

codecov Bot commented Jun 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

njzjz commented Jun 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 20, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

njzjz commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot left a comment •

edited

Loading

codecov Bot commented Jun 20, 2026 •

edited

Loading

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading