feat(data): add dpdata format conversion#5565
Conversation
| from deepmd.dpmodel.utils.lmdb_data import ( | ||
| LmdbDataReader, | ||
| SameNlocBatchSampler, | ||
| compute_block_targets, | ||
| ) |
| finally: | ||
| try: | ||
| lock_path.unlink() | ||
| except FileNotFoundError: |
| from deepmd.dpmodel.utils.lmdb_data import ( | ||
| is_lmdb, | ||
| ) |
| from deepmd.dpmodel.utils.lmdb_data import ( | ||
| is_lmdb, | ||
| ) |
📝 WalkthroughWalkthroughAdds automatic dpdata-based format conversion (e.g., extxyz → LMDB/HDF5) for DeePMD training and validation datasets. Introduces ChangesAutomatic dpdata Format Conversion for Training/Validation Datasets
Sequence Diagram(s)sequenceDiagram
participant User as User Config
participant Entrypoint as Training Entrypoint (pt/pd/pt_expt)
participant ProcessSystems as process_systems()
participant Converter as _convert_system_by_dpdata()
participant Cache as _DPDATA_CONVERSION_CACHE / .deepmd_dpdata_cache
participant DataSystem as LmdbDataSystem / DeepmdDataSystem
User->>Entrypoint: training_data with format="extxyz", out_format="lmdb"
Entrypoint->>ProcessSystems: process_systems(systems, fmt="extxyz", out_fmt="lmdb")
ProcessSystems->>Converter: _convert_system_by_dpdata(path, fmt, out_fmt)
Converter->>Cache: check freshness (mtime vs. cache)
alt cache fresh
Cache-->>Converter: return cached LMDB path
else cache stale or missing
Converter->>Converter: acquire .lock file
Converter->>Converter: dpdata.MultiSystems → temp output
Converter->>Cache: move to cache path, update _DPDATA_CONVERSION_CACHE
Converter->>Converter: release lock
end
Converter-->>ProcessSystems: [lmdb_path]
ProcessSystems-->>Entrypoint: [lmdb_path]
Entrypoint->>DataSystem: LmdbDataSystem(lmdb_path, type_map, ...)
DataSystem-->>Entrypoint: batches (type, natoms_vec, coord, box, ...)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (4)
deepmd/utils/data_system.py (4)
1147-1162: ⚖️ Poor tradeoffRecursive mtime scan may be slow for large source directories.
_source_mtimewalks the entire source directory tree to find the latest modification time. For datasets with many files, this could add noticeable latency on every cache freshness check. Consider caching the computed mtime or using a faster heuristic (e.g., only checking top-level directory mtime plus a sample of files).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/utils/data_system.py` around lines 1147 - 1162, The `_source_mtime` function performs a full recursive directory walk using source.rglob("*") to find the latest modification time across all files, which becomes inefficient for large source directories. To improve performance, implement a caching mechanism to store previously computed mtimes so that repeated calls for the same source directory do not re-scan the entire tree, or alternatively replace the full recursive scan with a faster heuristic that only examines the top-level directory mtime and a representative sample of files rather than traversing every single file.
786-796: 💤 Low valueMixed-type detection scans all frames at initialization.
_detect_mixed_typeiterates through every frame in the LMDB dataset comparing atom types, which could be slow for very large datasets (thousands of frames). Consider caching this property in the LMDB metadata during conversion, or adding a sampling heuristic for large datasets.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/utils/data_system.py` around lines 786 - 796, The _detect_mixed_type method iterates through all frames in the dataset to check for mixed atom types, which is inefficient for large datasets. Implement caching by storing the detection result as an instance variable after the first call, and consider adding a sampling heuristic for datasets with many frames such that for very large datasets (e.g., more than a configurable threshold), only a sample of frames are checked instead of all frames. Update the method to return the cached result on subsequent calls and use the sampling strategy to limit iterations while still maintaining reasonable confidence in the mixed-type detection.
1395-1408: 💤 Low valueSingle-LMDB fast-path only; consider documenting multi-LMDB limitation.
The LMDB routing only handles the case where
systemsresolves to exactly one LMDB path. If multiple LMDB paths are provided (or conversion produces multiple systems), they fall through toDeepmdDataSystem. Consider adding a log warning or updating docstring to clarify this behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/utils/data_system.py` around lines 1395 - 1408, The code currently only provides optimized handling for a single LMDB system through LmdbDataSystem, while multiple LMDB systems silently fall through to DeepmdDataSystem. Add a log warning message when multiple LMDB paths are detected (when len(systems) > 1 and all are LMDB) to alert users that they will be handled through the standard DeepmdDataSystem path rather than the optimized LmdbDataSystem, and update the function's docstring to document this single-LMDB fast-path behavior and clarify what happens with multiple LMDB inputs.
1219-1277: ⚖️ Poor tradeoffStale lock files may persist after process crashes.
If a process crashes after creating the lock file (line 1242) but before the
finallyblock runs (e.g., SIGKILL), the.lockfile will remain. Subsequent processes will wait 5 minutes before timing out. Consider adding stale-lock detection using the PID written to the lock file, or a timestamp-based staleness check.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/utils/data_system.py` around lines 1219 - 1277, The _convert_system_by_dpdata function creates lock files to coordinate between processes, but if a process crashes after creating the lock file but before the finally block executes, the lock file persists causing other processes to wait 5 minutes before timing out. Add stale lock detection logic in the except FileExistsError block before calling _wait_for_conversion. Read the PID from the existing lock file and check if that process is still running using platform-appropriate methods (e.g., os.kill with signal 0 on Unix, or process existence checks). If the process is not running or if the lock file is older than a reasonable threshold (e.g., 10 minutes), remove the stale lock file and retry the lock acquisition instead of waiting. This prevents indefinite hangs on stale locks from crashed processes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deepmd/pd/entrypoints/main.py`:
- Around line 123-144: The current LMDB validation checks for both
training_systems and validation_systems only reject LMDB when the result is a
single system (len(...) == 1), but the error messages indicate that Paddle does
not support LMDB data in general. Remove the len(...) == 1 condition from both
the training_systems check (around line 123) and the validation_systems check
(around line 139) so that any LMDB dataset is rejected regardless of whether
it's a single system or multiple systems in the list. This ensures that any
LMDB-resolved dataset triggers the NotImplementedError with a clear message,
preventing less clear failures downstream.
In `@deepmd/pt_expt/entrypoints/main.py`:
- Around line 118-132: The current code in _get_neighbor_stat_data only
validates the single-LMDB case with `if len(systems) == 1 and
is_lmdb(systems[0])`, but when format-based conversion produces multiple LMDB
paths, this check is skipped and execution falls through to get_data() instead
of raising an appropriate error for list-form LMDB systems. Add validation
guards in both _get_neighbor_stat_data and _build_data_system functions to
ensure that after process_systems() is called, if any LMDB systems are returned,
they are validated to not be in list form (similar to what _detect_lmdb_path
does), and raise a clear error before reaching the fallback get_data() or
DeepmdDataSystem paths.
In `@deepmd/pt/entrypoints/main.py`:
- Around line 197-204: Add a validation guard before the existing condition that
checks `len(systems) == 1 and is_lmdb(systems[0])` to prevent multiple LMDB
paths from being passed to DpLoaderSet. The guard should use
`isinstance(systems, list)` combined with `any(isinstance(s, str) and is_lmdb(s)
for s in systems)` to detect when systems is a list containing LMDB paths and
raise a clear ValueError message explaining that LMDB datasets must be passed as
a scalar string rather than as a list.
In `@deepmd/utils/data_system.py`:
- Around line 848-852: Add a defensive check at the beginning of the
`_stack_frames` method to guard against empty frames lists. Before accessing
`frames[0]` at line 864, add validation to check if the frames list is empty and
handle this edge case appropriately, such as raising a more informative error or
returning early. This will prevent IndexError when the sampler yields an empty
batch due to malformed LMDB data, since both `_load_set` and `get_batch` call
this method with frames lists derived from sampler indices.
---
Nitpick comments:
In `@deepmd/utils/data_system.py`:
- Around line 1147-1162: The `_source_mtime` function performs a full recursive
directory walk using source.rglob("*") to find the latest modification time
across all files, which becomes inefficient for large source directories. To
improve performance, implement a caching mechanism to store previously computed
mtimes so that repeated calls for the same source directory do not re-scan the
entire tree, or alternatively replace the full recursive scan with a faster
heuristic that only examines the top-level directory mtime and a representative
sample of files rather than traversing every single file.
- Around line 786-796: The _detect_mixed_type method iterates through all frames
in the dataset to check for mixed atom types, which is inefficient for large
datasets. Implement caching by storing the detection result as an instance
variable after the first call, and consider adding a sampling heuristic for
datasets with many frames such that for very large datasets (e.g., more than a
configurable threshold), only a sample of frames are checked instead of all
frames. Update the method to return the cached result on subsequent calls and
use the sampling strategy to limit iterations while still maintaining reasonable
confidence in the mixed-type detection.
- Around line 1395-1408: The code currently only provides optimized handling for
a single LMDB system through LmdbDataSystem, while multiple LMDB systems
silently fall through to DeepmdDataSystem. Add a log warning message when
multiple LMDB paths are detected (when len(systems) > 1 and all are LMDB) to
alert users that they will be handled through the standard DeepmdDataSystem path
rather than the optimized LmdbDataSystem, and update the function's docstring to
document this single-LMDB fast-path behavior and clarify what happens with
multiple LMDB inputs.
- Around line 1219-1277: The _convert_system_by_dpdata function creates lock
files to coordinate between processes, but if a process crashes after creating
the lock file but before the finally block executes, the lock file persists
causing other processes to wait 5 minutes before timing out. Add stale lock
detection logic in the except FileExistsError block before calling
_wait_for_conversion. Read the PID from the existing lock file and check if that
process is still running using platform-appropriate methods (e.g., os.kill with
signal 0 on Unix, or process existence checks). If the process is not running or
if the lock file is older than a reasonable threshold (e.g., 10 minutes), remove
the stale lock file and retry the lock acquisition instead of waiting. This
prevents indefinite hangs on stale locks from crashed processes.
🪄 Autofix (Beta)
❌ Autofix failed (check again to retry)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 54161e1f-c01b-4956-b3a7-a6ab0d33cb04
📒 Files selected for processing (7)
deepmd/pd/entrypoints/main.pydeepmd/pt/entrypoints/main.pydeepmd/pt_expt/entrypoints/main.pydeepmd/utils/argcheck.pydeepmd/utils/data_system.pypyproject.tomlsource/tests/common/test_data_system_conversion.py
| if len(training_systems) == 1 and is_lmdb(training_systems[0]): | ||
| raise NotImplementedError( | ||
| "Paddle backend does not support LMDB training data yet. " | ||
| "Set training_data.out_format to 'deepmd/hdf5' when using " | ||
| "training_data.format for automatic conversion." | ||
| ) | ||
| if validation_systems is not None: | ||
| val_patterns = validation_dataset_params.get("rglob_patterns", None) | ||
| validation_systems = process_systems(validation_systems, val_patterns) | ||
| validation_systems = process_systems( | ||
| validation_systems, | ||
| val_patterns, | ||
| fmt=validation_dataset_params.get("format", None), | ||
| out_fmt=validation_dataset_params.get( | ||
| "out_format", validation_dataset_params.get("output_format", None) | ||
| ), | ||
| ) | ||
| if len(validation_systems) == 1 and is_lmdb(validation_systems[0]): | ||
| raise NotImplementedError( | ||
| "Paddle backend does not support LMDB validation data yet. " | ||
| "Set validation_data.out_format to 'deepmd/hdf5' when using " | ||
| "validation_data.format for automatic conversion." | ||
| ) |
There was a problem hiding this comment.
Block any LMDB-resolved dataset for Paddle, not only single-path LMDB.
Line 123 and Line 139 only reject len(...) == 1 LMDB results, but the message says Paddle does not support LMDB data yet. LMDB lists can bypass this guard and fail later in less clear ways.
Suggested fix
- if len(training_systems) == 1 and is_lmdb(training_systems[0]):
+ if any(is_lmdb(path) for path in training_systems):
raise NotImplementedError(
"Paddle backend does not support LMDB training data yet. "
"Set training_data.out_format to 'deepmd/hdf5' when using "
"training_data.format for automatic conversion."
)
@@
- if len(validation_systems) == 1 and is_lmdb(validation_systems[0]):
+ if any(is_lmdb(path) for path in validation_systems):
raise NotImplementedError(
"Paddle backend does not support LMDB validation data yet. "
"Set validation_data.out_format to 'deepmd/hdf5' when using "
"validation_data.format for automatic conversion."
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepmd/pd/entrypoints/main.py` around lines 123 - 144, The current LMDB
validation checks for both training_systems and validation_systems only reject
LMDB when the result is a single system (len(...) == 1), but the error messages
indicate that Paddle does not support LMDB data in general. Remove the len(...)
== 1 condition from both the training_systems check (around line 123) and the
validation_systems check (around line 139) so that any LMDB dataset is rejected
regardless of whether it's a single system or multiple systems in the list. This
ensures that any LMDB-resolved dataset triggers the NotImplementedError with a
clear message, preventing less clear failures downstream.
| systems = process_systems( | ||
| dataset_params["systems"], | ||
| patterns=dataset_params.get("rglob_patterns", None), | ||
| fmt=dataset_params.get("format", None), | ||
| out_fmt=dataset_params.get( | ||
| "out_format", dataset_params.get("output_format", None) | ||
| ), | ||
| ) | ||
| if len(systems) == 1 and is_lmdb(systems[0]): | ||
| from deepmd.dpmodel.utils.lmdb_data import ( | ||
| make_neighbor_stat_data, | ||
| ) | ||
|
|
||
| return make_neighbor_stat_data(systems[0], type_map) | ||
| return get_data(dataset_params, 0, type_map, None) |
There was a problem hiding this comment.
Re-apply scalar-LMDB validation after format-based conversion.
_detect_lmdb_path explicitly disallows LMDB in list-form systems, but when format is set the code skips that check and only handles the single-LMDB case. If process_systems(...) returns multiple LMDB paths, this now falls through to generic paths (get_data / DeepmdDataSystem) instead of raising the intended clear error.
Suggested fix pattern
systems = process_systems(
dataset_params["systems"],
patterns=dataset_params.get("rglob_patterns", None),
fmt=dataset_params.get("format", None),
out_fmt=dataset_params.get(
"out_format", dataset_params.get("output_format", None)
),
)
+ lmdb_systems = [s for s in systems if is_lmdb(s)]
+ if lmdb_systems and not (len(systems) == 1 and len(lmdb_systems) == 1):
+ raise ValueError(
+ "LMDB datasets must be passed as a scalar 'systems' string "
+ "(e.g. 'systems': '/path/to/data.lmdb'); list-form systems "
+ "with LMDB paths are not supported."
+ )
if len(systems) == 1 and is_lmdb(systems[0]):
...Apply the same guard in both _get_neighbor_stat_data and _build_data_system.
Also applies to: 168-175
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepmd/pt_expt/entrypoints/main.py` around lines 118 - 132, The current code
in _get_neighbor_stat_data only validates the single-LMDB case with `if
len(systems) == 1 and is_lmdb(systems[0])`, but when format-based conversion
produces multiple LMDB paths, this check is skipped and execution falls through
to get_data() instead of raising an appropriate error for list-form LMDB
systems. Add validation guards in both _get_neighbor_stat_data and
_build_data_system functions to ensure that after process_systems() is called,
if any LMDB systems are returned, they are validated to not be in list form
(similar to what _detect_lmdb_path does), and raise a clear error before
reaching the fallback get_data() or DeepmdDataSystem paths.
| if len(systems) == 1 and is_lmdb(systems[0]): | ||
| return LmdbDataset( | ||
| systems[0], | ||
| model_params_single["type_map"], | ||
| dataset_params["batch_size"], | ||
| auto_prob_style=dataset_params.get("auto_prob", None), | ||
| ) | ||
| return DpLoaderSet( |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect PT data loader implementations/usages for LMDB awareness.
fd -t f --extension py | xargs rg -n -C3 '\bclass\s+DpLoaderSet\b|\bclass\s+LmdbDataset\b|\bis_lmdb\s*\('
# Check whether tests cover multi-system conversion to LMDB in PT paths.
rg -n -C3 'format|out_format|output_format|lmdb|DpLoaderSet|LmdbDataset' source/tests deepmd/ptRepository: deepmodeling/deepmd-kit
Length of output: 50380
🏁 Script executed:
# Check DpLoaderSet implementation in PT
head -150 deepmd/pt/utils/dataloader.py | tail -100
# Check pt_expt's validation that rejects multi-LMDB
sed -n '80,95p' deepmd/pt_expt/entrypoints/main.py
# Verify if pt should have same check as pt_expt
diff -u <(sed -n '80,95p' deepmd/pt_expt/entrypoints/main.py) <(sed -n '190,205p' deepmd/pt/entrypoints/main.py) || trueRepository: deepmodeling/deepmd-kit
Length of output: 4467
🏁 Script executed:
# Check DeepmdDataSetForLoader to see if it handles LMDB
rg -A 20 "class DeepmdDataSetForLoader" deepmd/pt/
# Check if there's any LMDB handling in DpLoaderSet or construct_dataset
rg -B 5 -A 10 "def construct_dataset" deepmd/pt/utils/dataloader.py
# Check what process_systems actually returns for multi-LMDB cases
rg -B 3 -A 10 "def process_systems" deepmd/utils/data_system.py | head -60Repository: deepmodeling/deepmd-kit
Length of output: 2444
🏁 Script executed:
# Check if DeepmdData can handle LMDB paths
rg -B 5 -A 25 "class DeepmdData" deepmd/pt/utils/dataset.py | head -50
# Check if DeepmdData/DeepmdDataSystem supports LMDB directly
rg "LMDB\|lmdb" deepmd/pt/utils/dataset.pyRepository: deepmodeling/deepmd-kit
Length of output: 1015
🏁 Script executed:
# Check DeepmdData implementation - is it from pt or common
rg -l "class DeepmdData" deepmd/
# Find where DeepmdData is defined
fd -t f -name "*.py" | xargs rg "^class DeepmdData" | head -5
# Check if DeepmdData supports LMDB
rg -B 2 -A 20 "def __init__" deepmd/utils/data.py | grep -A 20 "class DeepmdData" | head -40Repository: deepmodeling/deepmd-kit
Length of output: 642
🏁 Script executed:
# Check DeepmdData __init__ to see what it accepts
sed -n '/^class DeepmdData/,/^class /p' deepmd/utils/data.py | head -80
# Check if there's any LMDB support in DeepmdData
rg "LMDB|lmdb" deepmd/utils/data.py
# Check if process_systems can return LMDB paths and how DpLoaderSet handles them
sed -n '1330,1360p' deepmd/utils/data_system.pyRepository: deepmodeling/deepmd-kit
Length of output: 4291
🏁 Script executed:
# Verify DeepmdData cannot handle LMDB by checking what happens if we pass LMDB path
rg -B 5 -A 30 "def __init__" deepmd/utils/data.py | grep -A 30 "class DeepmdData" | tail -40
# See what error DeepmdData would raise for LMDB
rg "is_dir\|set_prefix" deepmd/utils/data.py | head -10
# Confirm pt_expt validation function and compare intent
sed -n '75,100p' deepmd/pt_expt/entrypoints/main.pyRepository: deepmodeling/deepmd-kit
Length of output: 1077
🏁 Script executed:
# Check what error DeepmdData actually raises when given an LMDB path
rg -A 5 "is_dir" deepmd/utils/data.py | head -20
# Verify if there are tests for multi-LMDB conversion to see current behavior
rg -l "process_systems.*lmdb\|LMDB.*process_systems" source/tests/ | head -5
# Check if pt/entrypoints/main.py should have the _detect_lmdb_path guard like pt_expt
sed -n '190,210p' deepmd/pt/entrypoints/main.pyRepository: deepmodeling/deepmd-kit
Length of output: 1179
Add LMDB validation guard to prevent silent failures when multi-system conversion produces LMDB paths.
The check at line 197 only routes the single-LMDB case to LmdbDataset. If process_systems(...) returns multiple LMDB paths (e.g., from multi-input conversion with default LMDB output), they fall through to DpLoaderSet, which passes them to DeepmdData. Since DeepmdData expects set.* subdirectories and will fail with a confusing "No set.* found" error when given LMDB paths, add an explicit guard matching the one in deepmd/pt_expt/entrypoints/main.py lines 84-89:
if isinstance(systems, list) and any(isinstance(s, str) and is_lmdb(s) for s in systems):
raise ValueError(
"LMDB datasets must be passed as a scalar 'systems' string "
"(e.g. 'systems': '/path/to/data.lmdb'); list-form systems "
"with LMDB paths are not supported."
)Place this check before the len(systems) == 1 condition to catch and clarify the error early.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepmd/pt/entrypoints/main.py` around lines 197 - 204, Add a validation guard
before the existing condition that checks `len(systems) == 1 and
is_lmdb(systems[0])` to prevent multiple LMDB paths from being passed to
DpLoaderSet. The guard should use `isinstance(systems, list)` combined with
`any(isinstance(s, str) and is_lmdb(s) for s in systems)` to detect when systems
is a list containing LMDB paths and raise a clear ValueError message explaining
that LMDB datasets must be passed as a scalar string rather than as a list.
| def _load_set(self, set_name: str) -> dict[str, Any]: | ||
| """Load one same-nloc LMDB group for legacy neighbor-stat code.""" | ||
| indices = self._nloc_set_indices[str(set_name)] | ||
| frames = [self._reader[int(idx)] for idx in indices] | ||
| return self._stack_frames(frames) |
There was a problem hiding this comment.
Add guard for empty frames list in _stack_frames.
_load_set and get_batch both call _stack_frames(frames) where frames comes from sampler indices. If the sampler ever yields an empty batch (edge case with malformed LMDB), frames[0] at line 864 would raise IndexError.
🛡️ Suggested defensive check
def _stack_frames(self, frames: list[dict[str, Any]]) -> dict[str, Any]:
+ if not frames:
+ raise ValueError("Cannot stack an empty list of frames")
out: dict[str, Any] = {}Also applies to: 917-921
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepmd/utils/data_system.py` around lines 848 - 852, Add a defensive check at
the beginning of the `_stack_frames` method to guard against empty frames lists.
Before accessing `frames[0]` at line 864, add validation to check if the frames
list is empty and handle this edge case appropriately, such as raising a more
informative error or returning early. This will prevent IndexError when the
sampler yields an empty batch due to malformed LMDB data, since both `_load_set`
and `get_batch` call this method with frames lists derived from sampler indices.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5565 +/- ##
==========================================
- Coverage 82.17% 82.13% -0.04%
==========================================
Files 898 898
Lines 103576 103869 +293
Branches 4432 4433 +1
==========================================
+ Hits 85117 85317 +200
- Misses 17063 17159 +96
+ Partials 1396 1393 -3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. An unexpected error occurred while generating fixes: Not Found - https://docs.github.com/rest/git/refs#get-a-reference |
Summary
formatandout_formatoptions for dpdata-backed conversionlmdband cache converted datasets under/home/jzzeng/codes/deepmd-kit/.deepmd_dpdata_cachedpdata>=1.0.1a runtime dependencyCloses #5237
Tests
ruff check .ruff format --check .pytest source/tests/common/test_data_system_conversion.py -qpytest source/tests/common/dpmodel/test_lmdb_data.py::TestLmdbDataReader::test_is_lmdb -qpytest source/tests/tf/test_dp_test.py::TestDPTestEner::test_1frame -qsrun --gres=gpu:1 dp train input.jsonwith extxyz input, default LMDB conversion, no--skip-neighbor-statsrun --gres=gpu:1 dp --pt train input.jsonwith extxyz input, default LMDB conversion, no--skip-neighbor-statsrun --gres=gpu:1 dp --jax train input.jsonwith extxyz input, default LMDB conversion, no--skip-neighbor-stat(environment used CPU JAX fallback because CUDA jaxlib is unavailable)srun --gres=gpu:1 dp --pt-expt train input.jsonverified conversion and neighbor statistics; this environment then hits the existing pt-expt tensor serialization error during model constructionSummary by CodeRabbit
New Features
.extxyz) with automatic conversion to LMDB or other formatsformatandout_format/output_formatfor training and validation datasets to enable flexible data preprocessingTests