Skip to content

[SYSTEMDS-3949] Add native Delta Lake frame read/write via Delta Kernel#2515

Open
Baunsgaard wants to merge 3 commits into
apache:mainfrom
Baunsgaard:delta-frame-io
Open

[SYSTEMDS-3949] Add native Delta Lake frame read/write via Delta Kernel#2515
Baunsgaard wants to merge 3 commits into
apache:mainfrom
Baunsgaard:delta-frame-io

Conversation

@Baunsgaard

Copy link
Copy Markdown
Contributor

Extend the native Delta Lake support (#2511) from matrices to frames, reading and writing Delta Lake tables through the Spark-free Delta Kernel library on the single-node CP path. DML read/write with format="delta" now works for frames, discovering schema, column names, and dimensions directly from the table.

Stacked on #2511 and should merge after it. Append/overwrite semantics, distributed execution, and time travel remain out of scope

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.63594% with 58 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.70%. Comparing base (384a8dc) to head (f26b5b6).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...che/sysds/runtime/io/FrameReaderDeltaParallel.java 80.22% 20 Missing and 15 partials ⚠️
.../org/apache/sysds/runtime/io/FrameReaderDelta.java 90.32% 0 Missing and 15 partials ⚠️
.../org/apache/sysds/runtime/io/FrameWriterDelta.java 89.47% 4 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2515      +/-   ##
============================================
+ Coverage     71.56%   71.70%   +0.14%     
- Complexity    49110    49463     +353     
============================================
  Files          1575     1583       +8     
  Lines        189793   190943    +1150     
  Branches      37235    37451     +216     
============================================
+ Hits         135816   136908    +1092     
- Misses        43480    43482       +2     
- Partials      10497    10553      +56     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Extend the native Delta Lake support from matrices to frames, reading and
writing Delta Lake tables through the Spark-free Delta Kernel library on the
single-node CP path. DML read/write with format="delta" now works for
frames, discovering schema, column names, and dimensions directly from the
table.

- Add FrameReaderDelta, FrameReaderDeltaParallel and FrameWriterDelta
- Wire DELTA into the frame reader and writer factories
- Refresh cached frame metadata and schema after a Delta read
- Broaden Delta frame component IO coverage

Stacked on the matrix Delta support; append/overwrite semantics,
distributed execution, and time travel remain out of scope.
The native Delta read decode is CPU-bound and parallelizes per data
file, so a table written as one large file cannot use more than one
reader thread. Size data files toward roughly one file per expected
parallel reader, capped by the configured target and floored to avoid
tiny-file proliferation. This materially improves parallel-read
throughput for both matrix and frame tables.

- Add the sysds.io.delta.writer.adaptivefilesize config (default true)
  plus adaptiveWriterTargetFileSize/createWriteEngine helpers in
  DeltaKernelUtils, and document the target file size as an upper bound
- Wire FrameWriterDelta and WriterDelta to size files from the block's
  estimated bytes (dense double footprint for matrices)
- Use the configurable DELTA_WRITER_BATCH_SIZE in FrameWriterDelta
  instead of a hardcoded batch size, matching the matrix writer
The parallel frame reader's metadata-direct path wrote each data file's
rows into shared per-column arrays at a fixed offset without bounding the
row count, so a table whose per-file numRecords statistic under-counts the
actual rows (possible for externally written Delta tables) could overrun
its slice into the next file's region under concurrent writes.

- Add the per-file row-count overflow guard in FrameReaderDeltaParallel
  .readDirect, matching the matrix reader: fail fast with a clear message
  instead of risking overlapping concurrent writes or an array overrun
- Reuse DeltaKernelUtils.typeCode/T_* in FrameReaderDelta instead of a
  forked R_* table and instanceof cascade, keeping the frame and matrix
  type dispatch in lockstep; drop the now-unused type imports
- Extract awaitFileTasks in FrameReaderDeltaParallel to share the pool
  lifecycle across both read paths and restore the interrupt flag when a
  parallel read is cancelled
- Add a unit test covering the adaptive target-file-size flag on/off and
  the floor/cap clamp boundaries
- Clarify the adaptive-size javadoc floor wording, the createWriteEngine
  batch-size comment, and rename opaque locals (names2, bcs/bss)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant