Skip to content

Implement copy-on-write for symlinked demo cache files#90

Open
t0mdavid-m wants to merge 1 commit into
developfrom
claude/festive-cerf-qeZlk
Open

Implement copy-on-write for symlinked demo cache files#90
t0mdavid-m wants to merge 1 commit into
developfrom
claude/festive-cerf-qeZlk

Conversation

@t0mdavid-m
Copy link
Copy Markdown
Member

@t0mdavid-m t0mdavid-m commented May 29, 2026

Summary

Fixes a critical bug where reprocessing data in a loaded demo workspace would overwrite the read-only ground truth files. When a demo workspace is materialized on Linux via symlinks (pointing to example-data/), any in-place writes (e.g., sqlite3 database updates, file overwrites) would follow the symlinks and corrupt the committed demo data.

This PR implements copy-on-write semantics: FileManager now detects symlinked cache entries and materializes them as independent real files in the workspace before any write operation, ensuring the ground truth is never modified.

Key Changes

  • Added _materialize_if_symlink() method to FileManager that:

    • Detects symlinked cache files
    • Copies the symlink target's content (for cache.db, preserving demo index rows)
    • Atomically replaces the symlink with a real file using os.replace()
    • Leaves the ground truth untouched
  • Integrated materialization at all write points:

    • _connect_to_sql(): Materializes cache.db before sqlite3 connection (prevents in-place database writes from following the link)
    • _store_data(): Materializes result files (.pkl.gz, .pq) before writing via gzip, pandas, or polars
    • store_file(): Materializes target path before copying user files
  • Added comprehensive test suite (test_filemanager_symlink_cow.py):

    • Simulates the demo workspace symlink structure
    • Verifies cache.db is materialized on first connection while preserving demo index
    • Confirms reprocessing diverges workspace files without modifying ground truth
    • Tests all write paths (pickle, pandas, polars, binary file)

Implementation Details

  • Uses Path.is_symlink() to detect symlinks (works on all platforms, no-op on Windows)
  • Atomic replacement via os.replace() prevents partial writes or race conditions
  • preserve_content=True for cache.db (existing rows must survive); False for result files (about to be fully overwritten)
  • Temporary file pattern (*.materialize.tmp) avoids conflicts during atomic swap

https://claude.ai/code/session_01KhR5v5QbEGkBDFrqKTcngZ

Summary by CodeRabbit

  • Bug Fixes

    • Workspace cache operations now materialize symlinked paths into independent files, preventing unintended mutations of read-only demo data during reprocessing.
  • Tests

    • Added tests verifying copy-on-write behavior for symlinked cache artifacts.

Review Change Stack

Loading a demo workspace (online mode, Linux) materializes it by
symlinking every committed demo file back to the read-only source under
example-data/. Because writes follow symlinks, reprocessing data wrote
through those links and overwrote the committed ground truth: guaranteed
via cache.db (sqlite writes the db in place on every store) and per-dataset
result files whenever the reprocessed dataset id collided with a demo's.

Make FileManager copy-on-write so cache writes never traverse a symlink:
- materialize cache.db to an independent copy (preserving the demo's
  existing index rows) before sqlite3.connect
- replace symlinked result files with real workspace files before
  overwriting them in _store_data (polars/pandas/pickle) and store_file

Load-time symlinks are kept for read efficiency; the workspace diverges
from the ground truth on first write. Adds a regression test that fails
against the previous behaviour (the write resolved into example-data).

https://claude.ai/code/session_01KhR5v5QbEGkBDFrqKTcngZ
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ec8f7485-d543-4f8c-a0b3-64031690b959

📥 Commits

Reviewing files that changed from the base of the PR and between a89427d and ceedab7.

📒 Files selected for processing (2)
  • src/workflow/FileManager.py
  • tests/test_filemanager_symlink_cow.py

📝 Walkthrough

Walkthrough

FileManager introduces symlink-aware copy-on-write behavior to prevent writes from mutating read-only demo ground-truth data. A new _materialize_if_symlink method replaces symlinked paths with independent real files before writes, and this mechanism is integrated across database connections, data storage, and file storage operations with corresponding test coverage.

Changes

FileManager symlink copy-on-write

Layer / File(s) Summary
Symlink materialization mechanism
src/workflow/FileManager.py
_materialize_if_symlink method detects and replaces symlinks with independent real files using atomic replacement, optionally preserving original content. An os import is added to support atomic operations.
Apply materialization to writes
src/workflow/FileManager.py
Database connection (_connect_to_sql), data storage output for Polars, Pandas, and pickle (_store_data), and file storage (store_file) all call _materialize_if_symlink before writes to prevent following symlinks into demo data.
Copy-on-write behavior verification
tests/test_filemanager_symlink_cow.py
Test suite constructs ground-truth demo cache with symlinked workspace and verifies that cache.db materialization preserves content without altering ground truth, and that reprocessing datasets creates independent workspace files with updated content while leaving all ground-truth files unchanged.

🐰 A symlink walks into FileManager's den,
But CoW says "not today, my friend!"
Duplicate, diverge, and all is well—
Ground truth sleeps safe beneath its shell! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Implement copy-on-write for symlinked demo cache files' accurately and concisely summarizes the main change: adding copy-on-write semantics to FileManager to prevent symlinked cache files from being mutated.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/festive-cerf-qeZlk

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants