Implement copy-on-write for symlinked demo cache files#90
Conversation
Loading a demo workspace (online mode, Linux) materializes it by symlinking every committed demo file back to the read-only source under example-data/. Because writes follow symlinks, reprocessing data wrote through those links and overwrote the committed ground truth: guaranteed via cache.db (sqlite writes the db in place on every store) and per-dataset result files whenever the reprocessed dataset id collided with a demo's. Make FileManager copy-on-write so cache writes never traverse a symlink: - materialize cache.db to an independent copy (preserving the demo's existing index rows) before sqlite3.connect - replace symlinked result files with real workspace files before overwriting them in _store_data (polars/pandas/pickle) and store_file Load-time symlinks are kept for read efficiency; the workspace diverges from the ground truth on first write. Adds a regression test that fails against the previous behaviour (the write resolved into example-data). https://claude.ai/code/session_01KhR5v5QbEGkBDFrqKTcngZ
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughFileManager introduces symlink-aware copy-on-write behavior to prevent writes from mutating read-only demo ground-truth data. A new ChangesFileManager symlink copy-on-write
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Fixes a critical bug where reprocessing data in a loaded demo workspace would overwrite the read-only ground truth files. When a demo workspace is materialized on Linux via symlinks (pointing to
example-data/), any in-place writes (e.g., sqlite3 database updates, file overwrites) would follow the symlinks and corrupt the committed demo data.This PR implements copy-on-write semantics: FileManager now detects symlinked cache entries and materializes them as independent real files in the workspace before any write operation, ensuring the ground truth is never modified.
Key Changes
Added
_materialize_if_symlink()method to FileManager that:cache.db, preserving demo index rows)os.replace()Integrated materialization at all write points:
_connect_to_sql(): Materializescache.dbbefore sqlite3 connection (prevents in-place database writes from following the link)_store_data(): Materializes result files (.pkl.gz,.pq) before writing via gzip, pandas, or polarsstore_file(): Materializes target path before copying user filesAdded comprehensive test suite (
test_filemanager_symlink_cow.py):cache.dbis materialized on first connection while preserving demo indexImplementation Details
Path.is_symlink()to detect symlinks (works on all platforms, no-op on Windows)os.replace()prevents partial writes or race conditionspreserve_content=Trueforcache.db(existing rows must survive);Falsefor result files (about to be fully overwritten)*.materialize.tmp) avoids conflicts during atomic swaphttps://claude.ai/code/session_01KhR5v5QbEGkBDFrqKTcngZ
Summary by CodeRabbit
Bug Fixes
Tests