Skip to content

Add a 10K slice of YFCC to test_data#1199

Merged
magdalendobson merged 28 commits into
mainfrom
users/magdalen/better_dataset_in_test_data
Jun 26, 2026
Merged

Add a 10K slice of YFCC to test_data#1199
magdalendobson merged 28 commits into
mainfrom
users/magdalen/better_dataset_in_test_data

Conversation

@magdalendobson

Copy link
Copy Markdown
Contributor

Recently we've had a lot of discussions about how the smaller datasets in test_data aren't always suitable for catching regressions or bugs, and that we could use a still-small but more substantial dataset for our integration tests. This PR adds a 10K slice of the YFCC dataset, with 100 queries. The dataset is converted to float32 so that we don't have to register new benchmarks to use it. I curated a set of filters with varying match rates, added groundtruth for the native euclidean as well as cosine and inner product, and constructed a runbook that will force the provider to recycle slots. This should set us up to do nontrivial tests on large portions of the codebase.

@magdalendobson magdalendobson marked this pull request as ready for review June 23, 2026 23:59
@magdalendobson magdalendobson requested review from a team and Copilot June 23, 2026 23:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new, moderately sized YFCC-derived dataset slice under test_data/ intended to support more realistic integration/regression testing scenarios (including filtered search and streaming-style update patterns).

Changes:

  • Add a 10K-vector base set and a 100-query set (LFS-tracked binaries).
  • Add multiple ground-truth artifacts (Euclidean/cosine/IP + filtered range results) and step-based runbook ground truth.
  • Add dataset documentation (README.md) describing provenance, metrics, filters, and the streaming runbook intent.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test_data/yfcc/README.md Documents dataset provenance, metrics/groundtruth, filter match-rate stats, and runbook intent.
test_data/yfcc/yfcc_10k.fbin 10K-vector float32 base dataset (LFS).
test_data/yfcc/yfcc_query_100.fbin 100-query float32 query set (LFS).
test_data/yfcc/yfcc_metadata.json Query/vector metadata used for filtered search (LFS).
test_data/yfcc/yfcc_query_filters.json Curated filter definitions for filtered-search testing (LFS).
test_data/yfcc/yfcc_runbook.yaml Streaming runbook configuration for exercising insert/delete/replace + slot recycling (LFS).
test_data/yfcc/groundtruth.bin Ground truth for the native Euclidean metric (LFS).
test_data/yfcc/groundtruth_cosine.bin Ground truth for cosine metric testing (LFS).
test_data/yfcc/groundtruth_ip.bin Ground truth for inner product metric testing (LFS).
test_data/yfcc/groundtruth_filtered.rangeres Ground truth for filtered/range-style testing (LFS).
test_data/yfcc/yfcc_runbook_gt/step2.gt10 Runbook step ground truth snapshot (LFS).
test_data/yfcc/yfcc_runbook_gt/step4.gt10 Runbook step ground truth snapshot (LFS).
test_data/yfcc/yfcc_runbook_gt/step6.gt10 Runbook step ground truth snapshot (LFS).
test_data/yfcc/yfcc_runbook_gt/step8.gt10 Runbook step ground truth snapshot (LFS).
test_data/yfcc/yfcc_runbook_gt/step10.gt10 Runbook step ground truth snapshot (LFS).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test_data/yfcc/README.md Outdated
Comment thread test_data/yfcc/yfcc_runbook.yaml Outdated
Comment thread test_data/yfcc/yfcc_query_filters.json
@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.84%. Comparing base (0449d4d) to head (d2e5558).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1199      +/-   ##
==========================================
- Coverage   90.84%   90.84%   -0.01%     
==========================================
  Files         488      488              
  Lines       93233    93305      +72     
==========================================
+ Hits        84697    84759      +62     
- Misses       8536     8546      +10     
Flag Coverage Δ
miri 90.84% <100.00%> (-0.01%) ⬇️
unittests 90.80% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-benchmark/src/main.rs 91.53% <100.00%> (+0.17%) ⬆️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hildebrandmw

Copy link
Copy Markdown
Contributor

Awesome! This will be great! One quick question for you: can we also use YFCC to test i8 data unmodified? Or will we have to do something like shift all u8 values by -128 and only use the L2 distance?

@harsha-simhadri

Copy link
Copy Markdown
Contributor

Would it make sense to add an entry to diskann-benchmark to illustrate how to use these dataset/runbook

@hildebrandmw

Copy link
Copy Markdown
Contributor

Also note that checking out this branch yields the following warning:

Encountered 1 file(s) that should have been pointers, but weren't:
        test_data/yfcc/README.md

@magdalendobson

Copy link
Copy Markdown
Contributor Author

Awesome! This will be great! One quick question for you: can we also use YFCC to test i8 data unmodified? Or will we have to do something like shift all u8 values by -128 and only use the L2 distance?

I think we discussed dividing by two earlier. Did you want me to add a converted version myself?

@magdalendobson

Copy link
Copy Markdown
Contributor Author

Would it make sense to add an entry to diskann-benchmark to illustrate how to use these dataset/runbook

Yeah, I can do this, maybe for just a couple of examples for now. The intent was to use these in integration tests more after big changes to the integration test framework land with inmem 2.0--after discussion with Mark it seemed like it would be wasted work to wire them all up now.

@hildebrandmw

Copy link
Copy Markdown
Contributor

Awesome! This will be great! One quick question for you: can we also use YFCC to test i8 data unmodified? Or will we have to do something like shift all u8 values by -128 and only use the L2 distance?

I think we discussed dividing by two earlier. Did you want me to add a converted version myself?

Nah, I can do that.

Would it make sense to add an entry to diskann-benchmark to illustrate how to use these dataset/runbook

Yeah, I can do this, maybe for just a couple of examples for now. The intent was to use these in integration tests more after big changes to the integration test framework land with inmem 2.0--after discussion with Mark it seemed like it would be wasted work to wire them all up now.

And agreed here.

@magdalendobson

Copy link
Copy Markdown
Contributor Author

Ok, two integration tests are added and the warning is resolved now. I needed to update .gitattributes to exclude .md and .yaml files. Interestingly this error didn't seem to pop up last time I added a runbook--maybe we never found it?

Magdalen Manohar added 3 commits June 26, 2026 15:23
@magdalendobson magdalendobson merged commit d657d21 into main Jun 26, 2026
24 checks passed
@magdalendobson magdalendobson deleted the users/magdalen/better_dataset_in_test_data branch June 26, 2026 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants