Add a 10K slice of YFCC to test_data by magdalendobson · Pull Request #1199 · microsoft/DiskANN

magdalendobson · 2026-06-23T23:45:22Z

Recently we've had a lot of discussions about how the smaller datasets in test_data aren't always suitable for catching regressions or bugs, and that we could use a still-small but more substantial dataset for our integration tests. This PR adds a 10K slice of the YFCC dataset, with 100 queries. The dataset is converted to float32 so that we don't have to register new benchmarks to use it. I curated a set of filters with varying match rates, added groundtruth for the native euclidean as well as cosine and inner product, and constructed a runbook that will force the provider to recycle slots. This should set us up to do nontrivial tests on large portions of the codebase.

Copilot

Pull request overview

Adds a new, moderately sized YFCC-derived dataset slice under test_data/ intended to support more realistic integration/regression testing scenarios (including filtered search and streaming-style update patterns).

Changes:

Add a 10K-vector base set and a 100-query set (LFS-tracked binaries).
Add multiple ground-truth artifacts (Euclidean/cosine/IP + filtered range results) and step-based runbook ground truth.
Add dataset documentation (README.md) describing provenance, metrics, filters, and the streaming runbook intent.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
test_data/yfcc/README.md	Documents dataset provenance, metrics/groundtruth, filter match-rate stats, and runbook intent.
test_data/yfcc/yfcc_10k.fbin	10K-vector float32 base dataset (LFS).
test_data/yfcc/yfcc_query_100.fbin	100-query float32 query set (LFS).
test_data/yfcc/yfcc_metadata.json	Query/vector metadata used for filtered search (LFS).
test_data/yfcc/yfcc_query_filters.json	Curated filter definitions for filtered-search testing (LFS).
test_data/yfcc/yfcc_runbook.yaml	Streaming runbook configuration for exercising insert/delete/replace + slot recycling (LFS).
test_data/yfcc/groundtruth.bin	Ground truth for the native Euclidean metric (LFS).
test_data/yfcc/groundtruth_cosine.bin	Ground truth for cosine metric testing (LFS).
test_data/yfcc/groundtruth_ip.bin	Ground truth for inner product metric testing (LFS).
test_data/yfcc/groundtruth_filtered.rangeres	Ground truth for filtered/range-style testing (LFS).
test_data/yfcc/yfcc_runbook_gt/step2.gt10	Runbook step ground truth snapshot (LFS).
test_data/yfcc/yfcc_runbook_gt/step4.gt10	Runbook step ground truth snapshot (LFS).
test_data/yfcc/yfcc_runbook_gt/step6.gt10	Runbook step ground truth snapshot (LFS).
test_data/yfcc/yfcc_runbook_gt/step8.gt10	Runbook step ground truth snapshot (LFS).
test_data/yfcc/yfcc_runbook_gt/step10.gt10	Runbook step ground truth snapshot (LFS).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov-commenter · 2026-06-24T00:15:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.84%. Comparing base (0449d4d) to head (d2e5558).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1199      +/-   ##
==========================================
- Coverage   90.84%   90.84%   -0.01%     
==========================================
  Files         488      488              
  Lines       93233    93305      +72     
==========================================
+ Hits        84697    84759      +62     
- Misses       8536     8546      +10

Flag	Coverage Δ
miri	`90.84% <100.00%> (-0.01%)`	⬇️
unittests	`90.80% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-benchmark/src/main.rs	`91.53% <100.00%> (+0.17%)`	⬆️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hildebrandmw · 2026-06-24T21:51:12Z

Awesome! This will be great! One quick question for you: can we also use YFCC to test i8 data unmodified? Or will we have to do something like shift all u8 values by -128 and only use the L2 distance?

harsha-simhadri · 2026-06-25T21:07:00Z

Would it make sense to add an entry to diskann-benchmark to illustrate how to use these dataset/runbook

hildebrandmw · 2026-06-25T22:34:25Z

Also note that checking out this branch yields the following warning:

Encountered 1 file(s) that should have been pointers, but weren't:
        test_data/yfcc/README.md

magdalendobson · 2026-06-26T13:22:02Z

Awesome! This will be great! One quick question for you: can we also use YFCC to test i8 data unmodified? Or will we have to do something like shift all u8 values by -128 and only use the L2 distance?

I think we discussed dividing by two earlier. Did you want me to add a converted version myself?

magdalendobson · 2026-06-26T13:23:27Z

Would it make sense to add an entry to diskann-benchmark to illustrate how to use these dataset/runbook

Yeah, I can do this, maybe for just a couple of examples for now. The intent was to use these in integration tests more after big changes to the integration test framework land with inmem 2.0--after discussion with Mark it seemed like it would be wasted work to wire them all up now.

…en/better_dataset_in_test_data

hildebrandmw · 2026-06-26T14:36:17Z

Awesome! This will be great! One quick question for you: can we also use YFCC to test i8 data unmodified? Or will we have to do something like shift all u8 values by -128 and only use the L2 distance?

I think we discussed dividing by two earlier. Did you want me to add a converted version myself?

Nah, I can do that.

Would it make sense to add an entry to diskann-benchmark to illustrate how to use these dataset/runbook

Yeah, I can do this, maybe for just a couple of examples for now. The intent was to use these in integration tests more after big changes to the integration test framework land with inmem 2.0--after discussion with Mark it seemed like it would be wasted work to wire them all up now.

And agreed here.

magdalendobson · 2026-06-26T14:47:36Z

Ok, two integration tests are added and the warning is resolved now. I needed to update .gitattributes to exclude .md and .yaml files. Interestingly this error didn't seem to pop up last time I added a runbook--maybe we never found it?

…om:microsoft/DiskANN into users/magdalen/better_dataset_in_test_data

Magdalen Manohar added 21 commits May 14, 2026 17:56

finish up recall computation patch

97b36ef

Merge branch 'main' of github.com:microsoft/DiskANN

07b3671

Merge branch 'main' of github.com:microsoft/DiskANN

eac3ffb

fix conflict

17780f8

fix conflict

43eb517

Merge branch 'main' of github.com:microsoft/DiskANN

1d3a52b

Merge branch 'main' of github.com:microsoft/DiskANN

17eac62

Merge branch 'main' of github.com:microsoft/DiskANN

0ab4baa

Merge branch 'main' of github.com:microsoft/DiskANN

4ddca60

fix conflict

54ee01b

Merge branch 'main' of github.com:microsoft/DiskANN

9e1743f

Merge branch 'main' of github.com:microsoft/DiskANN

93504e6

Merge branch 'main' of github.com:microsoft/DiskANN

b7c27ce

Merge branch 'main' of github.com:microsoft/DiskANN

1dafc55

Merge branch 'main' of github.com:microsoft/DiskANN

33285e2

Merge branch 'main' of github.com:microsoft/DiskANN

824bdb3

add yfcc small

d5496d2

revert accidental change

bdeb695

remove readme

f114664

add description file

974bbe8

re-add readme

d00f3b1

magdalendobson marked this pull request as ready for review June 23, 2026 23:59

magdalendobson requested review from a team and Copilot June 23, 2026 23:59

Copilot started reviewing on behalf of magdalendobson June 24, 2026 00:00 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread test_data/yfcc/README.md Outdated

Comment thread test_data/yfcc/yfcc_runbook.yaml Outdated

Comment thread test_data/yfcc/yfcc_query_filters.json

Magdalen Manohar added 3 commits June 26, 2026 13:55

Merge branch 'main' of github.com:microsoft/DiskANN into users/magdal…

e4a904d

…en/better_dataset_in_test_data

add two examples and corresponding integration tests

6c177b6

add runbook in plain text

fea7c13

update gitattributes to remove LSF warning for .md and .yaml files

ccd1394

Magdalen Manohar added 3 commits June 26, 2026 15:23

re-add example_runbook.yaml as plaintext

c2404ee

Merge branch 'users/magdalen/better_dataset_in_test_data' of github.c…

833fa15

…om:microsoft/DiskANN into users/magdalen/better_dataset_in_test_data

added license info

d2e5558

hildebrandmw approved these changes Jun 26, 2026

View reviewed changes

harsha-simhadri approved these changes Jun 26, 2026

View reviewed changes

magdalendobson merged commit d657d21 into main Jun 26, 2026
24 checks passed

magdalendobson deleted the users/magdalen/better_dataset_in_test_data branch June 26, 2026 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a 10K slice of YFCC to test_data#1199

Add a 10K slice of YFCC to test_data#1199
magdalendobson merged 28 commits into
mainfrom
users/magdalen/better_dataset_in_test_data

magdalendobson commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 24, 2026 •

edited

Loading

Uh oh!

hildebrandmw commented Jun 24, 2026

Uh oh!

harsha-simhadri commented Jun 25, 2026

Uh oh!

hildebrandmw commented Jun 25, 2026

Uh oh!

magdalendobson commented Jun 26, 2026

Uh oh!

magdalendobson commented Jun 26, 2026

Uh oh!

hildebrandmw commented Jun 26, 2026

Uh oh!

magdalendobson commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

magdalendobson commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hildebrandmw commented Jun 24, 2026

Uh oh!

harsha-simhadri commented Jun 25, 2026

Uh oh!

hildebrandmw commented Jun 25, 2026

Uh oh!

magdalendobson commented Jun 26, 2026

Uh oh!

magdalendobson commented Jun 26, 2026

Uh oh!

hildebrandmw commented Jun 26, 2026

Uh oh!

magdalendobson commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Jun 24, 2026 •

edited

Loading