Add a 10K slice of YFCC to test_data#1199
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new, moderately sized YFCC-derived dataset slice under test_data/ intended to support more realistic integration/regression testing scenarios (including filtered search and streaming-style update patterns).
Changes:
- Add a 10K-vector base set and a 100-query set (LFS-tracked binaries).
- Add multiple ground-truth artifacts (Euclidean/cosine/IP + filtered range results) and step-based runbook ground truth.
- Add dataset documentation (
README.md) describing provenance, metrics, filters, and the streaming runbook intent.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test_data/yfcc/README.md | Documents dataset provenance, metrics/groundtruth, filter match-rate stats, and runbook intent. |
| test_data/yfcc/yfcc_10k.fbin | 10K-vector float32 base dataset (LFS). |
| test_data/yfcc/yfcc_query_100.fbin | 100-query float32 query set (LFS). |
| test_data/yfcc/yfcc_metadata.json | Query/vector metadata used for filtered search (LFS). |
| test_data/yfcc/yfcc_query_filters.json | Curated filter definitions for filtered-search testing (LFS). |
| test_data/yfcc/yfcc_runbook.yaml | Streaming runbook configuration for exercising insert/delete/replace + slot recycling (LFS). |
| test_data/yfcc/groundtruth.bin | Ground truth for the native Euclidean metric (LFS). |
| test_data/yfcc/groundtruth_cosine.bin | Ground truth for cosine metric testing (LFS). |
| test_data/yfcc/groundtruth_ip.bin | Ground truth for inner product metric testing (LFS). |
| test_data/yfcc/groundtruth_filtered.rangeres | Ground truth for filtered/range-style testing (LFS). |
| test_data/yfcc/yfcc_runbook_gt/step2.gt10 | Runbook step ground truth snapshot (LFS). |
| test_data/yfcc/yfcc_runbook_gt/step4.gt10 | Runbook step ground truth snapshot (LFS). |
| test_data/yfcc/yfcc_runbook_gt/step6.gt10 | Runbook step ground truth snapshot (LFS). |
| test_data/yfcc/yfcc_runbook_gt/step8.gt10 | Runbook step ground truth snapshot (LFS). |
| test_data/yfcc/yfcc_runbook_gt/step10.gt10 | Runbook step ground truth snapshot (LFS). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1199 +/- ##
==========================================
- Coverage 90.84% 90.84% -0.01%
==========================================
Files 488 488
Lines 93233 93305 +72
==========================================
+ Hits 84697 84759 +62
- Misses 8536 8546 +10
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Awesome! This will be great! One quick question for you: can we also use YFCC to test |
|
Would it make sense to add an entry to |
|
Also note that checking out this branch yields the following warning: |
I think we discussed dividing by two earlier. Did you want me to add a converted version myself? |
Yeah, I can do this, maybe for just a couple of examples for now. The intent was to use these in integration tests more after big changes to the integration test framework land with inmem 2.0--after discussion with Mark it seemed like it would be wasted work to wire them all up now. |
Nah, I can do that.
And agreed here. |
|
Ok, two integration tests are added and the warning is resolved now. I needed to update .gitattributes to exclude .md and .yaml files. Interestingly this error didn't seem to pop up last time I added a runbook--maybe we never found it? |
…om:microsoft/DiskANN into users/magdalen/better_dataset_in_test_data
Recently we've had a lot of discussions about how the smaller datasets in test_data aren't always suitable for catching regressions or bugs, and that we could use a still-small but more substantial dataset for our integration tests. This PR adds a 10K slice of the YFCC dataset, with 100 queries. The dataset is converted to float32 so that we don't have to register new benchmarks to use it. I curated a set of filters with varying match rates, added groundtruth for the native euclidean as well as cosine and inner product, and constructed a runbook that will force the provider to recycle slots. This should set us up to do nontrivial tests on large portions of the codebase.