[analytics-engine] Strip AE-unsupported fields from test data and exclude the ITs doomed by it#5541
Conversation
…overage The analytics-engine cannot read nested/geo_point/geo_shape fields, so a handful of test datasets need an _ae sibling that drops those fields from the mapping and the matching keys from each bulk doc. Add TestUtils.AnalyticsIndexConfig#resolveDatasetPath to transparently route a dataset/mapping path to its _ae variant when the config is enabled and the sibling exists on disk; normal CI is byte-for-byte identical otherwise. Ships the _ae variants for datatypes, deep_nested, merge_test_1/2, and nested_simple, plus AnalyticsIndexConfigVariantTests covering the resolution logic. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…load The analytics-engine (DataFusion) backend cannot read nested, geo_point, geo_shape, binary, or alias fields. Rather than maintain per-index _ae dataset variants, strip those fields uniformly at load time on the AE route: - createIndexByRestClient recursively removes unsupported-typed properties from the mapping (including object sub-properties) and reports the top-level fields dropped. - loadIndex passes that set to a new loadDataByRestClient overload, which strips the matching keys from each bulk source doc so mapping and data agree. Gated on tests.analytics.parquet_indices; byte-for-byte no-op off the AE route. Replaces the earlier filename-_ae hook + hand-authored variant files. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Add an exhaustive, fail-loud verifier IT (AnalyticsUnsupportedFieldStripVerifyIT) that loads every Index enum dataset through the real loadIndex harness on the AE route and asserts no nested/geo_point/geo_shape/binary/alias field survives in any live mapping — converting the "silent expected:<1> but was:<0>" risk into a direct, attributable failure. Skip CalciteAliasFieldAggregationIT on the AE route: it creates its index via a raw inline PUT (bypassing the load-path strip) and its queries reference alias fields directly, so alias aggregation is inherently AE-incompatible — there is nothing to salvage by stripping. Matches the existing Assume.assumeFalse idiom. Full calcite/remote PPL sweep on the AE route now has zero ingestion failures attributable to the five unsupported field types. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
PR Reviewer Guide 🔍(Review updated until commit f649962)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to f649962 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit ab4e7c6
Suggestions up to commit efdcc7e
Suggestions up to commit d076417
Suggestions up to commit 13d1081
Suggestions up to commit 812f87b
|
812f87b to
13d1081
Compare
|
Persistent review updated to latest commit 13d1081 |
…eserve untouched bulk docs
binary maps to VARBINARY in the analytics-engine schema builder (same as ip)
and is a registered parquet field type, so it scans fine - drop it from the
unsupported-field strip set, which is now {nested, geo_point, geo_shape, alias}.
Stripping binary needlessly mutated fixtures and risked masking a real
binary-handling bug instead of letting the assertion surface it.
Make UNSUPPORTED_FIELD_TYPES the single source of truth (public) and have
AnalyticsUnsupportedFieldStripVerifyIT import it instead of re-listing, so the
stripper and the verifier cannot drift. Add a byte-identity guard to
stripBulkFields: a bulk doc line is only re-serialized when it actually carried
a dropped key, so untouched docs pass through verbatim.
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
13d1081 to
d076417
Compare
|
Persistent review updated to latest commit d076417 |
The unsupported-field strip removes nested/geo_point/geo_shape/alias columns from test data+mappings on the AE route, so PPL tests whose queries reference such a column can never pass - the data is gone. Physically exclude them in integ-test/build.gradle (gated on tests.analytics.parquet_indices), centralized with the existing AE excludes and grouped comments, rather than letting them fail. Verified field-by-field against the index mappings; only doomed work is removed: - GeoPointFormatsIT (whole class - every test reads a geo_point field). - DataTypeIT.test_nonnumeric_data_types (nested_value + geo_point_value). - DataTypeIT.test_alias_data_type (where alias_col, type=alias). - SystemFunctionIT.typeof_opensearch_types (typeof(geo_point_value)). Cleared as NOT doomed (kept running): GeoIpFunctionsIT (geoip() over text ip/name), StatsCommandIT.testStatsNested* (nested function calls, not a nested field), ConvertCommandIT/TrendlineCommandIT/ExplainIT alias usages (result aliases, not alias-typed fields). Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
|
Persistent review updated to latest commit efdcc7e |
Guard raw-document seeding in init() with isIndexExist: the parquet/composite store is append-only on a same-_id PUT, so re-seeding test/test1 on every @before inflated their row counts across the class. Pin SELECT-* and fields- column order with an explicit | fields, since the analytics-engine route returns columns in storage (alphabetical) order rather than mapping order. Branch the tests that exercise analytics-route-incompatible behavior behind isAnalyticsParquetIndicesEnabled(): alias/nested field storage, decimal-vs-double literal arithmetic, multi-index integer/long schema merge, and the implicit search-filter syntax. Depends on opensearch-project#5541, which strips analytics-engine-unsupported field types at load and unblocks index creation for this class. Analytics-engine route: 46 failing -> 0 failing (39 pass, 7 documented skips). v2 route: 46/46 pass. Signed-off-by: Kai Huang <ahkcs@amazon.com>
These two files come from opensearch-project#5541 (the stacked dependency) and currently fail spotlessCheck. Reformatting them here keeps this PR's CI green; the commit is expected to drop out cleanly once opensearch-project#5541 lands its own formatting fix and this branch is rebased onto main. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Failing CIs are unrelated to this PR |
dai-chen
left a comment
There was a problem hiding this comment.
Is all our IT CI broken? Just wonder how to verify these changes..
…loud, stronger verifier
Resolves the correctness holes raised in review.
Blocking 1 — path-aware mapping+source strip. stripUnsupportedMappingFields
now returns exact dropped PATHS (e.g. [location, point]) instead of only
top-level names; stripBulkFields removes those paths recursively from _source,
descending through objects AND arrays of objects, preserving siblings
(location.city etc.). Adds unit coverage for a complex_geo shape and arrays of
objects.
Blocking 2 — fail loud on bulk item failures. loadDataByRestClient now parses
the _bulk response and throws on errors=true (naming the index + first item
errors), for ALL test bulk loads. Silent partial ingestion no longer slips
through a 200 status.
Blocking 3 — verifier proves the invariant. AnalyticsUnsupportedFieldStripVerifyIT
deletes each index first so the real load path always runs (not short-circuited
by isIndexExist), then per dataset checks: clean live mapping, no stripped path
left in sampled _source, and doc-count == source-doc count (tolerating a
cluster-side count error on system indices with a transparent logged skip).
NB1 — GeoPointFormatsIT exclude moved under the parsed analyticsEnabled gate
(was gated on mere property presence, so =false wrongly excluded it).
NB2 — out-of-scope skip (join) now refuses to skip if the mapping also declares
any unsupported type, so a mixed failure can't be hidden.
binary is now conditional: kept when store:true (engine reads VARBINARY), but
stripped when store:false because the parquet store can't create it
("Unable to derive source for [X] with store disabled", verified on the AE
cluster). Confirmed end-to-end: the verifier passes over all Index datasets on a
force-routed AE cluster.
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
|
Persistent review updated to latest commit ab4e7c6 |
we should fix the CIs besides this PR but I verified locally of Run AnalyticsUnsupportedFieldStripVerifyIT on a force-routed AE cluster with -Dtests.analytics.parquet_indices=true. It loads every Index dataset through the real load path and asserts per dataset: clean mapping, no stripped path left in _source, doc count == source doc count. It passes, and fails loudly/attributably if any dataset still ingests an unsupported field, so it's the durable proof artifact in the PR. Also: unit tests 7/7 (nested paths + arrays of objects), and legacy route with no parquet flag 123 pass / 0 fail so nothing regresses when AE is off. |
| * | ||
| * @return the dropped field paths (empty when disabled or nothing matched) | ||
| */ | ||
| static Set<List<String>> stripUnsupportedMappingFields(JSONObject jsonObject) { |
There was a problem hiding this comment.
I may miss it somewhere. Could you point me where we gate this by AE enabled setting?
There was a problem hiding this comment.
the call site is intentionally unconditional, but the actual gate is inside AnalyticsIndexConfig
I thought our CI is failing on dependency after version bump. But it seems due to this single test: |
yea fixed in this PR: #5545 |
…in build.gradle Per review: keep all AE-route test exclusions in one place. Move the inline Assume.assumeFalse(isAnalyticsParquetIndicesEnabled()) guard out of CalciteAliasFieldAggregationIT into the gated analyticsEnabled exclude block in integ-test/build.gradle, alongside the other doomed PPL ITs. The rationale (raw-PUT alias index can't be created on the AE route) is carried into both the build.gradle comment and a short note left at the test's init(). Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
|
Persistent review updated to latest commit f649962 |
Guard raw-document seeding in init() with isIndexExist: the parquet/composite store is append-only on a same-_id PUT, so re-seeding test/test1 on every @before inflated their row counts across the class. Pin SELECT-* and fields- column order with an explicit | fields, since the analytics-engine route returns columns in storage (alphabetical) order rather than mapping order. Branch the tests that exercise analytics-route-incompatible behavior behind isAnalyticsParquetIndicesEnabled(): alias/nested field storage, decimal-vs-double literal arithmetic, multi-index integer/long schema merge, and the implicit search-filter syntax. Depends on opensearch-project#5541, which strips analytics-engine-unsupported field types at load and unblocks index creation for this class. Analytics-engine route: 46 failing -> 0 failing (39 pass, 7 documented skips). v2 route: 46/46 pass. Signed-off-by: Kai Huang <ahkcs@amazon.com>
Guard raw-document seeding in init() with isIndexExist: the parquet/composite store is append-only on a same-_id PUT, so re-seeding test/test1 on every @before inflated their row counts across the class. Pin SELECT-* and fields- column order with an explicit | fields, since the analytics-engine route returns columns in storage (alphabetical) order rather than mapping order. Branch the tests that exercise analytics-route-incompatible behavior behind isAnalyticsParquetIndicesEnabled(): alias/nested field storage, decimal-vs-double literal arithmetic, multi-index integer/long schema merge, and the implicit search-filter syntax. Depends on #5541, which strips analytics-engine-unsupported field types at load and unblocks index creation for this class. Analytics-engine route: 46 failing -> 0 failing (39 pass, 7 documented skips). v2 route: 46/46 pass. Signed-off-by: Kai Huang <ahkcs@amazon.com>
Problem
On the analytics-engine route (
-Dtests.analytics.parquet_indices=true), a test dataset whose mapping contains a field type the parquet/composite store can't handle (nested,geo_point,geo_shape,alias, or a store-disabledbinary) fails index creation / ingestion. The failure surfaces later as an unrelatedexpected:<1> but was:<0>in a downstream IT rather than at the offending index/field — masking real assertions.Fix
1. Strip unsupported-typed fields at load time — gated on
tests.analytics.parquet_indices; a byte-for-byte no-op off the AE route.createIndexByRestClientremoves unsupported{nested, geo_point, geo_shape, alias}properties recursively and reports the exact dropped paths (e.g.[location, point]), not just top-level names.loadIndexpasses those paths toloadDataByRestClient, which strips each path from every bulk source doc recursively — through objects and arrays of objects — preserving siblings (location.cityetc.), so mapping and data agree. Untouched docs stay byte-for-byte verbatim.loadDataByRestClientnow fails loudly on bulkerrors=true(names the index + first item errors), for all loads — silent partial ingestion no longer slips through a 200.UNSUPPORTED_FIELD_TYPESis the single source of truth, imported by the verifier so the two can't drift.binaryis conditional: kept whenstore: true(engine reads it asVARBINARY), but stripped whenstore: false— the parquet store can't create a store-disabled binary field (Unable to derive source for [X] with store disabled, confirmed on the AE cluster).2. Physically exclude PPL ITs doomed by the strip — a test whose query references a stripped field can never pass once that field is gone, so excluding it is correct, not a workaround. Centralized in
integ-test/build.gradle, gated on the parsedanalyticsEnabledboolean (so=falseexcludes nothing), method-level wherever possible so the rest of each class still runs. Verified field-by-field against the index mappings:GeoPointFormatsITgeo_pointfieldDataTypeIT.test_nonnumeric_data_typesnested_value+geo_point_valueDataTypeIT.test_alias_data_typewhere alias_col(type: alias)SystemFunctionIT.typeof_opensearch_typestypeof(geo_point_value)Kept running (verified not doomed):
GeoIpFunctionsIT(geoip()overtextip/name),StatsCommandIT.testStatsNested*(nested function calls, not a nested field),ConvertCommandIT/TrendlineCommandIT/ExplainITalias usages (query result aliases, notalias-typed fields).Verification
Verified locally — CI's
integTestjobs run the legacy/default route (notests.analytics.parquet_indices), so they never exercise this code; the AE route needs a force-routed 9-plugin cluster CI doesn't spin up.Goal achieved (AE route, force-routed cluster,
_explainconfirmedLogicalTableScan(table=[[opensearch= real AE, not Calcite fallback):AnalyticsUnsupportedFieldStripVerifyITpasses over everyIndexdataset — per dataset it forces a clean reload, then asserts: clean live mapping, no stripped path left in sampled_source, and doc-count == source-doc count. (1 dataset,DATASOURCES, logs a transparent skip of the count leg only — its count query touches a system index; legs 1–3 still ran.)binary store:falsecase, which the conditional strip now handles.Unit (pure-logic, no cluster) —
AnalyticsFieldStripTests7/7: recursive path strip,complex_geo-shape (location.pointremoved, siblings kept, deepnested_locations.primary.office), arrays of objects, byte-identity of untouched docs,binarystore-conditional, no-op-when-disabled.No regression when AE is off (the v2/v3 path): strip code early-returns on
!isEnabled()(mapping/data byte-identical); excludes are gated on the parsedanalyticsEnabled. Confirmed empirically — ranStatsCommandIT/DataTypeIT/FieldsCommandIT/SearchCommandITlegacy (no parquet flag): 123 pass / 0 fail, and the new bulk-error check never tripped.Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.