[analytics-engine] Strip AE-unsupported fields from test data and exclude the ITs doomed by it by mengweieric · Pull Request #5541 · opensearch-project/sql

mengweieric · 2026-06-11T17:53:17Z

Problem

On the analytics-engine route (-Dtests.analytics.parquet_indices=true), a test dataset whose mapping contains a field type the parquet/composite store can't handle (nested, geo_point, geo_shape, alias, or a store-disabled binary) fails index creation / ingestion. The failure surfaces later as an unrelated expected:<1> but was:<0> in a downstream IT rather than at the offending index/field — masking real assertions.

Fix

1. Strip unsupported-typed fields at load time — gated on tests.analytics.parquet_indices; a byte-for-byte no-op off the AE route.

createIndexByRestClient removes unsupported {nested, geo_point, geo_shape, alias} properties recursively and reports the exact dropped paths (e.g. [location, point]), not just top-level names.
loadIndex passes those paths to loadDataByRestClient, which strips each path from every bulk source doc recursively — through objects and arrays of objects — preserving siblings (location.city etc.), so mapping and data agree. Untouched docs stay byte-for-byte verbatim.
loadDataByRestClient now fails loudly on bulk errors=true (names the index + first item errors), for all loads — silent partial ingestion no longer slips through a 200.
The trigger is the mapping's field types, not the test's identity, so any IT — existing or newly added — loading such a dataset is handled out of the box. No per-index variant files. UNSUPPORTED_FIELD_TYPES is the single source of truth, imported by the verifier so the two can't drift.

binary is conditional: kept when store: true (engine reads it as VARBINARY), but stripped when store: false — the parquet store can't create a store-disabled binary field (Unable to derive source for [X] with store disabled, confirmed on the AE cluster).

2. Physically exclude PPL ITs doomed by the strip — a test whose query references a stripped field can never pass once that field is gone, so excluding it is correct, not a workaround. Centralized in integ-test/build.gradle, gated on the parsed analyticsEnabled boolean (so =false excludes nothing), method-level wherever possible so the rest of each class still runs. Verified field-by-field against the index mappings:

Excluded	Reason	Level
`GeoPointFormatsIT`	every test reads a `geo_point` field	class
`DataTypeIT.test_nonnumeric_data_types`	asserts `nested_value` + `geo_point_value`	method
`DataTypeIT.test_alias_data_type`	`where alias_col` (`type: alias`)	method
`SystemFunctionIT.typeof_opensearch_types`	`typeof(geo_point_value)`	method

Kept running (verified not doomed): GeoIpFunctionsIT (geoip() over text ip/name), StatsCommandIT.testStatsNested* (nested function calls, not a nested field), ConvertCommandIT/TrendlineCommandIT/ExplainIT alias usages (query result aliases, not alias-typed fields).

Verification

Verified locally — CI's integTest jobs run the legacy/default route (no tests.analytics.parquet_indices), so they never exercise this code; the AE route needs a force-routed 9-plugin cluster CI doesn't spin up.

Goal achieved (AE route, force-routed cluster, _explain confirmed LogicalTableScan(table=[[opensearch = real AE, not Calcite fallback):

AnalyticsUnsupportedFieldStripVerifyIT passes over every Index dataset — per dataset it forces a clean reload, then asserts: clean live mapping, no stripped path left in sampled _source, and doc-count == source-doc count. (1 dataset, DATASOURCES, logs a transparent skip of the count leg only — its count query touches a system index; legs 1–3 still ran.)
This is the fix proving itself: the verifier caught the binary store:false case, which the conditional strip now handles.

Unit (pure-logic, no cluster) — AnalyticsFieldStripTests 7/7: recursive path strip, complex_geo-shape (location.point removed, siblings kept, deep nested_locations.primary.office), arrays of objects, byte-identity of untouched docs, binary store-conditional, no-op-when-disabled.

No regression when AE is off (the v2/v3 path): strip code early-returns on !isEnabled() (mapping/data byte-identical); excludes are gated on the parsed analyticsEnabled. Confirmed empirically — ran StatsCommandIT/DataTypeIT/FieldsCommandIT/SearchCommandIT legacy (no parquet flag): 123 pass / 0 fail, and the new bulk-error check never tripped.

Check List

New functionality includes testing.
Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…overage The analytics-engine cannot read nested/geo_point/geo_shape fields, so a handful of test datasets need an _ae sibling that drops those fields from the mapping and the matching keys from each bulk doc. Add TestUtils.AnalyticsIndexConfig#resolveDatasetPath to transparently route a dataset/mapping path to its _ae variant when the config is enabled and the sibling exists on disk; normal CI is byte-for-byte identical otherwise. Ships the _ae variants for datatypes, deep_nested, merge_test_1/2, and nested_simple, plus AnalyticsIndexConfigVariantTests covering the resolution logic. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

…load The analytics-engine (DataFusion) backend cannot read nested, geo_point, geo_shape, binary, or alias fields. Rather than maintain per-index _ae dataset variants, strip those fields uniformly at load time on the AE route: - createIndexByRestClient recursively removes unsupported-typed properties from the mapping (including object sub-properties) and reports the top-level fields dropped. - loadIndex passes that set to a new loadDataByRestClient overload, which strips the matching keys from each bulk source doc so mapping and data agree. Gated on tests.analytics.parquet_indices; byte-for-byte no-op off the AE route. Replaces the earlier filename-_ae hook + hand-authored variant files. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

Add an exhaustive, fail-loud verifier IT (AnalyticsUnsupportedFieldStripVerifyIT) that loads every Index enum dataset through the real loadIndex harness on the AE route and asserts no nested/geo_point/geo_shape/binary/alias field survives in any live mapping — converting the "silent expected:<1> but was:<0>" risk into a direct, attributable failure. Skip CalciteAliasFieldAggregationIT on the AE route: it creates its index via a raw inline PUT (bypassing the load-path strip) and its queries reference alias fields directly, so alias aggregation is inherently AE-incompatible — there is nothing to salvage by stripping. Matches the existing Assume.assumeFalse idiom. Full calcite/remote PPL sweep on the AE route now has zero ingestion failures attributable to the five unsupported field types. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

github-actions · 2026-06-11T17:54:47Z

PR Reviewer Guide 🔍

(Review updated until commit `f649962`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Possible Issue The `liveDocCountOrNull` method catches all exceptions and returns null, which causes the test to skip doc-count verification even when the failure is unrelated to the strip logic (e.g., a genuine ingestion problem). This masks real issues. The method should only catch and return null for specific, known-safe exceptions (like shard-level count failures on system indices), not all exceptions. private Integer liveDocCountOrNull(String indexName) { String query = String.format("source=%s \| stats count() as c", indexName); try { JSONObject result = executeQuery(query); return result.getJSONArray("datarows").getJSONArray(0).getInt(0); } catch (Exception e) { // The count query can fail for cluster-side reasons unrelated to the strip (e.g. it touches a // parquet-backed system index that can't serve a shard-level count). Don't let that mask the // strip proof — return null so the caller records a transparent skip. LOG.info("count query failed for [{}]: {}", indexName, rootMessage(e)); return null; } Possible Issue The `stripBulkFields` method splits on newline with limit -1 but then iterates `i < lines.length - 1` when appending newlines. If the input ends with a newline, `split("\n", -1)` produces a trailing empty string, and the loop skips appending a newline after the second-to-last line, corrupting the output. The condition should be `i < lines.length` (always append newline) or the split should use a positive limit. if (i < lines.length - 1) { out.append('\n'); } } Possible Issue The `failIfBulkHadItemErrors` method iterates over `action.keySet()` (the bulk operation names like "index", "create") but does not break after processing the first key. If a single item somehow has multiple operation keys (malformed response), the method would add duplicate error entries for the same doc. Add a break after processing the first key. for (String op : action.keySet()) { JSONObject result = action.optJSONObject(op); if (result != null && result.has("error")) { firstErrors.add("doc#" + i + ": " + result.get("error")); } } }

github-actions · 2026-06-11T17:55:16Z

PR Code Suggestions ✨

Latest suggestions up to f649962

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Use `.equals()` for Integer comparison The comparison `actualDocs != expectedDocs` uses reference inequality on `Integer` objects. For values outside the cached range (-128 to 127), this may fail even when the numeric values are equal. Use `.equals()` or unbox to primitive `int` for correct numeric comparison. integ-test/src/test/java/org/opensearch/sql/calcite/remote/AnalyticsUnsupportedFieldStripVerifyIT.java [157-172] Integer actualDocs = liveDocCountOrNull(name); if (actualDocs == null) { skippedDocCount.add(index.name() + " (count query errored — see Legs 1-3 + bulk check)"); -} else if (actualDocs != expectedDocs) { +} else if (!actualDocs.equals(expectedDocs)) { failures.add( "[" + index.name() + " -> " + name + "] ingested " + actualDocs + " docs but dataset" + " has " + expectedDocs + " (partial/failed ingestion)"); } Suggestion importance[1-10]: 8 __ Why: Using `!=` for `Integer` comparison can fail for values outside the cached range (-128 to 127) due to reference inequality. This is a potential bug that could cause incorrect test failures. Using `.equals()` ensures correct numeric comparison.	Medium
General	Skip recursion after field removal After removing a field from `properties`, the method continues to recurse into nested properties. If the removed field had nested properties, the recursion should be skipped since the entire subtree is dropped. Add an early continue after removal to avoid processing a removed field's children. integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [232-248] private static void collectAndRemoveUnsupported( JSONObject properties, List<String> prefix, Set<List<String>> dropped) { for (String field : properties.keySet().toArray(new String[0])) { JSONObject def = properties.optJSONObject(field); if (def == null) { continue; } List<String> path = new ArrayList<>(prefix); path.add(field); if (isUnsupportedForAeStore(def)) { properties.remove(field); dropped.add(List.copyOf(path)); - } else if (def.has("properties")) { + continue; + } + if (def.has("properties")) { collectAndRemoveUnsupported(def.getJSONObject("properties"), path, dropped); } } } Suggestion importance[1-10]: 3 __ Why: The suggestion is technically correct but has minimal impact. The current code uses `else if` which already prevents recursion after removal. Adding an explicit `continue` after removal would make the logic clearer but doesn't fix a bug since the `else if` already handles this case.	Low

Previous suggestions

Suggestions up to commit ab4e7c6

Category	Suggestion	Impact
General	Avoid loading entire file into memory The method loads the entire dataset file into memory as a single string, which can cause OutOfMemoryError for large test datasets. Use a streaming approach with BufferedReader to process lines incrementally instead of loading the entire file at once. integ-test/src/test/java/org/opensearch/sql/calcite/remote/AnalyticsUnsupportedFieldStripVerifyIT.java [320-340] private static int sourceDocCount(Index index) throws IOException { String dataSet = index.getDataSet(); if (dataSet == null \|\| dataSet.isEmpty()) { return 0; } - String body = new String(Files.readAllBytes(Paths.get(getResourceFilePath(dataSet)))); int docs = 0; - for (String line : body.split("\n", -1)) { - String trimmed = line.trim(); - if (trimmed.isEmpty() \|\| trimmed.charAt(0) != '{') { - continue; - } - JSONObject obj = new JSONObject(trimmed); - boolean isActionLine = - obj.has("index") \|\| obj.has("create") \|\| obj.has("update") \|\| obj.has("delete"); - if (!isActionLine) { - docs++; + try (BufferedReader reader = Files.newBufferedReader(Paths.get(getResourceFilePath(dataSet)))) { + String line; + while ((line = reader.readLine()) != null) { + String trimmed = line.trim(); + if (trimmed.isEmpty() \|\| trimmed.charAt(0) != '{') { + continue; + } + JSONObject obj = new JSONObject(trimmed); + boolean isActionLine = + obj.has("index") \|\| obj.has("create") \|\| obj.has("update") \|\| obj.has("delete"); + if (!isActionLine) { + docs++; + } } } return docs; } Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies a memory efficiency issue where `sourceDocCount` loads entire dataset files into memory. Using `BufferedReader` for line-by-line processing is a valid improvement for large files, though test datasets are typically small, limiting the practical impact.	Low
	Add file size validation check Loading the entire dataset file into memory can cause OutOfMemoryError for large test datasets. Consider processing the file in chunks or streaming the data to avoid memory issues with large bulk payloads. integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [435-447] public static void loadDataByRestClient( RestClient client, String indexName, String dataSetFilePath, Set<List<String>> droppedPaths) throws IOException { Path path = Paths.get(getResourceFilePath(dataSetFilePath)); + long fileSize = Files.size(path); + if (fileSize > 100 * 1024 * 1024) { // 100MB threshold + throw new IllegalArgumentException("Dataset file too large: " + fileSize + " bytes. Consider splitting the dataset."); + } String body = new String(Files.readAllBytes(path)); body = AnalyticsIndexConfig.stripBulkFields(body, droppedPaths); Request request = new Request( "POST", "/" + indexName + "/_bulk?" + AnalyticsIndexConfig.bulkLoadRefreshParam()); request.setJsonEntity(body); Response response = performRequest(client, request); failIfBulkHadItemErrors(indexName, response); } Suggestion importance[1-10]: 5 __ Why: Adding a file size check before loading is a reasonable safeguard against `OutOfMemoryError` with large datasets. However, the hardcoded 100MB threshold is arbitrary and may not suit all scenarios. The suggestion provides defensive validation but doesn't fundamentally solve the memory issue for files near the limit.	Low
	Prevent stack overflow in recursion The recursive method lacks depth limit protection, which could cause StackOverflowError on deeply nested JSON structures or circular references. Add a maximum recursion depth check to prevent stack exhaustion. integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [302-325] +private static final int MAX_RECURSION_DEPTH = 100; + private static boolean removePath(Object node, List<String> path, int idx) { + if (idx >= MAX_RECURSION_DEPTH) { + throw new IllegalStateException("Maximum recursion depth exceeded while removing path"); + } String part = path.get(idx); boolean last = idx == path.size() - 1; boolean removed = false; if (node instanceof JSONObject) { JSONObject obj = (JSONObject) node; if (!obj.has(part)) { return false; } if (last) { obj.remove(part); return true; } removed = removePath(obj.get(part), path, idx + 1); } else if (node instanceof JSONArray) { ... } return removed; } Suggestion importance[1-10]: 4 __ Why: While adding recursion depth protection is a defensive practice, the `removePath` method's recursion depth is bounded by the path length (typically shallow in JSON mappings). The risk of `StackOverflowError` is low in practice, making this a minor defensive improvement rather than addressing a likely issue.	Low

Suggestions up to commit efdcc7e

Category	Suggestion	Impact
General	Use explicit list for iteration The recursive call to `removeIfUnsupported` modifies the `sub` object while iterating over its keys. Converting `keySet()` to an array avoids `ConcurrentModificationException`, but this pattern is fragile. Use an explicit list copy for clarity and maintainability. integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [208-225] private static boolean removeIfUnsupported(JSONObject properties, String field) { JSONObject def = properties.optJSONObject(field); if (def == null) { return false; } String type = def.optString("type", null); if (type != null && UNSUPPORTED_FIELD_TYPES.contains(type)) { properties.remove(field); return true; } if (def.has("properties")) { JSONObject sub = def.getJSONObject("properties"); - for (String child : sub.keySet().toArray(new String[0])) { + List<String> childKeys = new ArrayList<>(sub.keySet()); + for (String child : childKeys) { removeIfUnsupported(sub, child); } } return false; } Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies that using `new ArrayList<>(sub.keySet())` is clearer than `sub.keySet().toArray(new String[0])` for avoiding `ConcurrentModificationException` during iteration. This improves code readability and maintainability, though both approaches are functionally correct.	Low
	Remove redundant null check The loop condition checks `r.getCause() != r` but also has a redundant `if` `(r.getCause() == null) break;` inside. The inner check is unnecessary since the loop condition already handles null causes. Remove the redundant break to simplify the logic. integ-test/src/test/java/org/opensearch/sql/calcite/remote/AnalyticsUnsupportedFieldStripVerifyIT.java [168-179] private static boolean isMissingDatasetFile(Throwable t) { - for (Throwable r = t; r != null && r.getCause() != r; r = r.getCause()) { + for (Throwable r = t; r != null && r.getCause() != null && r.getCause() != r; r = r.getCause()) { if (r instanceof java.nio.file.NoSuchFileException \|\| r instanceof java.io.FileNotFoundException) { return true; - } - if (r.getCause() == null) { - break; } } return false; } Suggestion importance[1-10]: 4 __ Why: The suggestion correctly identifies a redundant null check in the loop. However, the improved code changes the loop condition rather than just removing the inner break statement, which is a valid but different approach. The impact is minor as both versions work correctly.	Low
	Optimize JSON parsing logic The method creates a `JSONObject` from every line that looks like JSON, even action lines that won't be modified. This is inefficient and could throw exceptions on malformed JSON. Parse the JSON only after confirming it's a source line that needs modification. integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [238-272] static String stripBulkFields(String bulkBody, Set<String> droppedFields) { if (!isEnabled() \|\| droppedFields.isEmpty()) { return bulkBody; } String[] lines = bulkBody.split("\n", -1); StringBuilder out = new StringBuilder(bulkBody.length()); for (int i = 0; i < lines.length; i++) { String line = lines[i]; String trimmed = line.trim(); - if (!trimmed.isEmpty() && trimmed.charAt(0) == '{') { + if (!trimmed.isEmpty() && trimmed.charAt(0) == '{' && i % 2 == 1) { JSONObject doc = new JSONObject(trimmed); - ... + boolean removedAny = false; + for (String f : droppedFields) { + if (doc.has(f)) { + doc.remove(f); + removedAny = true; + } + } + if (removedAny) { + line = doc.toString(); + } } out.append(line); if (i < lines.length - 1) { out.append('\n'); } } return out.toString(); } Suggestion importance[1-10]: 3 __ Why: The suggestion to use `i % 2 == 1` to identify source lines is fragile and incorrect. Bulk format alternates action/source lines, but the current code correctly checks for action keys (`index`, `create`, etc.) which is more robust than relying on line position, especially if the bulk data has empty lines or formatting variations.	Low

Suggestions up to commit d076417

Category	Suggestion	Impact
General	Fix redundant loop termination logic The loop condition checks `r.getCause() != r` but then has a redundant break when `r.getCause() == null`. The loop will naturally terminate when `r.getCause()` is null because the next iteration's condition `r != null` will still be true but `r.getCause()` `!= r` becomes meaningless. Remove the redundant null check and break statement. integ-test/src/test/java/org/opensearch/sql/calcite/remote/AnalyticsUnsupportedFieldStripVerifyIT.java [168-179] private static boolean isMissingDatasetFile(Throwable t) { - for (Throwable r = t; r != null && r.getCause() != r; r = r.getCause()) { + for (Throwable r = t; r != null && r.getCause() != null && r.getCause() != r; r = r.getCause()) { if (r instanceof java.nio.file.NoSuchFileException \|\| r instanceof java.io.FileNotFoundException) { return true; - } - if (r.getCause() == null) { - break; } } return false; } Suggestion importance[1-10]: 4 __ Why: The suggestion correctly identifies redundant logic in the loop. However, the current code is functionally correct and the redundancy doesn't cause bugs. The improvement is minor - it simplifies the loop condition by removing the redundant `break` statement, making the code slightly cleaner.	Low

Suggestions up to commit 13d1081

Category	Suggestion	Impact
Possible issue	Verify array bounds before assertion The test assumes `lines[3]` contains the second document, but `split("\n", -1)` with a trailing newline creates an empty final element. This causes an off-by-one error where `lines[3]` is actually empty, not the second document. Verify the array indices match the actual split result. integ-test/src/test/java/org/opensearch/sql/legacy/AnalyticsFieldStripTests.java [109-127] @Test public void bulkStrip_leavesUntouchedSourceLinesByteForByte() { enable(); // doc1 carries a dropped key (gets rewritten); doc2 does not (must pass through verbatim). String docWithDrop = "{\"keep_text\":\"x\",\"nested_value\":[{\"a\":1}]}"; String docNoDrop = "{\"keep_text\":\"y\",\"age\": 30,\"z\":1}"; String bulk = "{\"index\":{\"_id\":\"1\"}}\n" + docWithDrop + "\n" + "{\"index\":{\"_id\":\"2\"}}\n" + docNoDrop + "\n"; String out = AnalyticsIndexConfig.stripBulkFields(bulk, Set.of("nested_value")); String[] lines = out.split("\n", -1); // The doc that had no dropped key is byte-for-byte identical (odd spacing/key order preserved). + assertTrue("Expected at least 5 lines", lines.length >= 5); assertEquals(docNoDrop, lines[3]); // The doc that had a dropped key lost it. assertFalse(new JSONObject(lines[1]).has("nested_value")); } Suggestion importance[1-10]: 3 __ Why: The suggestion identifies a potential off-by-one issue with array indexing after `split("\n", -1)` with a trailing newline. However, examining the test code shows the bulk string ends with `"\n"`, and `split("\n", -1)` will create 5 elements (lines[0-4]), so `lines[3]` is valid. The suggestion adds a bounds check which is defensive but not strictly necessary given the controlled test input.	Low

Suggestions up to commit 812f87b

Category Suggestion Impact

Possible issue

Verify array bounds after split operation

The test assumes lines[3] contains the second document, but split("\n", -1) with a
trailing newline creates an empty string at lines[4]. The assertion
assertEquals(docNoDrop, lines[3]) will fail because the actual document is at a
different index. Verify the split behavior matches the bulk format structure.

integ-test/src/test/java/org/opensearch/sql/legacy/AnalyticsFieldStripTests.java [109-127]

 @Test
 public void bulkStrip_leavesUntouchedSourceLinesByteForByte() {
   enable();
   // doc1 carries a dropped key (gets rewritten); doc2 does not (must pass through verbatim).
   String docWithDrop = "{\"keep_text\":\"x\",\"nested_value\":[{\"a\":1}]}";
   String docNoDrop = "{\"keep_text\":\"y\",\"age\":  30,\"z\":1}";
   String bulk =
       "{\"index\":{\"_id\":\"1\"}}\n"
           + docWithDrop
           + "\n"
           + "{\"index\":{\"_id\":\"2\"}}\n"
           + docNoDrop
           + "\n";
   String out = AnalyticsIndexConfig.stripBulkFields(bulk, Set.of("nested_value"));
   String[] lines = out.split("\n", -1);
+  // Verify we have the expected number of lines (4 content lines + 1 trailing empty)
+  assertEquals(5, lines.length);
   // The doc that had no dropped key is byte-for-byte identical (odd spacing/key order preserved).
   assertEquals(docNoDrop, lines[3]);
   // The doc that had a dropped key lost it.
   assertFalse(new JSONObject(lines[1]).has("nested_value"));
 }

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that split("\n", -1) with a trailing newline creates an empty string at the end of the array. Adding a verification of the expected array length (5 elements) before accessing lines[3] improves test robustness and makes the test's assumptions explicit, preventing potential ArrayIndexOutOfBoundsException.

Medium

General

Handle malformed JSON gracefully

Creating a JSONObject from every line that starts with { can throw JSONException for
malformed JSON, causing the entire bulk load to fail. Add exception handling to skip
or log malformed lines instead of propagating the exception, ensuring robustness
during data ingestion.

integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [238-272]

 static String stripBulkFields(String bulkBody, Set<String> droppedFields) {
   if (!isEnabled() || droppedFields.isEmpty()) {
     return bulkBody;
   }
   String[] lines = bulkBody.split("\n", -1);
   StringBuilder out = new StringBuilder(bulkBody.length());
   for (int i = 0; i < lines.length; i++) {
     String line = lines[i];
     String trimmed = line.trim();
     if (!trimmed.isEmpty() && trimmed.charAt(0) == '{') {
-      JSONObject doc = new JSONObject(trimmed);
-      ...
+      try {
+        JSONObject doc = new JSONObject(trimmed);
+        ...
+      } catch (org.json.JSONException e) {
+        // Skip malformed JSON lines to prevent bulk load failure
+        out.append(line);
+        if (i < lines.length - 1) {
+          out.append('\n');
+        }
+        continue;
+      }
     }
     out.append(line);
     if (i < lines.length - 1) {
       out.append('\n');
     }
   }
   return out.toString();
 }

Suggestion importance[1-10]: 7

__

Why: Adding exception handling for JSONException when parsing bulk lines improves robustness. However, the suggestion asks to verify/ensure proper error handling rather than fixing a confirmed bug. The current code assumes well-formed test fixtures, but graceful handling of malformed JSON would prevent complete failure during data ingestion.

Medium

github-actions · 2026-06-11T18:01:21Z

Persistent review updated to latest commit 13d1081

…eserve untouched bulk docs binary maps to VARBINARY in the analytics-engine schema builder (same as ip) and is a registered parquet field type, so it scans fine - drop it from the unsupported-field strip set, which is now {nested, geo_point, geo_shape, alias}. Stripping binary needlessly mutated fixtures and risked masking a real binary-handling bug instead of letting the assertion surface it. Make UNSUPPORTED_FIELD_TYPES the single source of truth (public) and have AnalyticsUnsupportedFieldStripVerifyIT import it instead of re-listing, so the stripper and the verifier cannot drift. Add a byte-identity guard to stripBulkFields: a bulk doc line is only re-serialized when it actually carried a dropped key, so untouched docs pass through verbatim. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

github-actions · 2026-06-11T18:19:57Z

Persistent review updated to latest commit d076417

The unsupported-field strip removes nested/geo_point/geo_shape/alias columns from test data+mappings on the AE route, so PPL tests whose queries reference such a column can never pass - the data is gone. Physically exclude them in integ-test/build.gradle (gated on tests.analytics.parquet_indices), centralized with the existing AE excludes and grouped comments, rather than letting them fail. Verified field-by-field against the index mappings; only doomed work is removed: - GeoPointFormatsIT (whole class - every test reads a geo_point field). - DataTypeIT.test_nonnumeric_data_types (nested_value + geo_point_value). - DataTypeIT.test_alias_data_type (where alias_col, type=alias). - SystemFunctionIT.typeof_opensearch_types (typeof(geo_point_value)). Cleared as NOT doomed (kept running): GeoIpFunctionsIT (geoip() over text ip/name), StatsCommandIT.testStatsNested* (nested function calls, not a nested field), ConvertCommandIT/TrendlineCommandIT/ExplainIT alias usages (result aliases, not alias-typed fields). Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

github-actions · 2026-06-11T18:49:54Z

Persistent review updated to latest commit efdcc7e

Guard raw-document seeding in init() with isIndexExist: the parquet/composite store is append-only on a same-_id PUT, so re-seeding test/test1 on every @before inflated their row counts across the class. Pin SELECT-* and fields- column order with an explicit | fields, since the analytics-engine route returns columns in storage (alphabetical) order rather than mapping order. Branch the tests that exercise analytics-route-incompatible behavior behind isAnalyticsParquetIndicesEnabled(): alias/nested field storage, decimal-vs-double literal arithmetic, multi-index integer/long schema merge, and the implicit search-filter syntax. Depends on opensearch-project#5541, which strips analytics-engine-unsupported field types at load and unblocks index creation for this class. Analytics-engine route: 46 failing -> 0 failing (39 pass, 7 documented skips). v2 route: 46/46 pass. Signed-off-by: Kai Huang <ahkcs@amazon.com>

These two files come from opensearch-project#5541 (the stacked dependency) and currently fail spotlessCheck. Reformatting them here keeps this PR's CI green; the commit is expected to drop out cleanly once opensearch-project#5541 lands its own formatting fix and this branch is rebased onto main. Signed-off-by: Kai Huang <ahkcs@amazon.com>

mengweieric · 2026-06-11T19:39:17Z

Failing CIs are unrelated to this PR

dai-chen

Is all our IT CI broken? Just wonder how to verify these changes..

…loud, stronger verifier Resolves the correctness holes raised in review. Blocking 1 — path-aware mapping+source strip. stripUnsupportedMappingFields now returns exact dropped PATHS (e.g. [location, point]) instead of only top-level names; stripBulkFields removes those paths recursively from _source, descending through objects AND arrays of objects, preserving siblings (location.city etc.). Adds unit coverage for a complex_geo shape and arrays of objects. Blocking 2 — fail loud on bulk item failures. loadDataByRestClient now parses the _bulk response and throws on errors=true (naming the index + first item errors), for ALL test bulk loads. Silent partial ingestion no longer slips through a 200 status. Blocking 3 — verifier proves the invariant. AnalyticsUnsupportedFieldStripVerifyIT deletes each index first so the real load path always runs (not short-circuited by isIndexExist), then per dataset checks: clean live mapping, no stripped path left in sampled _source, and doc-count == source-doc count (tolerating a cluster-side count error on system indices with a transparent logged skip). NB1 — GeoPointFormatsIT exclude moved under the parsed analyticsEnabled gate (was gated on mere property presence, so =false wrongly excluded it). NB2 — out-of-scope skip (join) now refuses to skip if the mapping also declares any unsupported type, so a mixed failure can't be hidden. binary is now conditional: kept when store:true (engine reads VARBINARY), but stripped when store:false because the parquet store can't create it ("Unable to derive source for [X] with store disabled", verified on the AE cluster). Confirmed end-to-end: the verifier passes over all Index datasets on a force-routed AE cluster. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

github-actions · 2026-06-11T22:34:17Z

Persistent review updated to latest commit ab4e7c6

mengweieric · 2026-06-11T22:49:51Z

Is all our IT CI broken? Just wonder how to verify these changes..

we should fix the CIs besides this PR but I verified locally of

Run AnalyticsUnsupportedFieldStripVerifyIT on a force-routed AE cluster with -Dtests.analytics.parquet_indices=true. It loads every Index dataset through the real load path and asserts per dataset: clean mapping, no stripped path left in _source, doc count == source doc count. It passes, and fails loudly/attributably if any dataset still ingests an unsupported field, so it's the durable proof artifact in the PR.

Also: unit tests 7/7 (nested paths + arrays of objects), and legacy route with no parquet flag 123 pass / 0 fail so nothing regresses when AE is off.

dai-chen · 2026-06-11T23:32:37Z

+     *
+     * @return the dropped field paths (empty when disabled or nothing matched)
+     */
+    static Set<List<String>> stripUnsupportedMappingFields(JSONObject jsonObject) {


I may miss it somewhere. Could you point me where we gate this by AE enabled setting?

the call site is intentionally unconditional, but the actual gate is inside AnalyticsIndexConfig

dai-chen · 2026-06-11T23:39:50Z

Is all our IT CI broken? Just wonder how to verify these changes..

we should fix the CIs besides this PR but I verified locally of

Run AnalyticsUnsupportedFieldStripVerifyIT on a force-routed AE cluster with -Dtests.analytics.parquet_indices=true. It loads every Index dataset through the real load path and asserts per dataset: clean mapping, no stripped path left in _source, doc count == source doc count. It passes, and fails loudly/attributably if any dataset still ingests an unsupported field, so it's the durable proof artifact in the PR.

Also: unit tests 7/7 (nested paths + arrays of objects), and legacy route with no parquet flag 123 pass / 0 fail so nothing regresses when AE is off.

I thought our CI is failing on dependency after version bump. But it seems due to this single test:

2026-06-11T22:45:47.0923183Z Tests with failures:
2026-06-11T22:45:47.0924323Z  - org.opensearch.sql.sql.QueryValidationIT.testAliasToKeywordMultiFieldFailsWithBadRequest

mengweieric · 2026-06-12T00:10:00Z

Is all our IT CI broken? Just wonder how to verify these changes..

we should fix the CIs besides this PR but I verified locally of
Run AnalyticsUnsupportedFieldStripVerifyIT on a force-routed AE cluster with -Dtests.analytics.parquet_indices=true. It loads every Index dataset through the real load path and asserts per dataset: clean mapping, no stripped path left in _source, doc count == source doc count. It passes, and fails loudly/attributably if any dataset still ingests an unsupported field, so it's the durable proof artifact in the PR.
Also: unit tests 7/7 (nested paths + arrays of objects), and legacy route with no parquet flag 123 pass / 0 fail so nothing regresses when AE is off.

I thought our CI is failing on dependency after version bump. But it seems due to this single test:
2026-06-11T22:45:47.0923183Z Tests with failures:
2026-06-11T22:45:47.0924323Z  - org.opensearch.sql.sql.QueryValidationIT.testAliasToKeywordMultiFieldFailsWithBadRequest

yea fixed in this PR: #5545

…in build.gradle Per review: keep all AE-route test exclusions in one place. Move the inline Assume.assumeFalse(isAnalyticsParquetIndicesEnabled()) guard out of CalciteAliasFieldAggregationIT into the gated analyticsEnabled exclude block in integ-test/build.gradle, alongside the other doomed PPL ITs. The rationale (raw-PUT alias index can't be created on the AE route) is carried into both the build.gradle comment and a short note left at the test's init(). Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

github-actions · 2026-06-12T00:16:05Z

Persistent review updated to latest commit f649962

Guard raw-document seeding in init() with isIndexExist: the parquet/composite store is append-only on a same-_id PUT, so re-seeding test/test1 on every @before inflated their row counts across the class. Pin SELECT-* and fields- column order with an explicit | fields, since the analytics-engine route returns columns in storage (alphabetical) order rather than mapping order. Branch the tests that exercise analytics-route-incompatible behavior behind isAnalyticsParquetIndicesEnabled(): alias/nested field storage, decimal-vs-double literal arithmetic, multi-index integer/long schema merge, and the implicit search-filter syntax. Depends on opensearch-project#5541, which strips analytics-engine-unsupported field types at load and unblocks index creation for this class. Analytics-engine route: 46 failing -> 0 failing (39 pass, 7 documented skips). v2 route: 46/46 pass. Signed-off-by: Kai Huang <ahkcs@amazon.com>

Guard raw-document seeding in init() with isIndexExist: the parquet/composite store is append-only on a same-_id PUT, so re-seeding test/test1 on every @before inflated their row counts across the class. Pin SELECT-* and fields- column order with an explicit | fields, since the analytics-engine route returns columns in storage (alphabetical) order rather than mapping order. Branch the tests that exercise analytics-route-incompatible behavior behind isAnalyticsParquetIndicesEnabled(): alias/nested field storage, decimal-vs-double literal arithmetic, multi-index integer/long schema merge, and the implicit search-filter syntax. Depends on #5541, which strips analytics-engine-unsupported field types at load and unblocks index creation for this class. Analytics-engine route: 46 failing -> 0 failing (39 pass, 7 documented skips). v2 route: 46/46 pass. Signed-off-by: Kai Huang <ahkcs@amazon.com>

mengweieric added 3 commits June 11, 2026 09:14

mengweieric force-pushed the ae-it-dataset-variants branch from 812f87b to 13d1081 Compare June 11, 2026 17:59

mengweieric added PPL Piped processing language bugFix labels Jun 11, 2026

mengweieric force-pushed the ae-it-dataset-variants branch from 13d1081 to d076417 Compare June 11, 2026 18:18

ahkcs mentioned this pull request Jun 11, 2026

Bring CalcitePPLBasicIT to parity on the analytics-engine route #5542

Merged

2 tasks

mengweieric changed the title ~~[analytics-engine] Strip AE-unsupported fields from test datasets at load~~ [analytics-engine] Strip AE-unsupported fields from test data and exclude the ITs doomed by it Jun 11, 2026

mengweieric marked this pull request as ready for review June 11, 2026 19:26

mengweieric requested review from LantaoJin, Swiddis, acarbonetto, anirudha, dai-chen, joshuali925, noCharger, penghuo, ps48, vamsimanohar and ykmr1224 as code owners June 11, 2026 19:26

mengweieric requested review from RyanL1997, ahkcs, qianheng-aws, songkant-aws and yuancu as code owners June 11, 2026 19:26

ahkcs approved these changes Jun 11, 2026

View reviewed changes

dai-chen reviewed Jun 11, 2026

View reviewed changes

Comment thread integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteAliasFieldAggregationIT.java Outdated

dai-chen reviewed Jun 11, 2026

View reviewed changes

Swiddis merged commit 89d1040 into opensearch-project:main Jun 12, 2026
40 of 41 checks passed

ahkcs mentioned this pull request Jun 12, 2026

Bring CalciteWhereCommandIT to parity on the analytics-engine route #5546

Open

2 tasks

Conversation

mengweieric commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Verification

Check List

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit f649962)

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

mengweieric commented Jun 11, 2026

Uh oh!

dai-chen left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

mengweieric commented Jun 11, 2026

Uh oh!

Uh oh!

dai-chen Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

mengweieric Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

dai-chen commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mengweieric commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mengweieric commented Jun 11, 2026 •

edited

Loading

github-actions Bot commented Jun 11, 2026 •

edited

Loading

(Review updated until commit `f649962`)

github-actions Bot commented Jun 11, 2026 •

edited

Loading

dai-chen commented Jun 11, 2026 •

edited

Loading