Skip to content

[analytics-engine] Strip AE-unsupported fields from test data and exclude the ITs doomed by it#5541

Merged
Swiddis merged 7 commits into
opensearch-project:mainfrom
mengweieric:ae-it-dataset-variants
Jun 12, 2026
Merged

[analytics-engine] Strip AE-unsupported fields from test data and exclude the ITs doomed by it#5541
Swiddis merged 7 commits into
opensearch-project:mainfrom
mengweieric:ae-it-dataset-variants

Conversation

@mengweieric

@mengweieric mengweieric commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Problem

On the analytics-engine route (-Dtests.analytics.parquet_indices=true), a test dataset whose mapping contains a field type the parquet/composite store can't handle (nested, geo_point, geo_shape, alias, or a store-disabled binary) fails index creation / ingestion. The failure surfaces later as an unrelated expected:<1> but was:<0> in a downstream IT rather than at the offending index/field — masking real assertions.

Fix

1. Strip unsupported-typed fields at load time — gated on tests.analytics.parquet_indices; a byte-for-byte no-op off the AE route.

  • createIndexByRestClient removes unsupported {nested, geo_point, geo_shape, alias} properties recursively and reports the exact dropped paths (e.g. [location, point]), not just top-level names.
  • loadIndex passes those paths to loadDataByRestClient, which strips each path from every bulk source doc recursively — through objects and arrays of objects — preserving siblings (location.city etc.), so mapping and data agree. Untouched docs stay byte-for-byte verbatim.
  • loadDataByRestClient now fails loudly on bulk errors=true (names the index + first item errors), for all loads — silent partial ingestion no longer slips through a 200.
  • The trigger is the mapping's field types, not the test's identity, so any IT — existing or newly added — loading such a dataset is handled out of the box. No per-index variant files. UNSUPPORTED_FIELD_TYPES is the single source of truth, imported by the verifier so the two can't drift.

binary is conditional: kept when store: true (engine reads it as VARBINARY), but stripped when store: false — the parquet store can't create a store-disabled binary field (Unable to derive source for [X] with store disabled, confirmed on the AE cluster).

2. Physically exclude PPL ITs doomed by the strip — a test whose query references a stripped field can never pass once that field is gone, so excluding it is correct, not a workaround. Centralized in integ-test/build.gradle, gated on the parsed analyticsEnabled boolean (so =false excludes nothing), method-level wherever possible so the rest of each class still runs. Verified field-by-field against the index mappings:

Excluded Reason Level
GeoPointFormatsIT every test reads a geo_point field class
DataTypeIT.test_nonnumeric_data_types asserts nested_value + geo_point_value method
DataTypeIT.test_alias_data_type where alias_col (type: alias) method
SystemFunctionIT.typeof_opensearch_types typeof(geo_point_value) method

Kept running (verified not doomed): GeoIpFunctionsIT (geoip() over text ip/name), StatsCommandIT.testStatsNested* (nested function calls, not a nested field), ConvertCommandIT/TrendlineCommandIT/ExplainIT alias usages (query result aliases, not alias-typed fields).

Verification

Verified locally — CI's integTest jobs run the legacy/default route (no tests.analytics.parquet_indices), so they never exercise this code; the AE route needs a force-routed 9-plugin cluster CI doesn't spin up.

Goal achieved (AE route, force-routed cluster, _explain confirmed LogicalTableScan(table=[[opensearch = real AE, not Calcite fallback):

  • AnalyticsUnsupportedFieldStripVerifyIT passes over every Index dataset — per dataset it forces a clean reload, then asserts: clean live mapping, no stripped path left in sampled _source, and doc-count == source-doc count. (1 dataset, DATASOURCES, logs a transparent skip of the count leg only — its count query touches a system index; legs 1–3 still ran.)
  • This is the fix proving itself: the verifier caught the binary store:false case, which the conditional strip now handles.

Unit (pure-logic, no cluster) — AnalyticsFieldStripTests 7/7: recursive path strip, complex_geo-shape (location.point removed, siblings kept, deep nested_locations.primary.office), arrays of objects, byte-identity of untouched docs, binary store-conditional, no-op-when-disabled.

No regression when AE is off (the v2/v3 path): strip code early-returns on !isEnabled() (mapping/data byte-identical); excludes are gated on the parsed analyticsEnabled. Confirmed empirically — ran StatsCommandIT/DataTypeIT/FieldsCommandIT/SearchCommandIT legacy (no parquet flag): 123 pass / 0 fail, and the new bulk-error check never tripped.

Check List

  • New functionality includes testing.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…overage

The analytics-engine cannot read nested/geo_point/geo_shape fields, so a handful
of test datasets need an _ae sibling that drops those fields from the mapping and
the matching keys from each bulk doc. Add TestUtils.AnalyticsIndexConfig#resolveDatasetPath
to transparently route a dataset/mapping path to its _ae variant when the config is
enabled and the sibling exists on disk; normal CI is byte-for-byte identical otherwise.

Ships the _ae variants for datatypes, deep_nested, merge_test_1/2, and nested_simple,
plus AnalyticsIndexConfigVariantTests covering the resolution logic.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…load

The analytics-engine (DataFusion) backend cannot read nested, geo_point,
geo_shape, binary, or alias fields. Rather than maintain per-index _ae dataset
variants, strip those fields uniformly at load time on the AE route:

- createIndexByRestClient recursively removes unsupported-typed properties from
  the mapping (including object sub-properties) and reports the top-level fields
  dropped.
- loadIndex passes that set to a new loadDataByRestClient overload, which strips
  the matching keys from each bulk source doc so mapping and data agree.

Gated on tests.analytics.parquet_indices; byte-for-byte no-op off the AE route.
Replaces the earlier filename-_ae hook + hand-authored variant files.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Add an exhaustive, fail-loud verifier IT (AnalyticsUnsupportedFieldStripVerifyIT)
that loads every Index enum dataset through the real loadIndex harness on the AE
route and asserts no nested/geo_point/geo_shape/binary/alias field survives in any
live mapping — converting the "silent expected:<1> but was:<0>" risk into a direct,
attributable failure.

Skip CalciteAliasFieldAggregationIT on the AE route: it creates its index via a raw
inline PUT (bypassing the load-path strip) and its queries reference alias fields
directly, so alias aggregation is inherently AE-incompatible — there is nothing to
salvage by stripping. Matches the existing Assume.assumeFalse idiom.

Full calcite/remote PPL sweep on the AE route now has zero ingestion failures
attributable to the five unsupported field types.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

(Review updated until commit f649962)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

The liveDocCountOrNull method catches all exceptions and returns null, which causes the test to skip doc-count verification even when the failure is unrelated to the strip logic (e.g., a genuine ingestion problem). This masks real issues. The method should only catch and return null for specific, known-safe exceptions (like shard-level count failures on system indices), not all exceptions.

private Integer liveDocCountOrNull(String indexName) {
  String query = String.format("source=%s | stats count() as c", indexName);
  try {
    JSONObject result = executeQuery(query);
    return result.getJSONArray("datarows").getJSONArray(0).getInt(0);
  } catch (Exception e) {
    // The count query can fail for cluster-side reasons unrelated to the strip (e.g. it touches a
    // parquet-backed system index that can't serve a shard-level count). Don't let that mask the
    // strip proof — return null so the caller records a transparent skip.
    LOG.info("count query failed for [{}]: {}", indexName, rootMessage(e));
    return null;
  }
Possible Issue

The stripBulkFields method splits on newline with limit -1 but then iterates i < lines.length - 1 when appending newlines. If the input ends with a newline, split("\n", -1) produces a trailing empty string, and the loop skips appending a newline after the second-to-last line, corrupting the output. The condition should be i < lines.length (always append newline) or the split should use a positive limit.

  if (i < lines.length - 1) {
    out.append('\n');
  }
}
Possible Issue

The failIfBulkHadItemErrors method iterates over action.keySet() (the bulk operation names like "index", "create") but does not break after processing the first key. If a single item somehow has multiple operation keys (malformed response), the method would add duplicate error entries for the same doc. Add a break after processing the first key.

  for (String op : action.keySet()) {
    JSONObject result = action.optJSONObject(op);
    if (result != null && result.has("error")) {
      firstErrors.add("doc#" + i + ": " + result.get("error"));
    }
  }
}

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Latest suggestions up to f649962

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Use .equals() for Integer comparison

The comparison actualDocs != expectedDocs uses reference inequality on Integer
objects. For values outside the cached range (-128 to 127), this may fail even when
the numeric values are equal. Use .equals() or unbox to primitive int for correct
numeric comparison.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/AnalyticsUnsupportedFieldStripVerifyIT.java [157-172]

 Integer actualDocs = liveDocCountOrNull(name);
 if (actualDocs == null) {
   skippedDocCount.add(index.name() + " (count query errored — see Legs 1-3 + bulk check)");
-} else if (actualDocs != expectedDocs) {
+} else if (!actualDocs.equals(expectedDocs)) {
   failures.add(
       "["
           + index.name()
           + " -> "
           + name
           + "] ingested "
           + actualDocs
           + " docs but dataset"
           + " has "
           + expectedDocs
           + " (partial/failed ingestion)");
 }
Suggestion importance[1-10]: 8

__

Why: Using != for Integer comparison can fail for values outside the cached range (-128 to 127) due to reference inequality. This is a potential bug that could cause incorrect test failures. Using .equals() ensures correct numeric comparison.

Medium
General
Skip recursion after field removal

After removing a field from properties, the method continues to recurse into nested
properties. If the removed field had nested properties, the recursion should be
skipped since the entire subtree is dropped. Add an early continue after removal to
avoid processing a removed field's children.

integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [232-248]

 private static void collectAndRemoveUnsupported(
     JSONObject properties, List<String> prefix, Set<List<String>> dropped) {
   for (String field : properties.keySet().toArray(new String[0])) {
     JSONObject def = properties.optJSONObject(field);
     if (def == null) {
       continue;
     }
     List<String> path = new ArrayList<>(prefix);
     path.add(field);
     if (isUnsupportedForAeStore(def)) {
       properties.remove(field);
       dropped.add(List.copyOf(path));
-    } else if (def.has("properties")) {
+      continue;
+    }
+    if (def.has("properties")) {
       collectAndRemoveUnsupported(def.getJSONObject("properties"), path, dropped);
     }
   }
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion is technically correct but has minimal impact. The current code uses else if which already prevents recursion after removal. Adding an explicit continue after removal would make the logic clearer but doesn't fix a bug since the else if already handles this case.

Low

Previous suggestions

Suggestions up to commit ab4e7c6
CategorySuggestion                                                                                                                                    Impact
General
Avoid loading entire file into memory

The method loads the entire dataset file into memory as a single string, which can
cause OutOfMemoryError for large test datasets. Use a streaming approach with
BufferedReader to process lines incrementally instead of loading the entire file at
once.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/AnalyticsUnsupportedFieldStripVerifyIT.java [320-340]

 private static int sourceDocCount(Index index) throws IOException {
   String dataSet = index.getDataSet();
   if (dataSet == null || dataSet.isEmpty()) {
     return 0;
   }
-  String body = new String(Files.readAllBytes(Paths.get(getResourceFilePath(dataSet))));
   int docs = 0;
-  for (String line : body.split("\n", -1)) {
-    String trimmed = line.trim();
-    if (trimmed.isEmpty() || trimmed.charAt(0) != '{') {
-      continue;
-    }
-    JSONObject obj = new JSONObject(trimmed);
-    boolean isActionLine =
-        obj.has("index") || obj.has("create") || obj.has("update") || obj.has("delete");
-    if (!isActionLine) {
-      docs++;
+  try (BufferedReader reader = Files.newBufferedReader(Paths.get(getResourceFilePath(dataSet)))) {
+    String line;
+    while ((line = reader.readLine()) != null) {
+      String trimmed = line.trim();
+      if (trimmed.isEmpty() || trimmed.charAt(0) != '{') {
+        continue;
+      }
+      JSONObject obj = new JSONObject(trimmed);
+      boolean isActionLine =
+          obj.has("index") || obj.has("create") || obj.has("update") || obj.has("delete");
+      if (!isActionLine) {
+        docs++;
+      }
     }
   }
   return docs;
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a memory efficiency issue where sourceDocCount loads entire dataset files into memory. Using BufferedReader for line-by-line processing is a valid improvement for large files, though test datasets are typically small, limiting the practical impact.

Low
Add file size validation check

Loading the entire dataset file into memory can cause OutOfMemoryError for large
test datasets. Consider processing the file in chunks or streaming the data to avoid
memory issues with large bulk payloads.

integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [435-447]

 public static void loadDataByRestClient(
     RestClient client, String indexName, String dataSetFilePath, Set<List<String>> droppedPaths)
     throws IOException {
   Path path = Paths.get(getResourceFilePath(dataSetFilePath));
+  long fileSize = Files.size(path);
+  if (fileSize > 100 * 1024 * 1024) { // 100MB threshold
+    throw new IllegalArgumentException("Dataset file too large: " + fileSize + " bytes. Consider splitting the dataset.");
+  }
   String body = new String(Files.readAllBytes(path));
   body = AnalyticsIndexConfig.stripBulkFields(body, droppedPaths);
   Request request =
       new Request(
           "POST", "/" + indexName + "/_bulk?" + AnalyticsIndexConfig.bulkLoadRefreshParam());
   request.setJsonEntity(body);
   Response response = performRequest(client, request);
   failIfBulkHadItemErrors(indexName, response);
 }
Suggestion importance[1-10]: 5

__

Why: Adding a file size check before loading is a reasonable safeguard against OutOfMemoryError with large datasets. However, the hardcoded 100MB threshold is arbitrary and may not suit all scenarios. The suggestion provides defensive validation but doesn't fundamentally solve the memory issue for files near the limit.

Low
Prevent stack overflow in recursion

The recursive method lacks depth limit protection, which could cause
StackOverflowError on deeply nested JSON structures or circular references. Add a
maximum recursion depth check to prevent stack exhaustion.

integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [302-325]

+private static final int MAX_RECURSION_DEPTH = 100;
+
 private static boolean removePath(Object node, List<String> path, int idx) {
+  if (idx >= MAX_RECURSION_DEPTH) {
+    throw new IllegalStateException("Maximum recursion depth exceeded while removing path");
+  }
   String part = path.get(idx);
   boolean last = idx == path.size() - 1;
   boolean removed = false;
   if (node instanceof JSONObject) {
     JSONObject obj = (JSONObject) node;
     if (!obj.has(part)) {
       return false;
     }
     if (last) {
       obj.remove(part);
       return true;
     }
     removed = removePath(obj.get(part), path, idx + 1);
   } else if (node instanceof JSONArray) {
     ...
   }
   return removed;
 }
Suggestion importance[1-10]: 4

__

Why: While adding recursion depth protection is a defensive practice, the removePath method's recursion depth is bounded by the path length (typically shallow in JSON mappings). The risk of StackOverflowError is low in practice, making this a minor defensive improvement rather than addressing a likely issue.

Low
Suggestions up to commit efdcc7e
CategorySuggestion                                                                                                                                    Impact
General
Use explicit list for iteration

The recursive call to removeIfUnsupported modifies the sub object while iterating
over its keys. Converting keySet() to an array avoids
ConcurrentModificationException, but this pattern is fragile. Use an explicit list
copy for clarity and maintainability.

integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [208-225]

 private static boolean removeIfUnsupported(JSONObject properties, String field) {
   JSONObject def = properties.optJSONObject(field);
   if (def == null) {
     return false;
   }
   String type = def.optString("type", null);
   if (type != null && UNSUPPORTED_FIELD_TYPES.contains(type)) {
     properties.remove(field);
     return true;
   }
   if (def.has("properties")) {
     JSONObject sub = def.getJSONObject("properties");
-    for (String child : sub.keySet().toArray(new String[0])) {
+    List<String> childKeys = new ArrayList<>(sub.keySet());
+    for (String child : childKeys) {
       removeIfUnsupported(sub, child);
     }
   }
   return false;
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that using new ArrayList<>(sub.keySet()) is clearer than sub.keySet().toArray(new String[0]) for avoiding ConcurrentModificationException during iteration. This improves code readability and maintainability, though both approaches are functionally correct.

Low
Remove redundant null check

The loop condition checks r.getCause() != r but also has a redundant if
(r.getCause() == null) break; inside. The inner check is unnecessary since the loop
condition already handles null causes. Remove the redundant break to simplify the
logic.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/AnalyticsUnsupportedFieldStripVerifyIT.java [168-179]

 private static boolean isMissingDatasetFile(Throwable t) {
-  for (Throwable r = t; r != null && r.getCause() != r; r = r.getCause()) {
+  for (Throwable r = t; r != null && r.getCause() != null && r.getCause() != r; r = r.getCause()) {
     if (r instanceof java.nio.file.NoSuchFileException
         || r instanceof java.io.FileNotFoundException) {
       return true;
-    }
-    if (r.getCause() == null) {
-      break;
     }
   }
   return false;
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies a redundant null check in the loop. However, the improved code changes the loop condition rather than just removing the inner break statement, which is a valid but different approach. The impact is minor as both versions work correctly.

Low
Optimize JSON parsing logic

The method creates a JSONObject from every line that looks like JSON, even action
lines that won't be modified. This is inefficient and could throw exceptions on
malformed JSON. Parse the JSON only after confirming it's a source line that needs
modification.

integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [238-272]

 static String stripBulkFields(String bulkBody, Set<String> droppedFields) {
   if (!isEnabled() || droppedFields.isEmpty()) {
     return bulkBody;
   }
   String[] lines = bulkBody.split("\n", -1);
   StringBuilder out = new StringBuilder(bulkBody.length());
   for (int i = 0; i < lines.length; i++) {
     String line = lines[i];
     String trimmed = line.trim();
-    if (!trimmed.isEmpty() && trimmed.charAt(0) == '{') {
+    if (!trimmed.isEmpty() && trimmed.charAt(0) == '{' && i % 2 == 1) {
       JSONObject doc = new JSONObject(trimmed);
-      ...
+      boolean removedAny = false;
+      for (String f : droppedFields) {
+        if (doc.has(f)) {
+          doc.remove(f);
+          removedAny = true;
+        }
+      }
+      if (removedAny) {
+        line = doc.toString();
+      }
     }
     out.append(line);
     if (i < lines.length - 1) {
       out.append('\n');
     }
   }
   return out.toString();
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion to use i % 2 == 1 to identify source lines is fragile and incorrect. Bulk format alternates action/source lines, but the current code correctly checks for action keys (index, create, etc.) which is more robust than relying on line position, especially if the bulk data has empty lines or formatting variations.

Low
Suggestions up to commit d076417
CategorySuggestion                                                                                                                                    Impact
General
Fix redundant loop termination logic

The loop condition checks r.getCause() != r but then has a redundant break when
r.getCause() == null. The loop will naturally terminate when r.getCause() is null
because the next iteration's condition r != null will still be true but r.getCause()
!= r becomes meaningless. Remove the redundant null check and break statement.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/AnalyticsUnsupportedFieldStripVerifyIT.java [168-179]

 private static boolean isMissingDatasetFile(Throwable t) {
-  for (Throwable r = t; r != null && r.getCause() != r; r = r.getCause()) {
+  for (Throwable r = t; r != null && r.getCause() != null && r.getCause() != r; r = r.getCause()) {
     if (r instanceof java.nio.file.NoSuchFileException
         || r instanceof java.io.FileNotFoundException) {
       return true;
-    }
-    if (r.getCause() == null) {
-      break;
     }
   }
   return false;
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies redundant logic in the loop. However, the current code is functionally correct and the redundancy doesn't cause bugs. The improvement is minor - it simplifies the loop condition by removing the redundant break statement, making the code slightly cleaner.

Low
Suggestions up to commit 13d1081
CategorySuggestion                                                                                                                                    Impact
Possible issue
Verify array bounds before assertion

The test assumes lines[3] contains the second document, but split("\n", -1) with a
trailing newline creates an empty final element. This causes an off-by-one error
where lines[3] is actually empty, not the second document. Verify the array indices
match the actual split result.

integ-test/src/test/java/org/opensearch/sql/legacy/AnalyticsFieldStripTests.java [109-127]

 @Test
 public void bulkStrip_leavesUntouchedSourceLinesByteForByte() {
   enable();
   // doc1 carries a dropped key (gets rewritten); doc2 does not (must pass through verbatim).
   String docWithDrop = "{\"keep_text\":\"x\",\"nested_value\":[{\"a\":1}]}";
   String docNoDrop = "{\"keep_text\":\"y\",\"age\":  30,\"z\":1}";
   String bulk =
       "{\"index\":{\"_id\":\"1\"}}\n"
           + docWithDrop
           + "\n"
           + "{\"index\":{\"_id\":\"2\"}}\n"
           + docNoDrop
           + "\n";
   String out = AnalyticsIndexConfig.stripBulkFields(bulk, Set.of("nested_value"));
   String[] lines = out.split("\n", -1);
   // The doc that had no dropped key is byte-for-byte identical (odd spacing/key order preserved).
+  assertTrue("Expected at least 5 lines", lines.length >= 5);
   assertEquals(docNoDrop, lines[3]);
   // The doc that had a dropped key lost it.
   assertFalse(new JSONObject(lines[1]).has("nested_value"));
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion identifies a potential off-by-one issue with array indexing after split("\n", -1) with a trailing newline. However, examining the test code shows the bulk string ends with "\n", and split("\n", -1) will create 5 elements (lines[0-4]), so lines[3] is valid. The suggestion adds a bounds check which is defensive but not strictly necessary given the controlled test input.

Low
Suggestions up to commit 812f87b
CategorySuggestion                                                                                                                                    Impact
Possible issue
Verify array bounds after split operation

The test assumes lines[3] contains the second document, but split("\n", -1) with a
trailing newline creates an empty string at lines[4]. The assertion
assertEquals(docNoDrop, lines[3]) will fail because the actual document is at a
different index. Verify the split behavior matches the bulk format structure.

integ-test/src/test/java/org/opensearch/sql/legacy/AnalyticsFieldStripTests.java [109-127]

 @Test
 public void bulkStrip_leavesUntouchedSourceLinesByteForByte() {
   enable();
   // doc1 carries a dropped key (gets rewritten); doc2 does not (must pass through verbatim).
   String docWithDrop = "{\"keep_text\":\"x\",\"nested_value\":[{\"a\":1}]}";
   String docNoDrop = "{\"keep_text\":\"y\",\"age\":  30,\"z\":1}";
   String bulk =
       "{\"index\":{\"_id\":\"1\"}}\n"
           + docWithDrop
           + "\n"
           + "{\"index\":{\"_id\":\"2\"}}\n"
           + docNoDrop
           + "\n";
   String out = AnalyticsIndexConfig.stripBulkFields(bulk, Set.of("nested_value"));
   String[] lines = out.split("\n", -1);
+  // Verify we have the expected number of lines (4 content lines + 1 trailing empty)
+  assertEquals(5, lines.length);
   // The doc that had no dropped key is byte-for-byte identical (odd spacing/key order preserved).
   assertEquals(docNoDrop, lines[3]);
   // The doc that had a dropped key lost it.
   assertFalse(new JSONObject(lines[1]).has("nested_value"));
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that split("\n", -1) with a trailing newline creates an empty string at the end of the array. Adding a verification of the expected array length (5 elements) before accessing lines[3] improves test robustness and makes the test's assumptions explicit, preventing potential ArrayIndexOutOfBoundsException.

Medium
General
Handle malformed JSON gracefully

Creating a JSONObject from every line that starts with { can throw JSONException for
malformed JSON, causing the entire bulk load to fail. Add exception handling to skip
or log malformed lines instead of propagating the exception, ensuring robustness
during data ingestion.

integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [238-272]

 static String stripBulkFields(String bulkBody, Set<String> droppedFields) {
   if (!isEnabled() || droppedFields.isEmpty()) {
     return bulkBody;
   }
   String[] lines = bulkBody.split("\n", -1);
   StringBuilder out = new StringBuilder(bulkBody.length());
   for (int i = 0; i < lines.length; i++) {
     String line = lines[i];
     String trimmed = line.trim();
     if (!trimmed.isEmpty() && trimmed.charAt(0) == '{') {
-      JSONObject doc = new JSONObject(trimmed);
-      ...
+      try {
+        JSONObject doc = new JSONObject(trimmed);
+        ...
+      } catch (org.json.JSONException e) {
+        // Skip malformed JSON lines to prevent bulk load failure
+        out.append(line);
+        if (i < lines.length - 1) {
+          out.append('\n');
+        }
+        continue;
+      }
     }
     out.append(line);
     if (i < lines.length - 1) {
       out.append('\n');
     }
   }
   return out.toString();
 }
Suggestion importance[1-10]: 7

__

Why: Adding exception handling for JSONException when parsing bulk lines improves robustness. However, the suggestion asks to verify/ensure proper error handling rather than fixing a confirmed bug. The current code assumes well-formed test fixtures, but graceful handling of malformed JSON would prevent complete failure during data ingestion.

Medium

@mengweieric mengweieric force-pushed the ae-it-dataset-variants branch from 812f87b to 13d1081 Compare June 11, 2026 17:59
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 13d1081

@mengweieric mengweieric added PPL Piped processing language bugFix labels Jun 11, 2026
…eserve untouched bulk docs

binary maps to VARBINARY in the analytics-engine schema builder (same as ip)
and is a registered parquet field type, so it scans fine - drop it from the
unsupported-field strip set, which is now {nested, geo_point, geo_shape, alias}.
Stripping binary needlessly mutated fixtures and risked masking a real
binary-handling bug instead of letting the assertion surface it.

Make UNSUPPORTED_FIELD_TYPES the single source of truth (public) and have
AnalyticsUnsupportedFieldStripVerifyIT import it instead of re-listing, so the
stripper and the verifier cannot drift. Add a byte-identity guard to
stripBulkFields: a bulk doc line is only re-serialized when it actually carried
a dropped key, so untouched docs pass through verbatim.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
@mengweieric mengweieric force-pushed the ae-it-dataset-variants branch from 13d1081 to d076417 Compare June 11, 2026 18:18
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit d076417

The unsupported-field strip removes nested/geo_point/geo_shape/alias columns
from test data+mappings on the AE route, so PPL tests whose queries reference
such a column can never pass - the data is gone. Physically exclude them in
integ-test/build.gradle (gated on tests.analytics.parquet_indices), centralized
with the existing AE excludes and grouped comments, rather than letting them
fail.

Verified field-by-field against the index mappings; only doomed work is removed:
- GeoPointFormatsIT (whole class - every test reads a geo_point field).
- DataTypeIT.test_nonnumeric_data_types (nested_value + geo_point_value).
- DataTypeIT.test_alias_data_type (where alias_col, type=alias).
- SystemFunctionIT.typeof_opensearch_types (typeof(geo_point_value)).

Cleared as NOT doomed (kept running): GeoIpFunctionsIT (geoip() over text
ip/name), StatsCommandIT.testStatsNested* (nested function calls, not a nested
field), ConvertCommandIT/TrendlineCommandIT/ExplainIT alias usages (result
aliases, not alias-typed fields).

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit efdcc7e

ahkcs added a commit to ahkcs/sql that referenced this pull request Jun 11, 2026
Guard raw-document seeding in init() with isIndexExist: the parquet/composite
store is append-only on a same-_id PUT, so re-seeding test/test1 on every
@before inflated their row counts across the class. Pin SELECT-* and fields-
column order with an explicit | fields, since the analytics-engine route returns
columns in storage (alphabetical) order rather than mapping order. Branch the
tests that exercise analytics-route-incompatible behavior behind
isAnalyticsParquetIndicesEnabled(): alias/nested field storage, decimal-vs-double
literal arithmetic, multi-index integer/long schema merge, and the implicit
search-filter syntax.

Depends on opensearch-project#5541, which strips analytics-engine-unsupported field types at load
and unblocks index creation for this class.

Analytics-engine route: 46 failing -> 0 failing (39 pass, 7 documented skips).
v2 route: 46/46 pass.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
ahkcs added a commit to ahkcs/sql that referenced this pull request Jun 11, 2026
These two files come from opensearch-project#5541 (the stacked dependency) and currently fail
spotlessCheck. Reformatting them here keeps this PR's CI green; the commit is
expected to drop out cleanly once opensearch-project#5541 lands its own formatting fix and this
branch is rebased onto main.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@mengweieric mengweieric changed the title [analytics-engine] Strip AE-unsupported fields from test datasets at load [analytics-engine] Strip AE-unsupported fields from test data and exclude the ITs doomed by it Jun 11, 2026
@mengweieric mengweieric marked this pull request as ready for review June 11, 2026 19:26
@mengweieric

Copy link
Copy Markdown
Collaborator Author

Failing CIs are unrelated to this PR

@dai-chen dai-chen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is all our IT CI broken? Just wonder how to verify these changes..

…loud, stronger verifier

Resolves the correctness holes raised in review.

Blocking 1 — path-aware mapping+source strip. stripUnsupportedMappingFields
now returns exact dropped PATHS (e.g. [location, point]) instead of only
top-level names; stripBulkFields removes those paths recursively from _source,
descending through objects AND arrays of objects, preserving siblings
(location.city etc.). Adds unit coverage for a complex_geo shape and arrays of
objects.

Blocking 2 — fail loud on bulk item failures. loadDataByRestClient now parses
the _bulk response and throws on errors=true (naming the index + first item
errors), for ALL test bulk loads. Silent partial ingestion no longer slips
through a 200 status.

Blocking 3 — verifier proves the invariant. AnalyticsUnsupportedFieldStripVerifyIT
deletes each index first so the real load path always runs (not short-circuited
by isIndexExist), then per dataset checks: clean live mapping, no stripped path
left in sampled _source, and doc-count == source-doc count (tolerating a
cluster-side count error on system indices with a transparent logged skip).

NB1 — GeoPointFormatsIT exclude moved under the parsed analyticsEnabled gate
(was gated on mere property presence, so =false wrongly excluded it).

NB2 — out-of-scope skip (join) now refuses to skip if the mapping also declares
any unsupported type, so a mixed failure can't be hidden.

binary is now conditional: kept when store:true (engine reads VARBINARY), but
stripped when store:false because the parquet store can't create it
("Unable to derive source for [X] with store disabled", verified on the AE
cluster). Confirmed end-to-end: the verifier passes over all Index datasets on a
force-routed AE cluster.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit ab4e7c6

@mengweieric

Copy link
Copy Markdown
Collaborator Author

Is all our IT CI broken? Just wonder how to verify these changes..

we should fix the CIs besides this PR but I verified locally of

Run AnalyticsUnsupportedFieldStripVerifyIT on a force-routed AE cluster with -Dtests.analytics.parquet_indices=true. It loads every Index dataset through the real load path and asserts per dataset: clean mapping, no stripped path left in _source, doc count == source doc count. It passes, and fails loudly/attributably if any dataset still ingests an unsupported field, so it's the durable proof artifact in the PR.

Also: unit tests 7/7 (nested paths + arrays of objects), and legacy route with no parquet flag 123 pass / 0 fail so nothing regresses when AE is off.

*
* @return the dropped field paths (empty when disabled or nothing matched)
*/
static Set<List<String>> stripUnsupportedMappingFields(JSONObject jsonObject) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may miss it somewhere. Could you point me where we gate this by AE enabled setting?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the call site is intentionally unconditional, but the actual gate is inside AnalyticsIndexConfig

@dai-chen

dai-chen commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Is all our IT CI broken? Just wonder how to verify these changes..

we should fix the CIs besides this PR but I verified locally of

Run AnalyticsUnsupportedFieldStripVerifyIT on a force-routed AE cluster with -Dtests.analytics.parquet_indices=true. It loads every Index dataset through the real load path and asserts per dataset: clean mapping, no stripped path left in _source, doc count == source doc count. It passes, and fails loudly/attributably if any dataset still ingests an unsupported field, so it's the durable proof artifact in the PR.

Also: unit tests 7/7 (nested paths + arrays of objects), and legacy route with no parquet flag 123 pass / 0 fail so nothing regresses when AE is off.

I thought our CI is failing on dependency after version bump. But it seems due to this single test:

2026-06-11T22:45:47.0923183Z Tests with failures:
2026-06-11T22:45:47.0924323Z  - org.opensearch.sql.sql.QueryValidationIT.testAliasToKeywordMultiFieldFailsWithBadRequest

@mengweieric

Copy link
Copy Markdown
Collaborator Author

Is all our IT CI broken? Just wonder how to verify these changes..

we should fix the CIs besides this PR but I verified locally of
Run AnalyticsUnsupportedFieldStripVerifyIT on a force-routed AE cluster with -Dtests.analytics.parquet_indices=true. It loads every Index dataset through the real load path and asserts per dataset: clean mapping, no stripped path left in _source, doc count == source doc count. It passes, and fails loudly/attributably if any dataset still ingests an unsupported field, so it's the durable proof artifact in the PR.
Also: unit tests 7/7 (nested paths + arrays of objects), and legacy route with no parquet flag 123 pass / 0 fail so nothing regresses when AE is off.

I thought our CI is failing on dependency after version bump. But it seems due to this single test:

2026-06-11T22:45:47.0923183Z Tests with failures:
2026-06-11T22:45:47.0924323Z  - org.opensearch.sql.sql.QueryValidationIT.testAliasToKeywordMultiFieldFailsWithBadRequest

yea fixed in this PR: #5545

…in build.gradle

Per review: keep all AE-route test exclusions in one place. Move the inline
Assume.assumeFalse(isAnalyticsParquetIndicesEnabled()) guard out of
CalciteAliasFieldAggregationIT into the gated analyticsEnabled exclude block in
integ-test/build.gradle, alongside the other doomed PPL ITs. The rationale
(raw-PUT alias index can't be created on the AE route) is carried into both the
build.gradle comment and a short note left at the test's init().

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit f649962

@Swiddis Swiddis merged commit 89d1040 into opensearch-project:main Jun 12, 2026
40 of 41 checks passed
ahkcs added a commit to ahkcs/sql that referenced this pull request Jun 12, 2026
Guard raw-document seeding in init() with isIndexExist: the parquet/composite
store is append-only on a same-_id PUT, so re-seeding test/test1 on every
@before inflated their row counts across the class. Pin SELECT-* and fields-
column order with an explicit | fields, since the analytics-engine route returns
columns in storage (alphabetical) order rather than mapping order. Branch the
tests that exercise analytics-route-incompatible behavior behind
isAnalyticsParquetIndicesEnabled(): alias/nested field storage, decimal-vs-double
literal arithmetic, multi-index integer/long schema merge, and the implicit
search-filter syntax.

Depends on opensearch-project#5541, which strips analytics-engine-unsupported field types at load
and unblocks index creation for this class.

Analytics-engine route: 46 failing -> 0 failing (39 pass, 7 documented skips).
v2 route: 46/46 pass.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
ahkcs added a commit that referenced this pull request Jun 12, 2026
Guard raw-document seeding in init() with isIndexExist: the parquet/composite
store is append-only on a same-_id PUT, so re-seeding test/test1 on every
@before inflated their row counts across the class. Pin SELECT-* and fields-
column order with an explicit | fields, since the analytics-engine route returns
columns in storage (alphabetical) order rather than mapping order. Branch the
tests that exercise analytics-route-incompatible behavior behind
isAnalyticsParquetIndicesEnabled(): alias/nested field storage, decimal-vs-double
literal arithmetic, multi-index integer/long schema merge, and the implicit
search-filter syntax.

Depends on #5541, which strips analytics-engine-unsupported field types at load
and unblocks index creation for this class.

Analytics-engine route: 46 failing -> 0 failing (39 pass, 7 documented skips).
v2 route: 46/46 pass.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugFix PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants