Skip to content

OPEN_STRUCT storage layer — columnar two-tier dense/sparse index (PR 2/4)#18643

Open
tarun11Mavani wants to merge 21 commits into
apache:masterfrom
tarun11Mavani:open-struct-storage
Open

OPEN_STRUCT storage layer — columnar two-tier dense/sparse index (PR 2/4)#18643
tarun11Mavani wants to merge 21 commits into
apache:masterfrom
tarun11Mavani:open-struct-storage

Conversation

@tarun11Mavani
Copy link
Copy Markdown
Contributor

@tarun11Mavani tarun11Mavani commented Jun 1, 2026

Summary

This is the second PR in a 4-part stack introducing DataType.OPEN_STRUCT. It adds the segment storage layer — the runtime that makes OPEN_STRUCT columns ingestible, sealable, and queryable. Without it, declaring dataType: OPEN_STRUCT in a schema does nothing.

RFC: https://docs.google.com/document/d/14kPmjDTKbO8l0ql4rrN7I5Yki5pqMw6GeGmxxc9grsU/edit?tab=t.0

PR 1 (SPI + data model): #18368

Architecture

OPEN_STRUCT columns use a two-tier columnar storage model:

  • Dense tier — keys with fill rate ≥ denseKeyMinFillRate (default 0.5) or explicitly listed in denseKeys are materialized as standard Pinot columns (<column>$<key>), each with its own forward index, dictionary, null vector, and optional per-key indexes (inverted, range, bloom).
  • Sparse tier — remaining keys are packed into a single JSON column (<column>$__sparse__).

Dense keys reuse the entire standard column infrastructure — no bespoke binary format. New index types (range, text, JSON) work on dense keys with zero custom code, and segment-level column pruning applies automatically.

Mutable (consuming) path

  • MutableOpenStructIndex — per-key in-memory state during consumption. Implements both MutableIndex and OpenStructIndexReader (dual-role: write-side index + read-side reader for queries against mutable segments). All observed keys are retained during consumption; dense/sparse classification is deferred to seal time where fill rates are known.
  • MutableKeyColumn — per-key forward index + presence bitmap + type tracking. Three-level type resolution: declared child FieldSpec → value-based inference via OpenStructTypeInference.
  • MutableOpenStructDataSource — exposes per-key DataSources to the query layer via the OpenStructDataSource SPI. Supports getMapValue(docId) for row-major map reconstruction during seal.

Seal / offline path

  • OpenStructColumnSplitter (~550 LOC) — classifies keys into dense vs sparse by fill rate (plus explicit denseKeys and maxDenseKeys), writes each dense key as a standard materialized column via the standard index creator pipeline (ForwardIndexCreator, DictionaryBasedInvertedIndexCreator, etc.), and packs the rest into a single sparse JSON column. Emits per-child and parent column metadata. Dense-key indexes are driven by a generic creator loop over IndexService.getAllIndexes() filtered to a vetted allowlist (inverted, range, bloom), with each index type's validate() run at build time against the resolved child FieldSpec.
  • FieldIndexConfigsUtil.fromFieldConfig — new additive method that builds FieldIndexConfigs from a single FieldConfig without requiring a TableConfig/Schema, enabling per-key index resolution for synthetic materialized children.
  • BaseSegmentCreator — merges the splitter's per-child metadata into the segment properties so each child loads as its own column. Registers splitter child columns in the DIMENSIONS property so the V3 converter discovers their index files.
  • Stats collector reuse — dense-key statistics (cardinality, min/max, distinct values, element lengths) are computed via the standard AbstractColumnStatisticsCollector family rather than hand-rolled tracking. Added a FieldSpec-based constructor to the base collector + 7 scalar collectors so they can be used without a StatsCollectorConfig/TableConfig.

Immutable (sealed) path

  • ImmutableSegmentImpl — single-pass post-load grouping. Scans column metadata for parentColumn, groups child columns by parent, and builds one ImmutableOpenStructDataSource per parent OPEN_STRUCT field. Child columns are removed from _dataSourceMap and only accessible via the parent's getDataSource(key).
  • ImmutableSegmentLoader — exempts materialized child columns from schema-vs-segment reconciliation (they exist in the segment but not in the user-facing schema).

Ingestion pipeline wiring

  • PinotDataType.getPinotDataTypeForIngestion — OPEN_STRUCT case returns MAP so the Map value passes through the transform pipeline unchanged.
  • StatsCollectorUtil.createStatsCollector — OPEN_STRUCT case uses MapColumnPreIndexStatsCollector; excluded from NoDictCollector optimization.
  • ForwardIndexType.shouldCreateIndex — returns false for OPEN_STRUCT parent columns (no forward index on the parent; data lives in materialized children).
  • PinotSegmentRecordReader — detects OPEN_STRUCT parents (no forward index), tracks their OpenStructDataSource, and reconstructs per-doc Map<String, Object> via getMapValue() for the row-major seal path.
  • SegmentColumnarIndexCreator — column-major build path reads from OpenStructDataSource.getMapValue() instead of attempting a PinotSegmentColumnReader on the parent.
  • IndexLoadingConfig.addOpenStructChildConfigs — resolves per-key FieldConfig from the parent's OpenStructIndexConfig and injects it into the config map before SegmentPreProcessor handlers run, preventing InvertedIndexHandler from stripping per-key inverted indexes.

Wiring / validation

  • OpenStructIndexType + OpenStructIndexPlugin — register the index via @AutoService(IndexPlugin.class). Reader factory is a no-op (children load as standard columns). Index handler returns needUpdateIndices() = false to prevent data corruption from handler-path index updates on OPEN_STRUCT parent columns.
  • OpenStructSupportedIndexes — vetted allowlist of per-key index types (inverted, range, bloom) with table-config-time validation.
  • TableConfigUtils — rejects user columns containing the reserved $ separator when any field is OPEN_STRUCT.
  • RealtimeSegmentStatsContainer — returns EmptyColumnStatistics for the OPEN_STRUCT parent.

pinot-spi changes

  • OpenStructTypeInference — value→DataType inference extracted from OpenStructNaming into its own single-responsibility class.
  • OpenStructNaming — added isMaterializedChildName and parentColumnName parser helpers for name-based child column detection.
  • V1ConstantsHAS_SPARSE_COLUMN metadata key for parent column metadata.

Backward compatibility

  • Existing DataType.MAP segments and schemas are untouched. Zero behavior change for MAP.
  • OPEN_STRUCT is opt-in at schema authoring time. Clusters with no OPEN_STRUCT schemas see no behavior change.
  • The $ column name restriction applies only to schemas containing at least one OPEN_STRUCT field.

Follow-up PRs

  • PR 3 — Query engine (pinot-core): OpenStructFilterOperator, ItemTransformFunction typing path, AggregationPlanNode, GroupByPlanNode.
  • PR 4 — Integration tests and end-to-end validation.

Test plan

  • OpenStructColumnSplitterTest — 18 tests covering dense/sparse classification, per-type materialization (INT, LONG, FLOAT, DOUBLE, STRING, BYTES, BIG_DECIMAL, BOOLEAN, TIMESTAMP), dictionary vs raw encoding, inverted/range/bloom index creation, absent-doc defaults, explicit denseKeys, maxDenseKeys cap, sparse JSON round-trip, BIG_DECIMAL scale-distinct values
  • OpenStructSegmentCreationTest — end-to-end segment creation with V1/V3 format conversion and index map verification
  • MutableOpenStructIndexTest — consume-time key tracking, type inference, per-key forward index reads, getMapValue() reconstruction, all-keys-retained semantics
  • MutableOpenStructDataSourceTest — per-key DataSource exposure, isMaterialized/isFullyMaterialized semantics
  • ImmutableOpenStructDataSourceTest — post-load child grouping, dense/sparse/mixed materialization status, getMapValue() for immutable segments
  • OpenStructIndexTypeTest — index registration, no-op handler, config extraction, per-key index validation
  • StatsCollectorUtilFieldSpecTest — FieldSpec-based collector factory for all scalar types
  • FieldIndexConfigsUtilTestfromFieldConfig with dictionary/raw encoding, inverted/range/bloom indexes, default fallback
  • OpenStructTypeInferenceTest — value→DataType mapping for all input types including widening (Byte/Short→INT), folding (Date/UUID→STRING), and unrepresentable (Map, bare Object→null)
  • OpenStructNamingTest — materialized/sparse column naming, isMaterializedChildName, parentColumnName parsing
  • OpenStructIngestionCommitRealtimeTest — realtime e2e: Kafka consume, forceCommit, COUNT(*), dense/sparse child column presence and types, per-key forward/dictionary/inverted index verification, parent/sparse metadata properties
  • OpenStructIngestionCommitOfflineTest — offline e2e: row-major build path (PinotSegmentRecordReader → SegmentColumnarIndexCreator) with the same per-key index matrix and dense/sparse validation

@tarun11Mavani tarun11Mavani force-pushed the open-struct-storage branch from 7f254e4 to d0e436f Compare June 1, 2026 05:26
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 1, 2026

Codecov Report

❌ Patch coverage is 69.13265% with 242 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.49%. Comparing base (30b3b31) to head (6ccb988).
⚠️ Report is 28 commits behind head on master.

Files with missing lines Patch % Lines
...gment/index/openstruct/MutableOpenStructIndex.java 61.70% 28 Missing and 8 partials ⚠️
...ator/impl/openstruct/OpenStructColumnSplitter.java 89.29% 14 Missing and 15 partials ⚠️
...local/segment/index/loader/IndexLoadingConfig.java 17.24% 22 Missing and 2 partials ⚠️
...local/indexsegment/mutable/MutableSegmentImpl.java 13.04% 15 Missing and 5 partials ⚠️
...ment/creator/impl/SegmentColumnarIndexCreator.java 4.76% 19 Missing and 1 partial ⚠️
...l/indexsegment/immutable/ImmutableSegmentImpl.java 25.00% 16 Missing and 2 partials ⚠️
.../index/openstruct/MutableOpenStructDataSource.java 56.09% 16 Missing and 2 partials ⚠️
...cal/segment/index/openstruct/MutableKeyColumn.java 60.00% 16 Missing ⚠️
...ocal/segment/readers/PinotSegmentRecordReader.java 45.00% 8 Missing and 3 partials ⚠️
.../segment/index/openstruct/OpenStructIndexType.java 76.74% 6 Missing and 4 partials ⚠️
... and 13 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18643      +/-   ##
============================================
+ Coverage     64.41%   64.49%   +0.08%     
- Complexity     1282     1291       +9     
============================================
  Files          3362     3381      +19     
  Lines        207907   209304    +1397     
  Branches      32463    32714     +251     
============================================
+ Hits         133923   135000    +1077     
- Misses        63221    63450     +229     
- Partials      10763    10854      +91     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (?)
java-21 64.49% <69.13%> (+0.08%) ⬆️
temurin 64.49% <69.13%> (+0.08%) ⬆️
unittests 64.49% <69.13%> (+0.08%) ⬆️
unittests1 56.66% <14.28%> (-0.14%) ⬇️
unittests2 37.23% <64.03%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…x (PR 2/4)

Storage layer for the OPEN_STRUCT column type: a self-describing column
whose keys are discovered at ingest time and stored columnar in two tiers.

Mutable (consuming) path:
- MutableOpenStructIndex / MutableKeyColumn: per-key dictionary-encoded
  forward index + presence bitmap, with 3-level type resolution (declared
  child FieldSpec, else value-based inference) and maxDenseKeys capping.
- MutableOpenStructDataSource exposes per-key DataSources to the query layer
  via the OpenStructDataSource SPI; implemented as a MutableIndex.

Seal / offline path:
- OpenStructColumnSplitter classifies keys into dense vs sparse by fill rate
  (plus explicit denseKeys and maxDenseKeys), writes each dense key as a
  standard materialized column (col$key) with optional dictionary/inverted/
  null-vector, and packs the rest into a single sparse JSON column
  (col$__sparse__). Emits per-child and parent column metadata.
- BaseSegmentCreator merges the splitter's per-child metadata into the
  segment properties so each child loads as its own column.

Immutable (sealed) path:
- ImmutableSegmentImpl groups materialized children (parentColumn metadata)
  under an ImmutableOpenStructDataSource in a single pass over column metadata.
- ImmutableSegmentLoader keeps materialized children that are absent from the
  user-facing schema.

Wiring / validation:
- OpenStructIndexType + OpenStructIndexPlugin register the index; reader
  factory is a no-op (children load as standard columns).
- TableConfigUtils rejects user columns containing the reserved '$' separator
  when any field is OPEN_STRUCT.
- RealtimeSegmentStatsContainer returns EmptyColumnStatistics for the parent.
- V1Constants: PARENT_COLUMN, HAS_SPARSE_COLUMN; OpenStructNaming helpers
  including shared value->DataType inference.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tarun11Mavani tarun11Mavani force-pushed the open-struct-storage branch from c6a3db5 to 72cf061 Compare June 1, 2026 08:29
tarun11Mavani and others added 19 commits June 1, 2026 09:32
A BIG_DECIMAL-valued OPEN_STRUCT key aborted segment seal: BIG_DECIMAL is
its own stored type (unlike BOOLEAN->INT, TIMESTAMP->LONG), so the splitter's
type switches hit `default: throw`. It is reachable via inferDataType for any
java.math.BigDecimal or an explicit child FieldSpec.

Add BIG_DECIMAL as a variable-length stored type alongside STRING/BYTES in
OpenStructColumnSplitter: getDefaultValue (BigDecimal.ZERO), dictionary build,
raw var-byte forward index (putBigDecimal), and dictionary-vs-raw sizing.
Dictionary dedup uses BigDecimal.equals (matching SegmentDictionaryCreator's
equals-keyed indexOfSV), not compareTo, so scale-differing equal values
(1.0 vs 1.00) stay distinct instead of silently resolving to dict id 0.

The realtime mutable path and segment min/max metadata already handle
BIG_DECIMAL and are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tisticsCollector

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… stats collector

Treats absent docs as null docs holding the default value (standard model). For keys
with absent docs, CARDINALITY now counts the default and MIN/MAX include it. Intended.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…MetadataInfo

Replaces the hand-written emitVirtualColumnMetadata with the standard metadata
writer plus the OPEN_STRUCT-specific keys (PARENT_COLUMN, hasNullValue,
hasInvertedIndex). The shared writer additionally emits standard keys the old
code omitted, producing a superset of the previous metadata.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…istinctValuesPerKey

Replaces the _distinctValuesPerKey distinct-string-set tracking with the sealed
stats collector's cardinality and longest-element length. _totalRawBytesPerKey is
retained for the var-length raw-size estimate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dictElementSize is already 0 on the raw path (only assigned when a dictionary
creator exists), so the conditional alias was a no-op. Pass it directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reators

Drives each dense materialized child's forward, dictionary-id forward, and inverted index
through the standard ForwardIndexCreator / DictionaryBasedInvertedIndexCreator obtained from
StandardIndexes, using an IndexCreationContext built from the sealed stats collector (no
TableConfig). The dict-vs-raw decision now mirrors BaseSegmentCreator.createDictionaryForColumn
(standard default flags), and absent docs store the Pinot dimension null default. Deletes
writeRawForwardIndex, shouldUseDictionary, getDefaultValue, and _totalRawBytesPerKey.

Behavior changes (intended, unreleased feature): absent-doc defaults are now dimension nulls
(STRING "null", INT MIN_VALUE, ...); the local size-ratio dict downgrade is removed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…riteDenseKeyColumn

Split the ~144-line writeDenseKeyColumn into a focused orchestrator plus two
private helpers, mirroring how BaseSegmentCreator decomposes column creation:

- resolveUseDictionary(...): the three-step dict-vs-raw decision.
- writeForwardAndInvertedIndexes(...): dictionary + forward + inverted creation,
  the per-doc add loop, and the nested resource handling; returns the dictionary
  element size for metadata.

Behavior-preserving — no logic, ordering, or resource-management change. Covered
by the existing OpenStructColumnSplitterTest (16 tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o OpenStructTypeInference

inferDataType performs value->DataType inference, an orthogonal concern to
OpenStructNaming's column-name string mapping (it uses none of the naming
constants). Relocate it to a dedicated OpenStructTypeInference class in the same
package so each class has a single responsibility; update the two callers.

Behavior-preserving — the method body is moved verbatim. The core Object
classification still delegates to PinotDataType.getSingleValueType; the
remaining PinotDataType->DataType switch is OPEN_STRUCT-specific policy that
exists nowhere else.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…taType

Data-driven test locking the value->DataType mapping policy now that it lives in
its own class: each return branch (INT via all four widening inputs, LONG, FLOAT,
DOUBLE, BIG_DECIMAL, BOOLEAN, TIMESTAMP, STRING via all four folding inputs,
BYTES) plus the unrepresentable cases (Map, bare Object) returning null.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ldConfig, no TableConfig)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…at table-config time

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (inverted/range/bloom)

Drive per-key index creation from IndexService#getAllIndexes (filtered to the OPEN_STRUCT
allowlist), building the dictionary separately and reconciling forward/dictionary encoding
with the dictionary decision. Each vetted index type's validate() runs against the resolved
child FieldSpec at build time so misconfigurations (e.g. range on a non-numeric raw key)
fail with the canonical message instead of crashing inside the creator.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…exCreator failures

dictCreator.seal() was positioned after the inner try/finally for index
creators. If any IndexCreator.seal() threw, the dictionary file was left
unsealed on disk. Move seal into the inner finally so it runs regardless
of index creator success/failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…build pipeline

The OPEN_STRUCT storage/creator layer was implemented but the offline
batch ingestion pipeline (SegmentIndexCreationDriverImpl) was never made
OPEN_STRUCT-aware, causing segment builds to throw.

Four gaps fixed:
- PinotDataType.getPinotDataTypeForIngestion: add OPEN_STRUCT case
  returning MAP so the Map value passes through the transform pipeline
  unchanged (unblocks both offline and realtime ingestion).
- StatsCollectorUtil.createStatsCollector: add OPEN_STRUCT case using
  MapColumnPreIndexStatsCollector and exclude from NoDictCollector opt.
- ForwardIndexType.shouldCreateIndex: return false for OPEN_STRUCT
  parent — it has no forward index (mirroring MutableSegmentImpl).
- BaseSegmentCreator.writeMetadata: register splitter child columns in
  the DIMENSIONS property so V3 converter discovers their index files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix three gaps in the realtime OPEN_STRUCT consume→seal path:

1. MutableOpenStructIndex: remove consume-time dense-key cap that
   incorrectly dropped keys by arrival order. All observed keys are
   now retained; dense/sparse classification deferred to seal time
   via OpenStructColumnSplitter.classify().

2. PinotSegmentRecordReader: OPEN_STRUCT parent columns have no
   forward index, so PinotSegmentColumnReader crashes on them. Track
   OpenStructDataSource columns separately and reconstruct per-doc
   map values via getMapValue(docId) for the row-major seal path.

3. SegmentColumnarIndexCreator: the column-major build path (used by
   RealtimeSegmentConverter when columnMajorSegmentBuilder is enabled)
   also tried to create PinotSegmentColumnReader for OPEN_STRUCT
   parents. Read from OpenStructDataSource.getMapValue(docId) instead,
   feeding the map values to the OpenStructColumnSplitter.

Add OpenStructDataSource.getMapValue() as a default SPI method (throws
UnsupportedOperationException), implemented by MutableOpenStructDataSource
which reconstructs the map from per-key MutableKeyColumns.

Add realtime e2e integration test (OpenStructIngestionCommitRealtimeTest)
that validates: Kafka consume, forceCommit, COUNT(*), dense/sparse child
column presence and types, per-key forward/dictionary index_map, and
parent/sparse metadata properties.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eProcessor

The SegmentPreProcessor's InvertedIndexHandler was stripping inverted indexes
from OPEN_STRUCT child columns because IndexLoadingConfig only knew about
schema-level columns. Add addOpenStructChildConfigs() to resolve per-key
FieldConfig from the parent's OpenStructIndexConfig and inject it into the
config map before handlers run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exercises the row-major offline build path (PinotSegmentRecordReader →
SegmentColumnarIndexCreator) with the same per-key index matrix and
dense/sparse validation as the realtime variant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tarun11Mavani tarun11Mavani marked this pull request as ready for review June 4, 2026 08:27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tarun11Mavani
Copy link
Copy Markdown
Contributor Author

@xiangfu0 @raghavyadav01 please take a look.

@tarun11Mavani tarun11Mavani changed the title [WIP] OPEN_STRUCT storage layer — columnar two-tier dense/sparse index (PR 2/4) OPEN_STRUCT storage layer — columnar two-tier dense/sparse index (PR 2/4) Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants