Skip to content

Read Opt.#754

Open
ColinLeeo wants to merge 1 commit into
developfrom
read_opt
Open

Read Opt.#754
ColinLeeo wants to merge 1 commit into
developfrom
read_opt

Conversation

@ColinLeeo
Copy link
Copy Markdown
Contributor

@ColinLeeo ColinLeeo commented Mar 27, 2026

TsFile C++ Read Path Performance Optimization — Overview

Background

The current TsFile C++ read path uses row-by-row decoding with a row-oriented result set API. In full-scan and filtered query scenarios, throughput falls behind Parquet+Arrow. This optimization aims to make TsFile batch read throughput significantly exceed Parquet+Arrow while maintaining interface compatibility.

Summary of Optimizations

The optimizations span four layers:

1. Batch Decode Infrastructure

  • Added read_batch_int32/int64/float/double and skip_* batch interfaces to Decoder (PLAIN / TS2DIFF / Gorilla), processing 129 values per call instead of one virtual-dispatch per value.
  • Added satisfy_batch_time batch filter interface to Filter, evaluating an entire batch of timestamps at once.
  • Eliminated intermediate stack buffer copies in TS2DIFF batch decode — reads directly from the wrapped ByteStream buffer pointer.
  • PLAIN batch decode now uses __builtin_bswap64/32 (compiles to a single ARM REV instruction) and skips the read_buf intermediate copy.

2. Single-Column Batch Read Path

  • Added DECODE_TV_BATCH method in ChunkReader / AlignedChunkReader: decodes time + value in batches of 129 rows, applies batch filter, and writes results into TsBlock.
  • SingleDeviceTsBlockReader adapted to the batch path, supporting get_next_tsblock to return TsBlock directly to the user.

3. Multi-Value Column Merged Read

  • Introduced MultiAlignedTimeseriesIndex to allow a single AlignedChunkReader to hold 1 time column + N value columns simultaneously.
  • Time column is decoded only once; N value columns share the decoded timestamps and filter mask.
  • VectorMeasurementColumnContext wraps a multi-value SSI; SingleDeviceTsBlockReader automatically detects and merges multiple measurements within the same device.
  • Fixed double-delete bug in SingleDeviceTsBlockReader::close() where multiple map entries pointed to the same VectorMeasurementColumnContext.
  • Fixed per-column buffer size tracking in get_cur_page_header (previously shared file_data_value_buf_size_ caused heap-buffer-overflow when columns had different page sizes).

4. Parallel Decode + Batch Append Fast Path

  • Introduced DecodeThreadPool for page-level parallel decompression of N value columns (Snappy decompress in parallel).
  • In the scatter phase of multi_DECODE_TV_BATCH, when all rows pass the filter and no column has nulls, the per-row row_appender.append() loop is bypassed — each column's decoded batch is written to the Vector buffer in a single memcpy.

Test Dataset

Parameter Value
Table bench_table
Devices 10
Total rows 1,000,000 (100,000 per device)
Columns time, id1(TAG), id2(TAG), s1(INT64), s2(DOUBLE), s3(FLOAT), s4(INT32)
Encoding Time: TS2DIFF, Values: PLAIN
Compression Snappy
Platform macOS ARM64 (Apple Silicon), clang -O3

Benchmark Results

TAG_FILTER — filter by device id, read 100,000 rows × 4 value columns from a single device:

Read Mode Throughput vs Baseline
TsFile (row, pre-optimization baseline) ~4.5M rows/s 1.0x
TsFile (batch, single-column) ~9.5M rows/s 2.1x
TsFile (batch, multi-value + parallel + batch append) ~21M rows/s 4.7x
Parquet+Arrow ~1.7M rows/s

TIME_FILTER — filter by time range, read 333,333 rows × 4 value columns across all devices:

Read Mode Throughput vs Baseline
TsFile (row, pre-optimization baseline) ~4.5M rows/s 1.0x
TsFile (batch, single-column) ~9.2M rows/s 2.0x
TsFile (batch, multi-value + parallel + batch append) ~19.5M rows/s 4.3x
Parquet+Arrow ~6.4M rows/s

Phase Timing Breakdown (Post-Optimization)

Instrumented timing of each phase within multi_DECODE_TV_BATCH:

Phase % of Total Description
Time decode (TS2DIFF) ~5% 128-value block bit-unpacking + prefix sum
Filter + value decode (PLAIN bswap) ~95% Batch time filter + 4-column byte-swap decode
Scatter (write to TsBlock) ~0% Eliminated by batch append fast path

PR Plan

Split into 5 PRs, merged in dependency order:

PR 1: Batch Decode Infrastructure
│     decoder.h, plain_decoder.h, ts2diff_decoder.h, gorilla_decoder.h
│     filter.h, and_filter.h, or_filter.h, time_operator.h/.cc
│     gorilla_codec_test.cc
│
└─► PR 2: Single-Column Batch Read Path
      │   chunk_reader.cc/.h, aligned_chunk_reader.cc/.h
      │   tsfile_series_scan_iterator.cc/.h
      │   single_device_tsblock_reader.cc/.h
      │   result_set.h, tsblock.h
      │
      └─► PR 3: Multi-Value Column Merged Read
            │   tsfile_common.h (MultiAlignedTimeseriesIndex)
            │   tsfile_io_reader.cc/.h
            │   aligned_chunk_reader.cc/.h (ValueColumnState, multi-value methods)
            │   single_device_tsblock_reader.cc/.h (VectorMeasurementColumnContext)
            │   vector.h
            │
            └─► PR 4: Parallel Decode + Batch Append Fast Path
                  thread_pool.h (new file)
                  aligned_chunk_reader.cc/.h (parallel decompress, batch append)

PR 5: Benchmark Tooling + Decoder Micro-Optimizations (independent)
      bench_read.cpp/.h (new files), examples CMakeLists
      plain_decoder.h (__builtin_bswap, direct pointer access)
      ts2diff_decoder.h (eliminate stack copy)
      third_party/simde (portable SIMD library)

PR 1 → 2 → 3 → 4 have sequential dependencies and must be merged in order. PR 5 has no dependencies and can be merged independently.

Correctness Verification

  • All 9 TableModel tests pass (including MultiLargePage large-data test).
  • All PLAIN / TS2DIFF / Gorilla codec tests pass.
  • Remaining reader/writer test results are consistent with the develop branch (10 pre-existing failures unaffected).

ColinLeeo added a commit that referenced this pull request May 26, 2026
Brings together batch decode infrastructure, multi-value aligned read,
parallel page decode, columnar tablet write, and SIMD micro-optimizations
from the long-lived `final` branch into a single review-ready change.

This change is a code snapshot, not a replay of `final` commit history --
the upstream history was a long sequence of WIP commits that wasn't
fit for review. Supersedes #749, #754, #774.

Read path
- Decoder base gains batch APIs (read_batch_int32/int64/float/double,
  skip_*); PLAIN, TS2DIFF, Gorilla decoders implement them. TS2DIFF
  has block-level peeking so time filters can skip blocks without
  decoding. Gorilla adds a raw-pointer GorillaBitReader that bypasses
  ByteStream overhead.
- ChunkReader / AlignedChunkReader add *_DECODE_TV_BATCH methods that
  decode time + value into a TsBlock in one pass, applying batch time
  filters before append.
- AlignedChunkReader supports a multi-value mode: one time chunk + N
  value chunks decoded in a single pass, sharing the decoded timestamps
  and filter mask. SingleDeviceTsBlockReader auto-detects same-device
  measurements via VectorMeasurementColumnContext.
- Optional page-level parallel decompression via a DecodeThreadPool +
  BlockingQueue when ENABLE_THREADS is set. Page-plan classification
  (SKIP / FULL_PASS / BOUNDARY) lets a scatter-free memcpy fast path
  fire when every row passes and no column has nulls.

Write path
- ValuePageWriter gains write_batch / write_string_batch that take
  timestamp+value+nullness arrays directly, removing the per-value
  append loop. Tablet exposes set_timestamps / set_column_values /
  set_column_string_repeated / reset for bulk reuse and switches
  StringColumn to an Arrow-compatible offset+buffer layout.
- TS2DIFFEncoder::flush now packs all deltas with a single
  pack_bits_msb + write_buf instead of per-value write_bits, falling
  back to the scalar path for the rare bit_width > 56 case.
- Int64Statistic::update_batch (NEON-accelerated min/max/sum).

Encoding / SIMD
- TS2DIFF batch decode adds AVX2 helpers via SIMDe (already on develop)
  for both i32 and i64; scalar fallback unchanged.
- PLAIN byte-swap path uses ARM NEON (vrev64q_u8 / vrev32q_u8) when
  available, falling back to __builtin_bswap.
- CMakeLists adds ENABLE_SIMD and turns on -O3 -march=native -flto in
  Release builds.

Allocator / ByteStream
- ByteStream caches page_mask_ (= page_size - 1) so the hot path uses
  a bitmask instead of modulo; wrap_from rounds buffer sizes up to a
  power of two so the mask remains correct. total_size_ widened to
  uint64_t to support files > 4GB.
- UncompressedCompressor now copies its output instead of aliasing
  caller buffers, letting callers free input safely.

C wrapper / Arrow
- Trimmed unused metadata-export surface (TsFileStatisticBase,
  TimeseriesMetadata, DeviceTimeseriesMetadataEntry, tag-filter handles)
  out of the public C API. Internal tag filtering is unaffected.
- arrow_c.cc simplified: per-row offset handling for sliced
  variable-length arrays in place of the InvertArrowBitmap copy.

Tests / benchmarks
- New tsfile_reader_table_batch_test.cc covers the TsBlock batch read
  path. gorilla_codec_test.cc adds Int32/Int64/Float batch decode
  tests. examples/cpp_examples adds bench_read.cpp/.h and an
  examples/read_perf_compare/ target.
- Removed cwrapper_metadata_test.cc and common/path.cc (Path bodies
  inlined into path.h; the C metadata API they covered is gone).

Compatibility
- All new C++ methods are additions; no existing C++ API was removed.
- C wrapper headers lost the metadata export / tag filter symbols
  listed above -- downstream callers (Python wrapper in particular)
  will want a sanity check before merge.
- cpp/third_party/ intentionally left at develop's state so the
  recent MSVC compatibility fixes (WITH_STATIC_CRT OFF, CMP0054 NEW,
  CMAKE_POLICY_VERSION_MINIMUM=3.5, _MSC_VER guards) are preserved.

Verification
- cmake configure + make -j on macOS arm64 (AppleClang, C++11) builds
  cleanly: libtsfile.2.2.1.dev.dylib and TsFile_Test both link, zero
  errors, only unused-lambda-capture warnings in pre-existing tests.
- Full TsFile_Test run and downstream Python binding load are left as
  pre-merge checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant