Read Opt. by ColinLeeo · Pull Request #754 · apache/tsfile

ColinLeeo · 2026-03-27T08:55:58Z

TsFile C++ Read Path Performance Optimization — Overview

Background

The current TsFile C++ read path uses row-by-row decoding with a row-oriented result set API. In full-scan and filtered query scenarios, throughput falls behind Parquet+Arrow. This optimization aims to make TsFile batch read throughput significantly exceed Parquet+Arrow while maintaining interface compatibility.

Summary of Optimizations

The optimizations span four layers:

1. Batch Decode Infrastructure

Added read_batch_int32/int64/float/double and skip_* batch interfaces to Decoder (PLAIN / TS2DIFF / Gorilla), processing 129 values per call instead of one virtual-dispatch per value.
Added satisfy_batch_time batch filter interface to Filter, evaluating an entire batch of timestamps at once.
Eliminated intermediate stack buffer copies in TS2DIFF batch decode — reads directly from the wrapped ByteStream buffer pointer.
PLAIN batch decode now uses __builtin_bswap64/32 (compiles to a single ARM REV instruction) and skips the read_buf intermediate copy.

2. Single-Column Batch Read Path

Added DECODE_TV_BATCH method in ChunkReader / AlignedChunkReader: decodes time + value in batches of 129 rows, applies batch filter, and writes results into TsBlock.
SingleDeviceTsBlockReader adapted to the batch path, supporting get_next_tsblock to return TsBlock directly to the user.

3. Multi-Value Column Merged Read

Introduced MultiAlignedTimeseriesIndex to allow a single AlignedChunkReader to hold 1 time column + N value columns simultaneously.
Time column is decoded only once; N value columns share the decoded timestamps and filter mask.
VectorMeasurementColumnContext wraps a multi-value SSI; SingleDeviceTsBlockReader automatically detects and merges multiple measurements within the same device.
Fixed double-delete bug in SingleDeviceTsBlockReader::close() where multiple map entries pointed to the same VectorMeasurementColumnContext.
Fixed per-column buffer size tracking in get_cur_page_header (previously shared file_data_value_buf_size_ caused heap-buffer-overflow when columns had different page sizes).

4. Parallel Decode + Batch Append Fast Path

Introduced DecodeThreadPool for page-level parallel decompression of N value columns (Snappy decompress in parallel).
In the scatter phase of multi_DECODE_TV_BATCH, when all rows pass the filter and no column has nulls, the per-row row_appender.append() loop is bypassed — each column's decoded batch is written to the Vector buffer in a single memcpy.

Test Dataset

Parameter	Value
Table	bench_table
Devices	10
Total rows	1,000,000 (100,000 per device)
Columns	time, id1(TAG), id2(TAG), s1(INT64), s2(DOUBLE), s3(FLOAT), s4(INT32)
Encoding	Time: TS2DIFF, Values: PLAIN
Compression	Snappy
Platform	macOS ARM64 (Apple Silicon), clang -O3

Benchmark Results

TAG_FILTER — filter by device id, read 100,000 rows × 4 value columns from a single device:

Read Mode	Throughput	vs Baseline
TsFile (row, pre-optimization baseline)	~4.5M rows/s	1.0x
TsFile (batch, single-column)	~9.5M rows/s	2.1x
TsFile (batch, multi-value + parallel + batch append)	~21M rows/s	4.7x
Parquet+Arrow	~1.7M rows/s	—

TIME_FILTER — filter by time range, read 333,333 rows × 4 value columns across all devices:

Read Mode	Throughput	vs Baseline
TsFile (row, pre-optimization baseline)	~4.5M rows/s	1.0x
TsFile (batch, single-column)	~9.2M rows/s	2.0x
TsFile (batch, multi-value + parallel + batch append)	~19.5M rows/s	4.3x
Parquet+Arrow	~6.4M rows/s	—

Phase Timing Breakdown (Post-Optimization)

Instrumented timing of each phase within multi_DECODE_TV_BATCH:

Phase	% of Total	Description
Time decode (TS2DIFF)	~5%	128-value block bit-unpacking + prefix sum
Filter + value decode (PLAIN bswap)	~95%	Batch time filter + 4-column byte-swap decode
Scatter (write to TsBlock)	~0%	Eliminated by batch append fast path

PR Plan

Split into 5 PRs, merged in dependency order:

PR 1: Batch Decode Infrastructure
│     decoder.h, plain_decoder.h, ts2diff_decoder.h, gorilla_decoder.h
│     filter.h, and_filter.h, or_filter.h, time_operator.h/.cc
│     gorilla_codec_test.cc
│
└─► PR 2: Single-Column Batch Read Path
      │   chunk_reader.cc/.h, aligned_chunk_reader.cc/.h
      │   tsfile_series_scan_iterator.cc/.h
      │   single_device_tsblock_reader.cc/.h
      │   result_set.h, tsblock.h
      │
      └─► PR 3: Multi-Value Column Merged Read
            │   tsfile_common.h (MultiAlignedTimeseriesIndex)
            │   tsfile_io_reader.cc/.h
            │   aligned_chunk_reader.cc/.h (ValueColumnState, multi-value methods)
            │   single_device_tsblock_reader.cc/.h (VectorMeasurementColumnContext)
            │   vector.h
            │
            └─► PR 4: Parallel Decode + Batch Append Fast Path
                  thread_pool.h (new file)
                  aligned_chunk_reader.cc/.h (parallel decompress, batch append)

PR 5: Benchmark Tooling + Decoder Micro-Optimizations (independent)
      bench_read.cpp/.h (new files), examples CMakeLists
      plain_decoder.h (__builtin_bswap, direct pointer access)
      ts2diff_decoder.h (eliminate stack copy)
      third_party/simde (portable SIMD library)

PR 1 → 2 → 3 → 4 have sequential dependencies and must be merged in order. PR 5 has no dependencies and can be merged independently.

Correctness Verification

All 9 TableModel tests pass (including MultiLargePage large-data test).
All PLAIN / TS2DIFF / Gorilla codec tests pass.
Remaining reader/writer test results are consistent with the develop branch (10 pre-existing failures unaffected).

Brings together batch decode infrastructure, multi-value aligned read, parallel page decode, columnar tablet write, and SIMD micro-optimizations from the long-lived `final` branch into a single review-ready change. This change is a code snapshot, not a replay of `final` commit history -- the upstream history was a long sequence of WIP commits that wasn't fit for review. Supersedes #749, #754, #774. Read path - Decoder base gains batch APIs (read_batch_int32/int64/float/double, skip_*); PLAIN, TS2DIFF, Gorilla decoders implement them. TS2DIFF has block-level peeking so time filters can skip blocks without decoding. Gorilla adds a raw-pointer GorillaBitReader that bypasses ByteStream overhead. - ChunkReader / AlignedChunkReader add *_DECODE_TV_BATCH methods that decode time + value into a TsBlock in one pass, applying batch time filters before append. - AlignedChunkReader supports a multi-value mode: one time chunk + N value chunks decoded in a single pass, sharing the decoded timestamps and filter mask. SingleDeviceTsBlockReader auto-detects same-device measurements via VectorMeasurementColumnContext. - Optional page-level parallel decompression via a DecodeThreadPool + BlockingQueue when ENABLE_THREADS is set. Page-plan classification (SKIP / FULL_PASS / BOUNDARY) lets a scatter-free memcpy fast path fire when every row passes and no column has nulls. Write path - ValuePageWriter gains write_batch / write_string_batch that take timestamp+value+nullness arrays directly, removing the per-value append loop. Tablet exposes set_timestamps / set_column_values / set_column_string_repeated / reset for bulk reuse and switches StringColumn to an Arrow-compatible offset+buffer layout. - TS2DIFFEncoder::flush now packs all deltas with a single pack_bits_msb + write_buf instead of per-value write_bits, falling back to the scalar path for the rare bit_width > 56 case. - Int64Statistic::update_batch (NEON-accelerated min/max/sum). Encoding / SIMD - TS2DIFF batch decode adds AVX2 helpers via SIMDe (already on develop) for both i32 and i64; scalar fallback unchanged. - PLAIN byte-swap path uses ARM NEON (vrev64q_u8 / vrev32q_u8) when available, falling back to __builtin_bswap. - CMakeLists adds ENABLE_SIMD and turns on -O3 -march=native -flto in Release builds. Allocator / ByteStream - ByteStream caches page_mask_ (= page_size - 1) so the hot path uses a bitmask instead of modulo; wrap_from rounds buffer sizes up to a power of two so the mask remains correct. total_size_ widened to uint64_t to support files > 4GB. - UncompressedCompressor now copies its output instead of aliasing caller buffers, letting callers free input safely. C wrapper / Arrow - Trimmed unused metadata-export surface (TsFileStatisticBase, TimeseriesMetadata, DeviceTimeseriesMetadataEntry, tag-filter handles) out of the public C API. Internal tag filtering is unaffected. - arrow_c.cc simplified: per-row offset handling for sliced variable-length arrays in place of the InvertArrowBitmap copy. Tests / benchmarks - New tsfile_reader_table_batch_test.cc covers the TsBlock batch read path. gorilla_codec_test.cc adds Int32/Int64/Float batch decode tests. examples/cpp_examples adds bench_read.cpp/.h and an examples/read_perf_compare/ target. - Removed cwrapper_metadata_test.cc and common/path.cc (Path bodies inlined into path.h; the C metadata API they covered is gone). Compatibility - All new C++ methods are additions; no existing C++ API was removed. - C wrapper headers lost the metadata export / tag filter symbols listed above -- downstream callers (Python wrapper in particular) will want a sanity check before merge. - cpp/third_party/ intentionally left at develop's state so the recent MSVC compatibility fixes (WITH_STATIC_CRT OFF, CMP0054 NEW, CMAKE_POLICY_VERSION_MINIMUM=3.5, _MSC_VER guards) are preserved. Verification - cmake configure + make -j on macOS arm64 (AppleClang, C++11) builds cleanly: libtsfile.2.2.1.dev.dylib and TsFile_Test both link, zero errors, only unused-lambda-capture warnings in pre-existing tests. - Full TsFile_Test run and downstream Python binding load are left as pre-merge checks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Read Opt.

5ceeeba

ColinLeeo mentioned this pull request May 26, 2026

TsFile C++ batch read/write optimization #823

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read Opt.#754

Read Opt.#754
ColinLeeo wants to merge 1 commit into
developfrom
read_opt

ColinLeeo commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ColinLeeo commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TsFile C++ Read Path Performance Optimization — Overview

Background

Summary of Optimizations

1. Batch Decode Infrastructure

2. Single-Column Batch Read Path

3. Multi-Value Column Merged Read

4. Parallel Decode + Batch Append Fast Path

Test Dataset

Benchmark Results

Phase Timing Breakdown (Post-Optimization)

PR Plan

Correctness Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ColinLeeo commented Mar 27, 2026 •

edited

Loading