Erase op types from mapreduce bookkeeping for precompilation by lkdvos · Pull Request #67 · QuantumKitHub/Strided.jl

lkdvos · 2026-06-18T22:37:42Z

Motivation

Strided's mapreduce setup chain — _mapreducedim! → _mapreduce_fuse! (fuse contiguous dims) → _mapreduce_order! (sort dims by cache-importance) → _mapreduce_block! (compute cache blocks) → _mapreduce_threaded!, plus the public map/map!/mapreduce/mapreducedim!/_mapreduce entry points — previously specialized on the map/reduce function types f/op/initop.

None of this bookkeeping depends on what those functions are: it only fuses dimensions, sorts the loop order, and computes cache blocks from the array shapes and strides, forwarding f/op/initop untouched to the kernel. Yet, because the chain specialized on the function types, a workload that calls mapreduce with many distinct ops (exactly what TensorOperations triggers) forced a fresh compilation of the entire setup chain for every (op, eltype) combination — a large per-op precompilation/TTFX cost.

What changed

@nospecialize the f/op/initop arguments across the whole chain. The bookkeeping now specializes only on the array-shape signature (ndims N, number of arrays M, eltypes) and no longer on the ops, so a precompile workload compiles it once per shape signature and reuses it across every op. Because the bookkeeping runs once per mapreduce call (coarse granularity), erasing the op types is free at runtime.

The monolithic @generated _mapreduce_kernel! is kept untouched (its loop nest is left intact — splitting it regresses permutation runtime).

The per-array data (strides, offsets, …) is deliberately kept as stack-allocated Tuples. A variant that carried this data in M-erased Vectors — specializing the bookkeeping purely on N — was prototyped and rejected: it roughly doubled small-array call overhead (heap allocation + dynamic dispatch) while producing an identical method-specialization count, because the remaining specializations are bounded by the distinct arrays tuple types either way (the @generated kernel genuinely needs that concrete type). The reasoning is documented in a comment in mapreduce.jl.

GPU extension

This PR does not change _mapreduce_block!'s signature, so the StridedGPUArraysExt _mapreduce_block! override's dispatch boundary is preserved unchanged. The GPU mapreduce/reduce tests pass on both JLArray and a real CuArray backend (see validation).

Overlap with the `@nospecialize`-only PR

This overlaps the separately-filed @nospecialize + _mapreduce_kernel_expr PR: both add @nospecialize to the bookkeeping. This PR is the focused, runtime-validated version of the bookkeeping precompile fix (it does not include the kernel _expr split) and supersedes the @nospecialize-on-bookkeeping portion of that PR — the maintainer can sequence/pick whichever is preferred. The documented Vector-vs-Tuple finding records why we stop at op-type erasure rather than going to pure-N.

Validation

All measured on Julia 1.12, baseline = main.

1. Tests pass. Pkg.test("Strided") passes single-threaded and multi-threaded (-t 4). The StridedGPUArraysExt extension precompiles and loads; the GPU mapreduce/reduce testsets pass on both JLArray (18/18, 14/14) and CuArray (18/18, 14/14).

2. Precompilation is more effective. Method specializations after a multi-op × multi-eltype × multi-ndims workload:

function	baseline	branch
`_mapreduce_block!`	346	75
`_mapreduce_fuse!`	202	75
`_mapreduce_order!`	202	75
`_mapreduce_threaded!`	515	124
`_mapreduce_kernel!`	659	412
`_computeblocks`	19	18

Cold compile time (sum of Base.@timed(...).compile_time across the workload):

workload	baseline	branch
grid (ndims 2–7 × {Float64, ComplexF64} × 5 op kinds)	28.8 s	26.5 s
many distinct ops (8 ops × ndims 2–5, `map!` + `mapreducedim!`)	14.5 s	11.0 s (−24%)

The many-distinct-ops number is the realistic TensorOperations scenario and shows the largest win.

3. Runtime unaffected. Single-thread, BenchmarkTools, back-to-back baseline vs branch.

Large (~4M-element) arrays: ratios within noise across permute/add/reduce; back-to-back re-runs of the few cases that looked off in cross-run measurements confirmed they are noise (branch sometimes faster). No systematic regression.
Tiny (4×4×4-scale) arrays: neutral on map/permute, and faster on reductions (e.g. reduce_inner N=2 ≈ −48%, reduce_full N=2 ≈ −24%), because @nospecialize also trims per-call work/allocations.
Per-call allocations drop (e.g. ComplexF64 N=3 mapreducedim!: 768 B → 464 B).

A reusable benchmark/ harness (compile_bench.jl, manyops_compile.jl, runtime_bench.jl, runtime_small.jl, spec_count.jl, cases.jl) is included for future regression checks.

🤖 Generated with Claude Code

The mapreduce setup chain (`_mapreducedim!`, `_mapreduce_fuse!`, `_mapreduce_order!`, `_mapreduce_block!`, `_mapreduce_threaded!`, and the public `map`/`map!`/`mapreduce`/`mapreducedim!`/`_mapreduce` entry points) previously specialized on the map/reduce function types `f`/`op`/`initop`. None of the bookkeeping logic depends on what those functions are — it only fuses dimensions, sorts the loop order by cache-importance and computes cache blocks from the array shapes/strides — yet a workload that calls mapreduce with many distinct ops (as TensorOperations does) forced a fresh compilation of the entire chain per (op, eltype) combination. `@nospecialize` the function arguments throughout so the bookkeeping specializes on the array-shape signature (ndims/n-arrays/eltypes) but no longer on the ops. A precompile workload can then compile it once per shape signature and reuse it across every op. The bookkeeping runs once per mapreduce call (coarse granularity), so erasing the op types is free at runtime; the only function that still specializes on the op is the monolithic `@generated _mapreduce_kernel!`, which is kept untouched. The per-array data is deliberately kept as stack-allocated `Tuple`s rather than `M`-erased `Vector`s; the latter (a pure-`N` bookkeeping variant) was prototyped and rejected because it roughly doubled small-array call overhead for no additional spec-count reduction (see the note in `mapreduce.jl`). The GPU `_mapreduce_block!` extension hook is unchanged: this commit does not alter `_mapreduce_block!`'s signature, so the extension's dispatch boundary is preserved. Adds a `benchmark/` harness (compile / many-op / runtime / spec-count) used to validate the change. Method specializations after a multi-op × multi-eltype × multi-ndims workload: function baseline branch _mapreduce_block! 346 75 _mapreduce_fuse! 202 75 _mapreduce_order! 202 75 _mapreduce_threaded! 515 124 _mapreduce_kernel! 659 412 Compile time: grid 28.8s -> 26.5s; many-distinct-ops 14.5s -> 11.0s. Runtime (single-thread, BenchmarkTools) neutral-to-better on both tiny (4^N) and large (~4M-element) arrays; per-call allocations drop (e.g. 768 -> 464 B). `Pkg.test` passes single- and multi-threaded, including the JLArray and CuArray GPU mapreduce/reduce tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-18T22:38:08Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.

diff --git a/benchmark/cases.jl b/benchmark/cases.jl
index bc1045c..5c5785c 100644
--- a/benchmark/cases.jl
+++ b/benchmark/cases.jl
@@ -27,7 +27,7 @@ function sizetuple(N::Int, total::Int)
     return ntuple(_ -> d, N)
 end
 
-function make_runner(c::Case, sz::NTuple{N,Int}) where {N}
+function make_runner(c::Case, sz::NTuple{N, Int}) where {N}
     T = c.T
     if c.kind == permute
         p = reverse(ntuple(identity, Val(N)))        # reverse perm: defeats fusion
diff --git a/benchmark/compile_bench.jl b/benchmark/compile_bench.jl
index b0d2def..2fc0ac5 100644
--- a/benchmark/compile_bench.jl
+++ b/benchmark/compile_bench.jl
@@ -25,7 +25,7 @@ function main()
         kinds = (permute, add, reduce_inner, reduce_outer, reduce_full),
     )
 
-    rows = Tuple{String,Float64,Float64}[]   # name, compile_time, total_time
+    rows = Tuple{String, Float64, Float64}[]   # name, compile_time, total_time
     total_compile = 0.0
     for c in cases
         sz = sizetuple(c.N, SMALL_TOTAL)
@@ -51,7 +51,7 @@ function main()
     end
     println("-"^56)
     println(rpad("TOTAL compile_time (s)", 32), "  ", round(total_compile; digits = 4))
-    println("\nwrote $out")
+    return println("\nwrote $out")
 end
 
 main()
diff --git a/benchmark/manyops_compile.jl b/benchmark/manyops_compile.jl
index 9195281..75457a0 100644
--- a/benchmark/manyops_compile.jl
+++ b/benchmark/manyops_compile.jl
@@ -43,7 +43,7 @@ function main()
     open(joinpath(@__DIR__, "results", "manyops_$(LABEL).txt"), "w") do io
         println(io, "label=$LABEL Kops=$KOPS Nmax=$NMAXD total_compile_s=$(round(total; digits = 4))")
     end
-    println("[$LABEL] Kops=$KOPS Nmax=$NMAXD  TOTAL compile_time = $(round(total; digits = 4)) s")
+    return println("[$LABEL] Kops=$KOPS Nmax=$NMAXD  TOTAL compile_time = $(round(total; digits = 4)) s")
 end
 
 main()
diff --git a/benchmark/runtime_bench.jl b/benchmark/runtime_bench.jl
index d01557a..1456226 100644
--- a/benchmark/runtime_bench.jl
+++ b/benchmark/runtime_bench.jl
@@ -44,13 +44,13 @@ function main()
         println("== runtime [$LABEL] nthreads=$nt ==")
         for c in cases
             t = bench_one(c)
-            us = t * 1e6
+            us = t * 1.0e6
             println(io, "$(name(c))\t$nt\t$(round(us; digits = 3))")
             println(rpad(name(c), 32), "  nt=$nt  ", round(us; digits = 3), " us")
         end
     end
     close(io)
-    println("\nwrote $out")
+    return println("\nwrote $out")
 end
 
 main()
diff --git a/benchmark/runtime_small.jl b/benchmark/runtime_small.jl
index a90be81..3901bf9 100644
--- a/benchmark/runtime_small.jl
+++ b/benchmark/runtime_small.jl
@@ -29,11 +29,11 @@ function main()
     println("== runtime small [$LABEL] nt=1 ==")
     for c in cases
         t = bench_one(c, sizes[c.N])
-        ns = t * 1e9
+        ns = t * 1.0e9
         println(io, "$(name(c))\t$(round(ns; digits = 2))")
         println(rpad(name(c), 32), "  ", round(ns; digits = 2), " ns")
     end
     close(io)
-    println("\nwrote $out")
+    return println("\nwrote $out")
 end
 main()
diff --git a/benchmark/spec_count.jl b/benchmark/spec_count.jl
index 78e4a4c..7226917 100644
--- a/benchmark/spec_count.jl
+++ b/benchmark/spec_count.jl
@@ -43,6 +43,7 @@ function workload()
             end
         end
     end
+    return
 end
 
 function main()
@@ -67,6 +68,6 @@ function main()
             println(rpad(nm, 24), "  ", n)
         end
     end
-    println("wrote $out")
+    return println("wrote $out")
 end
 main()

codecov · 2026-06-18T22:45:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines	Coverage Δ
src/mapreduce.jl	`80.50% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lkdvos marked this pull request as draft June 19, 2026 00:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erase op types from mapreduce bookkeeping for precompilation#67

Erase op types from mapreduce bookkeeping for precompilation#67
lkdvos wants to merge 1 commit into
mainfrom
ld-flatten-precompile

lkdvos commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

codecov Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lkdvos commented Jun 18, 2026

Motivation

What changed

GPU extension

Overlap with the @nospecialize-only PR

Validation

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

codecov Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Overlap with the `@nospecialize`-only PR

codecov Bot commented Jun 18, 2026 •

edited

Loading