Skip to content

Erase op types from mapreduce bookkeeping for precompilation#67

Draft
lkdvos wants to merge 1 commit into
mainfrom
ld-flatten-precompile
Draft

Erase op types from mapreduce bookkeeping for precompilation#67
lkdvos wants to merge 1 commit into
mainfrom
ld-flatten-precompile

Conversation

@lkdvos

@lkdvos lkdvos commented Jun 18, 2026

Copy link
Copy Markdown
Member

Motivation

Strided's mapreduce setup chain — _mapreducedim!_mapreduce_fuse! (fuse contiguous dims) → _mapreduce_order! (sort dims by cache-importance) → _mapreduce_block! (compute cache blocks) → _mapreduce_threaded!, plus the public map/map!/mapreduce/mapreducedim!/_mapreduce entry points — previously specialized on the map/reduce function types f/op/initop.

None of this bookkeeping depends on what those functions are: it only fuses dimensions, sorts the loop order, and computes cache blocks from the array shapes and strides, forwarding f/op/initop untouched to the kernel. Yet, because the chain specialized on the function types, a workload that calls mapreduce with many distinct ops (exactly what TensorOperations triggers) forced a fresh compilation of the entire setup chain for every (op, eltype) combination — a large per-op precompilation/TTFX cost.

What changed

@nospecialize the f/op/initop arguments across the whole chain. The bookkeeping now specializes only on the array-shape signature (ndims N, number of arrays M, eltypes) and no longer on the ops, so a precompile workload compiles it once per shape signature and reuses it across every op. Because the bookkeeping runs once per mapreduce call (coarse granularity), erasing the op types is free at runtime.

The monolithic @generated _mapreduce_kernel! is kept untouched (its loop nest is left intact — splitting it regresses permutation runtime).

The per-array data (strides, offsets, …) is deliberately kept as stack-allocated Tuples. A variant that carried this data in M-erased Vectors — specializing the bookkeeping purely on N — was prototyped and rejected: it roughly doubled small-array call overhead (heap allocation + dynamic dispatch) while producing an identical method-specialization count, because the remaining specializations are bounded by the distinct arrays tuple types either way (the @generated kernel genuinely needs that concrete type). The reasoning is documented in a comment in mapreduce.jl.

GPU extension

This PR does not change _mapreduce_block!'s signature, so the StridedGPUArraysExt _mapreduce_block! override's dispatch boundary is preserved unchanged. The GPU mapreduce/reduce tests pass on both JLArray and a real CuArray backend (see validation).

Overlap with the @nospecialize-only PR

This overlaps the separately-filed @nospecialize + _mapreduce_kernel_expr PR: both add @nospecialize to the bookkeeping. This PR is the focused, runtime-validated version of the bookkeeping precompile fix (it does not include the kernel _expr split) and supersedes the @nospecialize-on-bookkeeping portion of that PR — the maintainer can sequence/pick whichever is preferred. The documented Vector-vs-Tuple finding records why we stop at op-type erasure rather than going to pure-N.

Validation

All measured on Julia 1.12, baseline = main.

1. Tests pass. Pkg.test("Strided") passes single-threaded and multi-threaded (-t 4). The StridedGPUArraysExt extension precompiles and loads; the GPU mapreduce/reduce testsets pass on both JLArray (18/18, 14/14) and CuArray (18/18, 14/14).

2. Precompilation is more effective. Method specializations after a multi-op × multi-eltype × multi-ndims workload:

function baseline branch
_mapreduce_block! 346 75
_mapreduce_fuse! 202 75
_mapreduce_order! 202 75
_mapreduce_threaded! 515 124
_mapreduce_kernel! 659 412
_computeblocks 19 18

Cold compile time (sum of Base.@timed(...).compile_time across the workload):

workload baseline branch
grid (ndims 2–7 × {Float64, ComplexF64} × 5 op kinds) 28.8 s 26.5 s
many distinct ops (8 ops × ndims 2–5, map! + mapreducedim!) 14.5 s 11.0 s (−24%)

The many-distinct-ops number is the realistic TensorOperations scenario and shows the largest win.

3. Runtime unaffected. Single-thread, BenchmarkTools, back-to-back baseline vs branch.

  • Large (~4M-element) arrays: ratios within noise across permute/add/reduce; back-to-back re-runs of the few cases that looked off in cross-run measurements confirmed they are noise (branch sometimes faster). No systematic regression.
  • Tiny (4×4×4-scale) arrays: neutral on map/permute, and faster on reductions (e.g. reduce_inner N=2 ≈ −48%, reduce_full N=2 ≈ −24%), because @nospecialize also trims per-call work/allocations.
  • Per-call allocations drop (e.g. ComplexF64 N=3 mapreducedim!: 768 B → 464 B).

A reusable benchmark/ harness (compile_bench.jl, manyops_compile.jl, runtime_bench.jl, runtime_small.jl, spec_count.jl, cases.jl) is included for future regression checks.

🤖 Generated with Claude Code

The mapreduce setup chain (`_mapreducedim!`, `_mapreduce_fuse!`,
`_mapreduce_order!`, `_mapreduce_block!`, `_mapreduce_threaded!`, and the
public `map`/`map!`/`mapreduce`/`mapreducedim!`/`_mapreduce` entry points)
previously specialized on the map/reduce function types `f`/`op`/`initop`.
None of the bookkeeping logic depends on what those functions are — it only
fuses dimensions, sorts the loop order by cache-importance and computes cache
blocks from the array shapes/strides — yet a workload that calls mapreduce with
many distinct ops (as TensorOperations does) forced a fresh compilation of the
entire chain per (op, eltype) combination.

`@nospecialize` the function arguments throughout so the bookkeeping specializes
on the array-shape signature (ndims/n-arrays/eltypes) but no longer on the ops.
A precompile workload can then compile it once per shape signature and reuse it
across every op. The bookkeeping runs once per mapreduce call (coarse
granularity), so erasing the op types is free at runtime; the only function that
still specializes on the op is the monolithic `@generated _mapreduce_kernel!`,
which is kept untouched.

The per-array data is deliberately kept as stack-allocated `Tuple`s rather than
`M`-erased `Vector`s; the latter (a pure-`N` bookkeeping variant) was prototyped
and rejected because it roughly doubled small-array call overhead for no
additional spec-count reduction (see the note in `mapreduce.jl`).

The GPU `_mapreduce_block!` extension hook is unchanged: this commit does not
alter `_mapreduce_block!`'s signature, so the extension's dispatch boundary is
preserved.

Adds a `benchmark/` harness (compile / many-op / runtime / spec-count) used to
validate the change.

Method specializations after a multi-op × multi-eltype × multi-ndims workload:

  function              baseline  branch
  _mapreduce_block!     346       75
  _mapreduce_fuse!      202       75
  _mapreduce_order!     202       75
  _mapreduce_threaded!  515       124
  _mapreduce_kernel!    659       412

Compile time: grid 28.8s -> 26.5s; many-distinct-ops 14.5s -> 11.0s.
Runtime (single-thread, BenchmarkTools) neutral-to-better on both tiny (4^N)
and large (~4M-element) arrays; per-call allocations drop (e.g. 768 -> 464 B).
`Pkg.test` passes single- and multi-threaded, including the JLArray and CuArray
GPU mapreduce/reduce tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.
diff --git a/benchmark/cases.jl b/benchmark/cases.jl
index bc1045c..5c5785c 100644
--- a/benchmark/cases.jl
+++ b/benchmark/cases.jl
@@ -27,7 +27,7 @@ function sizetuple(N::Int, total::Int)
     return ntuple(_ -> d, N)
 end
 
-function make_runner(c::Case, sz::NTuple{N,Int}) where {N}
+function make_runner(c::Case, sz::NTuple{N, Int}) where {N}
     T = c.T
     if c.kind == permute
         p = reverse(ntuple(identity, Val(N)))        # reverse perm: defeats fusion
diff --git a/benchmark/compile_bench.jl b/benchmark/compile_bench.jl
index b0d2def..2fc0ac5 100644
--- a/benchmark/compile_bench.jl
+++ b/benchmark/compile_bench.jl
@@ -25,7 +25,7 @@ function main()
         kinds = (permute, add, reduce_inner, reduce_outer, reduce_full),
     )
 
-    rows = Tuple{String,Float64,Float64}[]   # name, compile_time, total_time
+    rows = Tuple{String, Float64, Float64}[]   # name, compile_time, total_time
     total_compile = 0.0
     for c in cases
         sz = sizetuple(c.N, SMALL_TOTAL)
@@ -51,7 +51,7 @@ function main()
     end
     println("-"^56)
     println(rpad("TOTAL compile_time (s)", 32), "  ", round(total_compile; digits = 4))
-    println("\nwrote $out")
+    return println("\nwrote $out")
 end
 
 main()
diff --git a/benchmark/manyops_compile.jl b/benchmark/manyops_compile.jl
index 9195281..75457a0 100644
--- a/benchmark/manyops_compile.jl
+++ b/benchmark/manyops_compile.jl
@@ -43,7 +43,7 @@ function main()
     open(joinpath(@__DIR__, "results", "manyops_$(LABEL).txt"), "w") do io
         println(io, "label=$LABEL Kops=$KOPS Nmax=$NMAXD total_compile_s=$(round(total; digits = 4))")
     end
-    println("[$LABEL] Kops=$KOPS Nmax=$NMAXD  TOTAL compile_time = $(round(total; digits = 4)) s")
+    return println("[$LABEL] Kops=$KOPS Nmax=$NMAXD  TOTAL compile_time = $(round(total; digits = 4)) s")
 end
 
 main()
diff --git a/benchmark/runtime_bench.jl b/benchmark/runtime_bench.jl
index d01557a..1456226 100644
--- a/benchmark/runtime_bench.jl
+++ b/benchmark/runtime_bench.jl
@@ -44,13 +44,13 @@ function main()
         println("== runtime [$LABEL] nthreads=$nt ==")
         for c in cases
             t = bench_one(c)
-            us = t * 1e6
+            us = t * 1.0e6
             println(io, "$(name(c))\t$nt\t$(round(us; digits = 3))")
             println(rpad(name(c), 32), "  nt=$nt  ", round(us; digits = 3), " us")
         end
     end
     close(io)
-    println("\nwrote $out")
+    return println("\nwrote $out")
 end
 
 main()
diff --git a/benchmark/runtime_small.jl b/benchmark/runtime_small.jl
index a90be81..3901bf9 100644
--- a/benchmark/runtime_small.jl
+++ b/benchmark/runtime_small.jl
@@ -29,11 +29,11 @@ function main()
     println("== runtime small [$LABEL] nt=1 ==")
     for c in cases
         t = bench_one(c, sizes[c.N])
-        ns = t * 1e9
+        ns = t * 1.0e9
         println(io, "$(name(c))\t$(round(ns; digits = 2))")
         println(rpad(name(c), 32), "  ", round(ns; digits = 2), " ns")
     end
     close(io)
-    println("\nwrote $out")
+    return println("\nwrote $out")
 end
 main()
diff --git a/benchmark/spec_count.jl b/benchmark/spec_count.jl
index 78e4a4c..7226917 100644
--- a/benchmark/spec_count.jl
+++ b/benchmark/spec_count.jl
@@ -43,6 +43,7 @@ function workload()
             end
         end
     end
+    return
 end
 
 function main()
@@ -67,6 +68,6 @@ function main()
             println(rpad(nm, 24), "  ", n)
         end
     end
-    println("wrote $out")
+    return println("wrote $out")
 end
 main()

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
src/mapreduce.jl 80.50% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lkdvos lkdvos marked this pull request as draft June 19, 2026 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant