Erase op types from mapreduce bookkeeping for precompilation#67
Draft
lkdvos wants to merge 1 commit into
Draft
Conversation
The mapreduce setup chain (`_mapreducedim!`, `_mapreduce_fuse!`, `_mapreduce_order!`, `_mapreduce_block!`, `_mapreduce_threaded!`, and the public `map`/`map!`/`mapreduce`/`mapreducedim!`/`_mapreduce` entry points) previously specialized on the map/reduce function types `f`/`op`/`initop`. None of the bookkeeping logic depends on what those functions are — it only fuses dimensions, sorts the loop order by cache-importance and computes cache blocks from the array shapes/strides — yet a workload that calls mapreduce with many distinct ops (as TensorOperations does) forced a fresh compilation of the entire chain per (op, eltype) combination. `@nospecialize` the function arguments throughout so the bookkeeping specializes on the array-shape signature (ndims/n-arrays/eltypes) but no longer on the ops. A precompile workload can then compile it once per shape signature and reuse it across every op. The bookkeeping runs once per mapreduce call (coarse granularity), so erasing the op types is free at runtime; the only function that still specializes on the op is the monolithic `@generated _mapreduce_kernel!`, which is kept untouched. The per-array data is deliberately kept as stack-allocated `Tuple`s rather than `M`-erased `Vector`s; the latter (a pure-`N` bookkeeping variant) was prototyped and rejected because it roughly doubled small-array call overhead for no additional spec-count reduction (see the note in `mapreduce.jl`). The GPU `_mapreduce_block!` extension hook is unchanged: this commit does not alter `_mapreduce_block!`'s signature, so the extension's dispatch boundary is preserved. Adds a `benchmark/` harness (compile / many-op / runtime / spec-count) used to validate the change. Method specializations after a multi-op × multi-eltype × multi-ndims workload: function baseline branch _mapreduce_block! 346 75 _mapreduce_fuse! 202 75 _mapreduce_order! 202 75 _mapreduce_threaded! 515 124 _mapreduce_kernel! 659 412 Compile time: grid 28.8s -> 26.5s; many-distinct-ops 14.5s -> 11.0s. Runtime (single-thread, BenchmarkTools) neutral-to-better on both tiny (4^N) and large (~4M-element) arrays; per-call allocations drop (e.g. 768 -> 464 B). `Pkg.test` passes single- and multi-threaded, including the JLArray and CuArray GPU mapreduce/reduce tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/benchmark/cases.jl b/benchmark/cases.jl
index bc1045c..5c5785c 100644
--- a/benchmark/cases.jl
+++ b/benchmark/cases.jl
@@ -27,7 +27,7 @@ function sizetuple(N::Int, total::Int)
return ntuple(_ -> d, N)
end
-function make_runner(c::Case, sz::NTuple{N,Int}) where {N}
+function make_runner(c::Case, sz::NTuple{N, Int}) where {N}
T = c.T
if c.kind == permute
p = reverse(ntuple(identity, Val(N))) # reverse perm: defeats fusion
diff --git a/benchmark/compile_bench.jl b/benchmark/compile_bench.jl
index b0d2def..2fc0ac5 100644
--- a/benchmark/compile_bench.jl
+++ b/benchmark/compile_bench.jl
@@ -25,7 +25,7 @@ function main()
kinds = (permute, add, reduce_inner, reduce_outer, reduce_full),
)
- rows = Tuple{String,Float64,Float64}[] # name, compile_time, total_time
+ rows = Tuple{String, Float64, Float64}[] # name, compile_time, total_time
total_compile = 0.0
for c in cases
sz = sizetuple(c.N, SMALL_TOTAL)
@@ -51,7 +51,7 @@ function main()
end
println("-"^56)
println(rpad("TOTAL compile_time (s)", 32), " ", round(total_compile; digits = 4))
- println("\nwrote $out")
+ return println("\nwrote $out")
end
main()
diff --git a/benchmark/manyops_compile.jl b/benchmark/manyops_compile.jl
index 9195281..75457a0 100644
--- a/benchmark/manyops_compile.jl
+++ b/benchmark/manyops_compile.jl
@@ -43,7 +43,7 @@ function main()
open(joinpath(@__DIR__, "results", "manyops_$(LABEL).txt"), "w") do io
println(io, "label=$LABEL Kops=$KOPS Nmax=$NMAXD total_compile_s=$(round(total; digits = 4))")
end
- println("[$LABEL] Kops=$KOPS Nmax=$NMAXD TOTAL compile_time = $(round(total; digits = 4)) s")
+ return println("[$LABEL] Kops=$KOPS Nmax=$NMAXD TOTAL compile_time = $(round(total; digits = 4)) s")
end
main()
diff --git a/benchmark/runtime_bench.jl b/benchmark/runtime_bench.jl
index d01557a..1456226 100644
--- a/benchmark/runtime_bench.jl
+++ b/benchmark/runtime_bench.jl
@@ -44,13 +44,13 @@ function main()
println("== runtime [$LABEL] nthreads=$nt ==")
for c in cases
t = bench_one(c)
- us = t * 1e6
+ us = t * 1.0e6
println(io, "$(name(c))\t$nt\t$(round(us; digits = 3))")
println(rpad(name(c), 32), " nt=$nt ", round(us; digits = 3), " us")
end
end
close(io)
- println("\nwrote $out")
+ return println("\nwrote $out")
end
main()
diff --git a/benchmark/runtime_small.jl b/benchmark/runtime_small.jl
index a90be81..3901bf9 100644
--- a/benchmark/runtime_small.jl
+++ b/benchmark/runtime_small.jl
@@ -29,11 +29,11 @@ function main()
println("== runtime small [$LABEL] nt=1 ==")
for c in cases
t = bench_one(c, sizes[c.N])
- ns = t * 1e9
+ ns = t * 1.0e9
println(io, "$(name(c))\t$(round(ns; digits = 2))")
println(rpad(name(c), 32), " ", round(ns; digits = 2), " ns")
end
close(io)
- println("\nwrote $out")
+ return println("\nwrote $out")
end
main()
diff --git a/benchmark/spec_count.jl b/benchmark/spec_count.jl
index 78e4a4c..7226917 100644
--- a/benchmark/spec_count.jl
+++ b/benchmark/spec_count.jl
@@ -43,6 +43,7 @@ function workload()
end
end
end
+ return
end
function main()
@@ -67,6 +68,6 @@ function main()
println(rpad(nm, 24), " ", n)
end
end
- println("wrote $out")
+ return println("wrote $out")
end
main() |
Codecov Report✅ All modified and coverable lines are covered by tests.
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Strided's mapreduce setup chain —_mapreducedim!→_mapreduce_fuse!(fuse contiguous dims) →_mapreduce_order!(sort dims by cache-importance) →_mapreduce_block!(compute cache blocks) →_mapreduce_threaded!, plus the publicmap/map!/mapreduce/mapreducedim!/_mapreduceentry points — previously specialized on the map/reduce function typesf/op/initop.None of this bookkeeping depends on what those functions are: it only fuses dimensions, sorts the loop order, and computes cache blocks from the array shapes and strides, forwarding
f/op/initopuntouched to the kernel. Yet, because the chain specialized on the function types, a workload that calls mapreduce with many distinct ops (exactly what TensorOperations triggers) forced a fresh compilation of the entire setup chain for every(op, eltype)combination — a large per-op precompilation/TTFX cost.What changed
@nospecializethef/op/initoparguments across the whole chain. The bookkeeping now specializes only on the array-shape signature (ndimsN, number of arraysM, eltypes) and no longer on the ops, so a precompile workload compiles it once per shape signature and reuses it across every op. Because the bookkeeping runs once per mapreduce call (coarse granularity), erasing the op types is free at runtime.The monolithic
@generated _mapreduce_kernel!is kept untouched (its loop nest is left intact — splitting it regresses permutation runtime).The per-array data (
strides,offsets, …) is deliberately kept as stack-allocatedTuples. A variant that carried this data inM-erasedVectors — specializing the bookkeeping purely onN— was prototyped and rejected: it roughly doubled small-array call overhead (heap allocation + dynamic dispatch) while producing an identical method-specialization count, because the remaining specializations are bounded by the distinctarraystuple types either way (the@generatedkernel genuinely needs that concrete type). The reasoning is documented in a comment inmapreduce.jl.GPU extension
This PR does not change
_mapreduce_block!'s signature, so theStridedGPUArraysExt_mapreduce_block!override's dispatch boundary is preserved unchanged. The GPU mapreduce/reduce tests pass on bothJLArrayand a realCuArraybackend (see validation).Overlap with the
@nospecialize-only PRThis overlaps the separately-filed
@nospecialize+_mapreduce_kernel_exprPR: both add@nospecializeto the bookkeeping. This PR is the focused, runtime-validated version of the bookkeeping precompile fix (it does not include the kernel_exprsplit) and supersedes the@nospecialize-on-bookkeeping portion of that PR — the maintainer can sequence/pick whichever is preferred. The documented Vector-vs-Tuple finding records why we stop at op-type erasure rather than going to pure-N.Validation
All measured on Julia 1.12, baseline =
main.1. Tests pass.
Pkg.test("Strided")passes single-threaded and multi-threaded (-t 4). TheStridedGPUArraysExtextension precompiles and loads; the GPUmapreduce/reducetestsets pass on bothJLArray(18/18,14/14) andCuArray(18/18,14/14).2. Precompilation is more effective. Method specializations after a multi-op × multi-eltype × multi-ndims workload:
_mapreduce_block!_mapreduce_fuse!_mapreduce_order!_mapreduce_threaded!_mapreduce_kernel!_computeblocksCold compile time (sum of
Base.@timed(...).compile_timeacross the workload):map!+mapreducedim!)The many-distinct-ops number is the realistic TensorOperations scenario and shows the largest win.
3. Runtime unaffected. Single-thread, BenchmarkTools, back-to-back baseline vs branch.
4×4×4-scale) arrays: neutral on map/permute, and faster on reductions (e.g.reduce_innerN=2 ≈ −48%,reduce_fullN=2 ≈ −24%), because@nospecializealso trims per-call work/allocations.mapreducedim!: 768 B → 464 B).A reusable
benchmark/harness (compile_bench.jl,manyops_compile.jl,runtime_bench.jl,runtime_small.jl,spec_count.jl,cases.jl) is included for future regression checks.🤖 Generated with Claude Code