Raisetolinalg#412
Draft
arpitj1 wants to merge 166 commits into
Draft
Conversation
…erands and map for linalg.generic
…ded extra tests in lit test
…f debufferizing added which works for tiling and fusion
…rray overrun Two findings from debugging the gesummv/gemver DIFF correctness: 1. *daxpby host-side*. The CUDA daxpby shim (cublasDscal + cublasDaxpy, pre-existing version) was correct, but the H↔D copy + two cuBLAS calls dominated any GPU benefit for the O(N) bandwidth-bound op. Replaced with a straight host loop. Verified to give identical output bits to the CUDA path; the corruption persists with both paths, so axpby itself is not the bug. 2. *print_array overrun is aarch64-specific.* Built a CPU-stub variant of gesummv for Jetson aarch64 (kernel.o linked against polygeist_cublas_rt_cpu.o instead of rt_cuda.o). It reproduces the exact same overrun — polybench's print_array reads ~17 extra elements past `y[n-1]` into adjacent heap. The same lowered MLIR + CPU stub on *x86* is bit-exact, so this is NOT a lowering bug and NOT a CUDA shim bug. Most likely an aarch64 calling-convention or stack-frame issue with 32-arg flat-memref impl signatures (kernel_gesummv_impl has 32 LLVM args after memref expansion). The kernel itself writes correct values to polybench's y[0..n-1] (verified at wrapper-exit boundary). Only print_array's read-loop bound is wrong. JETSON_RUNTIMES comment updated to record this distinction so future debug doesn't re-investigate the CUDA path. Next step is to inspect the LLVM IR's aarch64-specific stack frame for kernel_gesummv_impl — likely a mismatch between gcc-aarch64's outgoing-args sizing and the LLVM-generated callee's incoming-args layout. atax/bicg keep their PASS status (bit-exact at MINI) because their impl signatures are smaller (28 args, all GP, fit closer to the 8-X-register limit).
Root-cause for the heap-corruption-looking dump diff in gesummv/gemver
on Jetson aarch64: it wasn't heap corruption. gcc at -O3 examined the
local static body of `kernel_<name>` in the same translation unit,
ran intraprocedural-analysis passes (modref, pure-const), and decided
the kernel doesn't clobber w0. So main loaded `w0 = N` once before
init_array and *never reloaded it* before kernel_gesummv or
print_array — banking on the IPA conclusion.
But objcopy --weaken-symbol redirects the call at link time to our
wrapper, and AArch64 ABI says w0 is a scratch register the callee is
free to use. The wrapper does use it. Result: when main calls
print_array, w0 holds whatever the wrapper happened to leave there
(typically ~49 for the gesummv case, since the wrapper's final
fprintf returns the byte count of its formatted string). print_array's
`for (i=0; i<n; i++)` loop then iterates 49 times instead of 32,
reading 17 doubles of adjacent heap = the 1e+150-range "garbage."
Disassembly confirmed:
before fix: bl kernel_gesummv ; bl print_array (no w0 reload)
after fix: bl kernel_gesummv ; mov w0, #0x20 ; bl print_array
Fix: change `-Dstatic=` (which left the local body visible to gcc's
IPA) to `-Dstatic=__attribute__((noipa))`. This tags every kernel_*
body as IPA-opaque, forcing gcc to emit ABI-correct caller-saved-reg
reloads at every call site.
Verified MINI:
atax md5 GPU == md5 CPU (BIT-EXACT)
bicg md5 GPU == md5 CPU (BIT-EXACT)
gesummv md5 GPU == md5 CPU (BIT-EXACT)
mvt runs cleanly (no more segfault, exit 0). Small numerical
drift remains because the matcher fissioned mvt's
accumulating x1/x2 init wrong (kernel overwrites with β=0
instead of accumulating into them). Separate matcher bug.
gemver same story — small drift from dropped initial-value
contribution. Separate matcher fix needed.
JETSON_RUNTIMES updated: atax/bicg/gesummv now PASS, mvt/gemver still
DIFF but with the residual diagnosis recorded. mvt loses its ABORT
status — kernel runs to completion now.
Memory note for future-me: any time a polybench harness uses
`-Dstatic=` to weaken a `static void kernel_*()` for symbol
substitution, ALSO upgrade it to `-Dstatic=__attribute__((noipa))`
or gcc -O3's IPA will silently bake in invalid assumptions about
caller-saved-reg preservation. The bug manifests as nonsense data
in stack-resident locals (like n, n_iterations, loop bounds) AFTER
the wrapper returns.
On Jetson Orin, the CPU and integrated GPU share the same physical DRAM
(LPDDR5). Our prior runtime did cudaMalloc + cudaMemcpyH2D + cuBLAS +
cudaMemcpyD2H + cudaFree for every shim call — which on Tegra means
copying within the same DRAM to itself before/after the actual compute.
Replaced that pattern with cudaHostRegister(host_ptr, bytes,
cudaHostRegisterMapped) + cudaHostGetDevicePointer + direct cuBLAS call.
Sets up the iGPU's page-table mapping for polybench's existing buffers,
no extra allocations, no data movement.
Tried bypassing cudaHostRegister entirely (just passing host pointers to
cuBLAS, trusting UVA on Tegra) — fails with illegal-memory-access. cuBLAS
needs the buffer registered or device-allocated even when the iGPU can
technically reach it. cudaHostRegister is the right call.
Aliased operands (e.g. syrk's A passed as both A and B) are handled by a
register_host_safe() helper that silently tolerates
cudaErrorHostMemoryAlreadyRegistered. Same for unregister.
Refactored shims:
- polygeist_cublas_dgemm
- polygeist_cublas_dgemv (+ dgemv_T)
- polygeist_cublas_daxpy_unit
- polygeist_cublas_dger_rank2
Skipped (no net win expected):
- polygeist_cublas_daxpby — already host-side
- polygeist_cublas_memset_zero_{1d,2d} — already host-side
- polygeist_cublas_dscal_2d — already host-side
- cuDNN conv2d shims — cuDNN setup/algo-select dominates, not H↔D
Re-ran all 12 Jetson kernels with the new runtime:
Kernel MINI LARGE EXTRALARGE Δ vs prior
gemm 29.5 ms 78.8 ms 408 ms -69% / -47% / -16%
2mm 30.4 ms 98.8 ms 471 ms -67% / -41% / -16%
3mm 30.6 ms 146.0 ms 789 ms -68% / -33% / -12%
syrk 29.7 ms 291.6 ms 1960 ms +4% / -4% / -3%
atax 35.8 ms 265.4 ms - 0% / -29%
bicg 36.4 ms 265.8 ms - 0% / -26%
gesummv 32.2 ms 263.0 ms - 0% / -29%
gemver 34.2 ms 449.9 ms - 0% / -31%
mvt 36.0 ms - - 0%
Pattern: MINI gemm-family sees ~3× speedup (almost all of the prior
~94 ms was H↔D); LARGE for bandwidth-bound gemv kernels gets ~25-30%
(the cuBLAS work is roughly bandwidth-limited, so eliminating one
DRAM round-trip helps). LARGE/XLARGE for compute-bound gemm sees
smaller relative gains because the cuBLAS dgemm time dominates.
scripts/correctness/polybench_cublas_jetson.sh: also link -lcudnn now
(the shim file includes cuDNN code, so link picks it up unconditionally).
Correctness re-verified bit-exact at MINI for atax/bicg/gesummv/syrk
via md5 of GPU dump against CPU dump.
cudaHostRegister has real cost on Jetson — page-table setup for the mapped range is proportional to buffer size. For an 8000×8000 double matrix (128K pages) it's measurable. Gemver does 4 shim calls on the same A, so we were re-registering A four times per kernel run. Replaced the per-call register/unregister with a persistent cache: register on first sight, never unregister. A small flat array (cap=256) keyed on host pointer caches the device pointer. The OS reclaims the mappings at program exit. Effect on LARGE (n=8000): gemver: 450 ms → 390 ms (4 ops on A — biggest win) gesummv: 263 ms → 242 ms atax: 265 ms → 244 ms bicg: 266 ms → 245 ms gemm/2mm/3mm/syrk barely move (each call has distinct buffers, no amortization possible). MINI numbers also unchanged — fixed cuBLAS handle + first-register costs dominate, the cache only helps after. These gemv-style kernels are bandwidth-bound: each cublasDgemv on n=8000 streams 512 MB of A → minimum ~3 ms at Jetson Orin LPDDR5 peak (~204 GB/s). We measure ~120 ms per gemv → sustained ~4 GB/s, about 2% of peak. The big gap is cuBLAS's row-major-via-OP_T emulation — non- coalesced access. To go faster we'd need to either (a) transpose A to column-major once and use OP_N, or (b) fuse the multiple gemvs into a single kernel that streams A once. Both are matcher/lowering changes, not runtime. CPU LARGE numbers (Jetson ARM cores, plain -O3) for reference: atax 107 ms, bicg 294 ms, gesummv 293 ms, gemver 575 ms. So gemver/gesummv beat the CPU at LARGE but only modestly. atax is slower than CPU at LARGE — its inner loop is so trivially vectorizable that the ARM cores' wider memory subsystem wins.
Added a "notes" column next to the speedup column in the per-suite Jetson tables. Each (kernel, dataset) entry gains an optional "notes" string in JETSON_RUNTIMES; the explorer renders it as a small-text grey cell at the row tail. Notes fall into a few buckets: - "Setup-bound": MINI runs across all kernels. The 28-36 ms floor is cuBLAS handle init + first cudaHostRegister page-map for one of the larger buffers; the actual kernel work is microseconds. - "Bandwidth-bound dgemv via OP_T": atax/bicg/gesummv LARGE. cuBLAS emulates row-major y=A·x by passing A as col-major-Aᵀ and applying OP_T. The OP_T kernel uses strided reads across A's rows, killing coalescing. Measured throughput ~2-5% of peak DRAM bandwidth (~204 GB/s on Jetson Orin LPDDR5). CPU's wider memory subsystem + auto-vectorised contiguous-access loops keep pace. - "Matcher fission bug": mvt / gemver. The matcher didn't fission the accumulating init step (kernel.launch overwrites x1/x2/w with β=0 instead of += into the polybench-initialised values). Numerical output is off; wall-clock timing is real. - conv2d: rerun on the current runtime (the conv2d shims weren't touched in the zero-copy refactor but the surrounding runtime got cheaper). New numbers: MINI 27 ms / LARGE 140 ms / EXTRALARGE 305 ms. 3×3 stencil has AI≈1, so it's bandwidth-bound regardless of hardware; cuDNN can't reuse the filter across enough output elements to amortise descriptor setup. - syrk: matched as cublasDgemm with B=A pointer alias. cuBLAS doesn't recognise the symmetry; runs full M*N*K work. A native cublasDsyrk matcher pattern would be ~2× faster (it only updates the lower triangle). No runtime changes. Just metadata + a column.
Adds the darknet (pjreddie/darknet) third-party clone as a fifth
benchmark suite in the IR explorer. The "kernels" are individual .c
files in src/; the bake runs cgeist + raise + match on each.
Approach:
bake_darknet_mlir.sh iterates over third_party/darknet/src/*.c,
baking each through:
cgeist --function='*' --no-inline ...
polygeist-opt --raise-affine-to-linalg-pipeline --linalg-debufferize
kernel_match_rewrite.py
Files use --function='*' because darknet's compute is spread across
many entry points (gemm_nn/nt/tn/tt all need to lift); --no-inline
prevents the raise pass from collapsing init-into-kernel boilerplate
the way it used to on polybenchGpu.
Results (46 .c files, ~25K LOC total):
cgeist OK: 28 (61%)
raise OK: 23 (50%)
produced ≥1 linalg.generic: 18 (39%)
produced ≥1 kernel.launch: 1 ( 2%)
The 1 file that matches: src/gemm.c (6 launches across gemm_nn / nt /
tn / tt / bin). The 17 raise-OK-but-no-match files are an actionable
list of missing matcher templates: pooling (avg/max), batchnorm, LRN,
residual-add, GRU/LSTM gates, transposed conv, locally-connected, dense
+ bias, softmax-with-control-flow, l2norm. The 18 cgeist-fails are
mostly framework code (parser, image, data, network) with no compute.
darknet's actual production hot path is gemm_nn (TA=TB=0). The matcher
hits it as @cublasDaxpy (the inner loop has the scalar-hoisted axpy
shape) but doesn't compose the outer two loops back up into gemm.
gemm_nt and gemm_tt use the conventional sum-accumulator form and do
match as @cublasDgemm_alpha_only. Fixing gemm_nn composition is a
high-value matcher follow-up — it would auto-cover every conv layer
darknet runs at inference time (since every conv goes through gemm_nn
via im2col).
New section in build_ce_viewer.py:
- DARKNET_ROOT / DARKNET_MLIR_DIR path constants
- DARKNET_KERNELS dict (45 .c files)
- DARKNET_NOTES per-file with parallelism tag + characterisation
- DARKNET_BLOCKERS per-file mapped to existing taxonomy
(matcher-gap, cgeist-gap, debuf-bug, none)
- find_kernel_c dispatch for kset="darknet"
- build_index gains darknet_stats parameter
- new section + nav link to "#darknet"
The third_party/darknet/ clone itself is NOT committed (it's a vendored
upstream, would bloat the repo to ~25K LOC for the framework + cfgs).
The bake script's PATH is hardcoded so a fresh clone reproduces the
results.
…end on Jetson Orin
Polybench-style C kernels in third_party/cnn-extracted/, each lifted through the
full Polygeist pipeline (cgeist → raise → debufferize → matcher → ABI lowering →
LLVM IR → aarch64 cross-compile → Jetson silicon).
Five extracted-darknet baseline kernels (matcher templates + lowering branches +
cuDNN/cuBLAS shims + harness + per-kernel HTML page in the IR explorer):
conv2d_batched → cudnnConvolutionFwd_batched 23.8x LARGE
maxpool_batched → cudnnMaxPoolFwd_batched 1.29x LARGE
batchnorm_batched → cudnnBatchNormalizationForwardInference 0.38x LARGE
shortcut_batched → cudnnAddTensor_batched 0.08x LARGE
conv_bn_relu_batched → cudnnConvolutionBiasActivationForward
(with host-side BN folding) 23.5x LARGE
Four fusion-optimization kernels (algebraic rewrites + faster cuBLAS/cublasLt/
cuDNN entry points):
conv_bias_relu_add_batched → cudnnConvolutionBiasActivationForward
(α2*Z addend for ResNet skip) 23x LARGE
gemm_bias_relu → cublasLtMatmul EPILOGUE_RELU_BIAS 901x LARGE
ata_gemm → cublasSsyrk (operand-alias discriminator
detects AᵀA pattern; half the flops) 3393x LARGE
conv1x1_batched → cublasSgemmStridedBatched (4-par+1-red
shape distinguishes K=1 from K×K) 105x LARGE
Cross-cutting infrastructure additions:
* Matcher: ~9 new CompositionEntry templates + AᵀA→syrk post-unify operand-alias
discriminator in kernel_match_rewrite.py. Per-step span replacement preserves
intervening polygeist.submap ops between matched generics.
* Lowering pass: resolveSubmapBase now chains through both polygeist.submap and
polygeist.submapInverse (up to 16 hops). New pre-pass elides redundant
memset_zero_{1D,2D} launches preceding any β=0 op (syrk). Dtype-suffixed
memset dispatch (f32 alongside f64).
* Runtime: cublasLt linkage (libcublasLt.so.12); ensure_cublaslt() helper.
Host-side BN-folding for fused conv+bn+relu (precompute scaled filter + bias).
All cuDNN algo-selection loops use array-sized cudnnConvolutionFwdAlgoPerf_t
buffers (avoiding the stack-smash that bit single-struct attempts).
* Build: scripts/correctness/extracted_darknet_jetson.sh handles all 9 kernels;
bake_extracted_darknet_mlir.sh produces per-stage MLIR snapshots for the
IR explorer; -lcublasLt added to link line.
* IR explorer: two new sections (extracted darknet, Fusion optimization) with
Compiler Explorer deep-links + per-kernel raised/debuf/matched IR preview
pages.
All four fusion optimizations are 100% bit-exact (or FP-noise within 1e-4 print
precision); LARGE speedups range 23x→3393x over the CPU 3-loop reference on the
Jetson Orin (Tegra Ampere, FP32, cuDNN 9.x, CUDA 12.6).
…nv2d + 4 image filters on Jetson Orin
* New LowerKernelLaunchToPVA pass — owns the matcher's i8/i16
@cudnnConvolution2D_9tap_* launches plus new
@pvaBoxFilter_3x3_i{8,16}, @pvaGaussianFilter_3x3_i{8,16},
@pvaBilateralFilter_3x3_i{8,16}, @pvaHistogramEqualization_i8
symbols. Each routes to a polygeist_pva_* runtime shim. Disjoint
symbol set from --lower-kernel-launch-to-cublas; the two passes
run side by side; either order works.
* Shared 9-tap conv lowering helper extracted out of
LowerKernelLaunchToCuBLAS.cpp into KernelLaunchLoweringUtils.{h,cpp}
so both backend passes call the same body. Added a parallel
lowerImageFilter2Operand helper for the 2-memref filter launch
shape (Box/Gaussian/Bilateral/HistogramEq).
* cuBLAS pass: dropped i8/i16 from shimSymbolFor + the dispatch
switch; PVA-claimed launches fall through with a `continue`
instead of erroring out. Net diff is small in the cuBLAS pass
file (the 3 helpers moved out are the bulk of the delta).
* New PVA runtime shim runtime/polygeist_pva_rt.c with:
- cudaSetDevice + nvcvAllocatorConstructPva + non-blocking
stream init (idempotent, lazy, persistent for process lifetime)
- make_pva_image_tensor_dtype: HWC tensor alloc through the PVA
allocator with arbitrary NVCV dtype (needed because half the
PVA ops are U8-only; we reinterpret i8 bytes as U8)
- CupvaMemGetHostPointer-mediated host I/O (raw cudaMemcpy
segfaults on cuPVA-allocated pages; the host-pointer mapping
is mandatory)
- One pva<Op>Create / pva<Op>Submit wrapper per op
- (M-2)×(N-2) interior copy from PVA output back to caller B
to honour the matcher's &B[1][1] pointer-shift convention
(writing the full M×N overflows B by N+1 bytes)
* Matching CPU reference stubs in polygeist_cublas_rt_cpu.c modelled
to mirror PVA hardware semantics: centred kernel anchor, REPLICATE
border, Q-format >>qbits shift, unsigned-kernel reinterpretation
for Conv2d; rounded-mean (sum + 4) / 9 for BoxFilter; canonical
[1,2,1;2,4,2;1,2,1] / 16 for Gaussian; textbook 256-bin CDF-LUT
for HistogramEq. Bilateral has a pass-through stub (the
non-linear hardware semantics aren't worth mirroring bit-exactly).
* third_party/polybenchGpu-extracted/conv2d_i8.c — i8 variant of
the 9-tap stencil (i16 already existed). Matcher fires on it via
the existing dtype-suffix template + emits
@cudnnConvolution2D_9tap_i8, which the new PVA pass claims.
* Cross-compile script conv2d_cudnn_jetson_dtype.sh: i8 dtype
branch added; PVA-library link line (-lpva_operator -lcvcuda
-lnvcv_types -lcupva_host) plus direct DT_NEEDEDs for
-lnvscibuf -lnvscisync via -Wl,--no-as-needed (deferred
resolution segfaults during libcupva_host init constructors);
step (5) now invokes both --lower-kernel-launch-to-cublas
and --lower-kernel-launch-to-pva.
* Four hand-authored kernel.launch test scaffolds in
scripts/correctness/pva_{boxfilter,gaussian,bilateral,histeq}_jetson.sh.
Matcher templates for these C-level patterns aren't written yet,
so each script synthesises the kernel.launch MLIR directly and
runs the rest of the pipeline normally — same harness, wrapper,
ABI lowering, and link line.
* IR explorer (scripts/correctness/build_ce_viewer.py): new "PVA
backend" section at the bottom. Shows the 6 PVA-routed kernels
with their op name, libpva_operator entry points, shim symbol,
and Jetson PVA wall-clock at each size we benchmarked. No CPU
comparison in this view (CPU stubs exist for separate per-op
bit-exact validation).
* CLAUDE.md: "point, don't copy" rule for gated-distribution NVIDIA
SDKs. PVA Solutions / cuPVA SDK headers consumed via -I at build
time; never copied into the Polygeist tree.
End-to-end silicon validation on Jetson Orin: bit-exact PVA-vs-CPU
diff for Conv2d i8/i16, BoxFilter, Gaussian, and HistogramEq at 256².
Bilateral runs cleanly; visual spot-check only (non-linear).
Conv2d at 10240×10240: PVA 216 ms vs CPU 499 ms (2.3× speedup for i8).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Some modifications to fuse linalg.generic op with for op