Raisetolinalg by arpitj1 · Pull Request #412 · llvm/Polygeist

arpitj1 · 2024-06-06T15:32:58Z

Some modifications to fuse linalg.generic op with for op

…eric and for op

…erands and map for linalg.generic

…ded extra tests in lit test

…generic

…cases failing

…f debufferizing added which works for tiling and fusion

…nonicalize

…rray overrun Two findings from debugging the gesummv/gemver DIFF correctness: 1. *daxpby host-side*. The CUDA daxpby shim (cublasDscal + cublasDaxpy, pre-existing version) was correct, but the H↔D copy + two cuBLAS calls dominated any GPU benefit for the O(N) bandwidth-bound op. Replaced with a straight host loop. Verified to give identical output bits to the CUDA path; the corruption persists with both paths, so axpby itself is not the bug. 2. *print_array overrun is aarch64-specific.* Built a CPU-stub variant of gesummv for Jetson aarch64 (kernel.o linked against polygeist_cublas_rt_cpu.o instead of rt_cuda.o). It reproduces the exact same overrun — polybench's print_array reads ~17 extra elements past `y[n-1]` into adjacent heap. The same lowered MLIR + CPU stub on *x86* is bit-exact, so this is NOT a lowering bug and NOT a CUDA shim bug. Most likely an aarch64 calling-convention or stack-frame issue with 32-arg flat-memref impl signatures (kernel_gesummv_impl has 32 LLVM args after memref expansion). The kernel itself writes correct values to polybench's y[0..n-1] (verified at wrapper-exit boundary). Only print_array's read-loop bound is wrong. JETSON_RUNTIMES comment updated to record this distinction so future debug doesn't re-investigate the CUDA path. Next step is to inspect the LLVM IR's aarch64-specific stack frame for kernel_gesummv_impl — likely a mismatch between gcc-aarch64's outgoing-args sizing and the LLVM-generated callee's incoming-args layout. atax/bicg keep their PASS status (bit-exact at MINI) because their impl signatures are smaller (28 args, all GP, fit closer to the 8-X-register limit).

Root-cause for the heap-corruption-looking dump diff in gesummv/gemver on Jetson aarch64: it wasn't heap corruption. gcc at -O3 examined the local static body of `kernel_<name>` in the same translation unit, ran intraprocedural-analysis passes (modref, pure-const), and decided the kernel doesn't clobber w0. So main loaded `w0 = N` once before init_array and *never reloaded it* before kernel_gesummv or print_array — banking on the IPA conclusion. But objcopy --weaken-symbol redirects the call at link time to our wrapper, and AArch64 ABI says w0 is a scratch register the callee is free to use. The wrapper does use it. Result: when main calls print_array, w0 holds whatever the wrapper happened to leave there (typically ~49 for the gesummv case, since the wrapper's final fprintf returns the byte count of its formatted string). print_array's `for (i=0; i<n; i++)` loop then iterates 49 times instead of 32, reading 17 doubles of adjacent heap = the 1e+150-range "garbage." Disassembly confirmed: before fix: bl kernel_gesummv ; bl print_array (no w0 reload) after fix: bl kernel_gesummv ; mov w0, #0x20 ; bl print_array Fix: change `-Dstatic=` (which left the local body visible to gcc's IPA) to `-Dstatic=__attribute__((noipa))`. This tags every kernel_* body as IPA-opaque, forcing gcc to emit ABI-correct caller-saved-reg reloads at every call site. Verified MINI: atax md5 GPU == md5 CPU (BIT-EXACT) bicg md5 GPU == md5 CPU (BIT-EXACT) gesummv md5 GPU == md5 CPU (BIT-EXACT) mvt runs cleanly (no more segfault, exit 0). Small numerical drift remains because the matcher fissioned mvt's accumulating x1/x2 init wrong (kernel overwrites with β=0 instead of accumulating into them). Separate matcher bug. gemver same story — small drift from dropped initial-value contribution. Separate matcher fix needed. JETSON_RUNTIMES updated: atax/bicg/gesummv now PASS, mvt/gemver still DIFF but with the residual diagnosis recorded. mvt loses its ABORT status — kernel runs to completion now. Memory note for future-me: any time a polybench harness uses `-Dstatic=` to weaken a `static void kernel_*()` for symbol substitution, ALSO upgrade it to `-Dstatic=__attribute__((noipa))` or gcc -O3's IPA will silently bake in invalid assumptions about caller-saved-reg preservation. The bug manifests as nonsense data in stack-resident locals (like n, n_iterations, loop bounds) AFTER the wrapper returns.

On Jetson Orin, the CPU and integrated GPU share the same physical DRAM (LPDDR5). Our prior runtime did cudaMalloc + cudaMemcpyH2D + cuBLAS + cudaMemcpyD2H + cudaFree for every shim call — which on Tegra means copying within the same DRAM to itself before/after the actual compute. Replaced that pattern with cudaHostRegister(host_ptr, bytes, cudaHostRegisterMapped) + cudaHostGetDevicePointer + direct cuBLAS call. Sets up the iGPU's page-table mapping for polybench's existing buffers, no extra allocations, no data movement. Tried bypassing cudaHostRegister entirely (just passing host pointers to cuBLAS, trusting UVA on Tegra) — fails with illegal-memory-access. cuBLAS needs the buffer registered or device-allocated even when the iGPU can technically reach it. cudaHostRegister is the right call. Aliased operands (e.g. syrk's A passed as both A and B) are handled by a register_host_safe() helper that silently tolerates cudaErrorHostMemoryAlreadyRegistered. Same for unregister. Refactored shims: - polygeist_cublas_dgemm - polygeist_cublas_dgemv (+ dgemv_T) - polygeist_cublas_daxpy_unit - polygeist_cublas_dger_rank2 Skipped (no net win expected): - polygeist_cublas_daxpby — already host-side - polygeist_cublas_memset_zero_{1d,2d} — already host-side - polygeist_cublas_dscal_2d — already host-side - cuDNN conv2d shims — cuDNN setup/algo-select dominates, not H↔D Re-ran all 12 Jetson kernels with the new runtime: Kernel MINI LARGE EXTRALARGE Δ vs prior gemm 29.5 ms 78.8 ms 408 ms -69% / -47% / -16% 2mm 30.4 ms 98.8 ms 471 ms -67% / -41% / -16% 3mm 30.6 ms 146.0 ms 789 ms -68% / -33% / -12% syrk 29.7 ms 291.6 ms 1960 ms +4% / -4% / -3% atax 35.8 ms 265.4 ms - 0% / -29% bicg 36.4 ms 265.8 ms - 0% / -26% gesummv 32.2 ms 263.0 ms - 0% / -29% gemver 34.2 ms 449.9 ms - 0% / -31% mvt 36.0 ms - - 0% Pattern: MINI gemm-family sees ~3× speedup (almost all of the prior ~94 ms was H↔D); LARGE for bandwidth-bound gemv kernels gets ~25-30% (the cuBLAS work is roughly bandwidth-limited, so eliminating one DRAM round-trip helps). LARGE/XLARGE for compute-bound gemm sees smaller relative gains because the cuBLAS dgemm time dominates. scripts/correctness/polybench_cublas_jetson.sh: also link -lcudnn now (the shim file includes cuDNN code, so link picks it up unconditionally). Correctness re-verified bit-exact at MINI for atax/bicg/gesummv/syrk via md5 of GPU dump against CPU dump.

cudaHostRegister has real cost on Jetson — page-table setup for the mapped range is proportional to buffer size. For an 8000×8000 double matrix (128K pages) it's measurable. Gemver does 4 shim calls on the same A, so we were re-registering A four times per kernel run. Replaced the per-call register/unregister with a persistent cache: register on first sight, never unregister. A small flat array (cap=256) keyed on host pointer caches the device pointer. The OS reclaims the mappings at program exit. Effect on LARGE (n=8000): gemver: 450 ms → 390 ms (4 ops on A — biggest win) gesummv: 263 ms → 242 ms atax: 265 ms → 244 ms bicg: 266 ms → 245 ms gemm/2mm/3mm/syrk barely move (each call has distinct buffers, no amortization possible). MINI numbers also unchanged — fixed cuBLAS handle + first-register costs dominate, the cache only helps after. These gemv-style kernels are bandwidth-bound: each cublasDgemv on n=8000 streams 512 MB of A → minimum ~3 ms at Jetson Orin LPDDR5 peak (~204 GB/s). We measure ~120 ms per gemv → sustained ~4 GB/s, about 2% of peak. The big gap is cuBLAS's row-major-via-OP_T emulation — non- coalesced access. To go faster we'd need to either (a) transpose A to column-major once and use OP_N, or (b) fuse the multiple gemvs into a single kernel that streams A once. Both are matcher/lowering changes, not runtime. CPU LARGE numbers (Jetson ARM cores, plain -O3) for reference: atax 107 ms, bicg 294 ms, gesummv 293 ms, gemver 575 ms. So gemver/gesummv beat the CPU at LARGE but only modestly. atax is slower than CPU at LARGE — its inner loop is so trivially vectorizable that the ARM cores' wider memory subsystem wins.

Added a "notes" column next to the speedup column in the per-suite Jetson tables. Each (kernel, dataset) entry gains an optional "notes" string in JETSON_RUNTIMES; the explorer renders it as a small-text grey cell at the row tail. Notes fall into a few buckets: - "Setup-bound": MINI runs across all kernels. The 28-36 ms floor is cuBLAS handle init + first cudaHostRegister page-map for one of the larger buffers; the actual kernel work is microseconds. - "Bandwidth-bound dgemv via OP_T": atax/bicg/gesummv LARGE. cuBLAS emulates row-major y=A·x by passing A as col-major-Aᵀ and applying OP_T. The OP_T kernel uses strided reads across A's rows, killing coalescing. Measured throughput ~2-5% of peak DRAM bandwidth (~204 GB/s on Jetson Orin LPDDR5). CPU's wider memory subsystem + auto-vectorised contiguous-access loops keep pace. - "Matcher fission bug": mvt / gemver. The matcher didn't fission the accumulating init step (kernel.launch overwrites x1/x2/w with β=0 instead of += into the polybench-initialised values). Numerical output is off; wall-clock timing is real. - conv2d: rerun on the current runtime (the conv2d shims weren't touched in the zero-copy refactor but the surrounding runtime got cheaper). New numbers: MINI 27 ms / LARGE 140 ms / EXTRALARGE 305 ms. 3×3 stencil has AI≈1, so it's bandwidth-bound regardless of hardware; cuDNN can't reuse the filter across enough output elements to amortise descriptor setup. - syrk: matched as cublasDgemm with B=A pointer alias. cuBLAS doesn't recognise the symmetry; runs full M*N*K work. A native cublasDsyrk matcher pattern would be ~2× faster (it only updates the lower triangle). No runtime changes. Just metadata + a column.

Adds the darknet (pjreddie/darknet) third-party clone as a fifth benchmark suite in the IR explorer. The "kernels" are individual .c files in src/; the bake runs cgeist + raise + match on each. Approach: bake_darknet_mlir.sh iterates over third_party/darknet/src/*.c, baking each through: cgeist --function='*' --no-inline ... polygeist-opt --raise-affine-to-linalg-pipeline --linalg-debufferize kernel_match_rewrite.py Files use --function='*' because darknet's compute is spread across many entry points (gemm_nn/nt/tn/tt all need to lift); --no-inline prevents the raise pass from collapsing init-into-kernel boilerplate the way it used to on polybenchGpu. Results (46 .c files, ~25K LOC total): cgeist OK: 28 (61%) raise OK: 23 (50%) produced ≥1 linalg.generic: 18 (39%) produced ≥1 kernel.launch: 1 ( 2%) The 1 file that matches: src/gemm.c (6 launches across gemm_nn / nt / tn / tt / bin). The 17 raise-OK-but-no-match files are an actionable list of missing matcher templates: pooling (avg/max), batchnorm, LRN, residual-add, GRU/LSTM gates, transposed conv, locally-connected, dense + bias, softmax-with-control-flow, l2norm. The 18 cgeist-fails are mostly framework code (parser, image, data, network) with no compute. darknet's actual production hot path is gemm_nn (TA=TB=0). The matcher hits it as @cublasDaxpy (the inner loop has the scalar-hoisted axpy shape) but doesn't compose the outer two loops back up into gemm. gemm_nt and gemm_tt use the conventional sum-accumulator form and do match as @cublasDgemm_alpha_only. Fixing gemm_nn composition is a high-value matcher follow-up — it would auto-cover every conv layer darknet runs at inference time (since every conv goes through gemm_nn via im2col). New section in build_ce_viewer.py: - DARKNET_ROOT / DARKNET_MLIR_DIR path constants - DARKNET_KERNELS dict (45 .c files) - DARKNET_NOTES per-file with parallelism tag + characterisation - DARKNET_BLOCKERS per-file mapped to existing taxonomy (matcher-gap, cgeist-gap, debuf-bug, none) - find_kernel_c dispatch for kset="darknet" - build_index gains darknet_stats parameter - new section + nav link to "#darknet" The third_party/darknet/ clone itself is NOT committed (it's a vendored upstream, would bloat the repo to ~25K LOC for the framework + cfgs). The bake script's PATH is hardcoded so a fresh clone reproduces the results.

…end on Jetson Orin Polybench-style C kernels in third_party/cnn-extracted/, each lifted through the full Polygeist pipeline (cgeist → raise → debufferize → matcher → ABI lowering → LLVM IR → aarch64 cross-compile → Jetson silicon). Five extracted-darknet baseline kernels (matcher templates + lowering branches + cuDNN/cuBLAS shims + harness + per-kernel HTML page in the IR explorer): conv2d_batched → cudnnConvolutionFwd_batched 23.8x LARGE maxpool_batched → cudnnMaxPoolFwd_batched 1.29x LARGE batchnorm_batched → cudnnBatchNormalizationForwardInference 0.38x LARGE shortcut_batched → cudnnAddTensor_batched 0.08x LARGE conv_bn_relu_batched → cudnnConvolutionBiasActivationForward (with host-side BN folding) 23.5x LARGE Four fusion-optimization kernels (algebraic rewrites + faster cuBLAS/cublasLt/ cuDNN entry points): conv_bias_relu_add_batched → cudnnConvolutionBiasActivationForward (α2*Z addend for ResNet skip) 23x LARGE gemm_bias_relu → cublasLtMatmul EPILOGUE_RELU_BIAS 901x LARGE ata_gemm → cublasSsyrk (operand-alias discriminator detects AᵀA pattern; half the flops) 3393x LARGE conv1x1_batched → cublasSgemmStridedBatched (4-par+1-red shape distinguishes K=1 from K×K) 105x LARGE Cross-cutting infrastructure additions: * Matcher: ~9 new CompositionEntry templates + AᵀA→syrk post-unify operand-alias discriminator in kernel_match_rewrite.py. Per-step span replacement preserves intervening polygeist.submap ops between matched generics. * Lowering pass: resolveSubmapBase now chains through both polygeist.submap and polygeist.submapInverse (up to 16 hops). New pre-pass elides redundant memset_zero_{1D,2D} launches preceding any β=0 op (syrk). Dtype-suffixed memset dispatch (f32 alongside f64). * Runtime: cublasLt linkage (libcublasLt.so.12); ensure_cublaslt() helper. Host-side BN-folding for fused conv+bn+relu (precompute scaled filter + bias). All cuDNN algo-selection loops use array-sized cudnnConvolutionFwdAlgoPerf_t buffers (avoiding the stack-smash that bit single-struct attempts). * Build: scripts/correctness/extracted_darknet_jetson.sh handles all 9 kernels; bake_extracted_darknet_mlir.sh produces per-stage MLIR snapshots for the IR explorer; -lcublasLt added to link line. * IR explorer: two new sections (extracted darknet, Fusion optimization) with Compiler Explorer deep-links + per-kernel raised/debuf/matched IR preview pages. All four fusion optimizations are 100% bit-exact (or FP-noise within 1e-4 print precision); LARGE speedups range 23x→3393x over the CPU 3-loop reference on the Jetson Orin (Tegra Ampere, FP32, cuDNN 9.x, CUDA 12.6).

…nv2d + 4 image filters on Jetson Orin * New LowerKernelLaunchToPVA pass — owns the matcher's i8/i16 @cudnnConvolution2D_9tap_* launches plus new @pvaBoxFilter_3x3_i{8,16}, @pvaGaussianFilter_3x3_i{8,16}, @pvaBilateralFilter_3x3_i{8,16}, @pvaHistogramEqualization_i8 symbols. Each routes to a polygeist_pva_* runtime shim. Disjoint symbol set from --lower-kernel-launch-to-cublas; the two passes run side by side; either order works. * Shared 9-tap conv lowering helper extracted out of LowerKernelLaunchToCuBLAS.cpp into KernelLaunchLoweringUtils.{h,cpp} so both backend passes call the same body. Added a parallel lowerImageFilter2Operand helper for the 2-memref filter launch shape (Box/Gaussian/Bilateral/HistogramEq). * cuBLAS pass: dropped i8/i16 from shimSymbolFor + the dispatch switch; PVA-claimed launches fall through with a `continue` instead of erroring out. Net diff is small in the cuBLAS pass file (the 3 helpers moved out are the bulk of the delta). * New PVA runtime shim runtime/polygeist_pva_rt.c with: - cudaSetDevice + nvcvAllocatorConstructPva + non-blocking stream init (idempotent, lazy, persistent for process lifetime) - make_pva_image_tensor_dtype: HWC tensor alloc through the PVA allocator with arbitrary NVCV dtype (needed because half the PVA ops are U8-only; we reinterpret i8 bytes as U8) - CupvaMemGetHostPointer-mediated host I/O (raw cudaMemcpy segfaults on cuPVA-allocated pages; the host-pointer mapping is mandatory) - One pva<Op>Create / pva<Op>Submit wrapper per op - (M-2)×(N-2) interior copy from PVA output back to caller B to honour the matcher's &B[1][1] pointer-shift convention (writing the full M×N overflows B by N+1 bytes) * Matching CPU reference stubs in polygeist_cublas_rt_cpu.c modelled to mirror PVA hardware semantics: centred kernel anchor, REPLICATE border, Q-format >>qbits shift, unsigned-kernel reinterpretation for Conv2d; rounded-mean (sum + 4) / 9 for BoxFilter; canonical [1,2,1;2,4,2;1,2,1] / 16 for Gaussian; textbook 256-bin CDF-LUT for HistogramEq. Bilateral has a pass-through stub (the non-linear hardware semantics aren't worth mirroring bit-exactly). * third_party/polybenchGpu-extracted/conv2d_i8.c — i8 variant of the 9-tap stencil (i16 already existed). Matcher fires on it via the existing dtype-suffix template + emits @cudnnConvolution2D_9tap_i8, which the new PVA pass claims. * Cross-compile script conv2d_cudnn_jetson_dtype.sh: i8 dtype branch added; PVA-library link line (-lpva_operator -lcvcuda -lnvcv_types -lcupva_host) plus direct DT_NEEDEDs for -lnvscibuf -lnvscisync via -Wl,--no-as-needed (deferred resolution segfaults during libcupva_host init constructors); step (5) now invokes both --lower-kernel-launch-to-cublas and --lower-kernel-launch-to-pva. * Four hand-authored kernel.launch test scaffolds in scripts/correctness/pva_{boxfilter,gaussian,bilateral,histeq}_jetson.sh. Matcher templates for these C-level patterns aren't written yet, so each script synthesises the kernel.launch MLIR directly and runs the rest of the pipeline normally — same harness, wrapper, ABI lowering, and link line. * IR explorer (scripts/correctness/build_ce_viewer.py): new "PVA backend" section at the bottom. Shows the 6 PVA-routed kernels with their op name, libpva_operator entry points, shim symbol, and Jetson PVA wall-clock at each size we benchmarked. No CPU comparison in this view (CPU stubs exist for separate per-op bit-exact validation). * CLAUDE.md: "point, don't copy" rule for gated-distribution NVIDIA SDKs. PVA Solutions / cuPVA SDK headers consumed via -I at build time; never copied into the Polygeist tree. End-to-end silicon validation on Jetson Orin: bit-exact PVA-vs-CPU diff for Conv2d i8/i16, BoxFilter, Gaussian, and HistogramEq at 256². Bilateral runs cleanly; visual spot-check only (non-linear). Conv2d at 10240×10240: PVA 216 ms vs CPU 499 ms (2.3× speedup for i8). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

arpitj1 added 30 commits June 6, 2024 08:41

Unfinished changes with prototype function

0b97da9

Loop over linalg.generic's input and output ops

69ef423

Some comments

7678a05

Partial changes from coding session to implement fusion of linalg.gen…

0e88095

…eric and for op

Incremental changes to fuse linalg and for loop- Logic for shifted op…

b57c0b8

…erands and map for linalg.generic

ran clang format

f54c33d

some compile time fixes

56e2c54

Some compile fixes

e253040

Fixed all the compilation issues. Sample MLIR not raised

e99b8a5

Bug fixes, generating some output at getLinalgArgMap

34f595c

Almost implementated remap in affine dim for multi idx

05bad97

Added submap op support and refactored the code to use submap

5bbf5ef

bunch of fixes. Now able to generate raise linalg code

9018d92

Now almost working second loop raising to linalg

ec041a0

Fixes to correctly raise 2 level for loops to linalg.generic

23138fc

Missed file update to enable linalg dialect in polygeist

5f20bd7

Fix for syms and dims calculation

b0e96aa

More tests added to cover different loop cases

ea76f0a

Now able to compile 3/any number of loops with parallel iter type; Ad…

591c84e

…ded extra tests in lit test

Non iter-arg variant of matrix-mul and conv are now raised to linalg.…

b0108e3

…generic

submap canonicalizer implemented

4362c80

Added reduction loops for linalg

77c8168

Fix for incorrect for loop dims

98f0119

Linalg.generic 4 loop cases raised- todo: reduction and some if-else …

59eec0b

…cases failing

Adding test case for all passing raising and lowering, example case o…

a363f13

…f debufferizing added which works for tiling and fusion

Added pass remove iter args from scf; Added psuedo code for submap ca…

814ca51

…nonicalize

Added removal of iter_args for affine loops

701f25a

Temporary reverted pass registeration as the code was failing

d285fb5

WIP commit

c40e7a9

Added submap of submap canonicalizer with test- failing

788a3c4

arpitj1 and others added 30 commits May 24, 2026 21:35

Make pipeline paths portable

6363ac5

Add fused im2col GEMM raising path

837dc46

Add tensor LLM lowering paths

c7162d1

Fold guarded stores for tensor softmax

d534c23

Add larger LLM forward benchmark

927275c

Add Llama runtime matching and Jetson benchmarks

5de644e

Add Llama extended and stencil benchmark tracking

952731c

Add 5x5 stencil matching coverage

fa802e3

Record Jetson 5x5 stencil timings

b2feced

Add generalized ntap stencil matching

6009047

Use tensor ntap stencil matching

f867eb9

Fix cgeist memref and record lowering crashes

1db4fc5

Fix cgeist scalar conditions and pointer subtraction

3d616d0

Add Whisper kernel fixtures and RMSNorm matcher family

7143325

Fix proxy app raising failures

984c9bf

Add proxy app kernel extraction probes

e671cd0

Fix proxy kernel linalg raising gaps

dd0abae

Verify proxy kernel correctness progress

e64e381

Add easy proxy pipeline fixtures

9ffc4a0

Add backend-capable kernel candidate matching

3c61b86

Add custom CUDA 7-point stencil backend

66b05d1

Add optional CPU BLAS runtime backend

ccaf958

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raisetolinalg#412

Raisetolinalg#412
arpitj1 wants to merge 166 commits into
llvm:raisetolinalgfrom
arpitj1:raisetolinalg

arpitj1 commented Jun 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arpitj1 commented Jun 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants