TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267
Open
YWHyuk wants to merge 14 commits into
Open
TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267YWHyuk wants to merge 14 commits into
YWHyuk wants to merge 14 commits into
Conversation
9ab0a40 to
a6589b8
Compare
Design-of-record + status + handoff for the C++ trace producer: post-vcix MLIR -> skeleton+API -> EmitC -> compiled .so that TOGSim dlopens and feeds to the existing timing Core. Async DMAs pair with explicit memory barriers by the runtime tag slot (tag_id, tag_slot) via the Core tag table; the SRAM-buffer last-writer DAG carries compute dependencies. Validated on the 256^3 GEMM: trace 2518 vs legacy 2698 on the real gem5 cycle table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
One op-walk generator and the one-line attribute builders/readers were copied across the passes. Consolidate into passes/_mlir_util.py (walk_ops; i32/i64/i64_array/str_attr; attr_int/attr_bool/attr_i64_array) and adopt it in lower_to_vcix, decompose_transfer, dma_fine_grained, lower_dma_to_gemmini, lower_vlane_idx. walk_ops needs no MLIR bindings so the module imports mlir.ir lazily; pure functions, no module-global state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
The compiler half of the trace pipeline. build_skeleton (C2) reduces a post-vcix kernel to a loop skeleton + togsim.* API ops: dma_start -> togsim.dma (tag_id + runtime tag index), dma_wait -> explicit togsim.memory_barrier, compute node -> togsim.compute, then a use-based DCE strips the data math. dep_analysis derives per-op SRAM read/write buffers (the last-writer dependency DAG); cycle_table builds the tile_id->cycle sidecar; lower_to_emitc (C4) rewrites togsim.* to emitc.call_opaque and drives the upstream EmitC pipeline to C++. extension_codecache emits the .so + cycle sidecar opt-in (TORCHSIM_DUMP_TRACE_SO=1), snapshotting the gem5 cycle_list before the legacy TOG consumes it. tog_generator marked DEPRECATED. No static event_id: an async dma pairs with its barrier by the runtime tag slot, since one static op runs once per loop iteration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
TOGSim side of the trace pipeline. togsim_runtime.{h,cc} is the producer
ABI (v11): togsim_dma (void, carries tag_id + tag_slot), togsim_compute,
togsim_memory_barrier (the explicit async-DMA sync), togsim_compute_barrier,
togsim_core_alloc. togsim_loader records a TraceRec stream; the bridge
(togsim_trace_bridge) turns it into a TileGraph: an async dma and its
memory_barrier pair by (tag_id, tag_slot) through the Core tag table
(set_tag_finish / register_tag_waiter), the barrier becomes the last-writer
of the loaded buffer, and the SRAM read/write-buffer DAG drives compute
deps with the occupancy/latency systolic-array pipeline + an explicit
compute fence before a store. main.cc gains --trace_so/--cycle_table;
Instruction/Core gain MEMORY_BAR + COMPUTE_BAR and the pipeline-child model.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
test_togsim_skeleton pins the togsim_ops vocabulary against the ABI header and exercises build_skeleton on a post-vcix fixture (event-id-free output, explicit memory_barrier). test_togsim_emitc builds the .so and checks the EmitC/symbol-table shape + that it runs against a stub runtime. The togsim_runtime test links the real runtime, runs the loader, and checks the recorded trace (resolved addresses, tag-paired barriers, looked-up cycles). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
cc507fd to
f5e8e55
Compare
The .so's exported entry function (the renamed kernel skeleton the loader dlopens and runs) is renamed togsim_emit -> togsim_kernel. Pure rename of the single ENTRY_SYMBOL contract (producer export == loader dlsym); no signature or behavior change. Updated togsim_ops.ENTRY_SYMBOL, the runtime header/loader, lower_to_emitc, the tests' dlsym/nm checks, and the design docs. Left togsim_emitc (the C4 lowering / its test) untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…ag alloc) The trace bridge's dma tag key has an empty accum component, so it pairs correctly only for a single-tile reduction (the current GEMM). Document the agreed fix for multi-tile-K and conv: hoist the tag memref alloc into the reduction-loop body (coarse, pre-fine-grained DMA) so each reduction iteration gets a fresh tag whose runtime identity is the per-iteration tag_id -- no accum-axis enumeration, works for any reduction depth. Because that alloc dominates both the load and wait nests, dma and memory_barrier pair by the SSA tag handle, with tag_idx kept as the subtile slot. Comment only; no behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…d tag key) The bridge keyed the Core tag table on the static (tag_id, tag_slot), so the DMAs of successive reduction iterations of one static op shared a key and would collide for multi-tile-K (and conv, reduction = kh*kw*C). Mint a fresh per-DMA- record tag key (uniq) instead, and pair each memory_barrier with the CURRENT load for its (tag_id, tag_slot) -- it is 1 load : N barriers (the load runs once per reduction iteration; each consumer waits the same tag), and the load/consumer nests run in order within the reduction body, so "current load" is correct (not a FIFO). Distinct uniq per load => successive iterations never collide; axis- agnostic, no coordinate enumeration. Single-tile GEMM is unchanged (2518 cycles). FIXME kept: the per-iteration tag is reconstructed here from record order, while the producer IR still carries one static func-entry tag alloc -- the faithful fix is to hoist that memref.alloc into the reduction-loop body and emit a matching per-iteration togsim.tag_alloc threaded by SSA (then uniq is unnecessary). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…conv) A tag memref was allocated once at the func entry and reused by every reduction iteration of a static DMA, so the per-iteration tag identity was only an artifact of the timing path's bridge. Make it real in the IR: when fine-grained splits a matmul load, allocate a fresh tag memref.alloc just before the coarse dma_start and replace_all_uses_with the old tag -- this rewires both the re-emitted dma_start AND its dma_wait, and the coarse dma sits at the reduction- loop body level so the alloc dominates the load and wait nests. Each reduction iteration thus allocates its own tag (distinct for multi-tile-K / conv, no coordinate enumeration); the now-dead func-entry alloc is erased. Sync stores keep their tag. Legacy materializes to a distinct alloc per iteration (its calc_tag accum component becomes redundant); verified the 256^3 GEMM still passes and the trace path is unchanged at 2518 cycles. The bridge FIXME is updated: build_skeleton still collapses the in-loop alloc to one static tag_id, so the bridge's per-record uniq is still what distinguishes iterations until that identity is threaded as an SSA tag handle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
… slot build_skeleton carried the dma_wait tag index verbatim onto togsim.memory_- barrier. lower_to_vcix builds that index with a -acc_iv term for each accumulation (reduction) loop var -- a sentinel marking the reduction axis, not an arithmetic offset (legacy TileGraphParser skips stride -1 for the same reason). The matching async load index (dma_fine_grained) is subtile-only, so at reduction iteration > 0 the producer evaluated -acc_iv to a negative slot, the recorded barrier tag_slot diverged from the load slot, and TOGSim aborted with "Key does not exist in subgraph's tag table" on subtile + multi-tile-K. _strip_accum_terms now drops the negative-coefficient dim terms from the wait's affine.apply (composing with a selector that zeros those dims), so the barrier slot is subtile-only and pairs with its load. Reduction iterations are still told apart by the per-iteration tag alloc and the fresh per-record Core key in the bridge, not by the slot. Single-tile kernels (no reduction term) fall through unchanged. Verified: 256x512x256 forced to 128x128 subtiles (2 K-tiles) now runs to 5774 cycles instead of crashing; single-tile 256^3 unchanged. Adds a self-contained regression for _strip_accum_terms. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
Document that the trace tag_slot is subtile-only: build_skeleton strips the lower_to_vcix -acc_iv accumulation marker from the dma_wait index so a memory_barrier pairs with the slot its load wrote, mirroring legacy TileGraphParser's skip of stride -1. Record that subtile + multi-tile-K (256x512x256, 128x128 subtiles, 2 K-tiles) now runs at 5774 cycles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
The -1 coefficients lower_to_vcix puts on accumulation loop vars in the A/B dma_wait tag indices are a reduction-axis sentinel honored only by the legacy TOG path (TileGraphParser); the trace path strips them in build_skeleton._strip_accum_terms. Document this at both emission sites and note they are kept for byte-identity with the C++ -test-pytorchsim-to-vcix pass and should be removed (not flagged) once legacy retires. Comments only; output is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…_END Replace the bare togsim_core_alloc marker with a higher-order togsim_dispatch(ctx, tile_fn, iv, n_iv) wrapper. The runtime round-robins a core from the pool, brackets the work-item with TILE_BEGIN/TILE_END trace records, and invokes the producer's outlined tile function. The work-item scope is now exactly the function call, not an implicit "ops until the next core_alloc" range, and one general (kernel-independent) dispatcher serves every kernel via a uniform iv-array tile signature (togsim_tile_fn). Core alloc and the begin/end boundary are runtime-owned; the producer stays core-count transparent. TraceRec gains TILE_BEGIN/TILE_END (drops DISPATCH); the bridge opens a subgraph on TILE_BEGIN (bound to the record's core) and flushes it on TILE_END, and the reference timer treats both as zero-cost boundaries. Verified on the subtile 256x512x256 case: 5774 cycles, identical to the pre-outline core_alloc form. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…spatch lower_to_emitc now outlines the innermost parallel-loop body into a uniform togsim_kernel_tile(ctx, iv, n) func and replaces it with a togsim_dispatch(ctx, togsim_kernel_tile, iv, n) call, instead of inserting a bare togsim_core_alloc marker inline. The dispatcher loop marshals the parallel induction vars (m, n) into an int64 array and passes the tile fn as a verbatim function pointer (#emitc.opaque), so the work-item scope is the tile function body and the runtime wrapper owns the core-alloc + TILE_BEGIN/TILE_END boundary. The outline runs after the togsim.* ops become emitc.call_opaque: it moves the body ops into the tile fn, recovers each parallel index as index_cast(iv[k]) inside it, and remaps the captured ctx / induction vars / constants (Value == is identity; external constants are cloned). Only ctx, the parallel IVs, and constants may be captured (dynamic-shape captures raise -> P4). mlir-to-cpp renders a static togsim_kernel_tile defined before the extern "C" togsim_kernel dispatcher. togsim_ops gains DISPATCH_CALLEE / TILE_SYMBOL (drops CORE_ALLOC_CALLEE). Tests: the emitc/runtime harnesses define togsim_dispatch (calling the tile fn) and the skeleton/emitc contract checks use DISPATCH_CALLEE; the outlined .so builds, dlopens, and runs. Docs updated (outline DONE, ABI v12). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Replaces the timing-path TOG producer (
MLIR -> Python dict -> ONNX -> C++ TileGraphParser) with a compiled, shape-parametric trace producer:post-vcix MLIR -> skeleton -> EmitC -> C++ -> .so. TOGSim dlopens the.so, runs it to record an instruction trace, and feeds it into the existing Simulator/Core (timing core unchanged). Driven by a new--trace_somode; the legacy ONNX-TOG path is kept and marked DEPRECATED, so nothing existing breaks.Pipeline
Dependency model (no in-order, no runtime tag-hash, no op heuristics)
Dependencies are derived from two sources available pre-collapse:
SA_WEIGHTSbuffer that folds preload->matmul.MEMORY_BAR(renamed fromBAR): the DMA/tag memory fence; an async load -> compute waits the data's resp-complete.COMPUTE_BAR(new): the compute fence; a store waits all systolic-array pipelines to drain.Both barriers are first-class trace ops (
togsim.compute_barrier-> ABItogsim_compute_barrier) visible in the trace dump and the instruction stream.Status
--trace_so.build_togpath on the same kernel + gem5cycle_list: compute work and DRAM traffic match; matmuls pipeline on 2 SAs; the memory fence correctly delays compute until the weight load arrives.docs/design/togsim_cpp_trace.mdsec 10): preload-concurrency cap (needs non-zero preload occupancy), parallel output tiles (dispatch granularity), broader op coverage (conv/SDPA/vector).Testing
tests/test_togsim_skeleton.py,test_togsim_emitc.py,test_togsim_runtime.py(7 tests).--trace_soGEMM through TOGSim.Design of record:
docs/design/togsim_cpp_trace.md(sec 9-10).🤖 Generated with Claude Code