TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers by YWHyuk · Pull Request #267 · PSAL-POSTECH/PyTorchSim

YWHyuk · 2026-06-19T04:22:35Z

What

Replaces the timing-path TOG producer (MLIR -> Python dict -> ONNX -> C++ TileGraphParser) with a compiled, shape-parametric trace producer: post-vcix MLIR -> skeleton -> EmitC -> C++ -> .so. TOGSim dlopens the .so, runs it to record an instruction trace, and feeds it into the existing Simulator/Core (timing core unchanged). Driven by a new --trace_so mode; the legacy ONNX-TOG path is kept and marked DEPRECATED, so nothing existing breaks.

Pipeline

post-vcix .mlir
  | build_skeleton.py        loops + memref.dma_start/wait -> togsim.* ; DCE the rest
  | dep_analysis.py          per-op read/write SRAM buffers (SSA) + vcix preload/matmul pairing
  | lower_to_emitc.py        togsim.* -> emitc.call_opaque ; drive upstream convert-*-to-emitc
  v
EmitC --mlir-translate--> C++ --g++ -shared--> trace.so
  | run_producer (dlopen)    EmitCtx callbacks record a TraceRec stream
  | togsim_trace_bridge.cc   TraceRec -> TileGraph (explicit dependency DAG)
  v
existing Simulator / Core    cycles, DRAM traffic

Dependency model (no in-order, no runtime tag-hash, no op heuristics)

Dependencies are derived from two sources available pre-collapse:

SRAM last-writer per buffer (load->compute, the Y_spad accumulator chain), recovered via SSA + a virtual SA_WEIGHTS buffer that folds preload->matmul.
The systolic array modeled as a pipeline (occupancy/latency split) with two explicit, distinctly-named barriers:
- MEMORY_BAR (renamed from BAR): the DMA/tag memory fence; an async load -> compute waits the data's resp-complete.
- COMPUTE_BAR (new): the compute fence; a store waits all systolic-array pipelines to drain.

Both barriers are first-class trace ops (togsim.compute_barrier -> ABI togsim_compute_barrier) visible in the trace dump and the instruction stream.

Status

256^3 GEMM runs end-to-end through the real Simulator via --trace_so.
Cycle comparison vs the legacy build_tog path on the same kernel + gem5 cycle_list: compute work and DRAM traffic match; matmuls pipeline on 2 SAs; the memory fence correctly delays compute until the weight load arrives.
Known open items (documented in docs/design/togsim_cpp_trace.md sec 10): preload-concurrency cap (needs non-zero preload occupancy), parallel output tiles (dispatch granularity), broader op coverage (conv/SDPA/vector).

Testing

tests/test_togsim_skeleton.py, test_togsim_emitc.py, test_togsim_runtime.py (7 tests).
Manual --trace_so GEMM through TOGSim.
Legacy path untouched (comment-only DEPRECATED markers).

Design of record: docs/design/togsim_cpp_trace.md (sec 9-10).

🤖 Generated with Claude Code

Design-of-record + status + handoff for the C++ trace producer: post-vcix MLIR -> skeleton+API -> EmitC -> compiled .so that TOGSim dlopens and feeds to the existing timing Core. Async DMAs pair with explicit memory barriers by the runtime tag slot (tag_id, tag_slot) via the Core tag table; the SRAM-buffer last-writer DAG carries compute dependencies. Validated on the 256^3 GEMM: trace 2518 vs legacy 2698 on the real gem5 cycle table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

One op-walk generator and the one-line attribute builders/readers were copied across the passes. Consolidate into passes/_mlir_util.py (walk_ops; i32/i64/i64_array/str_attr; attr_int/attr_bool/attr_i64_array) and adopt it in lower_to_vcix, decompose_transfer, dma_fine_grained, lower_dma_to_gemmini, lower_vlane_idx. walk_ops needs no MLIR bindings so the module imports mlir.ir lazily; pure functions, no module-global state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

The compiler half of the trace pipeline. build_skeleton (C2) reduces a post-vcix kernel to a loop skeleton + togsim.* API ops: dma_start -> togsim.dma (tag_id + runtime tag index), dma_wait -> explicit togsim.memory_barrier, compute node -> togsim.compute, then a use-based DCE strips the data math. dep_analysis derives per-op SRAM read/write buffers (the last-writer dependency DAG); cycle_table builds the tile_id->cycle sidecar; lower_to_emitc (C4) rewrites togsim.* to emitc.call_opaque and drives the upstream EmitC pipeline to C++. extension_codecache emits the .so + cycle sidecar opt-in (TORCHSIM_DUMP_TRACE_SO=1), snapshotting the gem5 cycle_list before the legacy TOG consumes it. tog_generator marked DEPRECATED. No static event_id: an async dma pairs with its barrier by the runtime tag slot, since one static op runs once per loop iteration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

TOGSim side of the trace pipeline. togsim_runtime.{h,cc} is the producer ABI (v11): togsim_dma (void, carries tag_id + tag_slot), togsim_compute, togsim_memory_barrier (the explicit async-DMA sync), togsim_compute_barrier, togsim_core_alloc. togsim_loader records a TraceRec stream; the bridge (togsim_trace_bridge) turns it into a TileGraph: an async dma and its memory_barrier pair by (tag_id, tag_slot) through the Core tag table (set_tag_finish / register_tag_waiter), the barrier becomes the last-writer of the loaded buffer, and the SRAM read/write-buffer DAG drives compute deps with the occupancy/latency systolic-array pipeline + an explicit compute fence before a store. main.cc gains --trace_so/--cycle_table; Instruction/Core gain MEMORY_BAR + COMPUTE_BAR and the pipeline-child model. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

test_togsim_skeleton pins the togsim_ops vocabulary against the ABI header and exercises build_skeleton on a post-vcix fixture (event-id-free output, explicit memory_barrier). test_togsim_emitc builds the .so and checks the EmitC/symbol-table shape + that it runs against a stub runtime. The togsim_runtime test links the real runtime, runs the loader, and checks the recorded trace (resolved addresses, tag-paired barriers, looked-up cycles). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

The .so's exported entry function (the renamed kernel skeleton the loader dlopens and runs) is renamed togsim_emit -> togsim_kernel. Pure rename of the single ENTRY_SYMBOL contract (producer export == loader dlsym); no signature or behavior change. Updated togsim_ops.ENTRY_SYMBOL, the runtime header/loader, lower_to_emitc, the tests' dlsym/nm checks, and the design docs. Left togsim_emitc (the C4 lowering / its test) untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

…ag alloc) The trace bridge's dma tag key has an empty accum component, so it pairs correctly only for a single-tile reduction (the current GEMM). Document the agreed fix for multi-tile-K and conv: hoist the tag memref alloc into the reduction-loop body (coarse, pre-fine-grained DMA) so each reduction iteration gets a fresh tag whose runtime identity is the per-iteration tag_id -- no accum-axis enumeration, works for any reduction depth. Because that alloc dominates both the load and wait nests, dma and memory_barrier pair by the SSA tag handle, with tag_idx kept as the subtile slot. Comment only; no behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

…d tag key) The bridge keyed the Core tag table on the static (tag_id, tag_slot), so the DMAs of successive reduction iterations of one static op shared a key and would collide for multi-tile-K (and conv, reduction = kh*kw*C). Mint a fresh per-DMA- record tag key (uniq) instead, and pair each memory_barrier with the CURRENT load for its (tag_id, tag_slot) -- it is 1 load : N barriers (the load runs once per reduction iteration; each consumer waits the same tag), and the load/consumer nests run in order within the reduction body, so "current load" is correct (not a FIFO). Distinct uniq per load => successive iterations never collide; axis- agnostic, no coordinate enumeration. Single-tile GEMM is unchanged (2518 cycles). FIXME kept: the per-iteration tag is reconstructed here from record order, while the producer IR still carries one static func-entry tag alloc -- the faithful fix is to hoist that memref.alloc into the reduction-loop body and emit a matching per-iteration togsim.tag_alloc threaded by SSA (then uniq is unnecessary). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

…conv) A tag memref was allocated once at the func entry and reused by every reduction iteration of a static DMA, so the per-iteration tag identity was only an artifact of the timing path's bridge. Make it real in the IR: when fine-grained splits a matmul load, allocate a fresh tag memref.alloc just before the coarse dma_start and replace_all_uses_with the old tag -- this rewires both the re-emitted dma_start AND its dma_wait, and the coarse dma sits at the reduction- loop body level so the alloc dominates the load and wait nests. Each reduction iteration thus allocates its own tag (distinct for multi-tile-K / conv, no coordinate enumeration); the now-dead func-entry alloc is erased. Sync stores keep their tag. Legacy materializes to a distinct alloc per iteration (its calc_tag accum component becomes redundant); verified the 256^3 GEMM still passes and the trace path is unchanged at 2518 cycles. The bridge FIXME is updated: build_skeleton still collapses the in-loop alloc to one static tag_id, so the bridge's per-record uniq is still what distinguishes iterations until that identity is threaded as an SSA tag handle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

… slot build_skeleton carried the dma_wait tag index verbatim onto togsim.memory_- barrier. lower_to_vcix builds that index with a -acc_iv term for each accumulation (reduction) loop var -- a sentinel marking the reduction axis, not an arithmetic offset (legacy TileGraphParser skips stride -1 for the same reason). The matching async load index (dma_fine_grained) is subtile-only, so at reduction iteration > 0 the producer evaluated -acc_iv to a negative slot, the recorded barrier tag_slot diverged from the load slot, and TOGSim aborted with "Key does not exist in subgraph's tag table" on subtile + multi-tile-K. _strip_accum_terms now drops the negative-coefficient dim terms from the wait's affine.apply (composing with a selector that zeros those dims), so the barrier slot is subtile-only and pairs with its load. Reduction iterations are still told apart by the per-iteration tag alloc and the fresh per-record Core key in the bridge, not by the slot. Single-tile kernels (no reduction term) fall through unchanged. Verified: 256x512x256 forced to 128x128 subtiles (2 K-tiles) now runs to 5774 cycles instead of crashing; single-tile 256^3 unchanged. Adds a self-contained regression for _strip_accum_terms. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

Document that the trace tag_slot is subtile-only: build_skeleton strips the lower_to_vcix -acc_iv accumulation marker from the dma_wait index so a memory_barrier pairs with the slot its load wrote, mirroring legacy TileGraphParser's skip of stride -1. Record that subtile + multi-tile-K (256x512x256, 128x128 subtiles, 2 K-tiles) now runs at 5774 cycles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

The -1 coefficients lower_to_vcix puts on accumulation loop vars in the A/B dma_wait tag indices are a reduction-axis sentinel honored only by the legacy TOG path (TileGraphParser); the trace path strips them in build_skeleton._strip_accum_terms. Document this at both emission sites and note they are kept for byte-identity with the C++ -test-pytorchsim-to-vcix pass and should be removed (not flagged) once legacy retires. Comments only; output is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

…_END Replace the bare togsim_core_alloc marker with a higher-order togsim_dispatch(ctx, tile_fn, iv, n_iv) wrapper. The runtime round-robins a core from the pool, brackets the work-item with TILE_BEGIN/TILE_END trace records, and invokes the producer's outlined tile function. The work-item scope is now exactly the function call, not an implicit "ops until the next core_alloc" range, and one general (kernel-independent) dispatcher serves every kernel via a uniform iv-array tile signature (togsim_tile_fn). Core alloc and the begin/end boundary are runtime-owned; the producer stays core-count transparent. TraceRec gains TILE_BEGIN/TILE_END (drops DISPATCH); the bridge opens a subgraph on TILE_BEGIN (bound to the record's core) and flushes it on TILE_END, and the reference timer treats both as zero-cost boundaries. Verified on the subtile 256x512x256 case: 5774 cycles, identical to the pre-outline core_alloc form. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

…spatch lower_to_emitc now outlines the innermost parallel-loop body into a uniform togsim_kernel_tile(ctx, iv, n) func and replaces it with a togsim_dispatch(ctx, togsim_kernel_tile, iv, n) call, instead of inserting a bare togsim_core_alloc marker inline. The dispatcher loop marshals the parallel induction vars (m, n) into an int64 array and passes the tile fn as a verbatim function pointer (#emitc.opaque), so the work-item scope is the tile function body and the runtime wrapper owns the core-alloc + TILE_BEGIN/TILE_END boundary. The outline runs after the togsim.* ops become emitc.call_opaque: it moves the body ops into the tile fn, recovers each parallel index as index_cast(iv[k]) inside it, and remaps the captured ctx / induction vars / constants (Value == is identity; external constants are cloned). Only ctx, the parallel IVs, and constants may be captured (dynamic-shape captures raise -> P4). mlir-to-cpp renders a static togsim_kernel_tile defined before the extern "C" togsim_kernel dispatcher. togsim_ops gains DISPATCH_CALLEE / TILE_SYMBOL (drops CORE_ALLOC_CALLEE). Tests: the emitc/runtime harnesses define togsim_dispatch (calling the tile fn) and the skeleton/emitc contract checks use DISPATCH_CALLEE; the outlined .so builds, dlopens, and runs. Docs updated (outline DONE, ABI v12). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

YWHyuk force-pushed the feature/togsim-cpp-trace branch from 9ab0a40 to a6589b8 Compare June 19, 2026 04:27

YWHyuk and others added 5 commits June 19, 2026 17:12

YWHyuk force-pushed the feature/togsim-cpp-trace branch from cc507fd to f5e8e55 Compare June 19, 2026 08:12

YWHyuk and others added 9 commits June 19, 2026 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267
YWHyuk wants to merge 14 commits into
developfrom
feature/togsim-cpp-trace

YWHyuk commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YWHyuk commented Jun 19, 2026

What

Pipeline

Dependency model (no in-order, no runtime tag-hash, no op heuristics)

Status

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant