Skip to content

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267

Open
YWHyuk wants to merge 14 commits into
developfrom
feature/togsim-cpp-trace
Open

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267
YWHyuk wants to merge 14 commits into
developfrom
feature/togsim-cpp-trace

Conversation

@YWHyuk

@YWHyuk YWHyuk commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

What

Replaces the timing-path TOG producer (MLIR -> Python dict -> ONNX -> C++ TileGraphParser) with a compiled, shape-parametric trace producer: post-vcix MLIR -> skeleton -> EmitC -> C++ -> .so. TOGSim dlopens the .so, runs it to record an instruction trace, and feeds it into the existing Simulator/Core (timing core unchanged). Driven by a new --trace_so mode; the legacy ONNX-TOG path is kept and marked DEPRECATED, so nothing existing breaks.

Pipeline

post-vcix .mlir
  | build_skeleton.py        loops + memref.dma_start/wait -> togsim.* ; DCE the rest
  | dep_analysis.py          per-op read/write SRAM buffers (SSA) + vcix preload/matmul pairing
  | lower_to_emitc.py        togsim.* -> emitc.call_opaque ; drive upstream convert-*-to-emitc
  v
EmitC --mlir-translate--> C++ --g++ -shared--> trace.so
  | run_producer (dlopen)    EmitCtx callbacks record a TraceRec stream
  | togsim_trace_bridge.cc   TraceRec -> TileGraph (explicit dependency DAG)
  v
existing Simulator / Core    cycles, DRAM traffic

Dependency model (no in-order, no runtime tag-hash, no op heuristics)

Dependencies are derived from two sources available pre-collapse:

  • SRAM last-writer per buffer (load->compute, the Y_spad accumulator chain), recovered via SSA + a virtual SA_WEIGHTS buffer that folds preload->matmul.
  • The systolic array modeled as a pipeline (occupancy/latency split) with two explicit, distinctly-named barriers:
    • MEMORY_BAR (renamed from BAR): the DMA/tag memory fence; an async load -> compute waits the data's resp-complete.
    • COMPUTE_BAR (new): the compute fence; a store waits all systolic-array pipelines to drain.

Both barriers are first-class trace ops (togsim.compute_barrier -> ABI togsim_compute_barrier) visible in the trace dump and the instruction stream.

Status

  • 256^3 GEMM runs end-to-end through the real Simulator via --trace_so.
  • Cycle comparison vs the legacy build_tog path on the same kernel + gem5 cycle_list: compute work and DRAM traffic match; matmuls pipeline on 2 SAs; the memory fence correctly delays compute until the weight load arrives.
  • Known open items (documented in docs/design/togsim_cpp_trace.md sec 10): preload-concurrency cap (needs non-zero preload occupancy), parallel output tiles (dispatch granularity), broader op coverage (conv/SDPA/vector).

Testing

  • tests/test_togsim_skeleton.py, test_togsim_emitc.py, test_togsim_runtime.py (7 tests).
  • Manual --trace_so GEMM through TOGSim.
  • Legacy path untouched (comment-only DEPRECATED markers).

Design of record: docs/design/togsim_cpp_trace.md (sec 9-10).

🤖 Generated with Claude Code

@YWHyuk YWHyuk force-pushed the feature/togsim-cpp-trace branch from 9ab0a40 to a6589b8 Compare June 19, 2026 04:27
YWHyuk and others added 5 commits June 19, 2026 17:12
Design-of-record + status + handoff for the C++ trace producer: post-vcix
MLIR -> skeleton+API -> EmitC -> compiled .so that TOGSim dlopens and feeds
to the existing timing Core. Async DMAs pair with explicit memory barriers
by the runtime tag slot (tag_id, tag_slot) via the Core tag table; the
SRAM-buffer last-writer DAG carries compute dependencies. Validated on the
256^3 GEMM: trace 2518 vs legacy 2698 on the real gem5 cycle table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
One op-walk generator and the one-line attribute builders/readers were
copied across the passes. Consolidate into passes/_mlir_util.py
(walk_ops; i32/i64/i64_array/str_attr; attr_int/attr_bool/attr_i64_array)
and adopt it in lower_to_vcix, decompose_transfer, dma_fine_grained,
lower_dma_to_gemmini, lower_vlane_idx. walk_ops needs no MLIR bindings so
the module imports mlir.ir lazily; pure functions, no module-global state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
The compiler half of the trace pipeline. build_skeleton (C2) reduces a
post-vcix kernel to a loop skeleton + togsim.* API ops: dma_start ->
togsim.dma (tag_id + runtime tag index), dma_wait -> explicit
togsim.memory_barrier, compute node -> togsim.compute, then a use-based DCE
strips the data math. dep_analysis derives per-op SRAM read/write buffers
(the last-writer dependency DAG); cycle_table builds the tile_id->cycle
sidecar; lower_to_emitc (C4) rewrites togsim.* to emitc.call_opaque and
drives the upstream EmitC pipeline to C++. extension_codecache emits the
.so + cycle sidecar opt-in (TORCHSIM_DUMP_TRACE_SO=1), snapshotting the
gem5 cycle_list before the legacy TOG consumes it. tog_generator marked
DEPRECATED. No static event_id: an async dma pairs with its barrier by the
runtime tag slot, since one static op runs once per loop iteration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
TOGSim side of the trace pipeline. togsim_runtime.{h,cc} is the producer
ABI (v11): togsim_dma (void, carries tag_id + tag_slot), togsim_compute,
togsim_memory_barrier (the explicit async-DMA sync), togsim_compute_barrier,
togsim_core_alloc. togsim_loader records a TraceRec stream; the bridge
(togsim_trace_bridge) turns it into a TileGraph: an async dma and its
memory_barrier pair by (tag_id, tag_slot) through the Core tag table
(set_tag_finish / register_tag_waiter), the barrier becomes the last-writer
of the loaded buffer, and the SRAM read/write-buffer DAG drives compute
deps with the occupancy/latency systolic-array pipeline + an explicit
compute fence before a store. main.cc gains --trace_so/--cycle_table;
Instruction/Core gain MEMORY_BAR + COMPUTE_BAR and the pipeline-child model.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
test_togsim_skeleton pins the togsim_ops vocabulary against the ABI header
and exercises build_skeleton on a post-vcix fixture (event-id-free output,
explicit memory_barrier). test_togsim_emitc builds the .so and checks the
EmitC/symbol-table shape + that it runs against a stub runtime. The
togsim_runtime test links the real runtime, runs the loader, and checks the
recorded trace (resolved addresses, tag-paired barriers, looked-up cycles).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
@YWHyuk YWHyuk force-pushed the feature/togsim-cpp-trace branch from cc507fd to f5e8e55 Compare June 19, 2026 08:12
YWHyuk and others added 9 commits June 19, 2026 20:26
The .so's exported entry function (the renamed kernel skeleton the loader
dlopens and runs) is renamed togsim_emit -> togsim_kernel. Pure rename of
the single ENTRY_SYMBOL contract (producer export == loader dlsym); no
signature or behavior change. Updated togsim_ops.ENTRY_SYMBOL, the runtime
header/loader, lower_to_emitc, the tests' dlsym/nm checks, and the design
docs. Left togsim_emitc (the C4 lowering / its test) untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…ag alloc)

The trace bridge's dma tag key has an empty accum component, so it pairs
correctly only for a single-tile reduction (the current GEMM). Document the
agreed fix for multi-tile-K and conv: hoist the tag memref alloc into the
reduction-loop body (coarse, pre-fine-grained DMA) so each reduction
iteration gets a fresh tag whose runtime identity is the per-iteration
tag_id -- no accum-axis enumeration, works for any reduction depth. Because
that alloc dominates both the load and wait nests, dma and memory_barrier
pair by the SSA tag handle, with tag_idx kept as the subtile slot. Comment
only; no behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…d tag key)

The bridge keyed the Core tag table on the static (tag_id, tag_slot), so the
DMAs of successive reduction iterations of one static op shared a key and would
collide for multi-tile-K (and conv, reduction = kh*kw*C). Mint a fresh per-DMA-
record tag key (uniq) instead, and pair each memory_barrier with the CURRENT
load for its (tag_id, tag_slot) -- it is 1 load : N barriers (the load runs once
per reduction iteration; each consumer waits the same tag), and the load/consumer
nests run in order within the reduction body, so "current load" is correct (not a
FIFO). Distinct uniq per load => successive iterations never collide; axis-
agnostic, no coordinate enumeration. Single-tile GEMM is unchanged (2518 cycles).

FIXME kept: the per-iteration tag is reconstructed here from record order, while
the producer IR still carries one static func-entry tag alloc -- the faithful fix
is to hoist that memref.alloc into the reduction-loop body and emit a matching
per-iteration togsim.tag_alloc threaded by SSA (then uniq is unnecessary).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…conv)

A tag memref was allocated once at the func entry and reused by every reduction
iteration of a static DMA, so the per-iteration tag identity was only an
artifact of the timing path's bridge. Make it real in the IR: when fine-grained
splits a matmul load, allocate a fresh tag memref.alloc just before the coarse
dma_start and replace_all_uses_with the old tag -- this rewires both the
re-emitted dma_start AND its dma_wait, and the coarse dma sits at the reduction-
loop body level so the alloc dominates the load and wait nests. Each reduction
iteration thus allocates its own tag (distinct for multi-tile-K / conv, no
coordinate enumeration); the now-dead func-entry alloc is erased. Sync stores
keep their tag.

Legacy materializes to a distinct alloc per iteration (its calc_tag accum
component becomes redundant); verified the 256^3 GEMM still passes and the trace
path is unchanged at 2518 cycles. The bridge FIXME is updated: build_skeleton
still collapses the in-loop alloc to one static tag_id, so the bridge's per-record
uniq is still what distinguishes iterations until that identity is threaded as an
SSA tag handle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
… slot

build_skeleton carried the dma_wait tag index verbatim onto togsim.memory_-
barrier. lower_to_vcix builds that index with a -acc_iv term for each
accumulation (reduction) loop var -- a sentinel marking the reduction axis, not
an arithmetic offset (legacy TileGraphParser skips stride -1 for the same
reason). The matching async load index (dma_fine_grained) is subtile-only, so at
reduction iteration > 0 the producer evaluated -acc_iv to a negative slot, the
recorded barrier tag_slot diverged from the load slot, and TOGSim aborted with
"Key does not exist in subgraph's tag table" on subtile + multi-tile-K.

_strip_accum_terms now drops the negative-coefficient dim terms from the wait's
affine.apply (composing with a selector that zeros those dims), so the barrier
slot is subtile-only and pairs with its load. Reduction iterations are still
told apart by the per-iteration tag alloc and the fresh per-record Core key in
the bridge, not by the slot. Single-tile kernels (no reduction term) fall
through unchanged.

Verified: 256x512x256 forced to 128x128 subtiles (2 K-tiles) now runs to 5774
cycles instead of crashing; single-tile 256^3 unchanged. Adds a self-contained
regression for _strip_accum_terms.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
Document that the trace tag_slot is subtile-only: build_skeleton strips the
lower_to_vcix -acc_iv accumulation marker from the dma_wait index so a
memory_barrier pairs with the slot its load wrote, mirroring legacy
TileGraphParser's skip of stride -1. Record that subtile + multi-tile-K
(256x512x256, 128x128 subtiles, 2 K-tiles) now runs at 5774 cycles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
The -1 coefficients lower_to_vcix puts on accumulation loop vars in the A/B
dma_wait tag indices are a reduction-axis sentinel honored only by the legacy
TOG path (TileGraphParser); the trace path strips them in
build_skeleton._strip_accum_terms. Document this at both emission sites and note
they are kept for byte-identity with the C++ -test-pytorchsim-to-vcix pass and
should be removed (not flagged) once legacy retires. Comments only; output is
unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…_END

Replace the bare togsim_core_alloc marker with a higher-order
togsim_dispatch(ctx, tile_fn, iv, n_iv) wrapper. The runtime round-robins a core
from the pool, brackets the work-item with TILE_BEGIN/TILE_END trace records, and
invokes the producer's outlined tile function. The work-item scope is now exactly
the function call, not an implicit "ops until the next core_alloc" range, and one
general (kernel-independent) dispatcher serves every kernel via a uniform
iv-array tile signature (togsim_tile_fn). Core alloc and the begin/end boundary
are runtime-owned; the producer stays core-count transparent.

TraceRec gains TILE_BEGIN/TILE_END (drops DISPATCH); the bridge opens a subgraph
on TILE_BEGIN (bound to the record's core) and flushes it on TILE_END, and the
reference timer treats both as zero-cost boundaries. Verified on the subtile
256x512x256 case: 5774 cycles, identical to the pre-outline core_alloc form.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…spatch

lower_to_emitc now outlines the innermost parallel-loop body into a uniform
togsim_kernel_tile(ctx, iv, n) func and replaces it with a
togsim_dispatch(ctx, togsim_kernel_tile, iv, n) call, instead of inserting a bare
togsim_core_alloc marker inline. The dispatcher loop marshals the parallel
induction vars (m, n) into an int64 array and passes the tile fn as a verbatim
function pointer (#emitc.opaque), so the work-item scope is the tile function body
and the runtime wrapper owns the core-alloc + TILE_BEGIN/TILE_END boundary.

The outline runs after the togsim.* ops become emitc.call_opaque: it moves the
body ops into the tile fn, recovers each parallel index as index_cast(iv[k])
inside it, and remaps the captured ctx / induction vars / constants (Value == is
identity; external constants are cloned). Only ctx, the parallel IVs, and
constants may be captured (dynamic-shape captures raise -> P4). mlir-to-cpp
renders a static togsim_kernel_tile defined before the extern "C" togsim_kernel
dispatcher. togsim_ops gains DISPATCH_CALLEE / TILE_SYMBOL (drops
CORE_ALLOC_CALLEE).

Tests: the emitc/runtime harnesses define togsim_dispatch (calling the tile fn)
and the skeleton/emitc contract checks use DISPATCH_CALLEE; the outlined .so
builds, dlopens, and runs. Docs updated (outline DONE, ABI v12).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant