Feat/recursion v2 merge master#1374
Open
hero78119 wants to merge 20 commits into
Open
Conversation
## Problem The Keccak precompile path carried redundant witness materialization and mixed memory-facing ecall work with permutation constraints. That made the prover path wider than necessary and harder to reason about when accounting for reads, writes, lookups, and GPU chip behavior. ## Design Rationale This PR separates the memory boundary from the Keccak permutation work so each chip has a narrower responsibility: the ecall side handles state movement, while the core permutation chip owns the Keccak constraints. That makes the circuit accounting easier to validate and keeps memory interactions out of the core permutation path. The witness reductions remove data that is already derivable from existing state: the `c_aux` prefix copy for `j = 0` and split-rotation witnesses for the zero-rotation RhoPi lane. These changes preserve the existing Keccak semantics and lookup model while reducing committed witness columns and range-check pressure. Trade-off: the builder and witness generation now have a small amount of layout-specific branching, but the resulting circuit shape is more explicit and the hot path carries less redundant data. ## Change Highlights - `ceno_zkvm`: split Keccak memory/ecall handling from the core permutation chip and rename the split chips for clarity. - `ceno_zkvm`: reduce redundant Keccak witness columns and update lookup/range-check accounting to match the new layout. - `ceno_zkvm`: update GPU chip wiring and debug comparison paths for the split Keccak chip structure. - `ceno_zkvm`: clean up the Keccak ecall test helper lifetime so workspace clippy passes with warnings denied. ## Benchmark / Performance Impact This is performance-sensitive because it reduces Keccak witness width and range-check work. End-to-end benchmark numbers are not included yet. ### Operation | Operation | master (s) | this PR (s) | Improve (master -> this PR) | |-----------|------------|-------------|-----------------------------| | Not measured | N/A | N/A | N/A | ### Layer | Layer | master (s) | this PR (s) | Improve (master -> this PR) | |-------|------------|-------------|-----------------------------| | Not measured | N/A | N/A | N/A | Benchmark command(s): ```sh # not run ``` Environment (CPU/GPU, core count, rust toolchain, commit hash): not measured raw data: - master: N/A - this PR: N/A ## Testing ```sh cargo check -p ceno_zkvm --lib RUST_MIN_STACK=67108864 cargo test -p ceno_zkvm lookup_keccakf::tests::test_keccakf -- --nocapture cargo make clippy ``` ## Risks and Rollout Main risk is mismatched witness layout or lookup accounting after splitting the Keccak path. The rollout is contained to the Keccak precompile/chip implementation; rollback is to restore the prior unified Keccak chip and witness layout. ## Follow-ups (optional) - Add benchmark data for the Keccak split and witness-width reduction. ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>
## Problem The old tower argument proves each read, write, and lookup expression as its own tower spec. That is simple, but it creates a large proof surface: every spec carries its own tower proof metadata, evaluation points, and transcript data. Chips with many lookup expressions, especially Keccak, pay this cost hundreds or thousands of times. This PR packs same-kind tower records together before tower proving, so the prover/verifier see fewer, wider tower specs instead of many narrow specs. ## Terminology - **Tower**: the binary reduction tree used by the read/write/lookup argument. The leaf layer contains record MLEs; each internal layer folds adjacent values with the product or logup relation until one root remains. - **Product tower**: the tower for read/write multiplicative checks. Each leaf row has two MLEs and internal nodes multiply the pair. - **Logup tower**: the tower for lookup numerator/denominator checks. Each leaf row has `p1, p2, q1, q2`; internal nodes combine them into the next logup layer. - **Spec**: one independent tower proof unit. Before this PR, each read/write/lookup expression usually became one spec. - **Interleaving**: packing multiple specs into one wider leaf domain by adding operation bits after the row bits. - **Virtual leaf**: a GPU representation that describes the interleaved leaf without materializing the full padded leaf layer. - **Tail default**: rows outside the real occupied domain are known constants, so the prover can avoid storing them explicitly. ## Design Rationale The protocol-level idea is to reduce the number of tower specs, not to change the meaning of the read/write/lookup argument. Before interleaving, if a chip has `N` rows and `K` lookup expressions, it builds `K` independent logup towers: ```text lk0 rows: [a0, a1, ..., aN] lk1 rows: [b0, b1, ..., bN] lk2 rows: [c0, c1, ..., cN] proof specs: lk0 tower, lk1 tower, lk2 tower ``` After interleaving, the same values are packed into one logical tower. The extra operation bits select which lookup expression is being addressed: ```text row 0: [a0, b0, c0, padding] row 1: [a1, b1, c1, padding] row 2: [a2, b2, c2, padding] ... proof specs: one interleaved lookup tower ``` For a toy case with `N = 4` rows and `K = 3` lookup expressions, the old layout has three towers of height `log2(4) = 2`. The interleaved layout has one tower over `4 rows * next_power_of_two(3) ops = 16` leaves, so its height is `log2(16) = 4`. The prover does some padding work, but the verifier and transcript only track one lookup tower spec instead of three. For Keccak-like chips this matters more: many lookup expressions are compressed into one lookup tower. That is the main source of the proof-size reduction. ### Witness Build Witness build now groups tower-facing MLEs by kind: - read records become interleaved product tower inputs; - write records become interleaved product tower inputs; - lookup records become interleaved logup tower inputs. On GPU, the build path keeps the interleaved leaf virtual where possible. It builds the first internal layer directly from the virtual leaf, then hands dense tower layers to the tower prover. This avoids keeping thousands of separate tower specs alive and avoids materializing the largest padded leaf layer. ### Proving Tower proving runs sumcheck over the resulting product/logup tower specs. The verifier still checks the same product/logup relations; the difference is that the spec index is now encoded as operation bits in the MLE domain. This reduces proof size because the proof contains fewer independent tower specs, fewer per-spec evaluation point lists, and fewer transcript commitments for those specs. The tradeoff is that some GPU work moves into wider interleaved towers, so aggregate tower-proving profiler spans can increase even when wall-clock `app_prove` improves through overlap and smaller proof surface. ## Change Highlights - Add GPU tower interleaving for read/write/lookup tower records. - Build tower-facing witnesses separately from unrelated witness outputs to reduce live VRAM pressure during tower proving. - Add GPU memory estimation for the interleaved tower build/prove stages. - Update prover/verifier handling for interleaved tower proof metadata and evaluation points. - Keep the verifier semantics tied to the original read/write/lookup argument; interleaving changes representation, not the checked relation. ## Benchmark / Performance Impact CI runs compare block `23817600` with GPU proving and proof output enabled: - Before: https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/27769300721 - After: https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/27809992550 Published summaries: - Before summary: https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/ceno/mainnet23817600-20260618-231248_summary.md - After summary: https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/feat/sc_first_round/mainnet23817600-20260619-144305_summary.md ### Wall Clock | Metric | Before | After | Improvement | | --- | ---: | ---: | ---: | | E2E total | 83.300s | 72.400s | 1.15x faster, -13.1% | | app_prove | 68.500s | 56.100s | 1.22x faster, -18.1% | | Sum of shard `create_proof` times | 65.100s | 52.620s | 1.24x faster, -19.2% | | emulator | 10.200s | 10.100s | 1.01x faster, -1.0% | ### Prover Stage Breakdown These rows are profiler aggregate spans. With concurrent GPU proving, aggregate subspan time can exceed wall-clock `app_prove` because work from different shards overlaps. | Stage | Before | After | Ratio / note | | --- | ---: | ---: | --- | | commit_traces | 17.590s | 17.263s | 1.02x faster | | build_tower_witness_gpu | 1.428s | 18.765s | 13.14x more aggregate GPU work; interleaving moves real work into tower witness build | | prove_tower_relation_gpu | 22.565s | 160.990s | 7.13x more aggregate GPU work; wider interleaved towers replace many narrow towers | | prove_batched_main_constraints | 7.643s | 7.364s | 1.04x faster | | pcs_opening | 9.953s | 7.996s | 1.24x faster | ### Proof Size | Output | Before | After | Improvement | | --- | ---: | ---: | ---: | | `output/app_proof.bitcode` | 65.55 MiB | 23.75 MiB | 2.76x smaller, -63.8% | The proof-size improvement is the primary win: the same block proof drops by `40.80 MiB`. This matches the design goal of reducing the number of tower specs and their per-spec proof metadata. ## Testing Validated through the benchmark CI runs above. Both runs completed successfully and verified the generated shard proofs. ## Risks and Rollout - The main risk is representation mismatch between prover and verifier metadata for interleaved tower specs. The verifier must interpret operation bits consistently with the prover. - GPU aggregate tower time is higher in the current implementation. This PR trades some prover-side tower work for much smaller proof size and better wall-clock app proving on the measured block. - Rollback is straightforward: route tower witness build/proving back to the non-interleaved per-spec tower path. ## Follow-ups - Continue optimizing interleaved tower GPU kernels, especially build/prove aggregate time. - Add more targeted benchmarks for chips with very high lookup counts, such as Keccak, to track the cost of wider towers separately from full-block overlap. ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>
## Summary We want to thank **Sujal Tuladhar (EvilGenius)** for reporting this vulnerability. - skip self-hosted integration jobs for fork PRs - require same-repo PRs for GPU self-hosted CI and isolate PR cargo target output - unset the runner registration PAT before starting the Actions runner - document fork PR, PAT, and cache isolation guidance for the GPU runner
# Conflicts: # .github/workflows/integration.yml # Cargo.lock # Cargo.toml # ceno_recursion/src/zkvm_verifier/verifier.rs # ceno_zkvm/src/precompiles/lookup_keccakf.rs # ceno_zkvm/src/scheme/gpu/mod.rs # ceno_zkvm/src/scheme/verifier.rs # ceno_zkvm/src/structs.rs
## Problem ShardRam currently mixes the leaf RAM/Poseidon work and the EC accumulation tree in one circuit. That keeps the large Poseidon-heavy leaf witness on a 2n domain even though the EC tree is the part that naturally needs the binary-tree 2n layout. This PR splits the EC tree into a separate `ShardRamEcTreeCircuit` and connects the leaf and EC-tree chips through a compact custom RAM record. ## Design Rationale Golden rules for chip splitting: 1. Best case: trade a smaller lookup domain, especially across a power-of-two boundary, for an extra product read/write argument. This is the highest-value split because it can reduce the dominant lookup proving work. 2. Second-best case: trade smaller resident memory for an extra product read/write argument. This is still useful when cached witness/device residency is the bottleneck, but it may not improve e2e time if the added chip/product work dominates. The split keeps the ShardRam leaf on an n-sized domain and moves only the EC tree into a dedicated 2n chip. The chip boundary is connected with `RAMType::Custom` records carrying the EC point `(x[0..7], y[0..7])`. Soundness-sensitive points: - The ShardRam leaf still computes/binds the Poseidon-derived x-coordinate and y-sign/range constraints. - `ShardRamEcTreeCircuit` consumes the same EC point records and proves the EC accumulation tree separately. - The custom bridge rows are checked in tests so active leaf reads/writes match EC-tree writes/reads. - Padding/custom rows use neutral values where required; the new RAM custom read/write padding is one. Trade-off: this reduces cached raw ShardRam witness footprint, but introduces an extra chip and custom read/write product argument. In the current Reth benchmark shape, the saved resident witness is not the dominant e2e bottleneck. ## Change Highlights - `ceno_zkvm/src/tables/shard_ram.rs` - Split `ShardRamEcTreeCircuit` from `ShardRamCircuit`. - Compact the custom bridge record to `ShardRamEcPoint + x + y`. - Remove duplicated RAM/Poseidon fields from EC-tree. - Fix CPU Poseidon witness assignment to use `config.perm_config.p3_cols[0].id` instead of the old hardcoded offset. - Add focused selector/padding/custom-record tests. - `ceno_zkvm/src/instructions/gpu/chips/shard_ram.rs` - Update GPU column maps for the split leaf and EC-tree layouts. - `ceno-gpu/cpp/common/witgen/shard_ram_per_row.cuh` - Update EC-tree witness generation to write only x/y plus structural selector data. ## Benchmark / Performance Impact ### Operation Block: `23587691`, full shards, `CENO_GPU_WITGEN=0`, `CENO_GPU_CACHE_LEVEL=1`, GPU enabled. | Operation | master (s) | this PR (s) | Ratio (master -> this PR) | |-----------|------------|-------------|-----------------------------| | reth-block | 14.153 | 14.440 | `-1.02x` | | create_proof_of_shard, shard 0 span | 4.240 | 4.480 | `-1.06x` | | create_proof_of_shard, shard 1 span | 2.570 | 2.540 | `1.01x` | | app.verify | 0.266 | 0.261 | `1.02x` | Structured metric note: the JSON `create_proof_of_shard_time_ms` sample was `2568ms` on master and `2542ms` on this PR, but the span log is the clearer full-shard comparison because it reports both shard spans. ### Layer | Layer / Memory item | master | this PR | Ratio (master -> this PR) | |---------------------|--------|---------|-----------------------------| | ShardRam scheduled proof reservation | 92.00 MiB | 61.50 MiB leaf + 89.52 MiB EC-tree | `-1.64x` scheduler reservation | | ShardRam raw cached witness estimate | ~378 MiB | ~206.5 MiB | `1.83x` resident raw witness reduction | | ShardRam leaf rows | 262144 | 131072 | `2.00x` | | ShardRam leaf witness columns | ~378 | 371 | leaf no longer includes EC slope/tree columns | | ShardRamEcTree rows | included in baseline ShardRam 2n layout | 262144 | moved to separate chip | Detailed memtrack from this PR, shard 0: | Circuit | rows | witness columns | structural columns | resident | main witness | tower prove | ecc | total scheduler estimate | |---------|------|-----------------|--------------------|----------|--------------|-------------|-----|--------------------------| | ShardRamCircuit | 131072 | 371 | 3 | 1.50 MiB | 16.00 MiB | 45.89 MiB | 0.00 MiB | 61.50 MiB | | ShardRamEcTreeCircuit | 262144 | 21 | 7 | 7.00 MiB | 8.00 MiB | 19.89 MiB | 72.52 MiB | 89.52 MiB | Interpretation: - The intended resident raw-witness reduction is present: about `378 MiB -> 206.5 MiB`, roughly `171.5 MiB` saved. - The current e2e time does not improve because scheduler proof reservation is now split into two chips, and the EC quark allocation (`72.52 MiB`) remains in `ShardRamEcTreeCircuit`. - With `cache=1`, the scheduler `resident=` estimate does not include retained raw witness device backing, so it should not be used alone to judge the saved Poseidon-column footprint. Benchmark command(s): ```sh # master baseline and this PR used the same Reth shape: CENO_GPU_WITGEN=0 \ CENO_CONCURRENT_CHIP_PROVING=1 \ CENO_GPU_CACHE_LEVEL=1 \ CENO_GPU_JAGGED_RESHAPE_LOG_HEIGHT=23 \ CENO_MAX_CELL_PER_SHARD=805306368 \ CENO_GPU_MEM_TRACKING=0 \ CENO_GPU_LARGE_TASK_BOOKING_MARGIN_MB=0 \ OUTPUT_PATH=<metrics-json> \ RUST_LOG=info \ cargo run --features 'jemalloc,gpu' --bin ceno-reth-benchmark-bin --release \ --config 'patch."https://github.com/scroll-tech/ceno.git".ceno_emul.path="../ceno/ceno_emul"' \ --config 'patch."https://github.com/scroll-tech/ceno.git".ceno_host.path="../ceno/ceno_host"' \ --config 'patch."https://github.com/scroll-tech/ceno.git".ceno_zkvm.path="../ceno/ceno_zkvm"' \ --config 'patch."https://github.com/scroll-tech/ceno-gpu-mock.git".ceno_gpu.path="../ceno-gpu/cuda_hal"' \ -- \ --block-number 23587691 \ --chain-id 1 \ --cache-dir block_data \ --mode prove-app \ --app-proofs ../ceno/app_proof.bitcode # Extra diagnostic run for this PR, shard 0 only: CENO_GPU_MEM_TRACKING=1 ... --shard-id 0 ``` Environment: - CPU: AMD Ryzen 9 5900XT 16-Core Processor, 32 logical CPUs - GPU: NVIDIA GeForce RTX 5070 Ti, 16303 MiB, driver 570.172.08 - Rust: `rustc 1.93.0-nightly (07bdbaedc 2025-11-19)` - Branch: `feat/shardram_circuit` - This PR commit tested locally: `8162e6f45a53226a93bbf05bd03fd9edb163d53d` - Baseline master commit tested locally: `678910c71624ab69ea776a82f3ec99971cc3e6d9` Raw data: - master: - `metrics_23587691_full_upstream_master_witgen0_cache1_h23_maxcell6_localcenogpu_gkrpath.json` - `sanity_23587691_full_upstream_master_witgen0_cache1_h23_maxcell6_localcenogpu_gkrpath.log` - this PR: - `metrics_23587691_full_witgen0_cache1_shardram_split_current_gkrpatch_20260623.json` - `sanity_23587691_full_witgen0_cache1_shardram_split_current_gkrpatch_20260623.log` - memtrack diagnostic: `sanity_23587691_shard0_memtrack_shardram_split_current_20260623.log` ## CI Benchmark Comparison: Reth Block 23817600 Comparison source: - Feature run: https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/28036220554 - Baseline run: https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/27939140577 Both runs passed. No artifacts were published, so the metrics below were extracted from the GitHub Actions log archive `1_benchmark.txt`. | Metric | Baseline | Feature | Ratio | |--------|----------|---------|-------| | Ceno ref | `feat/opt#1e2ff5b` | `feat/shardram_circuit#de5cc96c` | changed | | Ceno-GPU ref | `feat/opt_sc_first_round#c61c925` | `feat/shardram_circuit#134576c4` | changed | | Block | `23817600` | `23817600` | same | | Shards | `13` | `13` | same | | `reth-block` | `72.7s` | `69.8s` | `1.04x` faster | | sum `create_proof_of_shard` | `52.86s` | `50.03s` | `1.06x` faster | | `app.verify` | `1.97s` | `1.96s` | `1.01x` faster | | sum `generate_witness` | `40.03s` | `39.22s` | `1.02x` faster | | proof size | `24,899,592 bytes`, `23.75 MiB` | `24,976,657 bytes`, `23.82 MiB` | `-1.00x` larger | | total verifier chip groups | `710` | `723` | `-1.02x`, exactly +1 per shard | Per-shard `create_proof_of_shard` spans: | Shard | Baseline | Feature | Ratio | |-------|----------|---------|-------| | 0 | `4.64s` | `4.49s` | `1.03x` faster | | 1 | `4.08s` | `4.06s` | `1.00x` faster | | 2 | `4.30s` | `4.18s` | `1.03x` faster | | 3 | `4.60s` | `4.39s` | `1.05x` faster | | 4 | `4.57s` | `4.28s` | `1.07x` faster | | 5 | `4.34s` | `4.23s` | `1.03x` faster | | 6 | `4.20s` | `3.89s` | `1.08x` faster | | 7 | `4.88s` | `4.68s` | `1.04x` faster | | 8 | `4.21s` | `3.78s` | `1.11x` faster | | 9 | `3.99s` | `3.64s` | `1.10x` faster | | 10 | `4.13s` | `3.77s` | `1.10x` faster | | 11 | `3.76s` | `3.42s` | `1.10x` faster | | 12 | `1.16s` | `1.22s` | `-1.05x` slower | Interpretation: - The CI e2e improvement is real for this feature-branch comparison: `reth-block` improves from `72.7s` to `69.8s`, or `1.04x` faster. - The improvement comes mainly from shard proving. The summed `create_proof_of_shard` spans are `1.06x` faster, almost matching the `reth-block` gain. - Verification is flat and proof size is slightly larger, so the win is not from smaller proof output or fewer verifier chip groups. - The feature adds one chip group per shard, consistent with the new `ShardRamEcTreeCircuit`. - The logs show the expected split effect: baseline has `ShardRamCircuit estimated=170.52MB`; feature has `ShardRamCircuit estimated=104.39MB` plus `ShardRamEcTreeCircuit estimated=168.52MB`. The ShardRam leaf got smaller, while the new EC-tree chip adds work. - Caveat: this is not a pure ShardRam split A/B. The feature run also switches Ceno, Ceno-GPU, and GKR refs. The defensible conclusion is that the feature branch improves e2e because shard proving is faster across most shards despite a slightly larger proof and one extra chip per shard. This CI alone does not isolate how much of the improvement comes from the split itself versus the newer dependency set. ## Testing ```sh cargo test --config net.git-fetch-with-cli=true --package ceno_zkvm --lib \ tables::shard_ram::tests::test_shard_ram_split_selectors_and_tower_padding -- --nocapture cargo test --config net.git-fetch-with-cli=true --package ceno_zkvm --lib \ tables::shard_ram::tests::test_shard_ram_circuit -- --nocapture cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --bin e2e -- \ --platform=ceno \ --max-cycle-per-shard=1600 \ examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall cargo run --config net.git-fetch-with-cli=true \ --config 'patch."https://github.com/scroll-tech/ceno-gpu-mock.git".ceno_gpu.path="../ceno-gpu/cuda_hal"' \ --release --package ceno_zkvm --features gpu --bin e2e -- \ --platform=ceno \ --max-cycle-per-shard=1600 \ examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall # Reth full-shard GPU validation, cache=1/witgen=0/full shards, command shape shown above. ``` Outcomes: - ShardRam selector/padding/custom-record test: passed. - ShardRam circuit test: passed. - CPU `keccak_syscall` e2e: passed. - GPU `keccak_syscall` e2e: passed. - Reth full-shard GPU validation: passed, final `exit code 0. Success.` ## Risks and Rollout - Soundness risk is concentrated in the new custom bridge between leaf and EC-tree. This is covered by active-row custom read/write matching tests and CPU/GPU e2e validation. - Performance risk: this split reduces cached raw witness VRAM but does not currently improve Reth e2e proof time under the tested `cache=1, witgen=0` shape. The extra chip and custom product argument can offset the raw witness saving. - Rollback is local to the ShardRam split: revert the EC-tree chip split and restore the single-chip ShardRam layout. ## Follow-ups (optional) - Add first-class metrics for retained raw witness device backing so cache=1 resident savings are visible directly in benchmark output. - Investigate whether ShardRam leaf tower prove can be reduced by shrinking the custom bridge product argument or avoiding materialized main-witness outputs that are not needed across the full tower prove stage. - Investigate scheduling policy so the extra EC-tree chip does not erase the resident-witness benefit in end-to-end proof time. ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly.
# Conflicts: # ceno_zkvm/src/instructions/gpu/chips/shard_ram.rs
This reverts commit e89cbe7.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Design Rationale
Change Highlights
Benchmark / Performance Impact
Operation
Layer
Benchmark command(s):
# paste exact command(s)Environment (CPU/GPU, core count, rust toolchain, commit hash):
raw data:
Testing
Risks and Rollout
Follow-ups (optional)
Copilot Reviewer Directive (keep this section)
When Copilot reviews this PR, apply
.github/copilot-instructions.mdstrictly.