Skip to content

Feat/recursion v2 merge master#1374

Open
hero78119 wants to merge 20 commits into
feat/recursion-v2from
feat/recursion-v2-merge-master
Open

Feat/recursion v2 merge master#1374
hero78119 wants to merge 20 commits into
feat/recursion-v2from
feat/recursion-v2-merge-master

Conversation

@hero78119

Copy link
Copy Markdown
Collaborator

Problem

Design Rationale

Change Highlights

Benchmark / Performance Impact

Operation

Operation master (s) this PR (s) Improve (master -> this PR)

Layer

Layer master (s) this PR (s) Improve (master -> this PR)

Benchmark command(s):

# paste exact command(s)

Environment (CPU/GPU, core count, rust toolchain, commit hash):

raw data:

  • master:
  • this PR:

Testing

# paste exact command(s), for example:
# cargo fmt --all --check
# cargo make clippy
# cargo make tests

Risks and Rollout

Follow-ups (optional)

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

hero78119 and others added 20 commits June 22, 2026 03:45
## Problem

The Keccak precompile path carried redundant witness materialization and
mixed memory-facing ecall work with permutation constraints. That made
the prover path wider than necessary and harder to reason about when
accounting for reads, writes, lookups, and GPU chip behavior.

## Design Rationale

This PR separates the memory boundary from the Keccak permutation work
so each chip has a narrower responsibility: the ecall side handles state
movement, while the core permutation chip owns the Keccak constraints.
That makes the circuit accounting easier to validate and keeps memory
interactions out of the core permutation path.

The witness reductions remove data that is already derivable from
existing state: the `c_aux` prefix copy for `j = 0` and split-rotation
witnesses for the zero-rotation RhoPi lane. These changes preserve the
existing Keccak semantics and lookup model while reducing committed
witness columns and range-check pressure.

Trade-off: the builder and witness generation now have a small amount of
layout-specific branching, but the resulting circuit shape is more
explicit and the hot path carries less redundant data.

## Change Highlights

- `ceno_zkvm`: split Keccak memory/ecall handling from the core
permutation chip and rename the split chips for clarity.
- `ceno_zkvm`: reduce redundant Keccak witness columns and update
lookup/range-check accounting to match the new layout.
- `ceno_zkvm`: update GPU chip wiring and debug comparison paths for the
split Keccak chip structure.
- `ceno_zkvm`: clean up the Keccak ecall test helper lifetime so
workspace clippy passes with warnings denied.

## Benchmark / Performance Impact

This is performance-sensitive because it reduces Keccak witness width
and range-check work. End-to-end benchmark numbers are not included yet.

### Operation

| Operation | master (s) | this PR (s) | Improve (master -> this PR) |
|-----------|------------|-------------|-----------------------------|
| Not measured | N/A | N/A | N/A |

### Layer

| Layer | master (s) | this PR (s) | Improve (master -> this PR) |
|-------|------------|-------------|-----------------------------|
| Not measured | N/A | N/A | N/A |

Benchmark command(s):

```sh
# not run
```

Environment (CPU/GPU, core count, rust toolchain, commit hash): not
measured

raw data:

- master: N/A
- this PR: N/A

## Testing

```sh
cargo check -p ceno_zkvm --lib
RUST_MIN_STACK=67108864 cargo test -p ceno_zkvm lookup_keccakf::tests::test_keccakf -- --nocapture
cargo make clippy
```

## Risks and Rollout

Main risk is mismatched witness layout or lookup accounting after
splitting the Keccak path. The rollout is contained to the Keccak
precompile/chip implementation; rollback is to restore the prior unified
Keccak chip and witness layout.

## Follow-ups (optional)

- Add benchmark data for the Keccak split and witness-width reduction.

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>
## Problem

The old tower argument proves each read, write, and lookup expression as
its own tower spec. That is simple, but it creates a large proof
surface: every spec carries its own tower proof metadata, evaluation
points, and transcript data. Chips with many lookup expressions,
especially Keccak, pay this cost hundreds or thousands of times.

This PR packs same-kind tower records together before tower proving, so
the prover/verifier see fewer, wider tower specs instead of many narrow
specs.

## Terminology

- **Tower**: the binary reduction tree used by the read/write/lookup
argument. The leaf layer contains record MLEs; each internal layer folds
adjacent values with the product or logup relation until one root
remains.
- **Product tower**: the tower for read/write multiplicative checks.
Each leaf row has two MLEs and internal nodes multiply the pair.
- **Logup tower**: the tower for lookup numerator/denominator checks.
Each leaf row has `p1, p2, q1, q2`; internal nodes combine them into the
next logup layer.
- **Spec**: one independent tower proof unit. Before this PR, each
read/write/lookup expression usually became one spec.
- **Interleaving**: packing multiple specs into one wider leaf domain by
adding operation bits after the row bits.
- **Virtual leaf**: a GPU representation that describes the interleaved
leaf without materializing the full padded leaf layer.
- **Tail default**: rows outside the real occupied domain are known
constants, so the prover can avoid storing them explicitly.

## Design Rationale

The protocol-level idea is to reduce the number of tower specs, not to
change the meaning of the read/write/lookup argument.

Before interleaving, if a chip has `N` rows and `K` lookup expressions,
it builds `K` independent logup towers:

```text
lk0 rows: [a0, a1, ..., aN]
lk1 rows: [b0, b1, ..., bN]
lk2 rows: [c0, c1, ..., cN]

proof specs: lk0 tower, lk1 tower, lk2 tower
```

After interleaving, the same values are packed into one logical tower.
The extra operation bits select which lookup expression is being
addressed:

```text
row 0: [a0, b0, c0, padding]
row 1: [a1, b1, c1, padding]
row 2: [a2, b2, c2, padding]
...

proof specs: one interleaved lookup tower
```

For a toy case with `N = 4` rows and `K = 3` lookup expressions, the old
layout has three towers of height `log2(4) = 2`. The interleaved layout
has one tower over `4 rows * next_power_of_two(3) ops = 16` leaves, so
its height is `log2(16) = 4`. The prover does some padding work, but the
verifier and transcript only track one lookup tower spec instead of
three.

For Keccak-like chips this matters more: many lookup expressions are
compressed into one lookup tower. That is the main source of the
proof-size reduction.

### Witness Build

Witness build now groups tower-facing MLEs by kind:

- read records become interleaved product tower inputs;
- write records become interleaved product tower inputs;
- lookup records become interleaved logup tower inputs.

On GPU, the build path keeps the interleaved leaf virtual where
possible. It builds the first internal layer directly from the virtual
leaf, then hands dense tower layers to the tower prover. This avoids
keeping thousands of separate tower specs alive and avoids materializing
the largest padded leaf layer.

### Proving

Tower proving runs sumcheck over the resulting product/logup tower
specs. The verifier still checks the same product/logup relations; the
difference is that the spec index is now encoded as operation bits in
the MLE domain.

This reduces proof size because the proof contains fewer independent
tower specs, fewer per-spec evaluation point lists, and fewer transcript
commitments for those specs. The tradeoff is that some GPU work moves
into wider interleaved towers, so aggregate tower-proving profiler spans
can increase even when wall-clock `app_prove` improves through overlap
and smaller proof surface.

## Change Highlights

- Add GPU tower interleaving for read/write/lookup tower records.
- Build tower-facing witnesses separately from unrelated witness outputs
to reduce live VRAM pressure during tower proving.
- Add GPU memory estimation for the interleaved tower build/prove
stages.
- Update prover/verifier handling for interleaved tower proof metadata
and evaluation points.
- Keep the verifier semantics tied to the original read/write/lookup
argument; interleaving changes representation, not the checked relation.

## Benchmark / Performance Impact

CI runs compare block `23817600` with GPU proving and proof output
enabled:

- Before:
https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/27769300721
- After:
https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/27809992550

Published summaries:

- Before summary:
https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/ceno/mainnet23817600-20260618-231248_summary.md
- After summary:
https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/feat/sc_first_round/mainnet23817600-20260619-144305_summary.md

### Wall Clock

| Metric | Before | After | Improvement |
| --- | ---: | ---: | ---: |
| E2E total | 83.300s | 72.400s | 1.15x faster, -13.1% |
| app_prove | 68.500s | 56.100s | 1.22x faster, -18.1% |
| Sum of shard `create_proof` times | 65.100s | 52.620s | 1.24x faster,
-19.2% |
| emulator | 10.200s | 10.100s | 1.01x faster, -1.0% |

### Prover Stage Breakdown

These rows are profiler aggregate spans. With concurrent GPU proving,
aggregate subspan time can exceed wall-clock `app_prove` because work
from different shards overlaps.

| Stage | Before | After | Ratio / note |
| --- | ---: | ---: | --- |
| commit_traces | 17.590s | 17.263s | 1.02x faster |
| build_tower_witness_gpu | 1.428s | 18.765s | 13.14x more aggregate GPU
work; interleaving moves real work into tower witness build |
| prove_tower_relation_gpu | 22.565s | 160.990s | 7.13x more aggregate
GPU work; wider interleaved towers replace many narrow towers |
| prove_batched_main_constraints | 7.643s | 7.364s | 1.04x faster |
| pcs_opening | 9.953s | 7.996s | 1.24x faster |

### Proof Size

| Output | Before | After | Improvement |
| --- | ---: | ---: | ---: |
| `output/app_proof.bitcode` | 65.55 MiB | 23.75 MiB | 2.76x smaller,
-63.8% |

The proof-size improvement is the primary win: the same block proof
drops by `40.80 MiB`. This matches the design goal of reducing the
number of tower specs and their per-spec proof metadata.

## Testing

Validated through the benchmark CI runs above. Both runs completed
successfully and verified the generated shard proofs.

## Risks and Rollout

- The main risk is representation mismatch between prover and verifier
metadata for interleaved tower specs. The verifier must interpret
operation bits consistently with the prover.
- GPU aggregate tower time is higher in the current implementation. This
PR trades some prover-side tower work for much smaller proof size and
better wall-clock app proving on the measured block.
- Rollback is straightforward: route tower witness build/proving back to
the non-interleaved per-spec tower path.

## Follow-ups

- Continue optimizing interleaved tower GPU kernels, especially
build/prove aggregate time.
- Add more targeted benchmarks for chips with very high lookup counts,
such as Keccak, to track the cost of wider towers separately from
full-block overlap.

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>
## Summary

We want to thank **Sujal Tuladhar (EvilGenius)** for reporting this
vulnerability.

- skip self-hosted integration jobs for fork PRs
- require same-repo PRs for GPU self-hosted CI and isolate PR cargo
target output
- unset the runner registration PAT before starting the Actions runner
- document fork PR, PAT, and cache isolation guidance for the GPU runner
Remove one degree of freedom by simplifying chip with just one proof. 

### related PR
- to extend to vector of proofs:
#1126
- optimize with less global shard records
#1171
# Conflicts:
#	.github/workflows/integration.yml
#	Cargo.lock
#	Cargo.toml
#	ceno_recursion/src/zkvm_verifier/verifier.rs
#	ceno_zkvm/src/precompiles/lookup_keccakf.rs
#	ceno_zkvm/src/scheme/gpu/mod.rs
#	ceno_zkvm/src/scheme/verifier.rs
#	ceno_zkvm/src/structs.rs
## Problem

ShardRam currently mixes the leaf RAM/Poseidon work and the EC
accumulation tree in one circuit. That keeps the large Poseidon-heavy
leaf witness on a 2n domain even though the EC tree is the part that
naturally needs the binary-tree 2n layout.

This PR splits the EC tree into a separate `ShardRamEcTreeCircuit` and
connects the leaf and EC-tree chips through a compact custom RAM record.

## Design Rationale

Golden rules for chip splitting:

1. Best case: trade a smaller lookup domain, especially across a
power-of-two boundary, for an extra product read/write argument. This is
the highest-value split because it can reduce the dominant lookup
proving work.
2. Second-best case: trade smaller resident memory for an extra product
read/write argument. This is still useful when cached witness/device
residency is the bottleneck, but it may not improve e2e time if the
added chip/product work dominates.

The split keeps the ShardRam leaf on an n-sized domain and moves only
the EC tree into a dedicated 2n chip. The chip boundary is connected
with `RAMType::Custom` records carrying the EC point `(x[0..7],
y[0..7])`.

Soundness-sensitive points:

- The ShardRam leaf still computes/binds the Poseidon-derived
x-coordinate and y-sign/range constraints.
- `ShardRamEcTreeCircuit` consumes the same EC point records and proves
the EC accumulation tree separately.
- The custom bridge rows are checked in tests so active leaf
reads/writes match EC-tree writes/reads.
- Padding/custom rows use neutral values where required; the new RAM
custom read/write padding is one.

Trade-off: this reduces cached raw ShardRam witness footprint, but
introduces an extra chip and custom read/write product argument. In the
current Reth benchmark shape, the saved resident witness is not the
dominant e2e bottleneck.

## Change Highlights

- `ceno_zkvm/src/tables/shard_ram.rs`
  - Split `ShardRamEcTreeCircuit` from `ShardRamCircuit`.
  - Compact the custom bridge record to `ShardRamEcPoint + x + y`.
  - Remove duplicated RAM/Poseidon fields from EC-tree.
- Fix CPU Poseidon witness assignment to use
`config.perm_config.p3_cols[0].id` instead of the old hardcoded offset.
  - Add focused selector/padding/custom-record tests.
- `ceno_zkvm/src/instructions/gpu/chips/shard_ram.rs`
  - Update GPU column maps for the split leaf and EC-tree layouts.
- `ceno-gpu/cpp/common/witgen/shard_ram_per_row.cuh`
- Update EC-tree witness generation to write only x/y plus structural
selector data.

## Benchmark / Performance Impact

### Operation

Block: `23587691`, full shards, `CENO_GPU_WITGEN=0`,
`CENO_GPU_CACHE_LEVEL=1`, GPU enabled.

| Operation | master (s) | this PR (s) | Ratio (master -> this PR) |
|-----------|------------|-------------|-----------------------------|
| reth-block | 14.153 | 14.440 | `-1.02x` |
| create_proof_of_shard, shard 0 span | 4.240 | 4.480 | `-1.06x` |
| create_proof_of_shard, shard 1 span | 2.570 | 2.540 | `1.01x` |
| app.verify | 0.266 | 0.261 | `1.02x` |

Structured metric note: the JSON `create_proof_of_shard_time_ms` sample
was `2568ms` on master and `2542ms` on this PR, but the span log is the
clearer full-shard comparison because it reports both shard spans.

### Layer

| Layer / Memory item | master | this PR | Ratio (master -> this PR) |
|---------------------|--------|---------|-----------------------------|
| ShardRam scheduled proof reservation | 92.00 MiB | 61.50 MiB leaf +
89.52 MiB EC-tree | `-1.64x` scheduler reservation |
| ShardRam raw cached witness estimate | ~378 MiB | ~206.5 MiB | `1.83x`
resident raw witness reduction |
| ShardRam leaf rows | 262144 | 131072 | `2.00x` |
| ShardRam leaf witness columns | ~378 | 371 | leaf no longer includes
EC slope/tree columns |
| ShardRamEcTree rows | included in baseline ShardRam 2n layout | 262144
| moved to separate chip |

Detailed memtrack from this PR, shard 0:

| Circuit | rows | witness columns | structural columns | resident |
main witness | tower prove | ecc | total scheduler estimate |

|---------|------|-----------------|--------------------|----------|--------------|-------------|-----|--------------------------|
| ShardRamCircuit | 131072 | 371 | 3 | 1.50 MiB | 16.00 MiB | 45.89 MiB
| 0.00 MiB | 61.50 MiB |
| ShardRamEcTreeCircuit | 262144 | 21 | 7 | 7.00 MiB | 8.00 MiB | 19.89
MiB | 72.52 MiB | 89.52 MiB |

Interpretation:

- The intended resident raw-witness reduction is present: about `378 MiB
-> 206.5 MiB`, roughly `171.5 MiB` saved.
- The current e2e time does not improve because scheduler proof
reservation is now split into two chips, and the EC quark allocation
(`72.52 MiB`) remains in `ShardRamEcTreeCircuit`.
- With `cache=1`, the scheduler `resident=` estimate does not include
retained raw witness device backing, so it should not be used alone to
judge the saved Poseidon-column footprint.

Benchmark command(s):

```sh
# master baseline and this PR used the same Reth shape:
CENO_GPU_WITGEN=0 \
CENO_CONCURRENT_CHIP_PROVING=1 \
CENO_GPU_CACHE_LEVEL=1 \
CENO_GPU_JAGGED_RESHAPE_LOG_HEIGHT=23 \
CENO_MAX_CELL_PER_SHARD=805306368 \
CENO_GPU_MEM_TRACKING=0 \
CENO_GPU_LARGE_TASK_BOOKING_MARGIN_MB=0 \
OUTPUT_PATH=<metrics-json> \
RUST_LOG=info \
cargo run --features 'jemalloc,gpu' --bin ceno-reth-benchmark-bin --release \
  --config 'patch."https://github.com/scroll-tech/ceno.git".ceno_emul.path="../ceno/ceno_emul"' \
  --config 'patch."https://github.com/scroll-tech/ceno.git".ceno_host.path="../ceno/ceno_host"' \
  --config 'patch."https://github.com/scroll-tech/ceno.git".ceno_zkvm.path="../ceno/ceno_zkvm"' \
  --config 'patch."https://github.com/scroll-tech/ceno-gpu-mock.git".ceno_gpu.path="../ceno-gpu/cuda_hal"' \
  -- \
  --block-number 23587691 \
  --chain-id 1 \
  --cache-dir block_data \
  --mode prove-app \
  --app-proofs ../ceno/app_proof.bitcode

# Extra diagnostic run for this PR, shard 0 only:
CENO_GPU_MEM_TRACKING=1 ... --shard-id 0
```

Environment:

- CPU: AMD Ryzen 9 5900XT 16-Core Processor, 32 logical CPUs
- GPU: NVIDIA GeForce RTX 5070 Ti, 16303 MiB, driver 570.172.08
- Rust: `rustc 1.93.0-nightly (07bdbaedc 2025-11-19)`
- Branch: `feat/shardram_circuit`
- This PR commit tested locally:
`8162e6f45a53226a93bbf05bd03fd9edb163d53d`
- Baseline master commit tested locally:
`678910c71624ab69ea776a82f3ec99971cc3e6d9`

Raw data:

- master:
-
`metrics_23587691_full_upstream_master_witgen0_cache1_h23_maxcell6_localcenogpu_gkrpath.json`
-
`sanity_23587691_full_upstream_master_witgen0_cache1_h23_maxcell6_localcenogpu_gkrpath.log`
- this PR:
-
`metrics_23587691_full_witgen0_cache1_shardram_split_current_gkrpatch_20260623.json`
-
`sanity_23587691_full_witgen0_cache1_shardram_split_current_gkrpatch_20260623.log`
- memtrack diagnostic:
`sanity_23587691_shard0_memtrack_shardram_split_current_20260623.log`


## CI Benchmark Comparison: Reth Block 23817600

Comparison source:

- Feature run:
https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/28036220554
- Baseline run:
https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/27939140577

Both runs passed. No artifacts were published, so the metrics below were
extracted from the GitHub Actions log archive `1_benchmark.txt`.

| Metric | Baseline | Feature | Ratio |
|--------|----------|---------|-------|
| Ceno ref | `feat/opt#1e2ff5b` | `feat/shardram_circuit#de5cc96c` |
changed |
| Ceno-GPU ref | `feat/opt_sc_first_round#c61c925` |
`feat/shardram_circuit#134576c4` | changed |
| Block | `23817600` | `23817600` | same |
| Shards | `13` | `13` | same |
| `reth-block` | `72.7s` | `69.8s` | `1.04x` faster |
| sum `create_proof_of_shard` | `52.86s` | `50.03s` | `1.06x` faster |
| `app.verify` | `1.97s` | `1.96s` | `1.01x` faster |
| sum `generate_witness` | `40.03s` | `39.22s` | `1.02x` faster |
| proof size | `24,899,592 bytes`, `23.75 MiB` | `24,976,657 bytes`,
`23.82 MiB` | `-1.00x` larger |
| total verifier chip groups | `710` | `723` | `-1.02x`, exactly +1 per
shard |

Per-shard `create_proof_of_shard` spans:

| Shard | Baseline | Feature | Ratio |
|-------|----------|---------|-------|
| 0 | `4.64s` | `4.49s` | `1.03x` faster |
| 1 | `4.08s` | `4.06s` | `1.00x` faster |
| 2 | `4.30s` | `4.18s` | `1.03x` faster |
| 3 | `4.60s` | `4.39s` | `1.05x` faster |
| 4 | `4.57s` | `4.28s` | `1.07x` faster |
| 5 | `4.34s` | `4.23s` | `1.03x` faster |
| 6 | `4.20s` | `3.89s` | `1.08x` faster |
| 7 | `4.88s` | `4.68s` | `1.04x` faster |
| 8 | `4.21s` | `3.78s` | `1.11x` faster |
| 9 | `3.99s` | `3.64s` | `1.10x` faster |
| 10 | `4.13s` | `3.77s` | `1.10x` faster |
| 11 | `3.76s` | `3.42s` | `1.10x` faster |
| 12 | `1.16s` | `1.22s` | `-1.05x` slower |

Interpretation:

- The CI e2e improvement is real for this feature-branch comparison:
`reth-block` improves from `72.7s` to `69.8s`, or `1.04x` faster.
- The improvement comes mainly from shard proving. The summed
`create_proof_of_shard` spans are `1.06x` faster, almost matching the
`reth-block` gain.
- Verification is flat and proof size is slightly larger, so the win is
not from smaller proof output or fewer verifier chip groups.
- The feature adds one chip group per shard, consistent with the new
`ShardRamEcTreeCircuit`.
- The logs show the expected split effect: baseline has `ShardRamCircuit
estimated=170.52MB`; feature has `ShardRamCircuit estimated=104.39MB`
plus `ShardRamEcTreeCircuit estimated=168.52MB`. The ShardRam leaf got
smaller, while the new EC-tree chip adds work.
- Caveat: this is not a pure ShardRam split A/B. The feature run also
switches Ceno, Ceno-GPU, and GKR refs. The defensible conclusion is that
the feature branch improves e2e because shard proving is faster across
most shards despite a slightly larger proof and one extra chip per
shard. This CI alone does not isolate how much of the improvement comes
from the split itself versus the newer dependency set.

## Testing

```sh
cargo test --config net.git-fetch-with-cli=true --package ceno_zkvm --lib \
  tables::shard_ram::tests::test_shard_ram_split_selectors_and_tower_padding -- --nocapture

cargo test --config net.git-fetch-with-cli=true --package ceno_zkvm --lib \
  tables::shard_ram::tests::test_shard_ram_circuit -- --nocapture

cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --bin e2e -- \
  --platform=ceno \
  --max-cycle-per-shard=1600 \
  examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall

cargo run --config net.git-fetch-with-cli=true \
  --config 'patch."https://github.com/scroll-tech/ceno-gpu-mock.git".ceno_gpu.path="../ceno-gpu/cuda_hal"' \
  --release --package ceno_zkvm --features gpu --bin e2e -- \
  --platform=ceno \
  --max-cycle-per-shard=1600 \
  examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall

# Reth full-shard GPU validation, cache=1/witgen=0/full shards, command shape shown above.
```

Outcomes:

- ShardRam selector/padding/custom-record test: passed.
- ShardRam circuit test: passed.
- CPU `keccak_syscall` e2e: passed.
- GPU `keccak_syscall` e2e: passed.
- Reth full-shard GPU validation: passed, final `exit code 0. Success.`

## Risks and Rollout

- Soundness risk is concentrated in the new custom bridge between leaf
and EC-tree. This is covered by active-row custom read/write matching
tests and CPU/GPU e2e validation.
- Performance risk: this split reduces cached raw witness VRAM but does
not currently improve Reth e2e proof time under the tested `cache=1,
witgen=0` shape. The extra chip and custom product argument can offset
the raw witness saving.
- Rollback is local to the ShardRam split: revert the EC-tree chip split
and restore the single-chip ShardRam layout.

## Follow-ups (optional)

- Add first-class metrics for retained raw witness device backing so
cache=1 resident savings are visible directly in benchmark output.
- Investigate whether ShardRam leaf tower prove can be reduced by
shrinking the custom bridge product argument or avoiding materialized
main-witness outputs that are not needed across the full tower prove
stage.
- Investigate scheduling policy so the extra EC-tree chip does not erase
the resident-witness benefit in end-to-end proof time.

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.
# Conflicts:
#	ceno_zkvm/src/instructions/gpu/chips/shard_ram.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants