ezkl.prove() deadlocks deterministically on certain inputs (mv_lookup hang)

## Summary (Let me know if you need more info!)

`ezkl.prove()` hangs indefinitely on certain inputs. The same install, same compiled circuit, and same proving key handle other inputs in ~1 second. The hang appears to be in halo2's `mv_lookup` prover. We've reproduced it with a minimal self-contained script (~70 lines, tiny MLP, no external assets).

## Versions

- ezkl: **23.0.5** (from pip)
- Python: 3.14.4
- OS: Ubuntu 25.10
- Hardware: AMD Ryzen 9 9950X3D, 60GB RAM (CPU prove path, no icicle)

## Minimal reproducer

```python
import json, os, subprocess, sys, tempfile
import torch, torch.nn as nn, ezkl

WORK = tempfile.mkdtemp(prefix="ezkl-repro-")
os.chdir(WORK)

class TinyMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1, self.act, self.fc2 = nn.Linear(6, 16), nn.ReLU(), nn.Linear(16, 1)
    def forward(self, x):
        return torch.sigmoid(self.fc2(self.act(self.fc1(x))))

torch.manual_seed(42)
model = TinyMLP().eval()
torch.onnx.export(model, torch.zeros(1, 6), "model.onnx",
                  input_names=["input"], output_names=["output"],
                  opset_version=17, dynamo=False)

json.dump({"input_shapes":[[1,6]], "input_data":[[0.5]*6], "output_data":[[0.5]]},
          open("cal.json","w"))

ezkl.gen_settings("model.onnx", "settings.json")
ezkl.calibrate_settings("cal.json", "model.onnx", "settings.json", "resources")
ezkl.compile_circuit("model.onnx", "circuit.ezkl", "settings.json")
logrows = json.load(open("settings.json"))["run_args"]["logrows"]
ezkl.gen_srs("kzg.srs", logrows)
ezkl.setup("circuit.ezkl", "vk.key", "pk.key", srs_path="kzg.srs")

def try_prove(label, inp, timeout=60):
    json.dump({"input_shapes":[[1,6]], "input_data":[inp]}, open("input.json","w"))
    ezkl.gen_witness("input.json", "circuit.ezkl", "witness.json",
                     vk_path="vk.key", srs_path="kzg.srs")
    proof = f"proof-{label}.json"
    if os.path.exists(proof): os.remove(proof)
    code = (f"import ezkl; ezkl.prove(witness='witness.json', model='circuit.ezkl', "
            f"pk_path='pk.key', proof_path='{proof}', srs_path='kzg.srs')")
    try:
        subprocess.run([sys.executable, "-c", code], timeout=timeout, capture_output=True)
        print(f"{label}: {'OK' if os.path.exists(proof) else 'NO PROOF'}")
    except subprocess.TimeoutExpired:
        print(f"{label}: HANG (>{timeout}s)")

# This input hangs deterministically
try_prove("hangs",     [0.95, 0.82, 0.71, 0.88, 0.05, 0.91])
# This input (every value halved) succeeds in ~1s on the same circuit
try_prove("succeeds",  [0.475, 0.41, 0.355, 0.44, 0.025, 0.455])
# Sanity check: this also succeeds
try_prove("baseline",  [0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
```

Expected output:

```
hangs: HANG (>60s)
succeeds: OK
baseline: OK
```

## Empirical observations

We tested ~20 inputs against this tiny model and against a larger LSTM-based model in a separate project. Patterns observed:

- **Per-circuit, not per-input.** The exact input `[0.95, ..., 0.91]` hangs on the TinyMLP above but succeeds on a different model (an LSTM-based anomaly detector). Conversely, inputs that hang on the LSTM (e.g. `[0.0, 0.042, 0.555, 0.004, 0.0, 0.355]`) succeed on the TinyMLP.
- **Robust to small input perturbations.** Changing one value by 0.01, or swapping value positions, does not escape the hang.
- **Sensitive to magnitude scaling.** Halving all input values escapes the hang. Same proportional relationships between values, smaller absolute magnitudes after quantization → succeeds.
- **Not a thread-count issue.** Tested `RAYON_NUM_THREADS=1, 2, 4, 8, 16, 32` against a hung input. All hang identically.
- **Not an environment issue.** Fresh cache rebuild reproduces. numpy 1.x vs 2.x doesn't change the behavior.

## Stack trace at the hang point

GDB on the hung process shows 33 threads (1 main + 32 rayon workers) in `futex_do_wait`. Main thread's stack:

```
#0 syscall
#1 rayon_core::latch::LockLatch::wait_and_reset
#2 rayon_core::registry::Registry::in_worker_cold
   .. rayon::iter::plumbing::bridge_producer_consumer::helper<...
      halo2_proofs::plonk::mv_lookup::Argument<...Fr> ...
      halo2_proofs::plonk::prover::create_proof<KZG, ShPlonk, Bn256, GraphCircuit>
        ::{closure#6}::{closure#0}
   ..>>
#6 halo2_proofs::plonk::prover::create_proof<...>
#7 ezkl::bindings::python::__pyfunction_prove
```

Worker threads are all parked on `rayon_core::sleep::Sleep::sleep`. The main thread is waiting on a latch held against the worker pool; the worker pool is asleep.

Process state: 0% CPU, stable memory (~109MB RSS), TIME field doesn't advance.

## What we ruled out

- **Rayon nested-parallelism deadlock.** Confirmed by running with `RAYON_NUM_THREADS=1` (still hangs).
- **Input perturbation as fix.** Adding ε ∈ {0.001, 0.01, 0.05, 0.1} to zero values does not help.
- **Cache staleness.** Regenerating circuit/vk/pk from scratch reproduces.
- **numpy version.** Same behavior across numpy 1.x and 2.x.

## Workaround

We deployed a subprocess-timeout + synthetic-fallback-input mechanism in our application's daemon. On a quiet test network, the first prove cycle hit the hang reliably, so the workaround was load-bearing for the daemon to function.

## Context

Used in a system where ezkl proofs are produced over live data features whose shape isn't controllable at proof time. The hang triggers on real-world inputs, not just edge cases. Happy to provide additional reproducers or stack traces if useful.

Thanks for ezkl — it's been great to build on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ezkl.prove() deadlocks deterministically on certain inputs (mv_lookup hang) #1029

Summary (Let me know if you need more info!)

Versions

Minimal reproducer

Empirical observations

Stack trace at the hang point

What we ruled out

Workaround

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ezkl.prove() deadlocks deterministically on certain inputs (mv_lookup hang) #1029

Description

Summary (Let me know if you need more info!)

Versions

Minimal reproducer

Empirical observations

Stack trace at the hang point

What we ruled out

Workaround

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions