Summary (Let me know if you need more info!)
ezkl.prove() hangs indefinitely on certain inputs. The same install, same compiled circuit, and same proving key handle other inputs in ~1 second. The hang appears to be in halo2's mv_lookup prover. We've reproduced it with a minimal self-contained script (~70 lines, tiny MLP, no external assets).
Versions
- ezkl: 23.0.5 (from pip)
- Python: 3.14.4
- OS: Ubuntu 25.10
- Hardware: AMD Ryzen 9 9950X3D, 60GB RAM (CPU prove path, no icicle)
Minimal reproducer
import json, os, subprocess, sys, tempfile
import torch, torch.nn as nn, ezkl
WORK = tempfile.mkdtemp(prefix="ezkl-repro-")
os.chdir(WORK)
class TinyMLP(nn.Module):
def __init__(self):
super().__init__()
self.fc1, self.act, self.fc2 = nn.Linear(6, 16), nn.ReLU(), nn.Linear(16, 1)
def forward(self, x):
return torch.sigmoid(self.fc2(self.act(self.fc1(x))))
torch.manual_seed(42)
model = TinyMLP().eval()
torch.onnx.export(model, torch.zeros(1, 6), "model.onnx",
input_names=["input"], output_names=["output"],
opset_version=17, dynamo=False)
json.dump({"input_shapes":[[1,6]], "input_data":[[0.5]*6], "output_data":[[0.5]]},
open("cal.json","w"))
ezkl.gen_settings("model.onnx", "settings.json")
ezkl.calibrate_settings("cal.json", "model.onnx", "settings.json", "resources")
ezkl.compile_circuit("model.onnx", "circuit.ezkl", "settings.json")
logrows = json.load(open("settings.json"))["run_args"]["logrows"]
ezkl.gen_srs("kzg.srs", logrows)
ezkl.setup("circuit.ezkl", "vk.key", "pk.key", srs_path="kzg.srs")
def try_prove(label, inp, timeout=60):
json.dump({"input_shapes":[[1,6]], "input_data":[inp]}, open("input.json","w"))
ezkl.gen_witness("input.json", "circuit.ezkl", "witness.json",
vk_path="vk.key", srs_path="kzg.srs")
proof = f"proof-{label}.json"
if os.path.exists(proof): os.remove(proof)
code = (f"import ezkl; ezkl.prove(witness='witness.json', model='circuit.ezkl', "
f"pk_path='pk.key', proof_path='{proof}', srs_path='kzg.srs')")
try:
subprocess.run([sys.executable, "-c", code], timeout=timeout, capture_output=True)
print(f"{label}: {'OK' if os.path.exists(proof) else 'NO PROOF'}")
except subprocess.TimeoutExpired:
print(f"{label}: HANG (>{timeout}s)")
# This input hangs deterministically
try_prove("hangs", [0.95, 0.82, 0.71, 0.88, 0.05, 0.91])
# This input (every value halved) succeeds in ~1s on the same circuit
try_prove("succeeds", [0.475, 0.41, 0.355, 0.44, 0.025, 0.455])
# Sanity check: this also succeeds
try_prove("baseline", [0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
Expected output:
hangs: HANG (>60s)
succeeds: OK
baseline: OK
Empirical observations
We tested ~20 inputs against this tiny model and against a larger LSTM-based model in a separate project. Patterns observed:
- Per-circuit, not per-input. The exact input
[0.95, ..., 0.91] hangs on the TinyMLP above but succeeds on a different model (an LSTM-based anomaly detector). Conversely, inputs that hang on the LSTM (e.g. [0.0, 0.042, 0.555, 0.004, 0.0, 0.355]) succeed on the TinyMLP.
- Robust to small input perturbations. Changing one value by 0.01, or swapping value positions, does not escape the hang.
- Sensitive to magnitude scaling. Halving all input values escapes the hang. Same proportional relationships between values, smaller absolute magnitudes after quantization → succeeds.
- Not a thread-count issue. Tested
RAYON_NUM_THREADS=1, 2, 4, 8, 16, 32 against a hung input. All hang identically.
- Not an environment issue. Fresh cache rebuild reproduces. numpy 1.x vs 2.x doesn't change the behavior.
Stack trace at the hang point
GDB on the hung process shows 33 threads (1 main + 32 rayon workers) in futex_do_wait. Main thread's stack:
#0 syscall
#1 rayon_core::latch::LockLatch::wait_and_reset
#2 rayon_core::registry::Registry::in_worker_cold
.. rayon::iter::plumbing::bridge_producer_consumer::helper<...
halo2_proofs::plonk::mv_lookup::Argument<...Fr> ...
halo2_proofs::plonk::prover::create_proof<KZG, ShPlonk, Bn256, GraphCircuit>
::{closure#6}::{closure#0}
..>>
#6 halo2_proofs::plonk::prover::create_proof<...>
#7 ezkl::bindings::python::__pyfunction_prove
Worker threads are all parked on rayon_core::sleep::Sleep::sleep. The main thread is waiting on a latch held against the worker pool; the worker pool is asleep.
Process state: 0% CPU, stable memory (~109MB RSS), TIME field doesn't advance.
What we ruled out
- Rayon nested-parallelism deadlock. Confirmed by running with
RAYON_NUM_THREADS=1 (still hangs).
- Input perturbation as fix. Adding ε ∈ {0.001, 0.01, 0.05, 0.1} to zero values does not help.
- Cache staleness. Regenerating circuit/vk/pk from scratch reproduces.
- numpy version. Same behavior across numpy 1.x and 2.x.
Workaround
We deployed a subprocess-timeout + synthetic-fallback-input mechanism in our application's daemon. On a quiet test network, the first prove cycle hit the hang reliably, so the workaround was load-bearing for the daemon to function.
Context
Used in a system where ezkl proofs are produced over live data features whose shape isn't controllable at proof time. The hang triggers on real-world inputs, not just edge cases. Happy to provide additional reproducers or stack traces if useful.
Thanks for ezkl — it's been great to build on.
Summary (Let me know if you need more info!)
ezkl.prove()hangs indefinitely on certain inputs. The same install, same compiled circuit, and same proving key handle other inputs in ~1 second. The hang appears to be in halo2'smv_lookupprover. We've reproduced it with a minimal self-contained script (~70 lines, tiny MLP, no external assets).Versions
Minimal reproducer
Expected output:
Empirical observations
We tested ~20 inputs against this tiny model and against a larger LSTM-based model in a separate project. Patterns observed:
[0.95, ..., 0.91]hangs on the TinyMLP above but succeeds on a different model (an LSTM-based anomaly detector). Conversely, inputs that hang on the LSTM (e.g.[0.0, 0.042, 0.555, 0.004, 0.0, 0.355]) succeed on the TinyMLP.RAYON_NUM_THREADS=1, 2, 4, 8, 16, 32against a hung input. All hang identically.Stack trace at the hang point
GDB on the hung process shows 33 threads (1 main + 32 rayon workers) in
futex_do_wait. Main thread's stack:Worker threads are all parked on
rayon_core::sleep::Sleep::sleep. The main thread is waiting on a latch held against the worker pool; the worker pool is asleep.Process state: 0% CPU, stable memory (~109MB RSS), TIME field doesn't advance.
What we ruled out
RAYON_NUM_THREADS=1(still hangs).Workaround
We deployed a subprocess-timeout + synthetic-fallback-input mechanism in our application's daemon. On a quiet test network, the first prove cycle hit the hang reliably, so the workaround was load-bearing for the daemon to function.
Context
Used in a system where ezkl proofs are produced over live data features whose shape isn't controllable at proof time. The hang triggers on real-world inputs, not just edge cases. Happy to provide additional reproducers or stack traces if useful.
Thanks for ezkl — it's been great to build on.