Skip to content

ezkl.prove() deadlocks deterministically on certain inputs (mv_lookup hang) #1029

@evermaat

Description

@evermaat

Summary (Let me know if you need more info!)

ezkl.prove() hangs indefinitely on certain inputs. The same install, same compiled circuit, and same proving key handle other inputs in ~1 second. The hang appears to be in halo2's mv_lookup prover. We've reproduced it with a minimal self-contained script (~70 lines, tiny MLP, no external assets).

Versions

  • ezkl: 23.0.5 (from pip)
  • Python: 3.14.4
  • OS: Ubuntu 25.10
  • Hardware: AMD Ryzen 9 9950X3D, 60GB RAM (CPU prove path, no icicle)

Minimal reproducer

import json, os, subprocess, sys, tempfile
import torch, torch.nn as nn, ezkl

WORK = tempfile.mkdtemp(prefix="ezkl-repro-")
os.chdir(WORK)

class TinyMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1, self.act, self.fc2 = nn.Linear(6, 16), nn.ReLU(), nn.Linear(16, 1)
    def forward(self, x):
        return torch.sigmoid(self.fc2(self.act(self.fc1(x))))

torch.manual_seed(42)
model = TinyMLP().eval()
torch.onnx.export(model, torch.zeros(1, 6), "model.onnx",
                  input_names=["input"], output_names=["output"],
                  opset_version=17, dynamo=False)

json.dump({"input_shapes":[[1,6]], "input_data":[[0.5]*6], "output_data":[[0.5]]},
          open("cal.json","w"))

ezkl.gen_settings("model.onnx", "settings.json")
ezkl.calibrate_settings("cal.json", "model.onnx", "settings.json", "resources")
ezkl.compile_circuit("model.onnx", "circuit.ezkl", "settings.json")
logrows = json.load(open("settings.json"))["run_args"]["logrows"]
ezkl.gen_srs("kzg.srs", logrows)
ezkl.setup("circuit.ezkl", "vk.key", "pk.key", srs_path="kzg.srs")

def try_prove(label, inp, timeout=60):
    json.dump({"input_shapes":[[1,6]], "input_data":[inp]}, open("input.json","w"))
    ezkl.gen_witness("input.json", "circuit.ezkl", "witness.json",
                     vk_path="vk.key", srs_path="kzg.srs")
    proof = f"proof-{label}.json"
    if os.path.exists(proof): os.remove(proof)
    code = (f"import ezkl; ezkl.prove(witness='witness.json', model='circuit.ezkl', "
            f"pk_path='pk.key', proof_path='{proof}', srs_path='kzg.srs')")
    try:
        subprocess.run([sys.executable, "-c", code], timeout=timeout, capture_output=True)
        print(f"{label}: {'OK' if os.path.exists(proof) else 'NO PROOF'}")
    except subprocess.TimeoutExpired:
        print(f"{label}: HANG (>{timeout}s)")

# This input hangs deterministically
try_prove("hangs",     [0.95, 0.82, 0.71, 0.88, 0.05, 0.91])
# This input (every value halved) succeeds in ~1s on the same circuit
try_prove("succeeds",  [0.475, 0.41, 0.355, 0.44, 0.025, 0.455])
# Sanity check: this also succeeds
try_prove("baseline",  [0.5, 0.5, 0.5, 0.5, 0.5, 0.5])

Expected output:

hangs: HANG (>60s)
succeeds: OK
baseline: OK

Empirical observations

We tested ~20 inputs against this tiny model and against a larger LSTM-based model in a separate project. Patterns observed:

  • Per-circuit, not per-input. The exact input [0.95, ..., 0.91] hangs on the TinyMLP above but succeeds on a different model (an LSTM-based anomaly detector). Conversely, inputs that hang on the LSTM (e.g. [0.0, 0.042, 0.555, 0.004, 0.0, 0.355]) succeed on the TinyMLP.
  • Robust to small input perturbations. Changing one value by 0.01, or swapping value positions, does not escape the hang.
  • Sensitive to magnitude scaling. Halving all input values escapes the hang. Same proportional relationships between values, smaller absolute magnitudes after quantization → succeeds.
  • Not a thread-count issue. Tested RAYON_NUM_THREADS=1, 2, 4, 8, 16, 32 against a hung input. All hang identically.
  • Not an environment issue. Fresh cache rebuild reproduces. numpy 1.x vs 2.x doesn't change the behavior.

Stack trace at the hang point

GDB on the hung process shows 33 threads (1 main + 32 rayon workers) in futex_do_wait. Main thread's stack:

#0 syscall
#1 rayon_core::latch::LockLatch::wait_and_reset
#2 rayon_core::registry::Registry::in_worker_cold
   .. rayon::iter::plumbing::bridge_producer_consumer::helper<...
      halo2_proofs::plonk::mv_lookup::Argument<...Fr> ...
      halo2_proofs::plonk::prover::create_proof<KZG, ShPlonk, Bn256, GraphCircuit>
        ::{closure#6}::{closure#0}
   ..>>
#6 halo2_proofs::plonk::prover::create_proof<...>
#7 ezkl::bindings::python::__pyfunction_prove

Worker threads are all parked on rayon_core::sleep::Sleep::sleep. The main thread is waiting on a latch held against the worker pool; the worker pool is asleep.

Process state: 0% CPU, stable memory (~109MB RSS), TIME field doesn't advance.

What we ruled out

  • Rayon nested-parallelism deadlock. Confirmed by running with RAYON_NUM_THREADS=1 (still hangs).
  • Input perturbation as fix. Adding ε ∈ {0.001, 0.01, 0.05, 0.1} to zero values does not help.
  • Cache staleness. Regenerating circuit/vk/pk from scratch reproduces.
  • numpy version. Same behavior across numpy 1.x and 2.x.

Workaround

We deployed a subprocess-timeout + synthetic-fallback-input mechanism in our application's daemon. On a quiet test network, the first prove cycle hit the hang reliably, so the workaround was load-bearing for the daemon to function.

Context

Used in a system where ezkl proofs are produced over live data features whose shape isn't controllable at proof time. The hang triggers on real-world inputs, not just edge cases. Happy to provide additional reproducers or stack traces if useful.

Thanks for ezkl — it's been great to build on.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions