Skip to content

fix(offline-diarizer): re-embed zero-vote spans instead of arbitrary cluster-0 tie-break#751

Open
ComicBit wants to merge 2 commits into
FluidInference:mainfrom
ComicBit:fix/zero-vote-reembed
Open

fix(offline-diarizer): re-embed zero-vote spans instead of arbitrary cluster-0 tie-break#751
ComicBit wants to merge 2 commits into
FluidInference:mainfrom
ComicBit:fix/zero-vote-reembed

Conversation

@ComicBit

@ComicBit ComicBit commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Problem

In OfflineReconstruction.buildSegments, aggregated timeline frames whose per-cluster vote sums are all zero are tie-broken arbitrarily to cluster 0. A frame ends up with zero votes when the active local speaker slot received no embedding in any covering window (assignment −2 everywhere). The result: whole speaker turns silently absorbed into the surrounding speaker's segment.

Reproduced on a 97 s two-speaker fixture: a clean 1.6 s turn (segmentation gaps delimit it correctly at 26.88–28.46 s, verified from raw powerset activations) had zero cluster votes in all six covering windows and was emitted as part of the other speaker's 13 s segment.

Fix

Optional post-pass (OfflineDiarizerConfig.zeroVoteReembed, disabled by default — no behavior change unless opted in):

  • detect maximal contiguous zero-vote runs (speech-active frames, no cluster votes, bounded by gaps/voted frames) ≥ minDurationSeconds (default 0.4 s)
  • re-embed each run's exact audio span (zero-padded window, weight mask covering only the span's frames so neighboring audio can't leak in)
  • assign to the closest speaker centroid by cosine; a failed/NaN embedding falls back to the existing tie-break
  • the run becomes its own segment when its assigned cluster differs from its neighbors

Results (fixture A/B, community-1 preset, min-segment 0.5)

config DER speaker-error
baseline 0.0404 0.0200
+ zeroVoteReembed 0.0216 0.0012

Three real turns recovered (26.9–28.4 s, 72.6–73.7 s, 87.3–87.9 s), all matching ground truth.

Tests

ZeroVoteReembedderTests: 15 cases — run detection (voted/non-speech bounds, min-duration filter, multiple runs, timeline-end), assignment (no-margin win, tie → lowest index, NaN, dimension mismatch), disabled-by-default, and synthetic-frame reconstruction integration (segment split on run boundaries, disabled path never invokes the embedder, nil-embedding parity with disabled).

Pure decision logic lives in ZeroVoteReembedder (no RNG, stable iteration order) so it is testable without CoreML models.

Aggregated timeline frames whose per-cluster vote sums are all zero (the
active local speaker slot got assignment -2 in every covering window) were
tie-broken arbitrarily to cluster 0, silently absorbing whole speaker turns
into the surrounding speaker's segment (e.g. the 26.876-28.455s turn on
test_large.wav).

Reconstruction now detects maximal contiguous zero-vote runs (speech-active,
zero votes across all clusters, >= minDurationSeconds), re-embeds each run's
exact audio span via embedSpan, and assigns its frames to the closest speaker
centroid regardless of margin -- zero votes means there is no incumbent to
defend. The run becomes its own segment on the frame-run boundaries. Failed
or NaN embeddings keep the tie-break behavior.

New OfflineDiarizerConfig.ZeroVoteReembed sub-config (enabled: false by
default for upstream parity, minDurationSeconds: 0.4). Pure decision logic
in ZeroVoteReembedder (run detection + assignment) is model-free and unit
tested; extraction is injected into buildSegments as a spanEmbedder closure.

test_large.wav A/B (min-segment=0.5): DER 0.040 -> 0.022, speaker error
0.020 -> 0.001; the 26.9-28.4s turn now emits as its own speaker segment.
Copilot AI review requested due to automatic review settings July 3, 2026 18:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional (disabled-by-default) post-pass to the offline diarization reconstruction pipeline to handle speech-active frames with zero cluster votes by re-embedding the exact audio span and assigning it to the closest centroid, avoiding the prior arbitrary “cluster 0” tie-break behavior that could absorb short speaker turns into neighboring segments.

Changes:

  • Introduces OfflineDiarizerConfig.zeroVoteReembed to gate and configure the post-pass (min-duration threshold).
  • Adds ZeroVoteReembedder pure logic (run detection + centroid assignment) and wires it into OfflineReconstruction.buildSegments via an injected spanEmbedder closure.
  • Implements OfflineEmbeddingExtractor.embedSpan to embed an exact audio span with masking, and adds comprehensive unit tests covering detection, assignment, config defaults, and reconstruction integration.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
Tests/FluidAudioTests/Diarizer/Offline/ZeroVoteReembedderTests.swift Adds unit/integration tests for zero-vote run detection, assignment determinism, config validation, and reconstruction behavior with an injected embedder.
Sources/FluidAudio/Diarizer/Offline/Utils/ZeroVoteReembedder.swift Adds pure run-detection and centroid-assignment logic for the zero-vote re-embed pass.
Sources/FluidAudio/Diarizer/Offline/Utils/OfflineReconstruction.swift Adds optional spanEmbedder parameter and applies the zero-vote re-embed pass before segment accumulation.
Sources/FluidAudio/Diarizer/Offline/Extraction/OfflineEmbeddingExtractor.swift Adds embedSpan to compute an embedding over an exact span using a zero-padded window and span-only weight mask.
Sources/FluidAudio/Diarizer/Offline/Core/OfflineDiarizerTypes.swift Adds ZeroVoteReembed config surface + validation for minDurationSeconds.
Sources/FluidAudio/Diarizer/Offline/Core/OfflineDiarizerManager.swift Provides spanEmbedder closure (using models + audioSource) to reconstruction when the feature is enabled.

Comment on lines +55 to +58
for frame in 0..<frameCount {
let isZeroVote =
speakerCountPerFrame[frame] > 0
&& activationSums[frame].allSatisfy { $0 == 0 }
Comment on lines +258 to +265
try buffer.withUnsafeMutableBufferPointer { pointer in
guard let baseAddress = pointer.baseAddress else { return }
try audioSource.copySamples(
into: baseAddress,
offset: startSample,
count: spanLength
)
}
Comment on lines +231 to +234
/// Used by the short-segment relabel post-pass: the span's samples are placed at the
/// start of a zero-padded model window and an all-active weight mask covering only the
/// span's frames is applied, so the embedding reflects the span's speaker exclusively
/// (neighboring audio never leaks in through the mask).
Comment on lines +235 to +239
let merged = mergeSegments(rawSegments, gapThreshold: gapThreshold)
return sanitize(segments: merged)
let output = sanitize(segments: merged)


return output
… on buffer failure, doc + style

- detectRuns requires exactly one active speaker per frame: an overlap
  frame with zero votes must keep the existing tie-break rather than be
  collapsed to a single re-embedded speaker (+ test)
- embedSpan throws instead of silently embedding a zero buffer when the
  span buffer's baseAddress is unavailable
- embedSpan doc no longer references only the fork's relabel pass
- drop leftover temporary in buildSegments return
ComicBit added a commit to ComicBit/FluidAudio that referenced this pull request Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants