fix(offline-diarizer): re-embed zero-vote spans instead of arbitrary cluster-0 tie-break#751
Open
ComicBit wants to merge 2 commits into
Open
fix(offline-diarizer): re-embed zero-vote spans instead of arbitrary cluster-0 tie-break#751ComicBit wants to merge 2 commits into
ComicBit wants to merge 2 commits into
Conversation
Aggregated timeline frames whose per-cluster vote sums are all zero (the active local speaker slot got assignment -2 in every covering window) were tie-broken arbitrarily to cluster 0, silently absorbing whole speaker turns into the surrounding speaker's segment (e.g. the 26.876-28.455s turn on test_large.wav). Reconstruction now detects maximal contiguous zero-vote runs (speech-active, zero votes across all clusters, >= minDurationSeconds), re-embeds each run's exact audio span via embedSpan, and assigns its frames to the closest speaker centroid regardless of margin -- zero votes means there is no incumbent to defend. The run becomes its own segment on the frame-run boundaries. Failed or NaN embeddings keep the tie-break behavior. New OfflineDiarizerConfig.ZeroVoteReembed sub-config (enabled: false by default for upstream parity, minDurationSeconds: 0.4). Pure decision logic in ZeroVoteReembedder (run detection + assignment) is model-free and unit tested; extraction is injected into buildSegments as a spanEmbedder closure. test_large.wav A/B (min-segment=0.5): DER 0.040 -> 0.022, speaker error 0.020 -> 0.001; the 26.9-28.4s turn now emits as its own speaker segment.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds an optional (disabled-by-default) post-pass to the offline diarization reconstruction pipeline to handle speech-active frames with zero cluster votes by re-embedding the exact audio span and assigning it to the closest centroid, avoiding the prior arbitrary “cluster 0” tie-break behavior that could absorb short speaker turns into neighboring segments.
Changes:
- Introduces
OfflineDiarizerConfig.zeroVoteReembedto gate and configure the post-pass (min-duration threshold). - Adds
ZeroVoteReembedderpure logic (run detection + centroid assignment) and wires it intoOfflineReconstruction.buildSegmentsvia an injectedspanEmbedderclosure. - Implements
OfflineEmbeddingExtractor.embedSpanto embed an exact audio span with masking, and adds comprehensive unit tests covering detection, assignment, config defaults, and reconstruction integration.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| Tests/FluidAudioTests/Diarizer/Offline/ZeroVoteReembedderTests.swift | Adds unit/integration tests for zero-vote run detection, assignment determinism, config validation, and reconstruction behavior with an injected embedder. |
| Sources/FluidAudio/Diarizer/Offline/Utils/ZeroVoteReembedder.swift | Adds pure run-detection and centroid-assignment logic for the zero-vote re-embed pass. |
| Sources/FluidAudio/Diarizer/Offline/Utils/OfflineReconstruction.swift | Adds optional spanEmbedder parameter and applies the zero-vote re-embed pass before segment accumulation. |
| Sources/FluidAudio/Diarizer/Offline/Extraction/OfflineEmbeddingExtractor.swift | Adds embedSpan to compute an embedding over an exact span using a zero-padded window and span-only weight mask. |
| Sources/FluidAudio/Diarizer/Offline/Core/OfflineDiarizerTypes.swift | Adds ZeroVoteReembed config surface + validation for minDurationSeconds. |
| Sources/FluidAudio/Diarizer/Offline/Core/OfflineDiarizerManager.swift | Provides spanEmbedder closure (using models + audioSource) to reconstruction when the feature is enabled. |
Comment on lines
+55
to
+58
| for frame in 0..<frameCount { | ||
| let isZeroVote = | ||
| speakerCountPerFrame[frame] > 0 | ||
| && activationSums[frame].allSatisfy { $0 == 0 } |
Comment on lines
+258
to
+265
| try buffer.withUnsafeMutableBufferPointer { pointer in | ||
| guard let baseAddress = pointer.baseAddress else { return } | ||
| try audioSource.copySamples( | ||
| into: baseAddress, | ||
| offset: startSample, | ||
| count: spanLength | ||
| ) | ||
| } |
Comment on lines
+231
to
+234
| /// Used by the short-segment relabel post-pass: the span's samples are placed at the | ||
| /// start of a zero-padded model window and an all-active weight mask covering only the | ||
| /// span's frames is applied, so the embedding reflects the span's speaker exclusively | ||
| /// (neighboring audio never leaks in through the mask). |
Comment on lines
+235
to
+239
| let merged = mergeSegments(rawSegments, gapThreshold: gapThreshold) | ||
| return sanitize(segments: merged) | ||
| let output = sanitize(segments: merged) | ||
|
|
||
|
|
||
| return output |
… on buffer failure, doc + style - detectRuns requires exactly one active speaker per frame: an overlap frame with zero votes must keep the existing tie-break rather than be collapsed to a single re-embedded speaker (+ test) - embedSpan throws instead of silently embedding a zero buffer when the span buffer's baseAddress is unavailable - embedSpan doc no longer references only the fork's relabel pass - drop leftover temporary in buildSegments return
ComicBit
added a commit
to ComicBit/FluidAudio
that referenced
this pull request
Jul 3, 2026
… throw (mirror of PR FluidInference#751 review fixes)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In
OfflineReconstruction.buildSegments, aggregated timeline frames whose per-cluster vote sums are all zero are tie-broken arbitrarily to cluster 0. A frame ends up with zero votes when the active local speaker slot received no embedding in any covering window (assignment −2 everywhere). The result: whole speaker turns silently absorbed into the surrounding speaker's segment.Reproduced on a 97 s two-speaker fixture: a clean 1.6 s turn (segmentation gaps delimit it correctly at 26.88–28.46 s, verified from raw powerset activations) had zero cluster votes in all six covering windows and was emitted as part of the other speaker's 13 s segment.
Fix
Optional post-pass (
OfflineDiarizerConfig.zeroVoteReembed, disabled by default — no behavior change unless opted in):minDurationSeconds(default 0.4 s)Results (fixture A/B, community-1 preset, min-segment 0.5)
Three real turns recovered (26.9–28.4 s, 72.6–73.7 s, 87.3–87.9 s), all matching ground truth.
Tests
ZeroVoteReembedderTests: 15 cases — run detection (voted/non-speech bounds, min-duration filter, multiple runs, timeline-end), assignment (no-margin win, tie → lowest index, NaN, dimension mismatch), disabled-by-default, and synthetic-frame reconstruction integration (segment split on run boundaries, disabled path never invokes the embedder, nil-embedding parity with disabled).Pure decision logic lives in
ZeroVoteReembedder(no RNG, stable iteration order) so it is testable without CoreML models.