fix: use output for python faithfulness statements by invocan-jsathe · Pull Request #201 · braintrustdata/autoevals

Janhavi Sathe (invocan-jsathe) · 2026-06-10T18:55:34Z

Summary

Fix a bug in Python Faithfulness where statements were extracted from expected instead of output, causing the scorer to evaluate the wrong text.
Update both sync and async Faithfulness paths in py/autoevals/ragas.py to route statement extraction through output.
Add a regression test in py/autoevals/test_ragas.py that uses mismatched output/expected and input-driven mocks so the final score depends on correct routing.

Bug

Faithfulness should score whether claims in the generated answer are supported by context.
In Python, it incorrectly passed expected to statement extraction, so claims were taken from ground truth rather than model output.

Fix

In Faithfulness._run_eval_async(...): change answer=expected -> answer=output.
In Faithfulness._run_eval_sync(...): change answer=expected -> answer=output.
Align sync required-field validation with async by requiring output as well.

Test

Added test_faithfulness_extracts_statements_from_output:

Uses different output and expected.
Mocks extract_statements to derive statements from passed answer.
Mocks extract_faithfulness to derive verdicts from context containment.
Ensures score behavior reflects correct routing (would fail under old bug).

Validation

uv run --extra dev --extra scipy pytest py/autoevals/test_ragas.py -k faithfulness_extracts_statements_from_output passed.
Pre-commit hooks pass on commit.

Co-authored-by: Cursor <cursoragent@cursor.com>

Barrett Pyke (barrettpyke) · 2026-06-10T20:04:20Z

Abhijeet Prasad (@AbhiPrasad) - Took a look a this one and it lgtm but wanted to run it past you

github-actions · 2026-06-11T19:44:54Z

Braintrust eval report

Autoevals (main-1781207099)

Score	Average	Improvements	Regressions
NumericDiff	78% (+0pp)	7 🟢	11 🔴
Time_to_first_token	10.39tok (-0.38tok)	155 🟢	61 🔴
Llm_calls	1.55 (+0)	-	-
Tool_calls	0 (+0)	-	-
Errors	0 (+0)	1 🟢	1 🔴
Llm_errors	0 (+0)	-	-
Tool_errors	0 (+0)	-	-
Prompt_tokens	526.63tok (-0.58tok)	1 🟢	1 🔴
Prompt_cached_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_5m_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_1h_tokens	0tok (+0tok)	-	-
Completion_tokens	467.56tok (+2.66tok)	108 🟢	100 🔴
Completion_reasoning_tokens	356.36tok (+2.04tok)	92 🟢	88 🔴
Completion_accepted_prediction_tokens	0tok (+0tok)	-	-
Completion_rejected_prediction_tokens	0tok (+0tok)	-	-
Completion_audio_tokens	0tok (+0tok)	-	-
Total_tokens	994.19tok (+2.08tok)	108 🟢	100 🔴
Estimated_cost	0$ (+0$)	91 🟢	88 🔴
Duration	10.43s (-0.35s)	156 🟢	62 🔴
Llm_duration	11.11s (-0.64s)	160 🟢	57 🔴

fix: use output for python faithfulness statements

cf33373

Co-authored-by: Cursor <cursoragent@cursor.com>

Barrett Pyke (barrettpyke) requested a review from Abhijeet Prasad (AbhiPrasad) June 10, 2026 20:03

Abhijeet Prasad (AbhiPrasad) approved these changes Jun 11, 2026

View reviewed changes

Barrett Pyke (barrettpyke) merged commit edfa259 into braintrustdata:main Jun 11, 2026
3 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use output for python faithfulness statements#201

fix: use output for python faithfulness statements#201
Barrett Pyke (barrettpyke) merged 1 commit into
braintrustdata:mainfrom
invocan-jsathe:fix/python-faithfulness-output-statements

Janhavi Sathe (invocan-jsathe) commented Jun 10, 2026

Uh oh!

Barrett Pyke (barrettpyke) commented Jun 10, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Janhavi Sathe (invocan-jsathe) commented Jun 10, 2026

Summary

Bug

Fix

Test

Validation

Uh oh!

Barrett Pyke (barrettpyke) commented Jun 10, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Braintrust eval report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 11, 2026 •

edited

Loading