Skip to content

fix: use output for python faithfulness statements#201

Merged
Barrett Pyke (barrettpyke) merged 1 commit into
braintrustdata:mainfrom
invocan-jsathe:fix/python-faithfulness-output-statements
Jun 11, 2026
Merged

fix: use output for python faithfulness statements#201
Barrett Pyke (barrettpyke) merged 1 commit into
braintrustdata:mainfrom
invocan-jsathe:fix/python-faithfulness-output-statements

Conversation

@invocan-jsathe

Copy link
Copy Markdown
Contributor

Summary

  • Fix a bug in Python Faithfulness where statements were extracted from expected instead of output, causing the scorer to evaluate the wrong text.
  • Update both sync and async Faithfulness paths in py/autoevals/ragas.py to route statement extraction through output.
  • Add a regression test in py/autoevals/test_ragas.py that uses mismatched output/expected and input-driven mocks so the final score depends on correct routing.

Bug

Faithfulness should score whether claims in the generated answer are supported by context.
In Python, it incorrectly passed expected to statement extraction, so claims were taken from ground truth rather than model output.

Fix

  • In Faithfulness._run_eval_async(...): change answer=expected -> answer=output.
  • In Faithfulness._run_eval_sync(...): change answer=expected -> answer=output.
  • Align sync required-field validation with async by requiring output as well.

Test

Added test_faithfulness_extracts_statements_from_output:

  • Uses different output and expected.
  • Mocks extract_statements to derive statements from passed answer.
  • Mocks extract_faithfulness to derive verdicts from context containment.
  • Ensures score behavior reflects correct routing (would fail under old bug).

Validation

  • uv run --extra dev --extra scipy pytest py/autoevals/test_ragas.py -k faithfulness_extracts_statements_from_output passed.
  • Pre-commit hooks pass on commit.

Co-authored-by: Cursor <cursoragent@cursor.com>
@barrettpyke

Copy link
Copy Markdown

Abhijeet Prasad (@AbhiPrasad) - Took a look a this one and it lgtm but wanted to run it past you

@barrettpyke Barrett Pyke (barrettpyke) merged commit edfa259 into braintrustdata:main Jun 11, 2026
3 of 21 checks passed
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

Braintrust eval report

Autoevals (main-1781207099)

Score Average Improvements Regressions
NumericDiff 78% (+0pp) 7 🟢 11 🔴
Time_to_first_token 10.39tok (-0.38tok) 155 🟢 61 🔴
Llm_calls 1.55 (+0) - -
Tool_calls 0 (+0) - -
Errors 0 (+0) 1 🟢 1 🔴
Llm_errors 0 (+0) - -
Tool_errors 0 (+0) - -
Prompt_tokens 526.63tok (-0.58tok) 1 🟢 1 🔴
Prompt_cached_tokens 0tok (+0tok) - -
Prompt_cache_creation_tokens 0tok (+0tok) - -
Prompt_cache_creation_5m_tokens 0tok (+0tok) - -
Prompt_cache_creation_1h_tokens 0tok (+0tok) - -
Completion_tokens 467.56tok (+2.66tok) 108 🟢 100 🔴
Completion_reasoning_tokens 356.36tok (+2.04tok) 92 🟢 88 🔴
Completion_accepted_prediction_tokens 0tok (+0tok) - -
Completion_rejected_prediction_tokens 0tok (+0tok) - -
Completion_audio_tokens 0tok (+0tok) - -
Total_tokens 994.19tok (+2.08tok) 108 🟢 100 🔴
Estimated_cost 0$ (+0$) 91 🟢 88 🔴
Duration 10.43s (-0.35s) 156 🟢 62 🔴
Llm_duration 11.11s (-0.64s) 160 🟢 57 🔴

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants