fix(evaluation): Support non-English responses in ROUGE-1 matching by agharsallah · Pull Request #6292 · google/adk-python

agharsallah · 2026-07-04T22:12:33Z

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

Closes: Eval fails for non-English languages #3111

Problem:

The response_match_score metric (RougeEvaluator) always returns 0 for responses in non-Latin scripts, even when the actual and expected responses are identical. The root cause is the default rouge_score tokenizer, which lowercases the text and then replaces every character outside [a-z0-9] with a space — so Thai, Chinese, Arabic, Japanese, Cyrillic, etc. tokenize to an empty token list and every comparison scores 0.

Reproduction on main:

from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(["rouge1"], use_stemmer=True)
scorer.score("สวัสดี", "สวัสดี")["rouge1"].fmeasure  # 0.0 — identical strings
scorer.score("hello", "hello")["rouge1"].fmeasure    # 1.0

Solution:

Pass a Unicode-aware tokenizer to RougeScorer in final_response_match_v1.py:

Word characters are detected with str.isalnum() plus Unicode combining-mark categories (Mn/Mc), so scripts with combining vowel signs (e.g. Thai สวัสดี, Devanagari matras) stay intact as single tokens.
Pure-ASCII tokens are delegated to rouge_score's own DefaultTokenizer, so lowercasing, Porter stemming, and scores for English text are unchanged (guarded by an equivalence test).
No new dependencies; tokenizers is re-exported through the existing google.adk.dependencies.rouge_scorer indirection, consistent with how rouge_scorer is imported today.

Known limitation (noted for reviewers): languages written without spaces (Thai, Chinese) are matched at phrase granularity rather than word granularity, since proper word segmentation would require a language-specific segmenter. This PR fixes the "identical/overlapping text scores 0" bug without adding dependencies.

Testing Plan

Unit Tests:

I have added or updated unit tests for my change.
All unit tests pass locally.

New tests in tests/unittests/evaluation/test_final_response_match_v1.py:

identical non-English text scores 1.0 (Thai, Chinese, Arabic, Japanese, Russian)
partially overlapping non-English text scores the expected ROUGE-1 fraction
mixed English + non-English text
_UnicodeAwareTokenizer produces identical tokens to rouge_score's DefaultTokenizer for ASCII inputs (regression guard for English scoring, incl. stemming, punctuation, digits, underscores)
evaluator-level case with identical Thai/Chinese responses → PASSED

$ pytest tests/unittests/evaluation/test_final_response_match_v1.py -q
20 passed, 5 warnings in 1.53s

$ pytest tests/unittests/evaluation/ -q
496 passed, 27 warnings in 149.13s (0:02:29)

Manual End-to-End (E2E) Tests:

Ran the evaluator directly on the scenario from #3111 (agent instructed to reply with the word สวัสดี, expected response สวัสดี):

ev = RougeEvaluator(EvalMetric(metric_name="response_match_score", threshold=0.8))
result = ev.evaluate_invocations([inv("สวัสดี")], [inv("สวัสดี")])
# before: score=0.0, status=EvalStatus.FAILED
# after:  score=1.0, status=EvalStatus.PASSED

Checklist

I have read the CONTRIBUTING.md document.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
I have manually tested my changes end-to-end.
Any dependent changes have been merged and published in downstream modules.

Additional context

Formatting verified with the repo-pinned tools: pyink 25.12 (unchanged), isort --profile google (clean), ruff 0.15.17 (all checks passed), and scripts/compliance_checks.py (exit 0).

The default rouge_score tokenizer drops every character outside [a-z0-9], so responses in non-Latin scripts (Thai, Chinese, Arabic, etc.) tokenize to nothing and response_match_score is always 0, even for identical texts. Use a Unicode-aware tokenizer that keeps non-ASCII word characters (including combining marks such as Thai vowel signs) and delegates ASCII tokens to the default tokenizer, so stemming and scores for English text are unchanged. Fixes google#3111 Signed-off-by: agharsallah <17379925+agharsallah@users.noreply.github.com>

google-cla · 2026-07-04T22:12:38Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(evaluation): Support non-English responses in ROUGE-1 matching#6292

fix(evaluation): Support non-English responses in ROUGE-1 matching#6292
agharsallah wants to merge 1 commit into
google:mainfrom
agharsallah:fix/eval-rouge-non-english-3111

agharsallah commented Jul 4, 2026

Uh oh!

google-cla Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

agharsallah commented Jul 4, 2026

Link to Issue or Description of Change

Testing Plan

Checklist

Additional context

Uh oh!

google-cla Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant