Skip to content

fix(evaluation): Support non-English responses in ROUGE-1 matching#6292

Open
agharsallah wants to merge 1 commit into
google:mainfrom
agharsallah:fix/eval-rouge-non-english-3111
Open

fix(evaluation): Support non-English responses in ROUGE-1 matching#6292
agharsallah wants to merge 1 commit into
google:mainfrom
agharsallah:fix/eval-rouge-non-english-3111

Conversation

@agharsallah

Copy link
Copy Markdown

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

Problem:

The response_match_score metric (RougeEvaluator) always returns 0 for responses in non-Latin scripts, even when the actual and expected responses are identical. The root cause is the default rouge_score tokenizer, which lowercases the text and then replaces every character outside [a-z0-9] with a space — so Thai, Chinese, Arabic, Japanese, Cyrillic, etc. tokenize to an empty token list and every comparison scores 0.

Reproduction on main:

from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(["rouge1"], use_stemmer=True)
scorer.score("สวัสดี", "สวัสดี")["rouge1"].fmeasure  # 0.0 — identical strings
scorer.score("hello", "hello")["rouge1"].fmeasure    # 1.0

Solution:

Pass a Unicode-aware tokenizer to RougeScorer in final_response_match_v1.py:

  • Word characters are detected with str.isalnum() plus Unicode combining-mark categories (Mn/Mc), so scripts with combining vowel signs (e.g. Thai สวัสดี, Devanagari matras) stay intact as single tokens.
  • Pure-ASCII tokens are delegated to rouge_score's own DefaultTokenizer, so lowercasing, Porter stemming, and scores for English text are unchanged (guarded by an equivalence test).
  • No new dependencies; tokenizers is re-exported through the existing google.adk.dependencies.rouge_scorer indirection, consistent with how rouge_scorer is imported today.

Known limitation (noted for reviewers): languages written without spaces (Thai, Chinese) are matched at phrase granularity rather than word granularity, since proper word segmentation would require a language-specific segmenter. This PR fixes the "identical/overlapping text scores 0" bug without adding dependencies.

Testing Plan

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.

New tests in tests/unittests/evaluation/test_final_response_match_v1.py:

  • identical non-English text scores 1.0 (Thai, Chinese, Arabic, Japanese, Russian)
  • partially overlapping non-English text scores the expected ROUGE-1 fraction
  • mixed English + non-English text
  • _UnicodeAwareTokenizer produces identical tokens to rouge_score's DefaultTokenizer for ASCII inputs (regression guard for English scoring, incl. stemming, punctuation, digits, underscores)
  • evaluator-level case with identical Thai/Chinese responses → PASSED
$ pytest tests/unittests/evaluation/test_final_response_match_v1.py -q
20 passed, 5 warnings in 1.53s

$ pytest tests/unittests/evaluation/ -q
496 passed, 27 warnings in 149.13s (0:02:29)

Manual End-to-End (E2E) Tests:

Ran the evaluator directly on the scenario from #3111 (agent instructed to reply with the word สวัสดี, expected response สวัสดี):

ev = RougeEvaluator(EvalMetric(metric_name="response_match_score", threshold=0.8))
result = ev.evaluate_invocations([inv("สวัสดี")], [inv("สวัสดี")])
# before: score=0.0, status=EvalStatus.FAILED
# after:  score=1.0, status=EvalStatus.PASSED

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules.

Additional context

Formatting verified with the repo-pinned tools: pyink 25.12 (unchanged), isort --profile google (clean), ruff 0.15.17 (all checks passed), and scripts/compliance_checks.py (exit 0).

The default rouge_score tokenizer drops every character outside
[a-z0-9], so responses in non-Latin scripts (Thai, Chinese, Arabic,
etc.) tokenize to nothing and response_match_score is always 0, even
for identical texts.

Use a Unicode-aware tokenizer that keeps non-ASCII word characters
(including combining marks such as Thai vowel signs) and delegates
ASCII tokens to the default tokenizer, so stemming and scores for
English text are unchanged.

Fixes google#3111

Signed-off-by: agharsallah <17379925+agharsallah@users.noreply.github.com>
@google-cla

google-cla Bot commented Jul 4, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval fails for non-English languages

1 participant