fix: correct suffix boundary lookup for prefixed last names (#100)#179
Merged
Conversation
The prefix-joining loop located the suffix stop boundary with a value-based pieces.index() that searched from position 0. When a token value repeated (a trailing title that is also a suffix acronym, e.g. the second 'dr' in 'dr Vincent van Gogh dr'), it matched the leading occurrence, producing an empty slice that duplicated pieces and corrupted the middle name. Constrain the lookup to start at i + 1, consistent with the sibling next_prefix lookup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix inline comment in join_on_conjunctions: clarify that filter()
finds the value in pieces[i+1:] but index() searches from 0 by
default, and drop the misleading "title" framing (the token only
needs to satisfy is_suffix, not is_title)
- Add test for two-word prefix collision ("van der") — different loop
iteration count than the single-word case
- Add test with a genuine middle name alongside the repeated token,
since the pre-fix bug corrupted the middle field specifically
- Add @pytest.mark.timeout(2) to the #108 guard so the timeout is
enforced locally and in CI, not just by CI job limits
- Assert hn.last contains "Berg" in the #108 guard to catch silent
last-name corruption
- Add pytest-timeout dev dependency
- Resolve pre-existing stash conflict in docs/resources.rst (keep upstream)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
"dr Vincent van Gogh dr"was producing a corrupted middle name (" dr Vincent van") becausepieces.index(stop_at)searched from position 0, matching the leading"dr"(a title that is also a suffix acronym) instead of the trailing onei + 1start argument topieces.index(stop_at, i + 1), making it consistent with the siblingnext_prefixlookup just above it that was already correctTest Plan
test_title_before_and_after_prefixed_last_name— asserts the agreed output for Strange parsing of name w lastname prefix and title before and after #100:title="dr",first="Vincent",middle="",last="van Gogh",suffix="dr"test_many_repeated_prefixes_does_not_blow_up— parses"Jan van der … Berg"(30× prefix) without hanging or raisingtest_prefix_is_first_name(Van Johnson),test_portuguese_prefixes,test_portuguese_dos,test_prefix_before_two_part_last_name_with_acronym_suffixmypyandruffcleanOut of scope
Issues #121 and #132 were evaluated and excluded — they are irreducible ambiguities that collide with real names, not corruption bugs.
🤖 Generated with Claude Code