Background
ben (Hebrew/Arabic "son of") functions as a last-name prefix particle in names like "Ahmad ben Husain", exactly like van or von. It was removed from PREFIXES in v0.2.5 because it conflicts with the common English given/middle name "Ben" (short for Benjamin) — e.g. "Alex Ben Johnson" would incorrectly eat "Ben" as a prefix.
Proposed approach
A case-sensitive heuristic in is_prefix(): treat ben as a prefix only when it appears already lowercase in an otherwise mixed-case name. In "Ahmad ben Husain" the lowercase ben is a strong signal it's a particle; in "Alex Ben Johnson" the capitalized Ben signals a given name.
This is consistent with the existing precedent in is_an_initial(), which uses original casing to distinguish initials from other tokens.
Risks
The case-sensitive heuristic is a weak, easily-destroyed signal and can fail in both directions:
- False positives on lowercased input. Datasets that arrive all-lowercase (e.g. "alex ben johnson") would have
ben treated as a particle, eating the middle name. All-lowercase and all-uppercase input are common in real data.
- False negatives on capitalized particles — including the motivating names. Title-cased data breaks it: "David Ben Gurion" has a capitalized
Ben that genuinely is the particle, so the heuristic would miss it. Any title-cased or ALL-CAPS dataset destroys the casing signal.
- Contradicts the library's own stance on casing. The parser lowercases for matching precisely because input casing is unreliable, and the whole
capitalize() feature exists to repair bad casing. Basing a parse decision on casing reintroduces the assumption the rest of the library rejects.
- Doesn't resolve the ambiguity, only relocates it.
ben (son-of) vs. "Ben" (Benjamin) is genuinely ambiguous; casing is a proxy that works for clean mixed-case input and silently fails otherwise.
Net: a default-on ben heuristic could be wrong more often than the status quo (where ben is just a normal name piece). The opt-in workaround below stays the safe recommendation; any default handling should be weighed against these failure modes, and would most defensibly ship opt-in rather than as a global default.
Why it's non-trivial (implementation)
is_prefix() is called from five places in parser.py:
- line 250 — initials computation
- line 448 —
_split_last() for last_base/last_prefixes
- line 1054 — main prefix-join loop during parsing
- line 1075 — chained prefix lookahead
- line 1106 —
cap_word() during capitalization
Making is_prefix() case-sensitive globally would break the capitalization path (line 1106), where a capitalized Van in an all-caps input being normalized needs to still be recognized as a prefix and lowercased. A narrower fix — special-casing ben only in the parse-flow call sites, not in cap_word — would work but requires more surgical changes.
Workaround
Users with Hebrew/Arabic name datasets can add it themselves:
from nameparser.config import CONSTANTS
CONSTANTS.prefixes.add('ben')
Background
ben(Hebrew/Arabic "son of") functions as a last-name prefix particle in names like "Ahmad ben Husain", exactly likevanorvon. It was removed fromPREFIXESin v0.2.5 because it conflicts with the common English given/middle name "Ben" (short for Benjamin) — e.g. "Alex Ben Johnson" would incorrectly eat "Ben" as a prefix.Proposed approach
A case-sensitive heuristic in
is_prefix(): treatbenas a prefix only when it appears already lowercase in an otherwise mixed-case name. In "Ahmad ben Husain" the lowercasebenis a strong signal it's a particle; in "Alex Ben Johnson" the capitalizedBensignals a given name.This is consistent with the existing precedent in
is_an_initial(), which uses original casing to distinguish initials from other tokens.Risks
The case-sensitive heuristic is a weak, easily-destroyed signal and can fail in both directions:
bentreated as a particle, eating the middle name. All-lowercase and all-uppercase input are common in real data.Benthat genuinely is the particle, so the heuristic would miss it. Any title-cased or ALL-CAPS dataset destroys the casing signal.capitalize()feature exists to repair bad casing. Basing a parse decision on casing reintroduces the assumption the rest of the library rejects.ben(son-of) vs. "Ben" (Benjamin) is genuinely ambiguous; casing is a proxy that works for clean mixed-case input and silently fails otherwise.Net: a default-on
benheuristic could be wrong more often than the status quo (wherebenis just a normal name piece). The opt-in workaround below stays the safe recommendation; any default handling should be weighed against these failure modes, and would most defensibly ship opt-in rather than as a global default.Why it's non-trivial (implementation)
is_prefix()is called from five places inparser.py:_split_last()forlast_base/last_prefixescap_word()during capitalizationMaking
is_prefix()case-sensitive globally would break the capitalization path (line 1106), where a capitalizedVanin an all-caps input being normalized needs to still be recognized as a prefix and lowercased. A narrower fix — special-casingbenonly in the parse-flow call sites, not incap_word— would work but requires more surgical changes.Workaround
Users with Hebrew/Arabic name datasets can add it themselves: