Skip to content

Support 'ben' as a prefix particle for Hebrew/Arabic names #183

Description

@derek73

Background

ben (Hebrew/Arabic "son of") functions as a last-name prefix particle in names like "Ahmad ben Husain", exactly like van or von. It was removed from PREFIXES in v0.2.5 because it conflicts with the common English given/middle name "Ben" (short for Benjamin) — e.g. "Alex Ben Johnson" would incorrectly eat "Ben" as a prefix.

Proposed approach

A case-sensitive heuristic in is_prefix(): treat ben as a prefix only when it appears already lowercase in an otherwise mixed-case name. In "Ahmad ben Husain" the lowercase ben is a strong signal it's a particle; in "Alex Ben Johnson" the capitalized Ben signals a given name.

This is consistent with the existing precedent in is_an_initial(), which uses original casing to distinguish initials from other tokens.

Risks

The case-sensitive heuristic is a weak, easily-destroyed signal and can fail in both directions:

  • False positives on lowercased input. Datasets that arrive all-lowercase (e.g. "alex ben johnson") would have ben treated as a particle, eating the middle name. All-lowercase and all-uppercase input are common in real data.
  • False negatives on capitalized particles — including the motivating names. Title-cased data breaks it: "David Ben Gurion" has a capitalized Ben that genuinely is the particle, so the heuristic would miss it. Any title-cased or ALL-CAPS dataset destroys the casing signal.
  • Contradicts the library's own stance on casing. The parser lowercases for matching precisely because input casing is unreliable, and the whole capitalize() feature exists to repair bad casing. Basing a parse decision on casing reintroduces the assumption the rest of the library rejects.
  • Doesn't resolve the ambiguity, only relocates it. ben (son-of) vs. "Ben" (Benjamin) is genuinely ambiguous; casing is a proxy that works for clean mixed-case input and silently fails otherwise.

Net: a default-on ben heuristic could be wrong more often than the status quo (where ben is just a normal name piece). The opt-in workaround below stays the safe recommendation; any default handling should be weighed against these failure modes, and would most defensibly ship opt-in rather than as a global default.

Why it's non-trivial (implementation)

is_prefix() is called from five places in parser.py:

  • line 250 — initials computation
  • line 448_split_last() for last_base/last_prefixes
  • line 1054 — main prefix-join loop during parsing
  • line 1075 — chained prefix lookahead
  • line 1106cap_word() during capitalization

Making is_prefix() case-sensitive globally would break the capitalization path (line 1106), where a capitalized Van in an all-caps input being normalized needs to still be recognized as a prefix and lowercased. A narrower fix — special-casing ben only in the parse-flow call sites, not in cap_word — would work but requires more surgical changes.

Workaround

Users with Hebrew/Arabic name datasets can add it themselves:

from nameparser.config import CONSTANTS
CONSTANTS.prefixes.add('ben')

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions