Skip to content

Auto-possessify greedy repeats in regular expressions #153044

Description

@serhiy-storchaka

Feature or enhancement

Proposal

Make the re compiler turn a greedy repeat into a possessive one when this cannot change what the pattern matches: when the repeated atom and every character that can possibly follow the repeat are provably disjoint, backtracking into the repeat is always futile, so a+b can be compiled as if it were a++b. PCRE2 performs the same optimization under the name "auto-possessification".

This is a pure compile-time transform: the POSSESSIVE_REPEAT opcodes exist since Python 3.11, and the analysis only fires when it can prove disjointness, so any imprecision can only cost an optimization, never change a match result.

The main benefit is failing or backtracking-heavy matches whose repeat is followed by a character set or category: the existing REPEAT_ONE fast path only covers literal tails, so for example \d+\. or \w+\s currently backtrack character by character. Measured speedups are 2.2–4.5× on such failing matches, and some patterns with catastrophic backtracking are defused as a side effect (though this is an optimization, not a security fix). Correction: where disjointness is provable, every give-back already fails at the first following atom, so this removes only a constant factor; the catastrophic nested-repeat patterns like (a+)+b are among the possible extensions below.

It also composes with the character class set operations (gh-152100): a repeat of a difference class that excludes its follower, such as [\w--\d]+\d, becomes possessive.

I have a working implementation validated by a differential fuzzer and will open a PR.

This is only a first step: the initial implementation is deliberately conservative, covering repeats of single-character atoms and of rigid multi-atom bodies, with the follower analysis traversing group boundaries, alternations, anchors and atomic groups. It can be extended later, for example to alternation bodies with pairwise-disjoint first characters, to word-boundary and lookahead followers, and to repeats of groups (the (a+)+b class).

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagestdlibStandard Library Python modules in the Lib/ directorytopic-regextype-featureA feature request or enhancement

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions