Feature or enhancement
Proposal
Make the re compiler turn a greedy repeat into a possessive one when this cannot change what the pattern matches: when the repeated atom and every character that can possibly follow the repeat are provably disjoint, backtracking into the repeat is always futile, so a+b can be compiled as if it were a++b. PCRE2 performs the same optimization under the name "auto-possessification".
This is a pure compile-time transform: the POSSESSIVE_REPEAT opcodes exist since Python 3.11, and the analysis only fires when it can prove disjointness, so any imprecision can only cost an optimization, never change a match result.
The main benefit is failing or backtracking-heavy matches whose repeat is followed by a character set or category: the existing REPEAT_ONE fast path only covers literal tails, so for example \d+\. or \w+\s currently backtrack character by character. Measured speedups are 2.2–4.5× on such failing matches, and some patterns with catastrophic backtracking are defused as a side effect (though this is an optimization, not a security fix). Correction: where disjointness is provable, every give-back already fails at the first following atom, so this removes only a constant factor; the catastrophic nested-repeat patterns like (a+)+b are among the possible extensions below.
It also composes with the character class set operations (gh-152100): a repeat of a difference class that excludes its follower, such as [\w--\d]+\d, becomes possessive.
I have a working implementation validated by a differential fuzzer and will open a PR.
This is only a first step: the initial implementation is deliberately conservative, covering repeats of single-character atoms and of rigid multi-atom bodies, with the follower analysis traversing group boundaries, alternations, anchors and atomic groups. It can be extended later, for example to alternation bodies with pairwise-disjoint first characters, to word-boundary and lookahead followers, and to repeats of groups (the (a+)+b class).
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere.
Linked PRs
Feature or enhancement
Proposal
Make the re compiler turn a greedy repeat into a possessive one when this cannot change what the pattern matches: when the repeated atom and every character that can possibly follow the repeat are provably disjoint, backtracking into the repeat is always futile, so
a+bcan be compiled as if it werea++b. PCRE2 performs the same optimization under the name "auto-possessification".This is a pure compile-time transform: the POSSESSIVE_REPEAT opcodes exist since Python 3.11, and the analysis only fires when it can prove disjointness, so any imprecision can only cost an optimization, never change a match result.
The main benefit is failing or backtracking-heavy matches whose repeat is followed by a character set or category: the existing REPEAT_ONE fast path only covers literal tails, so for example
\d+\.or\w+\scurrently backtrack character by character. Measured speedups are 2.2–4.5× on such failing matches,and some patterns with catastrophic backtracking are defused as a side effect(though this is an optimization, not a security fix). Correction: where disjointness is provable, every give-back already fails at the first following atom, so this removes only a constant factor; the catastrophic nested-repeat patterns like(a+)+bare among the possible extensions below.It also composes with the character class set operations (gh-152100): a repeat of a difference class that excludes its follower, such as
[\w--\d]+\d, becomes possessive.I have a working implementation validated by a differential fuzzer and will open a PR.
This is only a first step: the initial implementation is deliberately conservative, covering repeats of single-character atoms and of rigid multi-atom bodies, with the follower analysis traversing group boundaries, alternations, anchors and atomic groups. It can be extended later, for example to alternation bodies with pairwise-disjoint first characters, to word-boundary and lookahead followers, and to repeats of groups (the
(a+)+bclass).Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere.
Linked PRs