gh-136063: WIP: Refactor email _header_value_parser and eliminate non-linear complexity by bitdancer · Pull Request #152710 · python/cpython

bitdancer · 2026-06-30T21:37:54Z

As a response to the report of quadratic complexity in the header value parser in the email package, I began a rewrite of the parser code. This has taken me considerably longer than I originally envisioned, since as I worked I realized I really wanted to rewrite all the tests to be more thorough, easier to understand, and easier to extend; and as I did the refactoring I kept finding things the parser was really doing wrong that I wanted to fix (mostly around encoded word edge cases).

At this point I'm about half way through the project, and Serhiy has proposed a patch to solve the quadratic complexity that is a minimal fix, rather than the rewrite that I'm currently engaged in, at the expense of breaking backward compatibility (which, since this is labeled as an internal module, might be OK).

I decided I should publish the work in progress so we can decide how we want to proceed.

If you look at this PR as an (un)finished whole, it will look pretty much like a replacement of the original code...because a lot of it is, in the end. However, I've structured the PR as a set of commits that can be reviewed as an incremental refactoring process. This is a curated set of changes: I do lots of rebase editing to keep the flow of the refactor consistent as I find new problems that result in new tests or require changes to earlier refactoring steps.

The refactoring is broken down in to two broad phases: refactoring and adding to the tests, and refactoring the code. Most of the commits are step-wise refactoring of the tests to make sure I didn't lose any test coverage. I do it this way for my own peace of mind, and to make it possible to ensure that any test changes on main get incorporated without, again, losing any test coverage. Along the way in the test refactoring part I fixed a few minor bugs in the existing parser, so that the test changes in the code refactoring phase are concerned primarily with functional improvements rather than minor bug fixes.

Since I took an incremental refactoring approach, I include backward compatibility wrappers so that at each step the code continues to pass all of its tests. This also allows for a deprecation period for the old parser API, if we wish to have one.

Like I said, this is half done (at least, I hope it is half :). In addition to the functions I haven't refactored yet, I also haven't dealt with the slicing for the error messages that Serhiy's PR handles, but I do have a plan for that ;)

Issue: Potential Quadratic Complexity Vulnerabilities in the email Module #136063

This is going to be pretty much a complete rewrite of the header value parser. As such, we want to do our best to ensure that we don't break anything. That means more tests. Writing tests for the email package is a bit enervating, given the number of different variations of input data and edge cases that really should be tested. I wrote the 'parameterize' helper to mitigate some of the pain of that test writing, but it was a pretty primitive tool and didn't really help all that much. To make it palatable to write more tests, I'm replacing that original, primitive decorator with a new parameterization framework informed by the years of test writing experience I've had since then. I'm pretty pleased with the result, which is conceptually based on a package I wrote a few years ago but never published outside of github. Its test suite gives a lot of usage examples, and it is much more intuitive than parameterize. As I move along in this refactor, there will be *many* more usage examples added to test__header_value_parser.

This new version gives a much more useful error message, and also supports specifying regexes to match the expected defects. It also no longer depends on the order of the defects, making it less fragile. In addition it supports using a function to produce the defect and regex, making it easier to verify defects that take arguments instead of a message string.

These will be testing for exact matches of the messages, but since there's only one place in the tests that need to be changed if the message changes, this seems reasonable. There could in theory be a defect subclass hierarchy here, but we'll leave that idea for some future refactor.

BUGFIX: In certain cases correctly detected defects were being dropped when combining syntactic units. These defects are now correctly copied to the higher level syntactic units, which will result in additional defects being reported in certain cases (mostly mailbox and MIME parameter lists). I found this bug fairly far along in the refactoring process, but it makes more sense to fix it early, since the test bug it triggered wasn't related to the code I was working on at the time.

This allows us to easily test that, for example, *every* non-printable character is treated as an error, not just the one used by a non-parameterized test.

The test refactoring process is a bit laborious. I ended up splitting the changesets up into incremental steps to make sure that I wasn't omitting or changing any tests that I didn't intend to. Some changesets are fairly clean as is, others are best viewed using git's --color-words or a similar substring rather than line based diff tool, such as github's diff display. Whether it will be worthwhile for anyone to review all these commits is an open question, but having them helped me get the refactoring correct, especially when I had to go back and fix mistakes and/or bugs at various points during the refactoring. Not to mention rebasing on main as some parser bugs were fixed there. Along the way we are going to fix a non-trivial number of small bugs. The commit messages will contain paragraphs starting with "BUGFIX:" for any such changes. All such bug fixes are internal: not worthy of a NEWS item beyond the one announcing the parser rewrite. Anything that doesn't fall into that category I intend to fix in main first in separate PRs. This commit is the setup, to make the next diff easier to read.

This functionality of this code will eventually get absorbed into a new function during the refactor, but as the test comments say, equivalent tests will still need to run against the new implementation. And we'll probably keep the function around until the deprecation period has expired.

This is more preparation for the refactoring steps. This is an expanded version of the old _test_get_x method that is much more general, works with the new test parameterization, has optional automatic testing of the 'start' attribute, and provides a convenience method for checking that only the expected token types are included in whatever is returned by the parsing method. This will be used to test both the new and the old API for all refactored methods. The refactoring process will be a several step process for each function: 1) refactor the existing tests to the new framework without changing the test arguments. This allows verification that the tests are the same if anyone cares to check. 2) simplify the test arguments using the parameterization, and in most cases add additional tests. 3) refactor the function to support the new API, and start testing both APIs. (1) is supported by having the first few keyword arguments to _test_parse be in the same order as the positional arguments in _test_get_x. In steps (1) and (2) we use the 'old_api_only' params map to call _test_method such that we ignore any warnings. In step (3) we switch to 'for_each_api', which expects a deprecation warning when using the old API call style. When the refactoring is complete, old_api_only can go away. When the deprecation happens, for_each_api can go away after a global s/for_each_api/Params/. Much of this code is informed by the later refactoring process, but it is easier to present and maintain it in this changeset.

This is an internal function whose functionality will get replaced during the refactor, but as the comment says, the replacement function will need to pass these tests as well. Adding these tests revealed a bug: the existing code wasn't properly reporting that there were quoted printables if the only ones were \\s, even though it was correctly decoding them. Clearly there are no test cases in the rest of the tests that cover this case, and since it only affects defect detection it has never been reported. I fixed the bug in the old code, even though the code will go away in the refactor, so that the tests are checking for the desired behavior. BUGFIX: the detection of obsolete quoted pairs in certain contexts (i.e.: domain literals) was imperfect, such that some defects of this type were missed. They should now be consistently detected.

This makes the next diff easier to compare, then we'll tidy up.

Unfortunately the existing behavior some of the new tests check is buggy: get_fws returns a WhitesSaceTerminal whose contents is an empty string, but 'value' still returns a single blank...neither of these is correct behavior. We'll deprecate calling get_fws that way when we do the refactor.

Code and comments to make the next diff cleaner. We're converting the _encoded_words tests because part of the parser refactor is going to involve a small refactor in this helper module.

Using 'actual_' as a prefix makes the error results clearer when tests fail. Adding 'defects=' doesn't really add much here, but it is the new pattern established in test__header_value_parser so it is nice to be consistent. The rest is a small simplification of how the encoder and decoder tests are factored.

Add comments to make the next diff cleaner.

And add regexes for the exceptions and defects. I replaced one test: the 'lang default is blank' test is redundant given that the test harness now tests that. I replaced it with one that checks that a non-default charset does appear in the charset attribute.

The params_map allows us to skip specifying 'C', and all these tests will fit on one line. I think it reads cleaner with the strings lined up.

Switch to 'content' instead of 'abc' for clarity as to which part it is, and update the names to harmonize with the ones I'm about to add.

content_getter of necessity passes all the tests that were written for get_unstructured, and then there are the additional ones to exercise the start, tl_type, text_type, end_chars, and qp keywords, plus test changes to test ew_indexes. content_getter is in many ways the heart of this refactor. It abstracts the lowest level of the parsing and makes it more consistent across RFC token types. It centralizes the handling of encoded words, which will allow us to decode them in more places (for better or worse). It also makes that handling much more efficient than the old code. As the generic workhorse method for the parser, content_getter knows almost nothing about the syntactic units or their requirements. Thus we tell it what token type list to use, what text type to wrap the terminals in, what characters are the stop characters, and whether or not quoted printables are to be decoded. And it tracks the encoded words that it decodes, recording their indexes in the input value. This last is because encoded words can be defects if they appear in various syntactic units, when we've decided that we're going to decode them anyway (which is most places). Only the higher level parsing functions can know if they are defects, so content_getter supplies a list of indexes that the higher level routines can use to generate defects. Right now a boolean flag would be enough, but eventually the defects will contain a pointer to the location of the defect in the value; tracking the index allows us to support that for EW defects.. In order to make it efficient for higher level functions to check for ew defects, this commit also adds support for copying the index list to each higher level token as the token lists are constructed. Otherwise the higher level functions would have to walk the parse tree for every token they process, whether those tokens had encoded words or not. We use a few cycles doing the copy as we go to avoid using a lot more cycles every time we process a token where encoded words are a defect. Getting all the new tests to pass required an addition to the get_encoded_word API: a 'decode_qp' keyword to allow telling it to decode quoted printables in the payload before decoding. content_getter also passes the _get_ptext_to_endchars tests (per the comment on those tests, since it is the replacement for that function), with the exception of the flag that _get_ptext_to_endchars returns about whether there was qp present or not. That flag functionality is to support get_dtext, where obsolete dtext can contain quoted printables but we need to register a defect. We won't be using content_getter for get_dtext, though, because we do *not* want to decode encoded words in domain literals for security reasons. We'll get the qp flag directly via _qp_unquote in get_dtext. In addition one test is changed, because by default content_getter does *not* stop at whitespace (that's a big part of the point of it), while _get_ptext_to_endchars does. Only one test tested for that, and by moving it we maintain the test that whitespace *works* as endchars, which is functionality we will use. The new test position is a slightly less thorough test of the _get_ptext_to_endchars API, but since that API is deprecated that doesn't matter.

We delete one of the _get_ptext_to_endchars tests that was checking for the correct non-detecting of quoted printables that came after the endchar because that test isn't relevant to content_getter. As noted that functionality will be handled directly by _qp_unquote. The fact that there is one less test for _get_ptext_to_endchars doesn't really matter since before this PR there were non at all.

This function is only RFC BNF adjacent, but is no further off than get_qp_ctext, which it will replace. Its tests add whitespace and ew support testing and reuses most of the tests from get_qp_ctext.

This doesn't add a lot of tests, but it does mean that any tests added to get_ccontent_sequence will have to be consciously skipped if they don't pass get_comment.

Reorganize the tests to put the ones involving whitespace together, so we can split those out in the next step.

get_bare_quoted string becomes the quoted-string equivalent of get_ccontent_sequence, implemented now via content_getter. It will replace get_qcontent the same way get_ccontent_sequence replaces get_qp_ctext. It is more closely aligned with the RFC BNF than get_ccontent_sequence, but now fully handles "dirty data" in the form of improperly encoded words. Unlike for get_ccontent_sequence, get_bare_quoted_string, while it is a functional replacement for get_content, is not a *direct* replacement: semantically it is different in that it also handles the required leading a trailing quotes (or the absence of the latter). We therefore can't re-use the get_qcontent tests, but I have verified that there are equivalent tests for each of them in the existing get_bare_quoted_string tests. BUGFIX: The fix for bpo-16983 enabled decoding of encoded words inside quoted strings, where they are technically invalid. That fix did not handle the not uncommon case of there being whitespace missing before such encoded words, which is now fixed. BUGFIX: The fix for bpo-16983 incorrectly registered a missing whitespace defect if an encoded word ended just before the trailing quote in a quoted string. This defect is no longer registered for that case. This changeset also tweaks the defect message for encoded word inside a quoted string to use the RFC notation "encoded-word" instead of "encoded word".

get_parameter uses it, but we'll fix that when we refactor that function.

This will replace get_atext. It handles encoded words at a level that is more useful to our error-recovery parsing. We want to recognize encoded words almost everywhere they occur, which means looking for them at the atext level. This will become clearer when we refactor get_dot_atom_text, but the disabled ew tests should give a clue.

BUGFIX: Per the RFC encoded words are allowed in comments, but previously we did not decode them. They are now correctly decoded.

No code is calling it any longer.

BUGFIX: get_cfws would return an empty token list if the input was empty or started with non-cfws. This is clearly buggy, since it makes the parse tree inaccurate. It now raises a HeaderParseError in this case. When called via the old API, it generates a warning message and continues to return the buggy value. The existing code never calls it wrong, and hopefully no one else does either.

BUGFIX: Previously, while encoded words in phrases (eg: display names) would be decoded if they were surrounded by whitespace, encoded words embedded in non-whitespace would not be. While the latter is not RFC compliant, most email clients do decode them. The email package now does so as well. BUGFIX: Missing leading whitespace on encoded words preceded by a comment was not previously reported as a defect, but is now. BUGFIX: The TokenList method 'startwith_fws' would raise an IndexError if the TokenList was empty. This could only be an issue to user code if that code directly manipulated the private _parse_tree attribute of headers, or used the private _header_value_parser module directly. This changeset starts the elimination of 'vtext' as a thing. Instead of an atom list being comprised of a mixture of atext and vtext, it is now all labeled as atext. This gives a more concrete meaning to the Terminal token_type: it is what context the text came from. This is how it is being used in the folder: things sourced from atext or ptext need to be quoted if they contain any specials. This should not be visible to user code unless it is really digging in to the internals of this private module or the _parse_tree attached to headers. As part of this refactoring I've added endswith_fws in parallel to the existing startswith_fws, to support the checks for missing whitespace.

Which adds EW support. Which cascades to partially fix EW support in get_dot_atom.

No code is using it at this point.

This completes the elimination of 'vtext' as a thing outside of direct calls to the old API. It also changes the treatment of the 'sort_of_valid_ew' case in the local part tests from an 'obs' case into a non-obs case, because now get_dot_atom is parsing it instead of raising an error.

Deprecating the lack of raise if there are no words behavior. BUGFIX: Previously get_phrase would return a phrase containing nothing or only comment folding whitespace if that's all there was. However, a phrase per the RFC must contain at least one word. get_phrase now correctly raises a HeaderParseError in this case. When called via the old API, the existing behavior is maintained, with a deprecation warning. In addition to the basic value -> value, start refactor, this commit adds correct missing ew whitespace detection to get_phrase. For some definition of correct. This commit also updates the 'comment_without_atom_in_obs_phrase' defect to talk about 'cfws' rather than 'comment', since that is what is really accepted as a defect. The constant name changes accordingly.

Along the way I've tidied up the 'dot' defects to all have the same format, and use 'local-part' instead of 'local part' to be consistent with the usage in other defects.

BUGFIX: get_local_part now correctly registers defects if encoded words are found in the local part.

bitdancer · 2026-06-30T21:42:03Z

Oh, and even if you don't look at the commits themselves you definitely want to look at the commit log, there's lots of information in there about the functional changes and why they are made.

bitdancer added 30 commits June 30, 2026 16:34

Make params the default for email test cases.

5797d0d

DROPME: temporary unittest usability hack.

a5d8705

Add for_each_character params filter.

f847561

This allows us to easily test that, for example, *every* non-printable character is treated as an error, not just the one used by a non-parameterized test.

Do _wsp_splitter test conversion.

664e2e3

Update _wsp_splitter test names.

2bb5c39

Begin refactoring of get_fws tests.

1f0ffc1

Rough conversion of existing get_fws tests.

872fb8a

This makes the next diff easier to compare, then we'll tidy up.

Fix whitespace and test names in get_fws tests.

522bc0d

Convert get_fws tests to keyword form.

302de5b

Improve the get_fws test names.

7af1603

Begin _encoded_words test refactor.

aa5fa3f

Code and comments to make the next diff cleaner. We're converting the _encoded_words tests because part of the parser refactor is going to involve a small refactor in this helper module.

Rough conversion of _encoded_word tests.

6cc9d14

Fix whitespace and test names in _encoded_words tests.

d3f4fac

Begin get_encoded_word test refactor.

d44e56d

Add comments to make the next diff cleaner.

Rough conversion of get_encoded_word tests.

2a18c46

Fix test names and whitespace in get_encoded_word tests.

a0a5aaa

Begin simplifying get_encoded_word 'raise' tests.

22f292a

Simplify get_encoded_word raise tests further.

17cc660

The params_map allows us to skip specifying 'C', and all these tests will fit on one line. I think it reads cleaner with the strings lined up.

Update get_encoded_word raise tests before adding more.

ad0423b

Switch to 'content' instead of 'abc' for clarity as to which part it is, and update the names to harmonize with the ones I'm about to add.

bitdancer added 23 commits June 30, 2026 16:34

Refactor parse_unstructured to use content_getter.

ce077d4

Add get_ccontent_sequence.

e541d1c

This function is only RFC BNF adjacent, but is no further off than get_qp_ctext, which it will replace. Its tests add whitespace and ew support testing and reuses most of the tests from get_qp_ctext.

Have get_comment reuse the get_ccontent_sequence tests.

2c3e1bb

This doesn't add a lot of tests, but it does mean that any tests added to get_ccontent_sequence will have to be consciously skipped if they don't pass get_comment.

Preliminary deprecation of get_qp_ctext.

5e07f12

Reorganize the tests to put the ones involving whitespace together, so we can split those out in the next step.

Set up easy POSTDEP removal for get_qp_ctext.

016047c

Preliminary deprecation of get_qcontent.

d5dff26

get_parameter uses it, but we'll fix that when we refactor that function.

Preliminary deprecation of get_atext.

9103f27

Refactor get_comment.

24087d8

BUGFIX: Per the RFC encoded words are allowed in comments, but previously we did not decode them. They are now correctly decoded.

Finalize deprecation of get_qp_ctext.

51e075a

No code is calling it any longer.

Refactor get_quoted_string.

ac79897

Refactor get_dot_atom_text.

e919de5

Which adds EW support. Which cascades to partially fix EW support in get_dot_atom.

Finalize deprecation of get_atext.

578913f

No code is using it at this point.

Refactor get_word.

2755b6d

Refactor get_obs_local_part.

c185843

Along the way I've tidied up the 'dot' defects to all have the same format, and use 'local-part' instead of 'local part' to be consistent with the usage in other defects.

Refactor get_local_part.

87c85ab

BUGFIX: get_local_part now correctly registers defects if encoded words are found in the local part.

bitdancer requested a review from serhiy-storchaka June 30, 2026 21:37

bitdancer requested a review from a team as a code owner June 30, 2026 21:37

bedevere-app Bot added the awaiting core review label Jun 30, 2026

bedevere-app Bot mentioned this pull request Jun 30, 2026

Potential Quadratic Complexity Vulnerabilities in the email Module #136063

Open

bitdancer marked this pull request as draft June 30, 2026 21:40

bedevere-app Bot removed the awaiting core review label Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-136063: WIP: Refactor email _header_value_parser and eliminate non-linear complexity#152710

gh-136063: WIP: Refactor email _header_value_parser and eliminate non-linear complexity#152710
bitdancer wants to merge 152 commits into
python:mainfrom
bitdancer:hvp_refactor

bitdancer commented Jun 30, 2026 •

edited by bedevere-app Bot

Loading

Uh oh!

bitdancer commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

bitdancer commented Jun 30, 2026 • edited by bedevere-app Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bitdancer commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bitdancer commented Jun 30, 2026 •

edited by bedevere-app Bot

Loading