gh-136063: WIP: Refactor email _header_value_parser and eliminate non-linear complexity#152710
Draft
bitdancer wants to merge 152 commits into
Draft
gh-136063: WIP: Refactor email _header_value_parser and eliminate non-linear complexity#152710bitdancer wants to merge 152 commits into
bitdancer wants to merge 152 commits into
Conversation
This is going to be pretty much a complete rewrite of the header value parser. As such, we want to do our best to ensure that we don't break anything. That means more tests. Writing tests for the email package is a bit enervating, given the number of different variations of input data and edge cases that really should be tested. I wrote the 'parameterize' helper to mitigate some of the pain of that test writing, but it was a pretty primitive tool and didn't really help all that much. To make it palatable to write more tests, I'm replacing that original, primitive decorator with a new parameterization framework informed by the years of test writing experience I've had since then. I'm pretty pleased with the result, which is conceptually based on a package I wrote a few years ago but never published outside of github. Its test suite gives a lot of usage examples, and it is much more intuitive than parameterize. As I move along in this refactor, there will be *many* more usage examples added to test__header_value_parser.
This new version gives a much more useful error message, and also supports specifying regexes to match the expected defects. It also no longer depends on the order of the defects, making it less fragile. In addition it supports using a function to produce the defect and regex, making it easier to verify defects that take arguments instead of a message string.
These will be testing for exact matches of the messages, but since there's only one place in the tests that need to be changed if the message changes, this seems reasonable. There could in theory be a defect subclass hierarchy here, but we'll leave that idea for some future refactor.
BUGFIX: In certain cases correctly detected defects were being dropped when combining syntactic units. These defects are now correctly copied to the higher level syntactic units, which will result in additional defects being reported in certain cases (mostly mailbox and MIME parameter lists). I found this bug fairly far along in the refactoring process, but it makes more sense to fix it early, since the test bug it triggered wasn't related to the code I was working on at the time.
This allows us to easily test that, for example, *every* non-printable character is treated as an error, not just the one used by a non-parameterized test.
The test refactoring process is a bit laborious. I ended up splitting the changesets up into incremental steps to make sure that I wasn't omitting or changing any tests that I didn't intend to. Some changesets are fairly clean as is, others are best viewed using git's --color-words or a similar substring rather than line based diff tool, such as github's diff display. Whether it will be worthwhile for anyone to review all these commits is an open question, but having them helped me get the refactoring correct, especially when I had to go back and fix mistakes and/or bugs at various points during the refactoring. Not to mention rebasing on main as some parser bugs were fixed there. Along the way we are going to fix a non-trivial number of small bugs. The commit messages will contain paragraphs starting with "BUGFIX:" for any such changes. All such bug fixes are internal: not worthy of a NEWS item beyond the one announcing the parser rewrite. Anything that doesn't fall into that category I intend to fix in main first in separate PRs. This commit is the setup, to make the next diff easier to read.
This functionality of this code will eventually get absorbed into a new function during the refactor, but as the test comments say, equivalent tests will still need to run against the new implementation. And we'll probably keep the function around until the deprecation period has expired.
This is more preparation for the refactoring steps. This is an expanded
version of the old _test_get_x method that is much more general, works
with the new test parameterization, has optional automatic testing of
the 'start' attribute, and provides a convenience method for checking
that only the expected token types are included in whatever is returned
by the parsing method.
This will be used to test both the new and the old API for all
refactored methods. The refactoring process will be a several step
process for each function:
1) refactor the existing tests to the new framework without
changing the test arguments. This allows verification that
the tests are the same if anyone cares to check.
2) simplify the test arguments using the parameterization, and
in most cases add additional tests.
3) refactor the function to support the new API, and start testing
both APIs.
(1) is supported by having the first few keyword arguments to
_test_parse be in the same order as the positional arguments in
_test_get_x. In steps (1) and (2) we use the 'old_api_only' params map
to call _test_method such that we ignore any warnings. In step (3) we
switch to 'for_each_api', which expects a deprecation warning when using
the old API call style. When the refactoring is complete, old_api_only
can go away. When the deprecation happens, for_each_api can go away
after a global s/for_each_api/Params/.
Much of this code is informed by the later refactoring process, but it
is easier to present and maintain it in this changeset.
This is an internal function whose functionality will get replaced during the refactor, but as the comment says, the replacement function will need to pass these tests as well. Adding these tests revealed a bug: the existing code wasn't properly reporting that there were quoted printables if the only ones were \\s, even though it was correctly decoding them. Clearly there are no test cases in the rest of the tests that cover this case, and since it only affects defect detection it has never been reported. I fixed the bug in the old code, even though the code will go away in the refactor, so that the tests are checking for the desired behavior. BUGFIX: the detection of obsolete quoted pairs in certain contexts (i.e.: domain literals) was imperfect, such that some defects of this type were missed. They should now be consistently detected.
This makes the next diff easier to compare, then we'll tidy up.
Unfortunately the existing behavior some of the new tests check is buggy: get_fws returns a WhitesSaceTerminal whose contents is an empty string, but 'value' still returns a single blank...neither of these is correct behavior. We'll deprecate calling get_fws that way when we do the refactor.
Code and comments to make the next diff cleaner. We're converting the _encoded_words tests because part of the parser refactor is going to involve a small refactor in this helper module.
Using 'actual_' as a prefix makes the error results clearer when tests fail. Adding 'defects=' doesn't really add much here, but it is the new pattern established in test__header_value_parser so it is nice to be consistent. The rest is a small simplification of how the encoder and decoder tests are factored.
Add comments to make the next diff cleaner.
And add regexes for the exceptions and defects. I replaced one test: the 'lang default is blank' test is redundant given that the test harness now tests that. I replaced it with one that checks that a non-default charset does appear in the charset attribute.
The params_map allows us to skip specifying 'C', and all these tests will fit on one line. I think it reads cleaner with the strings lined up.
Switch to 'content' instead of 'abc' for clarity as to which part it is, and update the names to harmonize with the ones I'm about to add.
content_getter of necessity passes all the tests that were written for get_unstructured, and then there are the additional ones to exercise the start, tl_type, text_type, end_chars, and qp keywords, plus test changes to test ew_indexes. content_getter is in many ways the heart of this refactor. It abstracts the lowest level of the parsing and makes it more consistent across RFC token types. It centralizes the handling of encoded words, which will allow us to decode them in more places (for better or worse). It also makes that handling much more efficient than the old code. As the generic workhorse method for the parser, content_getter knows almost nothing about the syntactic units or their requirements. Thus we tell it what token type list to use, what text type to wrap the terminals in, what characters are the stop characters, and whether or not quoted printables are to be decoded. And it tracks the encoded words that it decodes, recording their indexes in the input value. This last is because encoded words can be defects if they appear in various syntactic units, when we've decided that we're going to decode them anyway (which is most places). Only the higher level parsing functions can know if they are defects, so content_getter supplies a list of indexes that the higher level routines can use to generate defects. Right now a boolean flag would be enough, but eventually the defects will contain a pointer to the location of the defect in the value; tracking the index allows us to support that for EW defects.. In order to make it efficient for higher level functions to check for ew defects, this commit also adds support for copying the index list to each higher level token as the token lists are constructed. Otherwise the higher level functions would have to walk the parse tree for every token they process, whether those tokens had encoded words or not. We use a few cycles doing the copy as we go to avoid using a lot more cycles every time we process a token where encoded words are a defect. Getting all the new tests to pass required an addition to the get_encoded_word API: a 'decode_qp' keyword to allow telling it to decode quoted printables in the payload before decoding. content_getter also passes the _get_ptext_to_endchars tests (per the comment on those tests, since it is the replacement for that function), with the exception of the flag that _get_ptext_to_endchars returns about whether there was qp present or not. That flag functionality is to support get_dtext, where obsolete dtext can contain quoted printables but we need to register a defect. We won't be using content_getter for get_dtext, though, because we do *not* want to decode encoded words in domain literals for security reasons. We'll get the qp flag directly via _qp_unquote in get_dtext. In addition one test is changed, because by default content_getter does *not* stop at whitespace (that's a big part of the point of it), while _get_ptext_to_endchars does. Only one test tested for that, and by moving it we maintain the test that whitespace *works* as endchars, which is functionality we will use. The new test position is a slightly less thorough test of the _get_ptext_to_endchars API, but since that API is deprecated that doesn't matter.
We delete one of the _get_ptext_to_endchars tests that was checking for the correct non-detecting of quoted printables that came after the endchar because that test isn't relevant to content_getter. As noted that functionality will be handled directly by _qp_unquote. The fact that there is one less test for _get_ptext_to_endchars doesn't really matter since before this PR there were non at all.
This function is only RFC BNF adjacent, but is no further off than get_qp_ctext, which it will replace. Its tests add whitespace and ew support testing and reuses most of the tests from get_qp_ctext.
This doesn't add a lot of tests, but it does mean that any tests added to get_ccontent_sequence will have to be consciously skipped if they don't pass get_comment.
Reorganize the tests to put the ones involving whitespace together, so we can split those out in the next step.
get_bare_quoted string becomes the quoted-string equivalent of get_ccontent_sequence, implemented now via content_getter. It will replace get_qcontent the same way get_ccontent_sequence replaces get_qp_ctext. It is more closely aligned with the RFC BNF than get_ccontent_sequence, but now fully handles "dirty data" in the form of improperly encoded words. Unlike for get_ccontent_sequence, get_bare_quoted_string, while it is a functional replacement for get_content, is not a *direct* replacement: semantically it is different in that it also handles the required leading a trailing quotes (or the absence of the latter). We therefore can't re-use the get_qcontent tests, but I have verified that there are equivalent tests for each of them in the existing get_bare_quoted_string tests. BUGFIX: The fix for bpo-16983 enabled decoding of encoded words inside quoted strings, where they are technically invalid. That fix did not handle the not uncommon case of there being whitespace missing before such encoded words, which is now fixed. BUGFIX: The fix for bpo-16983 incorrectly registered a missing whitespace defect if an encoded word ended just before the trailing quote in a quoted string. This defect is no longer registered for that case. This changeset also tweaks the defect message for encoded word inside a quoted string to use the RFC notation "encoded-word" instead of "encoded word".
get_parameter uses it, but we'll fix that when we refactor that function.
This will replace get_atext. It handles encoded words at a level that is more useful to our error-recovery parsing. We want to recognize encoded words almost everywhere they occur, which means looking for them at the atext level. This will become clearer when we refactor get_dot_atom_text, but the disabled ew tests should give a clue.
BUGFIX: Per the RFC encoded words are allowed in comments, but previously we did not decode them. They are now correctly decoded.
No code is calling it any longer.
BUGFIX: get_cfws would return an empty token list if the input was empty or started with non-cfws. This is clearly buggy, since it makes the parse tree inaccurate. It now raises a HeaderParseError in this case. When called via the old API, it generates a warning message and continues to return the buggy value. The existing code never calls it wrong, and hopefully no one else does either.
BUGFIX: Previously, while encoded words in phrases (eg: display names) would be decoded if they were surrounded by whitespace, encoded words embedded in non-whitespace would not be. While the latter is not RFC compliant, most email clients do decode them. The email package now does so as well. BUGFIX: Missing leading whitespace on encoded words preceded by a comment was not previously reported as a defect, but is now. BUGFIX: The TokenList method 'startwith_fws' would raise an IndexError if the TokenList was empty. This could only be an issue to user code if that code directly manipulated the private _parse_tree attribute of headers, or used the private _header_value_parser module directly. This changeset starts the elimination of 'vtext' as a thing. Instead of an atom list being comprised of a mixture of atext and vtext, it is now all labeled as atext. This gives a more concrete meaning to the Terminal token_type: it is what context the text came from. This is how it is being used in the folder: things sourced from atext or ptext need to be quoted if they contain any specials. This should not be visible to user code unless it is really digging in to the internals of this private module or the _parse_tree attached to headers. As part of this refactoring I've added endswith_fws in parallel to the existing startswith_fws, to support the checks for missing whitespace.
Which adds EW support. Which cascades to partially fix EW support in get_dot_atom.
No code is using it at this point.
This completes the elimination of 'vtext' as a thing outside of direct calls to the old API. It also changes the treatment of the 'sort_of_valid_ew' case in the local part tests from an 'obs' case into a non-obs case, because now get_dot_atom is parsing it instead of raising an error.
Deprecating the lack of raise if there are no words behavior. BUGFIX: Previously get_phrase would return a phrase containing nothing or only comment folding whitespace if that's all there was. However, a phrase per the RFC must contain at least one word. get_phrase now correctly raises a HeaderParseError in this case. When called via the old API, the existing behavior is maintained, with a deprecation warning. In addition to the basic value -> value, start refactor, this commit adds correct missing ew whitespace detection to get_phrase. For some definition of correct. This commit also updates the 'comment_without_atom_in_obs_phrase' defect to talk about 'cfws' rather than 'comment', since that is what is really accepted as a defect. The constant name changes accordingly.
Along the way I've tidied up the 'dot' defects to all have the same format, and use 'local-part' instead of 'local part' to be consistent with the usage in other defects.
BUGFIX: get_local_part now correctly registers defects if encoded words are found in the local part.
Member
Author
|
Oh, and even if you don't look at the commits themselves you definitely want to look at the commit log, there's lots of information in there about the functional changes and why they are made. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As a response to the report of quadratic complexity in the header value parser in the email package, I began a rewrite of the parser code. This has taken me considerably longer than I originally envisioned, since as I worked I realized I really wanted to rewrite all the tests to be more thorough, easier to understand, and easier to extend; and as I did the refactoring I kept finding things the parser was really doing wrong that I wanted to fix (mostly around encoded word edge cases).
At this point I'm about half way through the project, and Serhiy has proposed a patch to solve the quadratic complexity that is a minimal fix, rather than the rewrite that I'm currently engaged in, at the expense of breaking backward compatibility (which, since this is labeled as an internal module, might be OK).
I decided I should publish the work in progress so we can decide how we want to proceed.
If you look at this PR as an (un)finished whole, it will look pretty much like a replacement of the original code...because a lot of it is, in the end. However, I've structured the PR as a set of commits that can be reviewed as an incremental refactoring process. This is a curated set of changes: I do lots of rebase editing to keep the flow of the refactor consistent as I find new problems that result in new tests or require changes to earlier refactoring steps.
The refactoring is broken down in to two broad phases: refactoring and adding to the tests, and refactoring the code. Most of the commits are step-wise refactoring of the tests to make sure I didn't lose any test coverage. I do it this way for my own peace of mind, and to make it possible to ensure that any test changes on main get incorporated without, again, losing any test coverage. Along the way in the test refactoring part I fixed a few minor bugs in the existing parser, so that the test changes in the code refactoring phase are concerned primarily with functional improvements rather than minor bug fixes.
Since I took an incremental refactoring approach, I include backward compatibility wrappers so that at each step the code continues to pass all of its tests. This also allows for a deprecation period for the old parser API, if we wish to have one.
Like I said, this is half done (at least, I hope it is half :). In addition to the functions I haven't refactored yet, I also haven't dealt with the slicing for the error messages that Serhiy's PR handles, but I do have a plan for that ;)
emailModule #136063