HTML API: Ensure deferred byte processing is applied at read interfaces by sirreal · Pull Request #12385 · WordPress/wordpress-develop

sirreal · 2026-07-01T18:49:42Z

The tag processor defers some processing. It is currently applied inconsistently.

Newline should be normalized (CRLF and CR become LF). This is from the input stream pre-processing phase.

Null bytes are replaced with U+FFFD replacement. This depends on the tokenization rules, but relevant for this PR is that it applies in tag names, attribute names, and attribute values.

For example (invisible characters have been replaced with visiaul representations):

<di␀v att␀x="␍x␀␍␊">␀x␀</di�v>y

Parses as:

└─DI␀V att␀x="␍x␀␍␊"
  └─#text "xy"

Expected:

├─DI�V att�x="␊x�␊"
│ └─#text "x"
└─#text "y"

Trac ticket: https://core.trac.wordpress.org/ticket/65372

Use of AI Tools

AI assistance: Yes
Tool(s): Codex
Model(s): GPT-5.5
Used for: Fuzz testing, diagnosis, initial implementation, review.

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

Red TDD step: browser-verified expectations for raw CR/CRLF/NUL in attribute values; passing pins for encoded / and for verbatim pass-through of API-supplied values. See #65372.

Attribute values read from the input document now normalize newlines (CRLF/CR to LF) and replace U+0000 NULL bytes with U+FFFD before decoding character references, matching what browsers produce for the same markup. Values enqueued through set_attribute() are plaintext API values and continue to pass through unchanged. See #65372.

Red TDD step: flushing add_class()/remove_class() updates must read the existing class attribute through the same input preprocessing as get_attribute(), normalizing newlines and replacing NULL bytes. See #65372.

class_name_updates_to_attributes_updates() reads the existing class value through the same preprocessing helper as get_attribute(), so add_class()/remove_class() no longer rebuild the attribute from raw source bytes containing CR or NULL. See #65372.

Red TDD step: browser-verified expectations that attribute names are exposed and addressed with U+FFFD replacing NULL bytes, that names collapsing after replacement behave as duplicates of one attribute, and that attribute updates target the replaced name. See #65372.

Attribute lookup keys are normalized where they are created, in parse_next_attribute(): NULL bytes are replaced with U+FFFD before lowercasing, as the tokenizer does in browsers. Names which collapse to the same replaced name are duplicates of one attribute (first one wins), lookups by the raw NULL spelling no longer match, and updates or removals by the replaced name target the source attribute. Raw document spans are untouched. See #65372.

Red TDD step: tag names are exposed with U+FFFD replacing NULL bytes; passing pins confirm NULL bytes never select rawtext parsing and never appear in PI-lookalike comment tag names. See #65372.

get_tag() (and get_token_name(), which delegates to it) returns tag names with U+0000 NULL bytes replaced by U+FFFD, as the tokenizer does in browsers. Internal token identification continues to compare raw bytes: a NULL byte in a tag name already prevents rawtext detection, matching browsers, where the replaced name likewise never equals SCRIPT or the other special names. See #65372.

Red TDD step: browser-verified expectation that classList-equivalent reads preserve NULL bytes in values set through the API; the U+0000 replacement belongs to the tokenizer, and document-sourced values already receive it in get_attribute(). See #65372.

class_list() received its NULL-byte replacement when reading raw class values; that replacement now happens in get_attribute() for values from the input document. Performing it on API-supplied values diverged from browsers, where classList preserves NULL bytes in values set via setAttribute(). See #65372.

Benchmark-guided: reading an attribute value applies up to three str_replace passes which doubled read cost for long values containing no bytes needing replacement. Guarding with strpos keeps the common case at two fast scans; values are typically free of CR and NULL. Benchmark (PHP 8.4, medians of 3): scanning 100-tag documents reading 3 attributes each, 2000 iterations: trunk 667ms, unguarded 714ms, guarded 699ms. Reading a 10.8KB clean attribute value 200k times: trunk 147ms, unguarded 313ms, guarded 258ms. The remaining cost is the unavoidable byte inspection. See #65372.

Red TDD step from adversarial review: a named character reference without a terminating semicolon must decode when followed by a NULL byte or any non-ASCII byte. Replacing NULL with U+FFFD before decoding fed the decoder a multi-byte follower whose classification by ctype_alnum() depends on the process locale, suppressing valid decodes in attribute values, diverging from browsers and from trunk. See #65372.

The tokenizer replaces U+0000 NULL bytes as it consumes input, so a character reference without a terminating semicolon sees the raw NULL byte as its follower, which is unambiguous, and the reference decodes. Replacing before decoding handed the decoder U+FFFD's lead byte, whose ctype_alnum() classification depends on the process locale, wrongly suppressing the decode under UTF-8 locales. No character reference decodes into NULL, so replacing after decoding is equivalent for the value's own bytes and faithful to the tokenizer's order. See #65372.

Per the named-character-reference state, a semicolon-less reference is ambiguous only when followed by an ASCII alphanumeric or equals sign. ctype_alnum() classifies bytes 0x80 and above as alphanumeric under UTF-8 locales, wrongly suppressing decodes followed by any non-ASCII byte and making decoding depend on the process locale. See #65372.

Red TDD step from adversarial review: next_tag() must match tag names in the same U+FFFD-replaced alphabet that get_tag() exposes, so the getter round-trips into queries, raw NULL spellings match nothing, and the Tag Processor agrees with the HTML Processor, whose queries already compare against the replaced token name. See #65372.

next_tag() compared sought tag names against raw document bytes while get_tag() returns names with NULL bytes replaced by U+FFFD, breaking the getter-to-query round trip and disagreeing with the HTML Processor's queries. Matching now happens in the exposed alphabet; the existing byte comparison is unchanged for names without NULL bytes, so the hot path costs the same. See #65372.

Red TDD step from adversarial review: get_attribute( 'CLASS' ) returned a stale value when class updates were pending, because the flush guard compared the attribute name case-sensitively. See #65372.

Attribute lookups are ASCII-case-insensitive, but the pending-class flush in get_attribute() compared the requested name case-sensitively, returning a stale value for spellings like "CLASS". See #65372.

@SInCE

From adversarial review: pins for class helpers over replaced source values, boolean attributes with NULL-byte names, verbatim prefix matching in get_attribute_names_with_prefix(), and HTML Processor end-tag matching across NULL and U+FFFD spellings (browser-verified: both spellings tokenize to the same name). Documents the @SInCE 7.1.0 behavior on indirectly-affected getters and the known asymmetry of set_modifiable_text(), whose value reads back normalized unlike attribute values, which round-trip verbatim. See #65372.

github-actions · 2026-07-01T18:49:55Z

Hi there! 👋

Thank you for your contribution to WordPress! 💖

It looks like this is your first pull request to wordpress-develop. Here are a few things to be aware of that may help you out!

No one monitors this repository for new pull requests. Pull requests must be attached to a Trac ticket to be considered for inclusion in WordPress Core. To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description.

Pull requests are never merged on GitHub. The WordPress codebase continues to be managed through the SVN repository that this GitHub repository mirrors. Please feel free to open pull requests to work on any contribution you are making.

More information about how GitHub pull requests can be used to contribute to WordPress can be found in the Core Handbook.

Please include automated tests. Including tests in your pull request is one way to help your patch be considered faster. To learn about WordPress' test suites, visit the Automated Testing page in the handbook.

If you have not had a chance, please review the Contribute with Code page in the WordPress Core Handbook.

The Developer Hub also documents the various coding standards that are followed:

Thank you,
The WordPress Project

github-actions · 2026-07-01T18:53:48Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

github-actions · 2026-07-01T18:58:52Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

sirreal added 30 commits June 11, 2026 18:29

HTML API: Add tests for attribute value input preprocessing.

cee0661

Red TDD step: browser-verified expectations for raw CR/CRLF/NUL in attribute values; passing pins for encoded / and for verbatim pass-through of API-supplied values. See #65372.

HTML API: Add tests for class updates over preprocessed values.

48d8fb4

Red TDD step: flushing add_class()/remove_class() updates must read the existing class attribute through the same input preprocessing as get_attribute(), normalizing newlines and replacing NULL bytes. See #65372.

HTML API: Add tests for NULL bytes in tag names.

135157f

Red TDD step: tag names are exposed with U+FFFD replacing NULL bytes; passing pins confirm NULL bytes never select rawtext parsing and never appear in PI-lookalike comment tag names. See #65372.

HTML API: Add test for case-insensitive class update flushing.

5292c7d

Red TDD step from adversarial review: get_attribute( 'CLASS' ) returned a stale value when class updates were pending, because the flush guard compared the attribute name case-sensitively. See #65372.

HTML API: Flush class updates for any case spelling of "class".

8c26adf

Attribute lookups are ASCII-case-insensitive, but the pending-class flush in get_attribute() compared the requested name case-sensitively, returning a stale value for spellings like "CLASS". See #65372.

Merge remote-tracking branch 'upstream/trunk' into HEAD

88a4d52

Merge branch 'trunk' into spec-compliant-getters

7f64468

Revert irrelevant doc changes

62d682f

Remove redundant tests

b7d4b39

Remove excessive docs

30b17da

Simplify tag name matching logic

7e451fe

Remove excessive documentation

b3d15f5

simplify comment

624fe63

Remove excessive documentation

5293caa

Simplify attribute value getter

7f77e93

Ignore new private method

908f4b3

sirreal added 4 commits July 1, 2026 20:30

Rework new function docs

fed8e18

Remove excessive docs

3c61f03

Reformat comment

480bce8

Test function types

d701662

sirreal mentioned this pull request Jul 1, 2026

HTML API: Apply input preprocessing consistently at Tag Processor read boundaries sirreal/wordpress-develop#53

Closed

sirreal marked this pull request as ready for review July 1, 2026 18:58

sirreal requested a review from dmsnell July 1, 2026 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTML API: Ensure deferred byte processing is applied at read interfaces#12385

HTML API: Ensure deferred byte processing is applied at read interfaces#12385
sirreal wants to merge 34 commits into
WordPress:trunkfrom
sirreal:spec-compliant-getters

sirreal commented Jul 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sirreal commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Use of AI Tools

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

Test using WordPress Playground

Some things to be aware of

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sirreal commented Jul 1, 2026 •

edited

Loading