Skip to content

normalize_url lowercases the entire URL, silently deduplicating case-distinct pages #2008

Description

@vdusek

Description

normalize_url lowercases the entire URL, including the path and query, even though its docstring says it only converts the scheme and netloc to lower case:

# Construct the final URL
yarl_new_url = parsed_url.with_query(sorted_search_params)
yarl_new_url = yarl_new_url.with_path(
yarl_new_url.path.removesuffix('/'), keep_query=True, keep_fragment=keep_url_fragment
)
return str(yarl_new_url).lower()

Since compute_unique_key uses the normalized URL as the default unique_key, any two URLs that differ only in path or query casing collide:

  • https://example.com/Product/ABC and https://example.com/product/abc produce the same unique key.
  • https://example.com/?token=SeCrEt and https://example.com/?token=secret collide as well.

Per RFC 3986, only the scheme and host are case-insensitive. The path and query are case-sensitive.

Impact

On sites with case-sensitive paths (base64 or hashid identifiers, usernames, ...), case-distinct pages are silently deduplicated. The crawl finishes successfully with pages quietly missing. There's no log message and no statistic that would reveal it.

Proposed fix

Lowercase only the scheme and host, which matches browser behavior. Keep the path, query, and fragment casing intact.

This changes how default unique keys are computed: crawls that relied on the case-insensitive dedup will now visit more pages, and keys stored in persisted queues won't match newly computed ones. It should therefore land in 2.0 as a breaking change:

  • Document the change loudly in the upgrading guide.
  • Users who need the old behavior can pass an explicit unique_key or use transform_request_function to lowercase URLs before enqueuing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions