`normalize_url` lowercases the entire URL, silently deduplicating case-distinct pages

### Description

`normalize_url` lowercases the entire URL, including the path and query, even though its docstring says it only converts the scheme and netloc to lower case:

https://github.com/apify/crawlee-python/blob/8cc09f2a94cc8547c37d94e40e70315f4d43dd25/src/crawlee/_utils/requests.py#L41-L47

Since `compute_unique_key` uses the normalized URL as the default `unique_key`, any two URLs that differ only in path or query casing collide:

- `https://example.com/Product/ABC` and `https://example.com/product/abc` produce the same unique key.
- `https://example.com/?token=SeCrEt` and `https://example.com/?token=secret` collide as well.

Per [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986#section-6.2.2.1), only the scheme and host are case-insensitive. The path and query are case-sensitive.

### Impact

On sites with case-sensitive paths (base64 or hashid identifiers, usernames, ...), case-distinct pages are silently deduplicated. The crawl finishes successfully with pages quietly missing. There's no log message and no statistic that would reveal it.

### Proposed fix

Lowercase only the scheme and host, which matches browser behavior. Keep the path, query, and fragment casing intact.

This changes how default unique keys are computed: crawls that relied on the case-insensitive dedup will now visit more pages, and keys stored in persisted queues won't match newly computed ones. It should therefore land in 2.0 as a breaking change:

- Document the change loudly in the upgrading guide.
- Users who need the old behavior can pass an explicit `unique_key` or use `transform_request_function` to lowercase URLs before enqueuing.


	# Construct the final URL
	yarl_new_url = parsed_url.with_query(sorted_search_params)
	yarl_new_url = yarl_new_url.with_path(
	yarl_new_url.path.removesuffix('/'), keep_query=True, keep_fragment=keep_url_fragment
	)

	return str(yarl_new_url).lower()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`normalize_url` lowercases the entire URL, silently deduplicating case-distinct pages #2008

Description

Impact

Proposed fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

normalize_url lowercases the entire URL, silently deduplicating case-distinct pages #2008

Description

Description

Impact

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`normalize_url` lowercases the entire URL, silently deduplicating case-distinct pages #2008