Description
normalize_url lowercases the entire URL, including the path and query, even though its docstring says it only converts the scheme and netloc to lower case:
|
# Construct the final URL |
|
yarl_new_url = parsed_url.with_query(sorted_search_params) |
|
yarl_new_url = yarl_new_url.with_path( |
|
yarl_new_url.path.removesuffix('/'), keep_query=True, keep_fragment=keep_url_fragment |
|
) |
|
|
|
return str(yarl_new_url).lower() |
Since compute_unique_key uses the normalized URL as the default unique_key, any two URLs that differ only in path or query casing collide:
https://example.com/Product/ABC and https://example.com/product/abc produce the same unique key.
https://example.com/?token=SeCrEt and https://example.com/?token=secret collide as well.
Per RFC 3986, only the scheme and host are case-insensitive. The path and query are case-sensitive.
Impact
On sites with case-sensitive paths (base64 or hashid identifiers, usernames, ...), case-distinct pages are silently deduplicated. The crawl finishes successfully with pages quietly missing. There's no log message and no statistic that would reveal it.
Proposed fix
Lowercase only the scheme and host, which matches browser behavior. Keep the path, query, and fragment casing intact.
This changes how default unique keys are computed: crawls that relied on the case-insensitive dedup will now visit more pages, and keys stored in persisted queues won't match newly computed ones. It should therefore land in 2.0 as a breaking change:
- Document the change loudly in the upgrading guide.
- Users who need the old behavior can pass an explicit
unique_key or use transform_request_function to lowercase URLs before enqueuing.
Description
normalize_urllowercases the entire URL, including the path and query, even though its docstring says it only converts the scheme and netloc to lower case:crawlee-python/src/crawlee/_utils/requests.py
Lines 41 to 47 in 8cc09f2
Since
compute_unique_keyuses the normalized URL as the defaultunique_key, any two URLs that differ only in path or query casing collide:https://example.com/Product/ABCandhttps://example.com/product/abcproduce the same unique key.https://example.com/?token=SeCrEtandhttps://example.com/?token=secretcollide as well.Per RFC 3986, only the scheme and host are case-insensitive. The path and query are case-sensitive.
Impact
On sites with case-sensitive paths (base64 or hashid identifiers, usernames, ...), case-distinct pages are silently deduplicated. The crawl finishes successfully with pages quietly missing. There's no log message and no statistic that would reveal it.
Proposed fix
Lowercase only the scheme and host, which matches browser behavior. Keep the path, query, and fragment casing intact.
This changes how default unique keys are computed: crawls that relied on the case-insensitive dedup will now visit more pages, and keys stored in persisted queues won't match newly computed ones. It should therefore land in 2.0 as a breaking change:
unique_keyor usetransform_request_functionto lowercase URLs before enqueuing.