New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
request_fingerprint not is unique #2017
Comments
It is common for www and non-www content to be the same, but it is not the only way websites are implemented. For example, it may be even more common to redirect from non-www version to www. In this case if fingerprints of |
Hi, Did you meant custom duplication filter? I use scrapy deltafetch but it was recorded twice. Thanks! |
No, I mean the default dupefilter. There is a tradeoff: if you make URL canonicalization too aggressive then sometimes you may miss requests; if it is not aggressive enough then sometimes there will be duplicate requests. For a general framework it is better to err on "more duplicate requests" side. |
Default dupefilter uses request_fingerprint to check the unique url's so they are recorded twice. So, I meant I should implement custom dupefilter to fix URL - www part. Thanks. |
@nengine ah, yeah, that's correct. See also: #900. |
I'm not familiar with deltafetch extension; maybe just passing a right |
Please see an example below. Should it generate same fingerprint? Thanks.
1577e4ad857665390d44cd04a638104d0575d903
a907c28bf08125b8a87535a117c2d8a4a629415c
The text was updated successfully, but these errors were encountered: