Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

request_fingerprint not is unique #2017

Closed
nengine opened this issue May 27, 2016 · 6 comments
Closed

request_fingerprint not is unique #2017

nengine opened this issue May 27, 2016 · 6 comments

Comments

@nengine
Copy link

nengine commented May 27, 2016

Please see an example below. Should it generate same fingerprint? Thanks.

from scrapy.http import Request
from scrapy.utils.request import request_fingerprint

r1 = Request("http://www.example.com/123")
r2 = Request("http://example.com/123")

print request_fingerprint(r1)
print request_fingerprint(r2)

1577e4ad857665390d44cd04a638104d0575d903
a907c28bf08125b8a87535a117c2d8a4a629415c

@kmike
Copy link
Member

kmike commented May 27, 2016

It is common for www and non-www content to be the same, but it is not the only way websites are implemented. For example, it may be even more common to redirect from non-www version to www. In this case if fingerprints of http://example.com/123 and http://www.example.com/123 are the same then http://example.com/123 redirection target (http://www.example.com/123) may be filtered out by duplication filter. So I think the current implementation is better.

@kmike kmike closed this as completed May 27, 2016
@nengine
Copy link
Author

nengine commented May 27, 2016

Hi, Did you meant custom duplication filter? I use scrapy deltafetch but it was recorded twice. Thanks!

@kmike
Copy link
Member

kmike commented May 27, 2016

No, I mean the default dupefilter.

There is a tradeoff: if you make URL canonicalization too aggressive then sometimes you may miss requests; if it is not aggressive enough then sometimes there will be duplicate requests. For a general framework it is better to err on "more duplicate requests" side.

@nengine
Copy link
Author

nengine commented May 27, 2016

Default dupefilter uses request_fingerprint to check the unique url's so they are recorded twice. So, I meant I should implement custom dupefilter to fix URL - www part. Thanks.

@kmike
Copy link
Member

kmike commented May 27, 2016

@nengine ah, yeah, that's correct.
There are some edge cases because not only dupefilters use request fingerprints - they are also used by cache. So if you're using Scrapy cache you will have to override a cache backend as well. It looks DeltaFetch may also need a subclass or a custom deltafetch_key in request.meta - it uses the original request_fingerprint here.

See also: #900.

@kmike
Copy link
Member

kmike commented May 27, 2016

I'm not familiar with deltafetch extension; maybe just passing a right deltafetch_key would be enough and there is no need to override builtin Scrapy components.

Gallaecio added a commit to Gallaecio/scrapy that referenced this issue Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants