request_fingerprint not is unique #2017

nengine · 2016-05-27T11:11:31Z

Please see an example below. Should it generate same fingerprint? Thanks.

from scrapy.http import Request
from scrapy.utils.request import request_fingerprint

r1 = Request("http://www.example.com/123")
r2 = Request("http://example.com/123")

print request_fingerprint(r1)
print request_fingerprint(r2)

1577e4ad857665390d44cd04a638104d0575d903
a907c28bf08125b8a87535a117c2d8a4a629415c

kmike · 2016-05-27T11:19:55Z

It is common for www and non-www content to be the same, but it is not the only way websites are implemented. For example, it may be even more common to redirect from non-www version to www. In this case if fingerprints of http://example.com/123 and http://www.example.com/123 are the same then http://example.com/123 redirection target (http://www.example.com/123) may be filtered out by duplication filter. So I think the current implementation is better.

nengine · 2016-05-27T11:25:36Z

Hi, Did you meant custom duplication filter? I use scrapy deltafetch but it was recorded twice. Thanks!

kmike · 2016-05-27T11:27:49Z

No, I mean the default dupefilter.

There is a tradeoff: if you make URL canonicalization too aggressive then sometimes you may miss requests; if it is not aggressive enough then sometimes there will be duplicate requests. For a general framework it is better to err on "more duplicate requests" side.

nengine · 2016-05-27T11:38:01Z

Default dupefilter uses request_fingerprint to check the unique url's so they are recorded twice. So, I meant I should implement custom dupefilter to fix URL - www part. Thanks.

kmike · 2016-05-27T11:45:31Z

@nengine ah, yeah, that's correct.
There are some edge cases because not only dupefilters use request fingerprints - they are also used by cache. So if you're using Scrapy cache you will have to override a cache backend as well. It looks DeltaFetch may also need a subclass or a custom deltafetch_key in request.meta - it uses the original request_fingerprint here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

request_fingerprint not is unique #2017

request_fingerprint not is unique #2017

nengine commented May 27, 2016

kmike commented May 27, 2016

nengine commented May 27, 2016

kmike commented May 27, 2016

nengine commented May 27, 2016

kmike commented May 27, 2016

kmike commented May 27, 2016

request_fingerprint not is unique #2017

request_fingerprint not is unique #2017

Comments

nengine commented May 27, 2016

kmike commented May 27, 2016

nengine commented May 27, 2016

kmike commented May 27, 2016

nengine commented May 27, 2016

kmike commented May 27, 2016

kmike commented May 27, 2016