New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new allow_offsite
parameter in OffsiteMiddleware
#6151
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #6151 +/- ##
==========================================
- Coverage 88.59% 88.52% -0.07%
==========================================
Files 159 159
Lines 11582 11582
Branches 1885 1885
==========================================
- Hits 10261 10253 -8
- Misses 994 1000 +6
- Partials 327 329 +2
|
I think this will show deprecation warnings for all requests with dont_filter=True, even ones that use it to skip the dupefilter? |
Yes but only once as https://docs.python.org/3/library/warnings.html states:
|
scrapy/spidermiddlewares/offsite.py
Outdated
ScrapyDeprecationWarning, | ||
stacklevel=2, | ||
) | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I wonder if it'd be better to just keep supporting dont_filter, without a deprecation warning.
First, the flag says "dont_filter", and we respect it, and don't filter out the request. An argument can be made though that "don't filter" only applies to deduplication filter, not to other types of filtering. I think that's valid, But that's not the current flag behavior, and also it's not in the flag name (it's not "dont_deduplicate").
Second, the user doesn't control all the don't_filter flags, Scrapy and other components can be setting this flag. For example, the default start_requests implementation uses dont_filter=True.
It seems it's not possibile to "deprecate don_filter flag in OffsiteMiddleware", because the user might be setting this flag not for the OffsiteMiddleware, but for other Scrapy components, but the request may still end up in OffsiteMiddleware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. In that case, we need to clarify that the dont_filter
value also affects some middlewares as well and not just for the Scheduler.
1d557a5
to
acba118
Compare
I'm proposing to have a new
allow_offsite
parameter inOffsiteMiddleware
. Currently, it relies onRequest.dont_filter
attribute to be set toTrue
to allow offsite requests. However, it seems that we cannot rely on this flag directly since setting it to True could result in multiple duplicated requests, as per its definition:There are cases where we may want to visit an offsite page but also want to filter duplicate requests after.