Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new allow_offsite parameter in OffsiteMiddleware #6151

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

BurnzZ
Copy link
Member

@BurnzZ BurnzZ commented Nov 22, 2023

I'm proposing to have a new allow_offsite parameter in OffsiteMiddleware. Currently, it relies on Request.dont_filter attribute to be set to True to allow offsite requests. However, it seems that we cannot rely on this flag directly since setting it to True could result in multiple duplicated requests, as per its definition:

dont_filter (bool) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops.

There are cases where we may want to visit an offsite page but also want to filter duplicate requests after.

Copy link

codecov bot commented Nov 22, 2023

Codecov Report

Merging #6151 (acba118) into master (70ba3a0) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6151      +/-   ##
==========================================
- Coverage   88.59%   88.52%   -0.07%     
==========================================
  Files         159      159              
  Lines       11582    11582              
  Branches     1885     1885              
==========================================
- Hits        10261    10253       -8     
- Misses        994     1000       +6     
- Partials      327      329       +2     
Files Coverage Δ
scrapy/spidermiddlewares/offsite.py 95.71% <100.00%> (ø)

... and 1 file with indirect coverage changes

@wRAR
Copy link
Member

wRAR commented Nov 22, 2023

I think this will show deprecation warnings for all requests with dont_filter=True, even ones that use it to skip the dupefilter?

@BurnzZ
Copy link
Member Author

BurnzZ commented Nov 28, 2023

I think this will show deprecation warnings for all requests with dont_filter=True, even ones that use it to skip the dupefilter?

Yes but only once as https://docs.python.org/3/library/warnings.html states:

Repetitions of a particular warning for the same source location are typically suppressed.

ScrapyDeprecationWarning,
stacklevel=2,
)
return True
Copy link
Member

@kmike kmike Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I wonder if it'd be better to just keep supporting dont_filter, without a deprecation warning.

First, the flag says "dont_filter", and we respect it, and don't filter out the request. An argument can be made though that "don't filter" only applies to deduplication filter, not to other types of filtering. I think that's valid, But that's not the current flag behavior, and also it's not in the flag name (it's not "dont_deduplicate").

Second, the user doesn't control all the don't_filter flags, Scrapy and other components can be setting this flag. For example, the default start_requests implementation uses dont_filter=True.

It seems it's not possibile to "deprecate don_filter flag in OffsiteMiddleware", because the user might be setting this flag not for the OffsiteMiddleware, but for other Scrapy components, but the request may still end up in OffsiteMiddleware.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. In that case, we need to clarify that the dont_filter value also affects some middlewares as well and not just for the Scheduler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants