Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new allow_offsite parameter in OffsiteMiddleware #6151

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/topics/request-response.rst
Original file line number Diff line number Diff line change
Expand Up @@ -144,9 +144,9 @@ Request objects
:type priority: int

:param dont_filter: indicates that this request should not be filtered by
the scheduler. This is used when you want to perform an identical
request multiple times, to ignore the duplicates filter. Use it with
care, or you will get into crawling loops. Default to ``False``.
the scheduler or some middlewares. This is used when you want to perform
an identical request multiple times, to ignore the duplicates filter.
Use it with care, or you will get into crawling loops. Default to ``False``.
:type dont_filter: bool

:param errback: a function that will be called if any exception was
Expand Down
7 changes: 4 additions & 3 deletions docs/topics/spider-middleware.rst
Original file line number Diff line number Diff line change
Expand Up @@ -345,9 +345,10 @@ OffsiteMiddleware
:attr:`~scrapy.Spider.allowed_domains` attribute, or the
attribute is empty, the offsite middleware will allow all requests.

If the request has the :attr:`~scrapy.Request.dont_filter` attribute
set, the offsite middleware will allow the request even if its domain is not
listed in allowed domains.
If the request has the :attr:`~scrapy.Request.dont_filter` attribute set to
``True`` or :attr:`Request.meta` has ``allow_offsite`` set to ``True``, then
the OffsiteMiddleware will allow the request even if its domain is not listed
in allowed domains.
BurnzZ marked this conversation as resolved.
Show resolved Hide resolved


RefererMiddleware
Expand Down
6 changes: 5 additions & 1 deletion scrapy/spidermiddlewares/offsite.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,11 @@ async def process_spider_output_async(
def _filter(self, request: Any, spider: Spider) -> bool:
if not isinstance(request, Request):
return True
if request.dont_filter or self.should_follow(request, spider):
if (
request.dont_filter
or request.meta.get("allow_offsite")
or self.should_follow(request, spider)
):
return True
domain = urlparse_cached(request).hostname
if domain and domain not in self.domains_seen:
Expand Down
1 change: 1 addition & 0 deletions tests/test_spidermiddleware_offsite.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ def test_process_spider_output(self):
Request("http://scrapy.org/1"),
Request("http://sub.scrapy.org/1"),
Request("http://offsite.tld/letmepass", dont_filter=True),
Request("http://offsite-2.tld/allow", meta={"allow_offsite": True}),
Request("http://scrapy.test.org/"),
Request("http://scrapy.test.org:8000/"),
]
Expand Down