Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new allow_offsite parameter in OffsiteMiddleware #6151

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
7 changes: 4 additions & 3 deletions docs/topics/downloader-middleware.rst
Expand Up @@ -797,9 +797,10 @@ OffsiteMiddleware
:attr:`~scrapy.Spider.allowed_domains` attribute, or the
attribute is empty, the offsite middleware will allow all requests.

If the request has the :attr:`~scrapy.Request.dont_filter` attribute
set, the offsite middleware will allow the request even if its domain is not
listed in allowed domains.
If the request has the :attr:`~scrapy.Request.dont_filter` attribute set to
``True`` or :attr:`Request.meta` has ``allow_offsite`` set to ``True``, then
the OffsiteMiddleware will allow the request even if its domain is not listed
in allowed domains.

RedirectMiddleware
------------------
Expand Down
6 changes: 3 additions & 3 deletions docs/topics/request-response.rst
Expand Up @@ -145,9 +145,9 @@ Request objects
:type priority: int

:param dont_filter: indicates that this request should not be filtered by
the scheduler. This is used when you want to perform an identical
request multiple times, to ignore the duplicates filter. Use it with
care, or you will get into crawling loops. Default to ``False``.
the scheduler or some middlewares. This is used when you want to perform
an identical request multiple times, to ignore the duplicates filter.
Use it with care, or you will get into crawling loops. Default to ``False``.
:type dont_filter: bool

:param errback: a function that will be called if any exception was
Expand Down
6 changes: 5 additions & 1 deletion scrapy/downloadermiddlewares/offsite.py
Expand Up @@ -28,7 +28,11 @@ def request_scheduled(self, request, spider):
self.process_request(request, spider)

def process_request(self, request, spider):
if request.dont_filter or self.should_follow(request, spider):
if (
request.dont_filter
or request.meta.get("allow_offsite")
or self.should_follow(request, spider)
):
return None
domain = urlparse_cached(request).hostname
if domain and domain not in self.domains_seen:
Expand Down
6 changes: 5 additions & 1 deletion scrapy/spidermiddlewares/offsite.py
Expand Up @@ -57,7 +57,11 @@ async def process_spider_output_async(
def _filter(self, request: Any, spider: Spider) -> bool:
if not isinstance(request, Request):
return True
if request.dont_filter or self.should_follow(request, spider):
if (
request.dont_filter
or request.meta.get("allow_offsite")
or self.should_follow(request, spider)
):
return True
domain = urlparse_cached(request).hostname
if domain and domain not in self.domains_seen:
Expand Down
31 changes: 31 additions & 0 deletions tests/test_downloadermiddleware_offsite.py
Expand Up @@ -62,6 +62,37 @@ def test_process_request_dont_filter(value, filtered):
assert mw.process_request(request, spider) is None


@pytest.mark.parametrize(
("allow_offsite", "dont_filter", "filtered"),
(
(True, UNSET, False),
(True, None, False),
(True, False, False),
(True, True, False),
(False, UNSET, True),
(False, None, True),
(False, False, True),
(False, True, False),
),
)
def test_process_request_allow_offsite(allow_offsite, dont_filter, filtered):
crawler = get_crawler(Spider)
spider = crawler._create_spider(name="a", allowed_domains=["a.example"])
mw = OffsiteMiddleware.from_crawler(crawler)
mw.spider_opened(spider)
kwargs = {"meta": {}}
if allow_offsite is not UNSET:
kwargs["meta"]["allow_offsite"] = allow_offsite
if dont_filter is not UNSET:
kwargs["dont_filter"] = dont_filter
request = Request("https://b.example", **kwargs)
if filtered:
with pytest.raises(IgnoreRequest):
mw.process_request(request, spider)
else:
assert mw.process_request(request, spider) is None


@pytest.mark.parametrize(
"value",
(
Expand Down
1 change: 1 addition & 0 deletions tests/test_spidermiddleware_offsite.py
Expand Up @@ -29,6 +29,7 @@ def test_process_spider_output(self):
Request("http://scrapy.org/1"),
Request("http://sub.scrapy.org/1"),
Request("http://offsite.tld/letmepass", dont_filter=True),
Request("http://offsite-2.tld/allow", meta={"allow_offsite": True}),
Request("http://scrapy.test.org/"),
Request("http://scrapy.test.org:8000/"),
]
Expand Down