Skip to content

Commit

Permalink
Fix the offsite middleware missing some requests (#6358)
Browse files Browse the repository at this point in the history
  • Loading branch information
Gallaecio committed May 13, 2024
1 parent 3562618 commit f149ea4
Show file tree
Hide file tree
Showing 18 changed files with 401 additions and 83 deletions.
40 changes: 18 additions & 22 deletions docs/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -138,39 +138,37 @@ See previous question.
How can I prevent memory errors due to many allowed domains?
------------------------------------------------------------

If you have a spider with a long list of
:attr:`~scrapy.Spider.allowed_domains` (e.g. 50,000+), consider
replacing the default
:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware` spider middleware
with a :ref:`custom spider middleware <custom-spider-middleware>` that requires
less memory. For example:
If you have a spider with a long list of :attr:`~scrapy.Spider.allowed_domains`
(e.g. 50,000+), consider replacing the default
:class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` downloader
middleware with a :ref:`custom downloader middleware
<topics-downloader-middleware-custom>` that requires less memory. For example:

- If your domain names are similar enough, use your own regular expression
instead joining the strings in
:attr:`~scrapy.Spider.allowed_domains` into a complex regular
expression.
instead joining the strings in :attr:`~scrapy.Spider.allowed_domains` into
a complex regular expression.

- If you can `meet the installation requirements`_, use pyre2_ instead of
Python’s re_ to compile your URL-filtering regular expression. See
:issue:`1908`.

See also other suggestions at `StackOverflow`_.
See also `other suggestions at StackOverflow
<https://stackoverflow.com/q/36440681>`__.

.. note:: Remember to disable
:class:`scrapy.spidermiddlewares.offsite.OffsiteMiddleware` when you enable
your custom implementation:
:class:`scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` when you
enable your custom implementation:

.. code-block:: python
SPIDER_MIDDLEWARES = {
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None,
"myproject.middlewares.CustomOffsiteMiddleware": 500,
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": None,
"myproject.middlewares.CustomOffsiteMiddleware": 50,
}
.. _meet the installation requirements: https://github.com/andreasvc/pyre2#installation
.. _pyre2: https://github.com/andreasvc/pyre2
.. _re: https://docs.python.org/library/re.html
.. _StackOverflow: https://stackoverflow.com/q/36440681/939364

Can I use Basic HTTP Authentication in my spiders?
--------------------------------------------------
Expand Down Expand Up @@ -206,12 +204,10 @@ I get "Filtered offsite request" messages. How can I fix them?
Those messages (logged with ``DEBUG`` level) don't necessarily mean there is a
problem, so you may not need to fix them.

Those messages are thrown by the Offsite Spider Middleware, which is a spider
middleware (enabled by default) whose purpose is to filter out requests to
domains outside the ones covered by the spider.

For more info see:
:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware`.
Those messages are thrown by
:class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware`, which is a
downloader middleware (enabled by default) whose purpose is to filter out
requests to domains outside the ones covered by the spider.

What is the recommended way to deploy a Scrapy crawler in production?
---------------------------------------------------------------------
Expand Down
4 changes: 2 additions & 2 deletions docs/topics/benchmarking.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ You should see an output like this::
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
Expand All @@ -37,7 +38,6 @@ You should see an output like this::
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
Expand Down
38 changes: 38 additions & 0 deletions docs/topics/downloader-middleware.rst
Original file line number Diff line number Diff line change
Expand Up @@ -763,6 +763,44 @@ HttpProxyMiddleware
Keep in mind this value will take precedence over ``http_proxy``/``https_proxy``
environment variables, and it will also ignore ``no_proxy`` environment variable.

OffsiteMiddleware
-----------------

.. module:: scrapy.downloadermiddlewares.offsite
:synopsis: Offsite Middleware

.. class:: OffsiteMiddleware

.. versionadded:: VERSION

Filters out Requests for URLs outside the domains covered by the spider.

This middleware filters out every request whose host names aren't in the
spider's :attr:`~scrapy.Spider.allowed_domains` attribute.
All subdomains of any domain in the list are also allowed.
E.g. the rule ``www.example.org`` will also allow ``bob.www.example.org``
but not ``www2.example.com`` nor ``example.com``.

When your spider returns a request for a domain not belonging to those
covered by the spider, this middleware will log a debug message similar to
this one::

DEBUG: Filtered offsite request to 'offsite.example': <GET http://offsite.example/some/page.html>

To avoid filling the log with too much noise, it will only print one of
these messages for each new domain filtered. So, for example, if another
request for ``offsite.example`` is filtered, no log message will be
printed. But if a request for ``other.example`` is filtered, a message
will be printed (but only for the first request filtered).

If the spider doesn't define an
:attr:`~scrapy.Spider.allowed_domains` attribute, or the
attribute is empty, the offsite middleware will allow all requests.

If the request has the :attr:`~scrapy.Request.dont_filter` attribute
set, the offsite middleware will allow the request even if its domain is not
listed in allowed domains.

RedirectMiddleware
------------------

Expand Down
2 changes: 1 addition & 1 deletion docs/topics/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -674,6 +674,7 @@ Default:
.. code-block:: python
{
"scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": 50,
"scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100,
"scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware": 300,
"scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware": 350,
Expand Down Expand Up @@ -1605,7 +1606,6 @@ Default:
{
"scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": 50,
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": 500,
"scrapy.spidermiddlewares.referer.RefererMiddleware": 700,
"scrapy.spidermiddlewares.urllength.UrlLengthMiddleware": 800,
"scrapy.spidermiddlewares.depth.DepthMiddleware": 900,
Expand Down
11 changes: 9 additions & 2 deletions docs/topics/signals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -343,11 +343,18 @@ request_scheduled
.. signal:: request_scheduled
.. function:: request_scheduled(request, spider)

Sent when the engine schedules a :class:`~scrapy.Request`, to be
downloaded later.
Sent when the engine is asked to schedule a :class:`~scrapy.Request`, to be
downloaded later, before the request reaches the :ref:`scheduler
<topics-scheduler>`.

Raise :exc:`~scrapy.exceptions.IgnoreRequest` to drop a request before it
reaches the scheduler.

This signal does not support returning deferreds from its handlers.

.. versionadded:: VERSION
Allow dropping requests with :exc:`~scrapy.exceptions.IgnoreRequest`.

:param request: the request that reached the scheduler
:type request: :class:`~scrapy.Request` object

Expand Down
40 changes: 2 additions & 38 deletions docs/topics/spider-middleware.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,8 @@ value. For example, if you want to disable the off-site middleware:
.. code-block:: python
SPIDER_MIDDLEWARES = {
"myproject.middlewares.CustomSpiderMiddleware": 543,
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None,
"scrapy.spidermiddlewares.referer.RefererMiddleware": None,
"myproject.middlewares.CustomRefererSpiderMiddleware": 700,
}
Finally, keep in mind that some middlewares may need to be enabled through a
Expand Down Expand Up @@ -313,42 +313,6 @@ Default: ``False``

Pass all responses, regardless of its status code.

OffsiteMiddleware
-----------------

.. module:: scrapy.spidermiddlewares.offsite
:synopsis: Offsite Spider Middleware

.. class:: OffsiteMiddleware

Filters out Requests for URLs outside the domains covered by the spider.

This middleware filters out every request whose host names aren't in the
spider's :attr:`~scrapy.Spider.allowed_domains` attribute.
All subdomains of any domain in the list are also allowed.
E.g. the rule ``www.example.org`` will also allow ``bob.www.example.org``
but not ``www2.example.com`` nor ``example.com``.

When your spider returns a request for a domain not belonging to those
covered by the spider, this middleware will log a debug message similar to
this one::

DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html>

To avoid filling the log with too much noise, it will only print one of
these messages for each new domain filtered. So, for example, if another
request for ``www.othersite.com`` is filtered, no log message will be
printed. But if a request for ``someothersite.com`` is filtered, a message
will be printed (but only for the first request filtered).

If the spider doesn't define an
:attr:`~scrapy.Spider.allowed_domains` attribute, or the
attribute is empty, the offsite middleware will allow all requests.

If the request has the :attr:`~scrapy.Request.dont_filter` attribute
set, the offsite middleware will allow the request even if its domain is not
listed in allowed domains.


RefererMiddleware
-----------------
Expand Down
3 changes: 2 additions & 1 deletion docs/topics/spiders.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,8 @@ scrapy.Spider
An optional list of strings containing domains that this spider is
allowed to crawl. Requests for URLs not belonging to the domain names
specified in this list (or their subdomains) won't be followed if
:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware` is enabled.
:class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` is
enabled.

Let's say your target url is ``https://www.example.com/1.html``,
then add ``'example.com'`` to the list.
Expand Down
17 changes: 14 additions & 3 deletions scrapy/core/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,15 @@
from scrapy import signals
from scrapy.core.downloader import Downloader
from scrapy.core.scraper import Scraper
from scrapy.exceptions import CloseSpider, DontCloseSpider
from scrapy.exceptions import CloseSpider, DontCloseSpider, IgnoreRequest
from scrapy.http import Request, Response
from scrapy.logformatter import LogFormatter
from scrapy.settings import BaseSettings, Settings
from scrapy.signalmanager import SignalManager
from scrapy.spiders import Spider
from scrapy.utils.log import failure_to_exc_info, logformatter_adapter
from scrapy.utils.misc import create_instance, load_object
from scrapy.utils.python import global_object_name
from scrapy.utils.reactor import CallLaterOnce

if TYPE_CHECKING:
Expand Down Expand Up @@ -291,9 +292,19 @@ def crawl(self, request: Request) -> None:
self.slot.nextcall.schedule() # type: ignore[union-attr]

def _schedule_request(self, request: Request, spider: Spider) -> None:
self.signals.send_catch_log(
signals.request_scheduled, request=request, spider=spider
request_scheduled_result = self.signals.send_catch_log(
signals.request_scheduled,
request=request,
spider=spider,
dont_log=IgnoreRequest,
)
for handler, result in request_scheduled_result:
if isinstance(result, Failure) and isinstance(result.value, IgnoreRequest):
logger.debug(
f"Signal handler {global_object_name(handler)} dropped "
f"request {request} before it reached the scheduler."
)
return
if not self.slot.scheduler.enqueue_request(request): # type: ignore[union-attr]
self.signals.send_catch_log(
signals.request_dropped, request=request, spider=spider
Expand Down
77 changes: 77 additions & 0 deletions scrapy/downloadermiddlewares/offsite.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
import logging
import re
import warnings

from scrapy import signals
from scrapy.exceptions import IgnoreRequest
from scrapy.utils.httpobj import urlparse_cached

logger = logging.getLogger(__name__)


class OffsiteMiddleware:
@classmethod
def from_crawler(cls, crawler):
o = cls(crawler.stats)
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(o.request_scheduled, signal=signals.request_scheduled)
return o

def __init__(self, stats):
self.stats = stats
self.domains_seen = set()

def spider_opened(self, spider):
self.host_regex = self.get_host_regex(spider)

def request_scheduled(self, request, spider):
self.process_request(request, spider)

def process_request(self, request, spider):
if request.dont_filter or self.should_follow(request, spider):
return None
domain = urlparse_cached(request).hostname
if domain and domain not in self.domains_seen:
self.domains_seen.add(domain)
logger.debug(
"Filtered offsite request to %(domain)r: %(request)s",
{"domain": domain, "request": request},
extra={"spider": spider},
)
self.stats.inc_value("offsite/domains", spider=spider)
self.stats.inc_value("offsite/filtered", spider=spider)
raise IgnoreRequest

def should_follow(self, request, spider):
regex = self.host_regex
# hostname can be None for wrong urls (like javascript links)
host = urlparse_cached(request).hostname or ""
return bool(regex.search(host))

def get_host_regex(self, spider):
"""Override this method to implement a different offsite policy"""
allowed_domains = getattr(spider, "allowed_domains", None)
if not allowed_domains:
return re.compile("") # allow all by default
url_pattern = re.compile(r"^https?://.*$")
port_pattern = re.compile(r":\d+$")
domains = []
for domain in allowed_domains:
if domain is None:
continue
if url_pattern.match(domain):
message = (
"allowed_domains accepts only domains, not URLs. "
f"Ignoring URL entry {domain} in allowed_domains."
)
warnings.warn(message)
elif port_pattern.search(domain):
message = (
"allowed_domains accepts only domains without ports. "
f"Ignoring entry {domain} in allowed_domains."
)
warnings.warn(message)
else:
domains.append(re.escape(domain))
regex = rf'^(.*\.)?({"|".join(domains)})$'
return re.compile(regex)
6 changes: 3 additions & 3 deletions scrapy/extensions/memusage.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,9 +128,9 @@ def _check_warning(self):
def _send_report(self, rcpts, subject):
"""send notification mail with some additional useful info"""
stats = self.crawler.stats
s = f"Memory usage at engine startup : {stats.get_value('memusage/startup')/1024/1024}M\r\n"
s += f"Maximum memory usage : {stats.get_value('memusage/max')/1024/1024}M\r\n"
s += f"Current memory usage : {self.get_virtual_size()/1024/1024}M\r\n"
s = f"Memory usage at engine startup : {stats.get_value('memusage/startup') / 1024 / 1024}M\r\n"
s += f"Maximum memory usage : {stats.get_value('memusage/max') / 1024 / 1024}M\r\n"
s += f"Current memory usage : {self.get_virtual_size() / 1024 / 1024}M\r\n"

s += (
"ENGINE STATUS ------------------------------------------------------- \r\n"
Expand Down
2 changes: 1 addition & 1 deletion scrapy/settings/default_settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@

DOWNLOADER_MIDDLEWARES_BASE = {
# Engine side
"scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": 50,
"scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100,
"scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware": 300,
"scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware": 350,
Expand Down Expand Up @@ -299,7 +300,6 @@
SPIDER_MIDDLEWARES_BASE = {
# Engine side
"scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": 50,
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": 500,
"scrapy.spidermiddlewares.referer.RefererMiddleware": 700,
"scrapy.spidermiddlewares.urllength.UrlLengthMiddleware": 800,
"scrapy.spidermiddlewares.depth.DepthMiddleware": 900,
Expand Down
7 changes: 7 additions & 0 deletions scrapy/spidermiddlewares/offsite.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,16 @@
import warnings

from scrapy import signals
from scrapy.exceptions import ScrapyDeprecationWarning
from scrapy.http import Request
from scrapy.utils.httpobj import urlparse_cached

warnings.warn(
"The scrapy.spidermiddlewares.offsite module is deprecated, use "
"scrapy.downloadermiddlewares.offsite instead.",
ScrapyDeprecationWarning,
)

logger = logging.getLogger(__name__)


Expand Down

0 comments on commit f149ea4

Please sign in to comment.