Fix the offsite middleware missing some requests (#6358)

scrapy · May 13, 2024 · f149ea4 · f149ea4
1 parent 3562618
commit f149ea4
Show file tree

Hide file tree

Showing 18 changed files with 401 additions and 83 deletions.
diff --git a/docs/faq.rst b/docs/faq.rst
@@ -138,39 +138,37 @@ See previous question.
 How can I prevent memory errors due to many allowed domains?
 ------------------------------------------------------------
 
-If you have a spider with a long list of
-:attr:`~scrapy.Spider.allowed_domains` (e.g. 50,000+), consider
-replacing the default
-:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware` spider middleware
-with a :ref:`custom spider middleware <custom-spider-middleware>` that requires
-less memory. For example:
+If you have a spider with a long list of :attr:`~scrapy.Spider.allowed_domains`
+(e.g. 50,000+), consider replacing the default
+:class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` downloader
+middleware with a :ref:`custom downloader middleware
+<topics-downloader-middleware-custom>` that requires less memory. For example:
 
 -   If your domain names are similar enough, use your own regular expression
-    instead joining the strings in
-    :attr:`~scrapy.Spider.allowed_domains` into a complex regular
-    expression.
+    instead joining the strings in :attr:`~scrapy.Spider.allowed_domains` into
+    a complex regular expression.
 
 -   If you can `meet the installation requirements`_, use pyre2_ instead of
     Python’s re_ to compile your URL-filtering regular expression. See
     :issue:`1908`.
 
-See also other suggestions at `StackOverflow`_.
+See also `other suggestions at StackOverflow
+<https://stackoverflow.com/q/36440681>`__.
 
 .. note:: Remember to disable
-   :class:`scrapy.spidermiddlewares.offsite.OffsiteMiddleware` when you enable
-   your custom implementation:
+   :class:`scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` when you
+   enable your custom implementation:
 
    .. code-block:: python
 
-       SPIDER_MIDDLEWARES = {
-           "scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None,
-           "myproject.middlewares.CustomOffsiteMiddleware": 500,
+       DOWNLOADER_MIDDLEWARES = {
+           "scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": None,
+           "myproject.middlewares.CustomOffsiteMiddleware": 50,
        }
 
 .. _meet the installation requirements: https://github.com/andreasvc/pyre2#installation
 .. _pyre2: https://github.com/andreasvc/pyre2
 .. _re: https://docs.python.org/library/re.html
-.. _StackOverflow: https://stackoverflow.com/q/36440681/939364
 
 Can I use Basic HTTP Authentication in my spiders?
 --------------------------------------------------
@@ -206,12 +204,10 @@ I get "Filtered offsite request" messages. How can I fix them?
 Those messages (logged with ``DEBUG`` level) don't necessarily mean there is a
 problem, so you may not need to fix them.
 
-Those messages are thrown by the Offsite Spider Middleware, which is a spider
-middleware (enabled by default) whose purpose is to filter out requests to
-domains outside the ones covered by the spider.
-
-For more info see:
-:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware`.
+Those messages are thrown by
+:class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware`, which is a
+downloader middleware (enabled by default) whose purpose is to filter out
+requests to domains outside the ones covered by the spider.
 
 What is the recommended way to deploy a Scrapy crawler in production?
 ---------------------------------------------------------------------

diff --git a/docs/topics/benchmarking.rst b/docs/topics/benchmarking.rst
@@ -24,7 +24,8 @@ You should see an output like this::
      'scrapy.extensions.telnet.TelnetConsole',
      'scrapy.extensions.corestats.CoreStats']
     2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
-    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
+    ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
+     'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
      'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
      'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
      'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
@@ -37,7 +38,6 @@ You should see an output like this::
      'scrapy.downloadermiddlewares.stats.DownloaderStats']
     2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled spider middlewares:
     ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
-     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
      'scrapy.spidermiddlewares.referer.RefererMiddleware',
      'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
      'scrapy.spidermiddlewares.depth.DepthMiddleware']

diff --git a/docs/topics/downloader-middleware.rst b/docs/topics/downloader-middleware.rst
@@ -763,6 +763,44 @@ HttpProxyMiddleware
    Keep in mind this value will take precedence over ``http_proxy``/``https_proxy``
    environment variables, and it will also ignore ``no_proxy`` environment variable.
 
+OffsiteMiddleware
+-----------------
+
+.. module:: scrapy.downloadermiddlewares.offsite
+   :synopsis: Offsite Middleware
+
+.. class:: OffsiteMiddleware
+
+   .. versionadded:: VERSION
+
+   Filters out Requests for URLs outside the domains covered by the spider.
+
+   This middleware filters out every request whose host names aren't in the
+   spider's :attr:`~scrapy.Spider.allowed_domains` attribute.
+   All subdomains of any domain in the list are also allowed.
+   E.g. the rule ``www.example.org`` will also allow ``bob.www.example.org``
+   but not ``www2.example.com`` nor ``example.com``.
+
+   When your spider returns a request for a domain not belonging to those
+   covered by the spider, this middleware will log a debug message similar to
+   this one::
+
+      DEBUG: Filtered offsite request to 'offsite.example': <GET http://offsite.example/some/page.html>
+
+   To avoid filling the log with too much noise, it will only print one of
+   these messages for each new domain filtered. So, for example, if another
+   request for ``offsite.example`` is filtered, no log message will be
+   printed. But if a request for ``other.example`` is filtered, a message
+   will be printed (but only for the first request filtered).
+
+   If the spider doesn't define an
+   :attr:`~scrapy.Spider.allowed_domains` attribute, or the
+   attribute is empty, the offsite middleware will allow all requests.
+
+   If the request has the :attr:`~scrapy.Request.dont_filter` attribute
+   set, the offsite middleware will allow the request even if its domain is not
+   listed in allowed domains.
+
 RedirectMiddleware
 ------------------
 

diff --git a/docs/topics/settings.rst b/docs/topics/settings.rst
@@ -674,6 +674,7 @@ Default:
 .. code-block:: python
 
     {
+        "scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": 50,
         "scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100,
         "scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware": 300,
         "scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware": 350,
@@ -1605,7 +1606,6 @@ Default:
 
     {
         "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": 50,
-        "scrapy.spidermiddlewares.offsite.OffsiteMiddleware": 500,
         "scrapy.spidermiddlewares.referer.RefererMiddleware": 700,
         "scrapy.spidermiddlewares.urllength.UrlLengthMiddleware": 800,
         "scrapy.spidermiddlewares.depth.DepthMiddleware": 900,

diff --git a/docs/topics/signals.rst b/docs/topics/signals.rst
@@ -343,11 +343,18 @@ request_scheduled
 .. signal:: request_scheduled
 .. function:: request_scheduled(request, spider)
 
-    Sent when the engine schedules a :class:`~scrapy.Request`, to be
-    downloaded later.
+    Sent when the engine is asked to schedule a :class:`~scrapy.Request`, to be
+    downloaded later, before the request reaches the :ref:`scheduler
+    <topics-scheduler>`.
+
+    Raise :exc:`~scrapy.exceptions.IgnoreRequest` to drop a request before it
+    reaches the scheduler.
 
     This signal does not support returning deferreds from its handlers.
 
+    .. versionadded:: VERSION
+        Allow dropping requests with :exc:`~scrapy.exceptions.IgnoreRequest`.
+
     :param request: the request that reached the scheduler
     :type request: :class:`~scrapy.Request` object
 

diff --git a/docs/topics/spider-middleware.rst b/docs/topics/spider-middleware.rst
@@ -51,8 +51,8 @@ value.  For example, if you want to disable the off-site middleware:
 .. code-block:: python
 
     SPIDER_MIDDLEWARES = {
-        "myproject.middlewares.CustomSpiderMiddleware": 543,
-        "scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None,
+        "scrapy.spidermiddlewares.referer.RefererMiddleware": None,
+        "myproject.middlewares.CustomRefererSpiderMiddleware": 700,
     }
 
 Finally, keep in mind that some middlewares may need to be enabled through a
@@ -313,42 +313,6 @@ Default: ``False``
 
 Pass all responses, regardless of its status code.
 
-OffsiteMiddleware
------------------
-
-.. module:: scrapy.spidermiddlewares.offsite
-   :synopsis: Offsite Spider Middleware
-
-.. class:: OffsiteMiddleware
-
-   Filters out Requests for URLs outside the domains covered by the spider.
-
-   This middleware filters out every request whose host names aren't in the
-   spider's :attr:`~scrapy.Spider.allowed_domains` attribute.
-   All subdomains of any domain in the list are also allowed.
-   E.g. the rule ``www.example.org`` will also allow ``bob.www.example.org``
-   but not ``www2.example.com`` nor ``example.com``.
-
-   When your spider returns a request for a domain not belonging to those
-   covered by the spider, this middleware will log a debug message similar to
-   this one::
-
-      DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html>
-
-   To avoid filling the log with too much noise, it will only print one of
-   these messages for each new domain filtered. So, for example, if another
-   request for ``www.othersite.com`` is filtered, no log message will be
-   printed. But if a request for ``someothersite.com`` is filtered, a message
-   will be printed (but only for the first request filtered).
-
-   If the spider doesn't define an
-   :attr:`~scrapy.Spider.allowed_domains` attribute, or the
-   attribute is empty, the offsite middleware will allow all requests.
-
-   If the request has the :attr:`~scrapy.Request.dont_filter` attribute
-   set, the offsite middleware will allow the request even if its domain is not
-   listed in allowed domains.
-
 
 RefererMiddleware
 -----------------

diff --git a/docs/topics/spiders.rst b/docs/topics/spiders.rst
@@ -75,7 +75,8 @@ scrapy.Spider
        An optional list of strings containing domains that this spider is
        allowed to crawl. Requests for URLs not belonging to the domain names
        specified in this list (or their subdomains) won't be followed if
-       :class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware` is enabled.
+       :class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` is
+       enabled.
 
        Let's say your target url is ``https://www.example.com/1.html``,
        then add ``'example.com'`` to the list.

diff --git a/scrapy/core/engine.py b/scrapy/core/engine.py
@@ -27,14 +27,15 @@
 from scrapy import signals
 from scrapy.core.downloader import Downloader
 from scrapy.core.scraper import Scraper
-from scrapy.exceptions import CloseSpider, DontCloseSpider
+from scrapy.exceptions import CloseSpider, DontCloseSpider, IgnoreRequest
 from scrapy.http import Request, Response
 from scrapy.logformatter import LogFormatter
 from scrapy.settings import BaseSettings, Settings
 from scrapy.signalmanager import SignalManager
 from scrapy.spiders import Spider
 from scrapy.utils.log import failure_to_exc_info, logformatter_adapter
 from scrapy.utils.misc import create_instance, load_object
+from scrapy.utils.python import global_object_name
 from scrapy.utils.reactor import CallLaterOnce
 
 if TYPE_CHECKING:
@@ -291,9 +292,19 @@ def crawl(self, request: Request) -> None:
         self.slot.nextcall.schedule()  # type: ignore[union-attr]
 
     def _schedule_request(self, request: Request, spider: Spider) -> None:
-        self.signals.send_catch_log(
-            signals.request_scheduled, request=request, spider=spider
+        request_scheduled_result = self.signals.send_catch_log(
+            signals.request_scheduled,
+            request=request,
+            spider=spider,
+            dont_log=IgnoreRequest,
         )
+        for handler, result in request_scheduled_result:
+            if isinstance(result, Failure) and isinstance(result.value, IgnoreRequest):
+                logger.debug(
+                    f"Signal handler {global_object_name(handler)} dropped "
+                    f"request {request} before it reached the scheduler."
+                )
+                return
         if not self.slot.scheduler.enqueue_request(request):  # type: ignore[union-attr]
             self.signals.send_catch_log(
                 signals.request_dropped, request=request, spider=spider

diff --git a/scrapy/downloadermiddlewares/offsite.py b/scrapy/downloadermiddlewares/offsite.py
@@ -0,0 +1,77 @@
+import logging
+import re
+import warnings
+
+from scrapy import signals
+from scrapy.exceptions import IgnoreRequest
+from scrapy.utils.httpobj import urlparse_cached
+
+logger = logging.getLogger(__name__)
+
+
+class OffsiteMiddleware:
+    @classmethod
+    def from_crawler(cls, crawler):
+        o = cls(crawler.stats)
+        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
+        crawler.signals.connect(o.request_scheduled, signal=signals.request_scheduled)
+        return o
+
+    def __init__(self, stats):
+        self.stats = stats
+        self.domains_seen = set()
+
+    def spider_opened(self, spider):
+        self.host_regex = self.get_host_regex(spider)
+
+    def request_scheduled(self, request, spider):
+        self.process_request(request, spider)
+
+    def process_request(self, request, spider):
+        if request.dont_filter or self.should_follow(request, spider):
+            return None
+        domain = urlparse_cached(request).hostname
+        if domain and domain not in self.domains_seen:
+            self.domains_seen.add(domain)
+            logger.debug(
+                "Filtered offsite request to %(domain)r: %(request)s",
+                {"domain": domain, "request": request},
+                extra={"spider": spider},
+            )
+            self.stats.inc_value("offsite/domains", spider=spider)
+        self.stats.inc_value("offsite/filtered", spider=spider)
+        raise IgnoreRequest
+
+    def should_follow(self, request, spider):
+        regex = self.host_regex
+        # hostname can be None for wrong urls (like javascript links)
+        host = urlparse_cached(request).hostname or ""
+        return bool(regex.search(host))
+
+    def get_host_regex(self, spider):
+        """Override this method to implement a different offsite policy"""
+        allowed_domains = getattr(spider, "allowed_domains", None)
+        if not allowed_domains:
+            return re.compile("")  # allow all by default
+        url_pattern = re.compile(r"^https?://.*$")
+        port_pattern = re.compile(r":\d+$")
+        domains = []
+        for domain in allowed_domains:
+            if domain is None:
+                continue
+            if url_pattern.match(domain):
+                message = (
+                    "allowed_domains accepts only domains, not URLs. "
+                    f"Ignoring URL entry {domain} in allowed_domains."
+                )
+                warnings.warn(message)
+            elif port_pattern.search(domain):
+                message = (
+                    "allowed_domains accepts only domains without ports. "
+                    f"Ignoring entry {domain} in allowed_domains."
+                )
+                warnings.warn(message)
+            else:
+                domains.append(re.escape(domain))
+        regex = rf'^(.*\.)?({"|".join(domains)})$'
+        return re.compile(regex)
diff --git a/scrapy/extensions/memusage.py b/scrapy/extensions/memusage.py
@@ -128,9 +128,9 @@ def _check_warning(self):
     def _send_report(self, rcpts, subject):
         """send notification mail with some additional useful info"""
         stats = self.crawler.stats
-        s = f"Memory usage at engine startup : {stats.get_value('memusage/startup')/1024/1024}M\r\n"
-        s += f"Maximum memory usage          : {stats.get_value('memusage/max')/1024/1024}M\r\n"
-        s += f"Current memory usage          : {self.get_virtual_size()/1024/1024}M\r\n"
+        s = f"Memory usage at engine startup : {stats.get_value('memusage/startup') / 1024 / 1024}M\r\n"
+        s += f"Maximum memory usage          : {stats.get_value('memusage/max') / 1024 / 1024}M\r\n"
+        s += f"Current memory usage          : {self.get_virtual_size() / 1024 / 1024}M\r\n"
 
         s += (
             "ENGINE STATUS ------------------------------------------------------- \r\n"

diff --git a/scrapy/settings/default_settings.py b/scrapy/settings/default_settings.py
@@ -101,6 +101,7 @@
 
 DOWNLOADER_MIDDLEWARES_BASE = {
     # Engine side
+    "scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": 50,
     "scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100,
     "scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware": 300,
     "scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware": 350,
@@ -299,7 +300,6 @@
 SPIDER_MIDDLEWARES_BASE = {
     # Engine side
     "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": 50,
-    "scrapy.spidermiddlewares.offsite.OffsiteMiddleware": 500,
     "scrapy.spidermiddlewares.referer.RefererMiddleware": 700,
     "scrapy.spidermiddlewares.urllength.UrlLengthMiddleware": 800,
     "scrapy.spidermiddlewares.depth.DepthMiddleware": 900,

diff --git a/scrapy/spidermiddlewares/offsite.py b/scrapy/spidermiddlewares/offsite.py
@@ -8,9 +8,16 @@
 import warnings
 
 from scrapy import signals
+from scrapy.exceptions import ScrapyDeprecationWarning
 from scrapy.http import Request
 from scrapy.utils.httpobj import urlparse_cached
 
+warnings.warn(
+    "The scrapy.spidermiddlewares.offsite module is deprecated, use "
+    "scrapy.downloadermiddlewares.offsite instead.",
+    ScrapyDeprecationWarning,
+)
+
 logger = logging.getLogger(__name__)