DNSCACHE_ENABLED not respected when in Spider.custom_settings #5988

starrify · 2023-07-24T20:53:33Z

Description

It's observed that custom DNSCACHE_ENABLED is not respected when specified as part of Spider.custom_settings.

The issue affects other DNS* settings as well (verified). It's assumed that it would affect REACTOR_THREADPOOL_MAXSIZE as well (not verified -- see below for details).

Steps to Reproduce

Here's a minimal example I prepared that could trigger the issue:

import scrapy


class TestSpider(scrapy.Spider):
    name = "test"
    custom_settings = {
        "DNSCACHE_ENABLED": False,
        "DNSCACHE_SIZE": 1234,
        "DNS_TIMEOUT": 5678,
    }
    start_urls = ["https://httpbingo.org/get"]

    def parse(self, response):
        from twisted.internet import reactor
        self.logger.info(f"FOO resolver {reactor.resolver}")
        self.logger.info(f"FOO timeout {reactor.resolver.timeout}")
        import scrapy.resolver
        self.logger.info(f"FOO cache {scrapy.resolver.dnscache}")
        self.logger.info(f"FOO cache size {scrapy.resolver.dnscache.limit}")

Upon invoking the spider, the log messages reveal that the custom_settings values are not respected:

$ scrapy runspider test.py 2>&1 | grep FOO
2023-07-24 21:27:51 [test] INFO: FOO resolver <scrapy.resolver.CachingThreadedResolver object at 0x7f950f775590>
2023-07-24 21:27:51 [test] INFO: FOO timeout 60.0
2023-07-24 21:27:51 [test] INFO: FOO cache LocalCache([('httpbingo.org', '77.83.142.42')])
2023-07-24 21:27:51 [test] INFO: FOO cache size 10000

However, if the same settings are passed via command line arguments instead, they're found to be respected:

$ scrapy runspider test.py -s DNSCACHE_ENABLED=False -s DNSCACHE_SIZE=4321 -s DNS_TIMEOUT=8765 2>&1 | grep FOO
2023-07-24 21:28:15 [test] INFO: FOO resolver <scrapy.resolver.CachingThreadedResolver object at 0x7fc1d5269590>
2023-07-24 21:28:15 [test] INFO: FOO timeout 8765.0
2023-07-24 21:28:15 [test] INFO: FOO cache LocalCache()
2023-07-24 21:28:15 [test] INFO: FOO cache size 0

Expected behavior: (please see above)

Actual behavior: (please see above)

Reproduces how often: Every time.

Versions

Both 2.9.0 (latest release) and 6e3e3c2 (latest Git revision).

Further Analysis

A spider's custom_settings attribute is access via Spider.update_settings in Crawler.__init__:

scrapy/scrapy/crawler.py

Line 68 in 52c0726

self.spidercls.update_settings(self.settings)

That ☝️ happens in CrawlerRunner.create_crawler which is in CrawlerRunner.crawl.

Those DNS* setting entries are accessed in CrawlerProcess.start:

scrapy/scrapy/crawler.py

Lines 384 to 385 in 52c0726

    
           resolver_class = load_object(self.settings["DNS_RESOLVER"]) 
        
           resolver = create_instance(resolver_class, self.settings, self, reactor=reactor)

which actually happens later than the above CrawlerRunner.crawl method, in both scrapy crawl:

scrapy/scrapy/commands/crawl.py

Lines 23 to 30 in 52c0726

    
           crawl_defer = self.crawler_process.crawl(spname, **opts.spargs) 
        
           if getattr(crawl_defer, "result", None) is not None and issubclass( 
        
               crawl_defer.result.type, Exception 
        
           ): 
        
               self.exitcode = 1 
        
           else: 
        
               self.crawler_process.start()

and in scrapy runspider:

scrapy/scrapy/commands/runspider.py

Lines 54 to 55 in 52c0726

    
           self.crawler_process.crawl(spidercls, **opts.spargs) 
        
           self.crawler_process.start()

The cause of issue is believed to be that the Crawler.__init__ method, despite being called earlier, operated on a copy of the settings object:

scrapy/scrapy/crawler.py

Lines 67 to 68 in 52c0726

    
           self.settings: Settings = settings.copy() 
        
           self.spidercls.update_settings(self.settings)

Therefore the setting update would not impact anything happens back there in CrawlerProcess.start. For the same reason it is assumed that REACTOR_THREADPOOL_MAXSIZE may be impacted by the same issue:

scrapy/scrapy/crawler.py

Line 388 in 52c0726

    
           tp.adjustPoolsize(maxthreads=self.settings.getint("REACTOR_THREADPOOL_MAXSIZE"))

Temporary Workaround

It may be possible to re-apply the settings upon a spider's start, e.g. in the start_requests method:

    def start_requests(self):
        import scrapy.resolver
        if not self.settings.getbool("DNSCACHE_ENABLED"):
            scrapy.resolver.dnscache.limit = 0
        yield scrapy.Request("https://httpbingo.org/get")

The text was updated successfully, but these errors were encountered:

wRAR · 2023-07-25T09:16:26Z

Such things are documented at the end of https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process (which, as I now think, is not the only place it should be documented).

GeorgeA92 · 2023-08-07T17:58:16Z

This is very similar to other #4485 related to TWISTED_REACTOR setting

That ☝️ happens in CrawlerRunner.create_crawler which is in CrawlerRunner.crawl.

It happens only one time per process (not one time per spider as expected from custom_settings) and It is not clear in case if Running multiple spiders in the same process what is expected output with configuration like this where multiple different spiders with different custom settings exists in single CrawerProcess:

process = CrawlerProcess() # project settings <- default dns resolver read settings from here
process.crawl(MySpider1_with_DNS_custom_settings) # custom_settings_1
process.crawl(MySpider2_with_other_DNS_related_custom_settings) # custom_settings_2
process.start()

So for making this to count custom_settings I have the same concern as on #4485 (comment)

Howerer it is possible to.. update dnsresolver to use custom_settings to init.

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.resolver import CachingThreadedResolver


class CustomResolver(CachingThreadedResolver):

    @classmethod
    def from_crawler(cls, crawler, reactor):
        if crawler.__class__ is CrawlerProcess:
            crawler = list(crawler.crawlers)[0]
        return super(CustomResolver, cls).from_crawler(crawler, reactor)


class TestSpider(scrapy.Spider):
    name = "test"
    custom_settings = {
        "DNSCACHE_ENABLED": False,
        "DNSCACHE_SIZE": 1234,
        "DNS_TIMEOUT": 5678,
        "REACTOR_THREADPOOL_MAXSIZE": 40
    }
    start_urls = ["https://httpbingo.org/get"]


    def parse(self, response):
        from twisted.internet import reactor
        self.logger.info(f"FOO resolver {reactor.resolver}")
        self.logger.info(f"FOO timeout {reactor.resolver.timeout}")
        import scrapy.resolver
        self.logger.info(f"FOO cache {scrapy.resolver.dnscache}")
        self.logger.info(f"FOO cache size {scrapy.resolver.dnscache.limit}")

if __name__ == "__main__":
    process = CrawlerProcess(settings={"DNS_RESOLVER": CustomResolver})
    #process = CrawlerProcess()
    process.crawl(TestSpider)
    process.start()

log_output


2023-08-07 20:32:15 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2023-08-07 20:32:15 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 1.1.1v  1 Aug 2023), cryptography 39.0.1, Platform Windows-10-10.0.22621-SP0
2023-08-07 20:32:15 [scrapy.crawler] INFO: Overridden settings:
{'DNSCACHE_ENABLED': False,
 'DNSCACHE_SIZE': 1234,
 'DNS_RESOLVER': <class '__main__.CustomResolver'>,
 'DNS_TIMEOUT': 5678,
 'REACTOR_THREADPOOL_MAXSIZE': 40}
2023-08-07 20:32:15 [py.warnings] WARNING: <redacted>: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2023-08-07 20:32:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-08-07 20:32:15 [scrapy.extensions.telnet] INFO: Telnet Password: d414e7ebdd9f6323
2023-08-07 20:32:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2023-08-07 20:32:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-08-07 20:32:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-08-07 20:32:16 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-08-07 20:32:16 [scrapy.core.engine] INFO: Spider opened
2023-08-07 20:32:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-08-07 20:32:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-08-07 20:32:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbingo.org/get> (referer: None)
2023-08-07 20:32:16 [test] INFO: FOO resolver <__main__.CustomResolver object at 0x00000275A137B790>
2023-08-07 20:32:16 [test] INFO: FOO timeout 5678.0
2023-08-07 20:32:16 [test] INFO: FOO cache LocalCache()
2023-08-07 20:32:16 [test] INFO: FOO cache size 0
2023-08-07 20:32:16 [scrapy.core.engine] INFO: Closing spider (finished)
2023-08-07 20:32:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 220,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 707,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.286586,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 8, 7, 17, 32, 16, 366147),
 'httpcompression/response_bytes': 731,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 14,
 'log_count/WARNING': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2023, 8, 7, 17, 32, 16, 79561)}
2023-08-07 20:32:16 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0

It should work as expected for cases with only one spider in CrawlerProcess and with enabled updated resolver on project settings.

wRAR mentioned this issue Apr 18, 2024

Per spider DNS_RESOLVER doesn't work #6319

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNSCACHE_ENABLED not respected when in Spider.custom_settings #5988

DNSCACHE_ENABLED not respected when in Spider.custom_settings #5988

starrify commented Jul 24, 2023

wRAR commented Jul 25, 2023

GeorgeA92 commented Aug 7, 2023

DNSCACHE_ENABLED not respected when in Spider.custom_settings #5988

DNSCACHE_ENABLED not respected when in Spider.custom_settings #5988

Comments

starrify commented Jul 24, 2023

Description

Steps to Reproduce

Versions

Further Analysis

Temporary Workaround

wRAR commented Jul 25, 2023

GeorgeA92 commented Aug 7, 2023