Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNSCACHE_ENABLED not respected when in Spider.custom_settings #5988

Open
starrify opened this issue Jul 24, 2023 · 2 comments
Open

DNSCACHE_ENABLED not respected when in Spider.custom_settings #5988

starrify opened this issue Jul 24, 2023 · 2 comments

Comments

@starrify
Copy link
Contributor

Description

It's observed that custom DNSCACHE_ENABLED is not respected when specified as part of Spider.custom_settings.

The issue affects other DNS* settings as well (verified). It's assumed that it would affect REACTOR_THREADPOOL_MAXSIZE as well (not verified -- see below for details).

Steps to Reproduce

Here's a minimal example I prepared that could trigger the issue:

import scrapy


class TestSpider(scrapy.Spider):
    name = "test"
    custom_settings = {
        "DNSCACHE_ENABLED": False,
        "DNSCACHE_SIZE": 1234,
        "DNS_TIMEOUT": 5678,
    }
    start_urls = ["https://httpbingo.org/get"]

    def parse(self, response):
        from twisted.internet import reactor
        self.logger.info(f"FOO resolver {reactor.resolver}")
        self.logger.info(f"FOO timeout {reactor.resolver.timeout}")
        import scrapy.resolver
        self.logger.info(f"FOO cache {scrapy.resolver.dnscache}")
        self.logger.info(f"FOO cache size {scrapy.resolver.dnscache.limit}")

Upon invoking the spider, the log messages reveal that the custom_settings values are not respected:

$ scrapy runspider test.py 2>&1 | grep FOO
2023-07-24 21:27:51 [test] INFO: FOO resolver <scrapy.resolver.CachingThreadedResolver object at 0x7f950f775590>
2023-07-24 21:27:51 [test] INFO: FOO timeout 60.0
2023-07-24 21:27:51 [test] INFO: FOO cache LocalCache([('httpbingo.org', '77.83.142.42')])
2023-07-24 21:27:51 [test] INFO: FOO cache size 10000

However, if the same settings are passed via command line arguments instead, they're found to be respected:

$ scrapy runspider test.py -s DNSCACHE_ENABLED=False -s DNSCACHE_SIZE=4321 -s DNS_TIMEOUT=8765 2>&1 | grep FOO
2023-07-24 21:28:15 [test] INFO: FOO resolver <scrapy.resolver.CachingThreadedResolver object at 0x7fc1d5269590>
2023-07-24 21:28:15 [test] INFO: FOO timeout 8765.0
2023-07-24 21:28:15 [test] INFO: FOO cache LocalCache()
2023-07-24 21:28:15 [test] INFO: FOO cache size 0

Expected behavior: (please see above)

Actual behavior: (please see above)

Reproduces how often: Every time.

Versions

Both 2.9.0 (latest release) and 6e3e3c2 (latest Git revision).

Further Analysis

A spider's custom_settings attribute is access via Spider.update_settings in Crawler.__init__:

self.spidercls.update_settings(self.settings)

That ☝️ happens in CrawlerRunner.create_crawler which is in CrawlerRunner.crawl.

Those DNS* setting entries are accessed in CrawlerProcess.start:

scrapy/scrapy/crawler.py

Lines 384 to 385 in 52c0726

resolver_class = load_object(self.settings["DNS_RESOLVER"])
resolver = create_instance(resolver_class, self.settings, self, reactor=reactor)

which actually happens later than the above CrawlerRunner.crawl method, in both scrapy crawl:

crawl_defer = self.crawler_process.crawl(spname, **opts.spargs)
if getattr(crawl_defer, "result", None) is not None and issubclass(
crawl_defer.result.type, Exception
):
self.exitcode = 1
else:
self.crawler_process.start()

and in scrapy runspider:

self.crawler_process.crawl(spidercls, **opts.spargs)
self.crawler_process.start()

The cause of issue is believed to be that the Crawler.__init__ method, despite being called earlier, operated on a copy of the settings object:

scrapy/scrapy/crawler.py

Lines 67 to 68 in 52c0726

self.settings: Settings = settings.copy()
self.spidercls.update_settings(self.settings)

Therefore the setting update would not impact anything happens back there in CrawlerProcess.start. For the same reason it is assumed that REACTOR_THREADPOOL_MAXSIZE may be impacted by the same issue:

tp.adjustPoolsize(maxthreads=self.settings.getint("REACTOR_THREADPOOL_MAXSIZE"))

Temporary Workaround

It may be possible to re-apply the settings upon a spider's start, e.g. in the start_requests method:

    def start_requests(self):
        import scrapy.resolver
        if not self.settings.getbool("DNSCACHE_ENABLED"):
            scrapy.resolver.dnscache.limit = 0
        yield scrapy.Request("https://httpbingo.org/get")
@wRAR
Copy link
Member

wRAR commented Jul 25, 2023

Such things are documented at the end of https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process (which, as I now think, is not the only place it should be documented).

@GeorgeA92
Copy link
Contributor

This is very similar to other #4485 related to TWISTED_REACTOR setting

That ☝️ happens in CrawlerRunner.create_crawler which is in CrawlerRunner.crawl.

It happens only one time per process (not one time per spider as expected from custom_settings) and It is not clear in case if Running multiple spiders in the same process what is expected output with configuration like this where multiple different spiders with different custom settings exists in single CrawerProcess:

process = CrawlerProcess() # project settings <- default dns resolver read settings from here
process.crawl(MySpider1_with_DNS_custom_settings) # custom_settings_1
process.crawl(MySpider2_with_other_DNS_related_custom_settings) # custom_settings_2
process.start()

So for making this to count custom_settings I have the same concern as on #4485 (comment)

Howerer it is possible to.. update dnsresolver to use custom_settings to init.

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.resolver import CachingThreadedResolver


class CustomResolver(CachingThreadedResolver):

    @classmethod
    def from_crawler(cls, crawler, reactor):
        if crawler.__class__ is CrawlerProcess:
            crawler = list(crawler.crawlers)[0]
        return super(CustomResolver, cls).from_crawler(crawler, reactor)


class TestSpider(scrapy.Spider):
    name = "test"
    custom_settings = {
        "DNSCACHE_ENABLED": False,
        "DNSCACHE_SIZE": 1234,
        "DNS_TIMEOUT": 5678,
        "REACTOR_THREADPOOL_MAXSIZE": 40
    }
    start_urls = ["https://httpbingo.org/get"]


    def parse(self, response):
        from twisted.internet import reactor
        self.logger.info(f"FOO resolver {reactor.resolver}")
        self.logger.info(f"FOO timeout {reactor.resolver.timeout}")
        import scrapy.resolver
        self.logger.info(f"FOO cache {scrapy.resolver.dnscache}")
        self.logger.info(f"FOO cache size {scrapy.resolver.dnscache.limit}")

if __name__ == "__main__":
    process = CrawlerProcess(settings={"DNS_RESOLVER": CustomResolver})
    #process = CrawlerProcess()
    process.crawl(TestSpider)
    process.start()
log_output

2023-08-07 20:32:15 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2023-08-07 20:32:15 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 1.1.1v  1 Aug 2023), cryptography 39.0.1, Platform Windows-10-10.0.22621-SP0
2023-08-07 20:32:15 [scrapy.crawler] INFO: Overridden settings:
{'DNSCACHE_ENABLED': False,
 'DNSCACHE_SIZE': 1234,
 'DNS_RESOLVER': <class '__main__.CustomResolver'>,
 'DNS_TIMEOUT': 5678,
 'REACTOR_THREADPOOL_MAXSIZE': 40}
2023-08-07 20:32:15 [py.warnings] WARNING: <redacted>: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2023-08-07 20:32:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-08-07 20:32:15 [scrapy.extensions.telnet] INFO: Telnet Password: d414e7ebdd9f6323
2023-08-07 20:32:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2023-08-07 20:32:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-08-07 20:32:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-08-07 20:32:16 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-08-07 20:32:16 [scrapy.core.engine] INFO: Spider opened
2023-08-07 20:32:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-08-07 20:32:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-08-07 20:32:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbingo.org/get> (referer: None)
2023-08-07 20:32:16 [test] INFO: FOO resolver <__main__.CustomResolver object at 0x00000275A137B790>
2023-08-07 20:32:16 [test] INFO: FOO timeout 5678.0
2023-08-07 20:32:16 [test] INFO: FOO cache LocalCache()
2023-08-07 20:32:16 [test] INFO: FOO cache size 0
2023-08-07 20:32:16 [scrapy.core.engine] INFO: Closing spider (finished)
2023-08-07 20:32:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 220,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 707,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.286586,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 8, 7, 17, 32, 16, 366147),
 'httpcompression/response_bytes': 731,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 14,
 'log_count/WARNING': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2023, 8, 7, 17, 32, 16, 79561)}
2023-08-07 20:32:16 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0

It should work as expected for cases with only one spider in CrawlerProcess and with enabled updated resolver on project settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants