New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNSCACHE_ENABLED not respected when in Spider.custom_settings #5988
Comments
Such things are documented at the end of https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process (which, as I now think, is not the only place it should be documented). |
This is very similar to other #4485 related to
It happens only one time per process (not one time per spider as expected from custom_settings) and It is not clear in case if Running multiple spiders in the same process what is expected output with configuration like this where multiple different spiders with different custom settings exists in single process = CrawlerProcess() # project settings <- default dns resolver read settings from here
process.crawl(MySpider1_with_DNS_custom_settings) # custom_settings_1
process.crawl(MySpider2_with_other_DNS_related_custom_settings) # custom_settings_2
process.start() So for making this to count custom_settings I have the same concern as on #4485 (comment) Howerer it is possible to.. update dnsresolver to use custom_settings to init. import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.resolver import CachingThreadedResolver
class CustomResolver(CachingThreadedResolver):
@classmethod
def from_crawler(cls, crawler, reactor):
if crawler.__class__ is CrawlerProcess:
crawler = list(crawler.crawlers)[0]
return super(CustomResolver, cls).from_crawler(crawler, reactor)
class TestSpider(scrapy.Spider):
name = "test"
custom_settings = {
"DNSCACHE_ENABLED": False,
"DNSCACHE_SIZE": 1234,
"DNS_TIMEOUT": 5678,
"REACTOR_THREADPOOL_MAXSIZE": 40
}
start_urls = ["https://httpbingo.org/get"]
def parse(self, response):
from twisted.internet import reactor
self.logger.info(f"FOO resolver {reactor.resolver}")
self.logger.info(f"FOO timeout {reactor.resolver.timeout}")
import scrapy.resolver
self.logger.info(f"FOO cache {scrapy.resolver.dnscache}")
self.logger.info(f"FOO cache size {scrapy.resolver.dnscache.limit}")
if __name__ == "__main__":
process = CrawlerProcess(settings={"DNS_RESOLVER": CustomResolver})
#process = CrawlerProcess()
process.crawl(TestSpider)
process.start() log_output
It should work as expected for cases with only one spider in |
Description
It's observed that custom
DNSCACHE_ENABLED
is not respected when specified as part ofSpider.custom_settings
.The issue affects other
DNS*
settings as well (verified). It's assumed that it would affectREACTOR_THREADPOOL_MAXSIZE
as well (not verified -- see below for details).Steps to Reproduce
Here's a minimal example I prepared that could trigger the issue:
Upon invoking the spider, the log messages reveal that the
custom_settings
values are not respected:However, if the same settings are passed via command line arguments instead, they're found to be respected:
Expected behavior: (please see above)
Actual behavior: (please see above)
Reproduces how often: Every time.
Versions
Both 2.9.0 (latest release) and 6e3e3c2 (latest Git revision).
Further Analysis
A spider's
custom_settings
attribute is access viaSpider.update_settings
inCrawler.__init__
:scrapy/scrapy/crawler.py
Line 68 in 52c0726
That ☝️ happens in
CrawlerRunner.create_crawler
which is inCrawlerRunner.crawl
.Those
DNS*
setting entries are accessed inCrawlerProcess.start
:scrapy/scrapy/crawler.py
Lines 384 to 385 in 52c0726
which actually happens later than the above
CrawlerRunner.crawl
method, in bothscrapy crawl
:scrapy/scrapy/commands/crawl.py
Lines 23 to 30 in 52c0726
and in
scrapy runspider
:scrapy/scrapy/commands/runspider.py
Lines 54 to 55 in 52c0726
The cause of issue is believed to be that the
Crawler.__init__
method, despite being called earlier, operated on a copy of the settings object:scrapy/scrapy/crawler.py
Lines 67 to 68 in 52c0726
Therefore the setting update would not impact anything happens back there in
CrawlerProcess.start
. For the same reason it is assumed thatREACTOR_THREADPOOL_MAXSIZE
may be impacted by the same issue:scrapy/scrapy/crawler.py
Line 388 in 52c0726
Temporary Workaround
It may be possible to re-apply the settings upon a spider's start, e.g. in the
start_requests
method:The text was updated successfully, but these errors were encountered: