2.6.0 breaks calling multiple Spider in CrawlerProcess() #5435

hideishi-m · 2022-03-02T09:58:06Z

Description

Since 2.6.0, it breaks calling multiple Spiders from CrawlerProcess() as shown in the common practices

https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

Steps to Reproduce

use Scrapy=>2.6.0
following is the code to reproduce

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http import Request


class MySpider(scrapy.Spider):
    name = 'MySpider'

    def __init__(self, url, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.url = url

    def start_requests(self):
        yield Request(url=self.url, callback=self.parse)

    def parse(self, response):
        print(response.url)


process = CrawlerProcess({
    'DEPTH_LIMIT': 1,
    'DEPTH_PRIORITY': 1
})
process.crawl(MySpider, url='https://www.google.com')
process.crawl(MySpider, url='https://www.google.co.jp')
process.start()

Expected behavior: [What you expect to happen]

Following is the result from Scrapy 2.5.1

2022-03-02 18:49:45 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-03-02 18:49:45 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17
2022-03-02 18:49:45 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-02 18:49:45 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet Password: afe09d724aae9642
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:45 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-02 18:49:45 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet Password: bd1670acfb7fb550
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:45 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2022-03-02 18:49:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.com> (referer: None)
2022-03-02 18:49:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.co.jp> (referer: None)
https://www.google.com
https://www.google.co.jp
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-02 18:49:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 214,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 7675,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.465932,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 3, 2, 9, 49, 46, 66477),
 'httpcompression/response_bytes': 15980,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 19,
 'memusage/max': 47611904,
 'memusage/startup': 47611904,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 3, 2, 9, 49, 45, 600545)}
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Spider closed (finished)
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-02 18:49:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 216,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 7602,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.431935,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 3, 2, 9, 49, 46, 98011),
 'httpcompression/response_bytes': 14794,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 13,
 'memusage/max': 47669248,
 'memusage/startup': 47669248,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 3, 2, 9, 49, 45, 666076)}
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Spider closed (finished)

Actual behavior: [What actually happens]

Spider fails with twisted.internet.error.ReactorAlreadyInstalledError

2022-03-02 18:49:12 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-03-02 18:49:12 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17
2022-03-02 18:49:12 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-02 18:49:12 [scrapy.extensions.telnet] INFO: Telnet Password: ce57e6aa863bb786
2022-03-02 18:49:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:13 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-02 18:49:13 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
Traceback (most recent call last):
  File "/home/kusanagi/work/scrapy/test.py", line 25, in <module>
    process.crawl(MySpider, url='https://www.google.co.jp')
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 82, in __init__
    default.install()
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/epollreactor.py", line 256, in install
    installReactor(p)
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

Reproduces how often: [What percentage of the time does it reproduce?]

Always.

Versions

Scrapy : 2.6.1
lxml : 4.8.0.0
libxml2 : 2.9.12
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 22.1.0
Python : 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)]
pyOpenSSL : 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021)
cryptography : 36.0.1
Platform : Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17

Additional context

The intension of using the same MySpider but from CrawlerProcess is to call Scrapy programatically using different initial url and some tweaks to parser depending on the initial url.

I think this is very fair usage and was working fine before 2.6.0.

The text was updated successfully, but these errors were encountered:

Gallaecio · 2022-03-02T13:08:09Z

Is it only reproducible if the spider class is the same?

hideishi-m · 2022-03-02T14:11:12Z

No, it happens even if different spider class is used.
I just copied complete MySpider class as MySpider2 and used MySpider2 for the second crawl.

process.crawl(MySpider, url='https://www.google.com')
process.crawl(MySpider2, url='https://www.google.co.jp')

Following is the last traceback.

Traceback (most recent call last):
  File "/home/kusanagi/work/scrapy/test.py", line 39, in <module>
    process.crawl(MySpider2, url='https://www.google.co.jp')
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 82, in __init__
    default.install()
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/epollreactor.py", line 256, in install
    installReactor(p)
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

Gallaecio · 2022-03-02T14:53:30Z

I have identified 60c8838 as the cause (things work with its parent commit, 46ef9cf). Working on a fix.

Gallaecio · 2022-04-07T15:00:09Z

#5436 (to be included in Scrapy 2.6.2)

Lightjohn · 2022-06-11T17:47:39Z

Hi,

When will 2.6.2 will be released. My personal project using scrapy check has been broken for month and even if it's not important it's a little bit sad that a fix have not been released at this point.

Thanks

Gallaecio · 2022-06-13T10:01:19Z

I have given an ETA a few times, all of them missed, so I think “soon” is the best I can do without lying again.

#5525 might delay the release a bit further if is it confirmed to be a breaking change introduced in 2.6.

Lightjohn · 2022-06-13T10:10:48Z

I switched to git branch in requirements.txt in the meantime. It fixes issue (as expected).

I know that feeling so "soon" is good enought for me at the moment.
Good luck

MichaelAquilina · 2022-07-20T16:51:23Z

Hi! We are also wondering when this fix will be released with 2.6.2

We upgraded to scrapy 2.6.1 to fix several vulnerabilities in scrapy but this broke scrapy check. We might have to disable it in favour of having a secure version of scrapy

https://security.snyk.io/vuln/SNYK-PYTHON-SCRAPY-2414471
https://security.snyk.io/vuln/SNYK-PYTHON-SCRAPY-1729576

Gallaecio · 2022-07-20T19:12:45Z

You are probably aware, but just in case: there is a middle ground, installing the 2.6 branch from Git, and pinning the latest commit, until 2.6.2 is released. For example:

pip install git+https://github.com/scrapy/scrapy.git@e3e69d1209407c72a6478936bdbfd32cc22e9432

MichaelAquilina · 2022-07-21T08:32:45Z

Hi @Gallaecio thanks for the suggestion. While it's true that is a compromise to consider, the problem with that approach is that we will be unable to track possible future vulnerabilities via Snyk if we pin to a git commit hash.

Are you aware if there is some kind of ETA for the release 2.6.2? Or is there currently no plan to release anything?

Gallaecio · 2022-07-21T09:04:48Z

There is no ETA, but we do plan on releasing it. We have a few things we want to include into 2.6.2 before release, and the maintainers that need to review them are short in time, that is why we have been delaying.

Gallaecio added the bug label Mar 2, 2022

Gallaecio self-assigned this Mar 2, 2022

Gallaecio mentioned this issue Mar 2, 2022

CrawlerProcess: initialize the reactor only once #5436

Merged

2 tasks

honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 3, 2022

downgrade scrapy due to scrapy/scrapy#5435

e7ef046

Gallaecio added this to the 2.6.2 milestone Mar 4, 2022

honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 4, 2022

downgrade scrapy due to scrapy/scrapy#5435

4f7bb2c

honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 7, 2022

downgrade scrapy due to scrapy/scrapy#5435

6564a16

honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 8, 2022

downgrade scrapy due to scrapy/scrapy#5435

264f4b3

honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 16, 2022

downgrade scrapy due to scrapy/scrapy#5435

8e041c7

honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 18, 2022

downgrade scrapy due to scrapy/scrapy#5435

be8d858

honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 24, 2022

downgrade scrapy due to scrapy/scrapy#5435

99ca3b2

yolile mentioned this issue Apr 5, 2022

Scrapy 2.6.1 bugs open-contracting/kingfisher-collect#918

Closed

wRAR mentioned this issue Apr 7, 2022

scrapy check fails when there is more than one spider #5467

Closed

Gallaecio closed this as completed Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.6.0 breaks calling multiple Spider in CrawlerProcess() #5435

2.6.0 breaks calling multiple Spider in CrawlerProcess() #5435

hideishi-m commented Mar 2, 2022 •

edited

Gallaecio commented Mar 2, 2022

hideishi-m commented Mar 2, 2022

Gallaecio commented Mar 2, 2022 •

edited

Gallaecio commented Apr 7, 2022

Lightjohn commented Jun 11, 2022

Gallaecio commented Jun 13, 2022

Lightjohn commented Jun 13, 2022

MichaelAquilina commented Jul 20, 2022

Gallaecio commented Jul 20, 2022

MichaelAquilina commented Jul 21, 2022 •

edited

Gallaecio commented Jul 21, 2022

2.6.0 breaks calling multiple Spider in CrawlerProcess() #5435

2.6.0 breaks calling multiple Spider in CrawlerProcess() #5435

Comments

hideishi-m commented Mar 2, 2022 • edited

Description

Steps to Reproduce

Versions

Additional context

Gallaecio commented Mar 2, 2022

hideishi-m commented Mar 2, 2022

Gallaecio commented Mar 2, 2022 • edited

Gallaecio commented Apr 7, 2022

Lightjohn commented Jun 11, 2022

Gallaecio commented Jun 13, 2022

Lightjohn commented Jun 13, 2022

MichaelAquilina commented Jul 20, 2022

Gallaecio commented Jul 20, 2022

MichaelAquilina commented Jul 21, 2022 • edited

Gallaecio commented Jul 21, 2022

hideishi-m commented Mar 2, 2022 •

edited

Gallaecio commented Mar 2, 2022 •

edited

MichaelAquilina commented Jul 21, 2022 •

edited