Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.6.0 breaks calling multiple Spider in CrawlerProcess() #5435

Closed
hideishi-m opened this issue Mar 2, 2022 · 11 comments · Fixed by #5436
Closed

2.6.0 breaks calling multiple Spider in CrawlerProcess() #5435

hideishi-m opened this issue Mar 2, 2022 · 11 comments · Fixed by #5436
Assignees
Labels
Milestone

Comments

@hideishi-m
Copy link

hideishi-m commented Mar 2, 2022

Description

Since 2.6.0, it breaks calling multiple Spiders from CrawlerProcess() as shown in the common practices

https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

Steps to Reproduce

  1. use Scrapy=>2.6.0
  2. following is the code to reproduce
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http import Request


class MySpider(scrapy.Spider):
    name = 'MySpider'

    def __init__(self, url, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.url = url

    def start_requests(self):
        yield Request(url=self.url, callback=self.parse)

    def parse(self, response):
        print(response.url)


process = CrawlerProcess({
    'DEPTH_LIMIT': 1,
    'DEPTH_PRIORITY': 1
})
process.crawl(MySpider, url='https://www.google.com')
process.crawl(MySpider, url='https://www.google.co.jp')
process.start()

Expected behavior: [What you expect to happen]

Following is the result from Scrapy 2.5.1

2022-03-02 18:49:45 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-03-02 18:49:45 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17
2022-03-02 18:49:45 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-02 18:49:45 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet Password: afe09d724aae9642
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:45 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-02 18:49:45 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet Password: bd1670acfb7fb550
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:45 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2022-03-02 18:49:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.com> (referer: None)
2022-03-02 18:49:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.co.jp> (referer: None)
https://www.google.com
https://www.google.co.jp
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-02 18:49:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 214,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 7675,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.465932,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 3, 2, 9, 49, 46, 66477),
 'httpcompression/response_bytes': 15980,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 19,
 'memusage/max': 47611904,
 'memusage/startup': 47611904,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 3, 2, 9, 49, 45, 600545)}
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Spider closed (finished)
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-02 18:49:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 216,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 7602,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.431935,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 3, 2, 9, 49, 46, 98011),
 'httpcompression/response_bytes': 14794,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 13,
 'memusage/max': 47669248,
 'memusage/startup': 47669248,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 3, 2, 9, 49, 45, 666076)}
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Spider closed (finished)

Actual behavior: [What actually happens]

Spider fails with twisted.internet.error.ReactorAlreadyInstalledError

2022-03-02 18:49:12 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-03-02 18:49:12 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17
2022-03-02 18:49:12 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-02 18:49:12 [scrapy.extensions.telnet] INFO: Telnet Password: ce57e6aa863bb786
2022-03-02 18:49:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:13 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-02 18:49:13 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
Traceback (most recent call last):
  File "/home/kusanagi/work/scrapy/test.py", line 25, in <module>
    process.crawl(MySpider, url='https://www.google.co.jp')
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 82, in __init__
    default.install()
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/epollreactor.py", line 256, in install
    installReactor(p)
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

Reproduces how often: [What percentage of the time does it reproduce?]

Always.

Versions

Scrapy : 2.6.1
lxml : 4.8.0.0
libxml2 : 2.9.12
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 22.1.0
Python : 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)]
pyOpenSSL : 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021)
cryptography : 36.0.1
Platform : Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17

Additional context

The intension of using the same MySpider but from CrawlerProcess is to call Scrapy programatically using different initial url and some tweaks to parser depending on the initial url.

I think this is very fair usage and was working fine before 2.6.0.

@Gallaecio Gallaecio added the bug label Mar 2, 2022
@Gallaecio
Copy link
Member

Is it only reproducible if the spider class is the same?

@hideishi-m
Copy link
Author

No, it happens even if different spider class is used.
I just copied complete MySpider class as MySpider2 and used MySpider2 for the second crawl.

process.crawl(MySpider, url='https://www.google.com')
process.crawl(MySpider2, url='https://www.google.co.jp')

Following is the last traceback.

Traceback (most recent call last):
  File "/home/kusanagi/work/scrapy/test.py", line 39, in <module>
    process.crawl(MySpider2, url='https://www.google.co.jp')
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 82, in __init__
    default.install()
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/epollreactor.py", line 256, in install
    installReactor(p)
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

@Gallaecio Gallaecio self-assigned this Mar 2, 2022
@Gallaecio
Copy link
Member

Gallaecio commented Mar 2, 2022

I have identified 60c8838 as the cause (things work with its parent commit, 46ef9cf). Working on a fix.

honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 3, 2022
@Gallaecio Gallaecio added this to the 2.6.2 milestone Mar 4, 2022
honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 4, 2022
honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 7, 2022
honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 8, 2022
honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 16, 2022
honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 18, 2022
honzajavorek added a commit to juniorguru/junior.guru that referenced this issue Mar 24, 2022
@Gallaecio
Copy link
Member

#5436 (to be included in Scrapy 2.6.2)

@Lightjohn
Copy link

Hi,

When will 2.6.2 will be released. My personal project using scrapy check has been broken for month and even if it's not important it's a little bit sad that a fix have not been released at this point.

Thanks

@Gallaecio
Copy link
Member

I have given an ETA a few times, all of them missed, so I think “soon” is the best I can do without lying again.

#5525 might delay the release a bit further if is it confirmed to be a breaking change introduced in 2.6.

@Lightjohn
Copy link

I switched to git branch in requirements.txt in the meantime. It fixes issue (as expected).

I know that feeling so "soon" is good enought for me at the moment.
Good luck

@MichaelAquilina
Copy link

Hi! We are also wondering when this fix will be released with 2.6.2

We upgraded to scrapy 2.6.1 to fix several vulnerabilities in scrapy but this broke scrapy check. We might have to disable it in favour of having a secure version of scrapy

https://security.snyk.io/vuln/SNYK-PYTHON-SCRAPY-2414471
https://security.snyk.io/vuln/SNYK-PYTHON-SCRAPY-1729576

@Gallaecio
Copy link
Member

You are probably aware, but just in case: there is a middle ground, installing the 2.6 branch from Git, and pinning the latest commit, until 2.6.2 is released. For example:

pip install git+https://github.com/scrapy/scrapy.git@e3e69d1209407c72a6478936bdbfd32cc22e9432

@MichaelAquilina
Copy link

MichaelAquilina commented Jul 21, 2022

Hi @Gallaecio thanks for the suggestion. While it's true that is a compromise to consider, the problem with that approach is that we will be unable to track possible future vulnerabilities via Snyk if we pin to a git commit hash.

Are you aware if there is some kind of ETA for the release 2.6.2? Or is there currently no plan to release anything?

@Gallaecio
Copy link
Member

There is no ETA, but we do plan on releasing it. We have a few things we want to include into 2.6.2 before release, and the maintainers that need to review them are short in time, that is why we have been delaying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants