Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing reactor in spider's file can cause ReactorAlreadyInstalledError #5525

Closed
adg-mh opened this issue Jun 9, 2022 · 6 comments
Closed
Assignees
Milestone

Comments

@adg-mh
Copy link

adg-mh commented Jun 9, 2022

Description

When running a spider with scrapy crawl ... <spider>, due to a change in the way the reactor is installed (possibly introduced in 60c8838), importing reactor from twisted at the top of that spider's python file will now cause an error (in the case where TWISTED_REACTOR is not defined):

twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

I was able to work around the issue by explicitly setting TWISTED_REACTOR which causes scrapy's install_reactor() helper to be used (which explicitly suppresses that error). This doesn't allow me to use twisted's default reactor per platform (without recreating the logic to do so myself).

Steps to Reproduce

from twisted.internet import reactor at the top of a spider's python file
(importing the reactor module installs the default reactor)

Versions

Scrapy       : 2.6.1
lxml         : 4.9.0.0
libxml2      : 2.9.12
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 22.4.0
Python       : 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 19:29:22) [MSC v.1916 32 bit (Intel)]
pyOpenSSL    : 22.0.0 (OpenSSL 3.0.3 3 May 2022)
cryptography : 37.0.2
Platform     : Windows-10-10.0.17763-SP0
@Gallaecio
Copy link
Member

I think importing the reactor from any code that is executed before the Scrapy code can handle reactor initialization has become a bad practice. I believe the best way forward is to avoid that import until Scrapy has setup the right reactor.

Even if we made it so that your code works, it would still fail if in the future you decide to switch reactors, or Scrapy changes the default reactor.

Now, you have not shared your code, but I imagine you can rewrite it so that the reactor is imported within a function or method that uses it, so that importing the spider file or class does not cause this issue.

That said, @wRAR, could you have a look into this? I wonder if this could be breaking pre-2.6 code, and if we could make it so that things do not break where the installed reactor is the one Scrapy would install.

If we can fix it, maybe we could also log a deprecation warning (e.g. “Importing twisted.internet.reactor before Scrapy initializes it is deprecated. You should move such imports within scopes (e.g. functions, methods) that are only called once Scrapy has initialized the reactor.”), and we should discuss if this is worth adding to 2.6.2 or 2.6.3.

@Gallaecio Gallaecio added this to the 2.6.2 milestone Jun 10, 2022
@wRAR
Copy link
Member

wRAR commented Jun 10, 2022

Without thinking too much about this:

  • ideally no code should import twisted.internet.reactor at the module import time (so at the top level, directly or indirectly), but this is not always possible and also I don't think this is documented at our side
  • technically, there are several ways to deal with the case where Scrapy wants to install the rector but it's already installed (at least if it's the same type that it wanted); we shouldn't change them before understanding the implications and reasons of the current behavior
  • it's possible that this indeed breaks pre-2.6 code but I need to try some test cases to check that

@adg-mh
Copy link
Author

adg-mh commented Jun 10, 2022

I understand if this isn't going to be supported.

I did consider importing the reactor where it's used, which is a reactor.callLater call inside a spider_opened signal handler.
Generally I try to avoid placing imports inside any code but this is probably a case where I should let that go.

@wRAR
Copy link
Member

wRAR commented Jun 10, 2022

Well, it's possible that it will be supported unless TWISTED_REACTOR is set to non-default, we just need to think about possible problems.

@Gallaecio
Copy link
Member

Gallaecio commented Jun 14, 2022

I have verified that scrapy runspider works on 2.5.1 but not on 2.6.1 for:

from scrapy import Spider
from twisted.internet import reactor


class ToScrapeComSpider(Spider):
    name = 'toscrape_com'
    start_urls = ['https://toscrape.com']

    def parse(self, response):
        yield {'html': response.text}

@Gallaecio Gallaecio self-assigned this Jun 14, 2022
@Gallaecio
Copy link
Member

  • I don't think this is documented at our side

Turns out we do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants