Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #5970 #6082

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Fix #5970 #6082

wants to merge 4 commits into from

Conversation

buungoo
Copy link

@buungoo buungoo commented Oct 2, 2023

Logs a warning as specified in issue #5970

Fixes #5970

@wRAR
Copy link
Member

wRAR commented Oct 2, 2023

  1. This needs a test.
  2. I think crawl is not the only command where it makes sense.
  3. There should be no need to check EXTENSIONS_BASE, you should use settings.getwithbase("EXTENSIONS") to get the final setting value.

@vishesh10
Copy link
Contributor

  1. This needs a test.
  2. I think crawl is not the only command where it makes sense.
  3. There should be no need to check EXTENSIONS_BASE, you should use settings.getwithbase("EXTENSIONS") to get the final setting value.

Hi @wRAR,
The problem with settings.getwithbase("EXTENSIONS") is that it is not able to read the overridden values using custom_settings. It can only read values from settings.py file. Is there any other way to get the updated settings?

I have disabled the FeedExporter using custom_settings in the spider file:

custom_settings = { 
        "EXTENSIONS": {
            "scrapy.extensions.feedexport.FeedExporter": None
        },  
    } 

and the logs confirm the same that the FeedExporter is not coming in the enabled extensions

[scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']

but when I write build_component_list(settings.getwithbase("EXTENSIONS")) in the crawl.py, it still prints the FeedExporter extension. Is there any other way to read the updated settings?

Additionally, the overridden_settings method is also not able to give me the updated value.

Thanks

@wRAR
Copy link
Member

wRAR commented May 11, 2024

Do you mean your existing code works in that case?

@vishesh10
Copy link
Contributor

vishesh10 commented May 12, 2024

Do you mean your existing code works in that case?

@wRAR,
No it does not work either in case of custom_settings.
I tried to use getwithbase to add a warning but it only works when the feed exporter is disabled via settings.py. It does not work properly when custom_settings is used instead. I have defined the problem in more detail below with code snippet.

I have defined two scenarios below. In Scenario 1, the getwithbase('EXTENSIONS') is able to correctly determine the status of feed exporter but in Scenario 2 it is not able to. Is there any way to achieve the same in Scenario 2 as well?

Scenario 1 - FeedExporter disabled via settings.py

Create a simple spider

import scrapy
from bar.items import BooklistItem 

class BookSpider(scrapy.Spider):
    name = "book"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["http://books.toscrape.com/"]   

    def parse(self, response):
        for article in response.css('article.product_pod'): 
            book_item = BooklistItem( 
                url=article.css("h3 > a::attr(href)").get(), 
                title=article.css("h3 > a::attr(title)").extract_first(), 
                price=article.css(".price_color::text").extract_first(), 
            )   
            yield book_item                 

settings.py

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
     "scrapy.extensions.feedexport.FeedExporter": None,
}

crawl.py
get the extensions from crawler settings

extensions = build_component_list(self.crawler_process.settings.getwithbase('EXTENSIONS'))

In this case the extensions list does not contain the scrapy.extensions.feedexport.FeedExporter, as it recognises the disabled feed exporter from settings.py. Thus a condition can be applied to check if feed exporter is enabled or not.

Scenario 2 - FeedExporter disabled via custom_settings

add custom_settings in the spider itself and disable the FeedExporter here instead of settings.py

import scrapy
from bar.items import BooklistItem 

class BookSpider(scrapy.Spider):
    name = "book"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["http://books.toscrape.com/"]   

    custom_settings = { 
        "EXTENSIONS": {
            "scrapy.extensions.feedexport.FeedExporter": None
        },  
    }   

    def parse(self, response):
        for article in response.css('article.product_pod'): 
            book_item = BooklistItem( 
                url=article.css("h3 > a::attr(href)").get(), 
                title=article.css("h3 > a::attr(title)").extract_first(), 
                price=article.css(".price_color::text").extract_first(), 
            )   
            yield book_item                              

settings.py

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
#     "scrapy.extensions.feedexport.FeedExporter": None,
}

In this case the extensions list still contains the scrapy.extensions.feedexport.FeedExporter. The custom_settings is not getting considered. Due to the presence of scrapy.extensions.feedexport.FeedExporter the code will still think that feed exporter is enabled when actually it is disabled. Thus a condition can not be applied.

Thanks

@wRAR
Copy link
Member

wRAR commented May 12, 2024

custom_settings are applied in Crawler.__init__, which happens later, when self.crawler_process.crawl() is called. I don't have a suggestion how to resolve this.

@vishesh10
Copy link
Contributor

custom_settings are applied in Crawler.__init__, which happens later, when self.crawler_process.crawl() is called. I don't have a suggestion how to resolve this.

Will that be fine to cater only scenario 1 for now?

@wRAR
Copy link
Member

wRAR commented May 12, 2024

Yeah, it's much better than nothing. But please add a comment that custom_settings is not checked here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Emit a warning if options -o or -O are specified when FeedExporter is disabled
3 participants