Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrapy.pqueues.ScrapyPriorityQueue #6266

Open
roykyle8 opened this issue Mar 4, 2024 · 8 comments
Open

scrapy.pqueues.ScrapyPriorityQueue #6266

roykyle8 opened this issue Mar 4, 2024 · 8 comments

Comments

@roykyle8
Copy link

roykyle8 commented Mar 4, 2024

Description

The ScrapyPriorityQueue throws a builtins.KeyError.

2024-03-04 09:33:09,684 2112359:CRITICAL [twisted] Unhandled Error
Traceback (most recent call last):
  File "/opt/custom_crawler.py", line 107, in main
    crawler_process.start()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/crawler.py", line 348, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 1318, in run
    self.mainLoop()
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 1328, in mainLoop
    reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 994, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/local/lib/python3.10/dist-packages/scrapy/utils/reactor.py", line 51, in __call__
    return self._func(*self._a, **self._kw)
  File "/usr/local/lib/python3.10/dist-packages/scrapy/core/engine.py", line 147, in _next_request
    while not self._needs_backout() and self._next_request_from_scheduler() is not None:
  File "/usr/local/lib/python3.10/dist-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
    request = self.slot.scheduler.next_request()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/core/scheduler.py", line 263, in next_request
    request = self._dqpop()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/core/scheduler.py", line 299, in _dqpop
    return self.dqs.pop()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/pqueues.py", line 98, in pop
    q = self.queues[self.curprio]
builtins.KeyError: 0

Steps to Reproduce

The error occurs randomly and not sure how to reproduce it.

Versions

2.6.2

Resolution:

Since python throws KeyError when the key does not exists in the queue(which is a dict here), should not we check if the key exists first?

if self.curprio is None can be replace with if self.curprio is None or self.curprio not in self.queues.

class ScrapyPriorityQueue:
        # other methods
       
    def pop(self):
        if self.curprio is None or self.curprio not in self.queues:
            return
        q = self.queues[self.curprio]
        m = q.pop()
        if not q:
            del self.queues[self.curprio]
            q.close()
            prios = [p for p, q in self.queues.items() if q]
            self.curprio = min(prios) if prios else None
        return m

@wRAR
Copy link
Member

wRAR commented Mar 4, 2024

Do you have a disk queue corrupted by e.g. a prior spider crash?

Otherwise you need to make a minimal reproducible example of this problem.

@wRAR
Copy link
Member

wRAR commented Mar 4, 2024

should not we check if the key exists first?

No, as it should exist there because of how is the class code written.

@roykyle8
Copy link
Author

roykyle8 commented Mar 5, 2024

@wRAR I clear the request cache directory before starting new run. Starting a fresh crawl with a corrupt queue is not at all possible.

Regarding the reproducible example, the does not occur every time and for every crawl. Hence reproducing it is very very difficult.

@Gallaecio
Copy link
Member

I clear the request cache directory before starting new run.

HTTPCACHE_DIR? JOBDIR? Something else?

@roykyle8
Copy link
Author

roykyle8 commented Mar 5, 2024

@Gallaecio I clear both HTTPCACHE_DIR & JOBDIR before starting new run. So I've a custom script to start multiple spiders together (custom_crawler.py).

# custom_crawler.py:
def main(cmd_args):
    crawler_process = CrawlerProcess(None, install_root_handler=False)
    for sp in [cmd_args.spiders]:

        spider_instance = __load_spider_instance__(sp)
        http_cache_dir = f'/opt/http_cache/{spider_instance.name}'
        job_dir = f'/opt/job_dir/{spider_instance.name}'
        __delete_and_recreate_dir__(http_cache_dir)
        __delete_and_recreate_dir__(job_dir)

        custom_settings = {'HTTPCACHE_DIR': http_cache_dir, 'JOBDIR': Job_dir}
        # SCHEDULER_DISK_QUEUE: scrapy.squeues.PickleFifoDiskQueue is set in spider class
        # SCHEDULER_MEMORY_QUEUE: scrapy.squeues.FifoMemoryQueue is set in spider class
        getattr(spider_instance, 'custom_settings').update(custom_settings)
        crawler_process.crawl(Crawler(spider_instance))
     
     crawl_process.start()

The spiders runs fine but I get KeyError for some spiders. Also this error does not occur every run.

@roykyle8
Copy link
Author

@wRAR / @Gallaecio can you please guide how can I find the issue and fix this?

@wRAR
Copy link
Member

wRAR commented Mar 10, 2024

No direct ideas but I would start with logging additions and removals for self.queues, to see if the code expectations (that the key always exists there when expected) are indeed broken, or aren't actually broken.

@roykyle8
Copy link
Author

@wRAR @Gallaecio still looking for a solution and trying to debug on KeyError, however started seeing another issue.

2024-03-27 03:30:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:31:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:32:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:33:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:34:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:35:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:36:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:37:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:38:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:39:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:40:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:41:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:42:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:43:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:44:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:45:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:46:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:47:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:48:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:49:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:50:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:51:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:52:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:53:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:54:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:55:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:56:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:57:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:58:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:59:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:00:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:01:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:02:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:03:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:04:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:05:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:06:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:07:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:08:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:09:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:10:09,967 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:11:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:12:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:13:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:14:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:15:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:16:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:17:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:18:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:19:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)

And the spider never completes. I've to kill the process and rerun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants