Add an option to limit disk and RAM usage by scheduler queue #6085

Prometheus3375 · 2023-10-04T11:52:24Z

Summary

An option to limit RAM and disk usage by the scheduler queue will make the engine to take new requests from the spiders only if there is available space.

Motivation

Recently we've run into the issue of high disk usage by the scheduler queue. We are going through company registry and making a lot of requests. These requests are the only requests spider makes. Sample code:

    # crn - company registration number
    def start(self, /) -> Iterator:
        # Reset generators to start from 0 after restarts
        for generator in self.generators:
            generator.reset()

        it = ((gen, crn) for gen in self.generators for crn in gen)
        for generator, crn in it:
            self.requests_in_progress += 1
            yield Request(
                self.url_pattern_check_crn.format(crn),
                self.check_crn_status,
                dont_filter=True,
                errback=self.check_crn_status_failed,
                cb_kwargs=dict(crn=crn, generator=generator),
                )

Currently, generators produce 142,685,210 unique CRNs. Requests can end with 404 (company not found) or 200 (company found).

After reaching ~110k successful requests disk queue occupies 10GB. Meanwhile, RAM usage does not exceed 200MB.

Describe alternatives you've considered

Currently, we resolved the issue with increasing CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN to big enough number, but this helps only if the site can afford so many connections or does not have any rate limiting techniques.

Additional context

There is SCRAPER_SLOT_MAX_ACTIVE_SIZE setting which is described as follows:

Soft limit (in bytes) for response data being processed.
While the sum of the sizes of all responses being processed is above this value, Scrapy does not process new requests.

"Scrapy does not process new requests" means Scrapy does not take new requests from the spider or does not put already scheduled requests to the downloader?

We are also using FIFO queue for this spider, but I do not think this matters.

The text was updated successfully, but these errors were encountered:

Gallaecio · 2023-10-04T12:02:31Z

Somewhat related to #3237, i.e. future work on that direction may help here.

wRAR · 2023-10-04T15:01:30Z

"Scrapy does not process new requests" means Scrapy does not take new requests from the spider or does not put already scheduled requests to the downloader?

The code (self._needs_backout() which calls self.scraper.slot.needs_backout()) is in

scrapy/scrapy/core/engine.py

Line 164 in 5b0b002

def _next_request(self) -> None:

It is about taking the next request from the scheduler and putting it into the downloader (and also about processing start_requests()).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option to limit disk and RAM usage by scheduler queue #6085

Add an option to limit disk and RAM usage by scheduler queue #6085

Prometheus3375 commented Oct 4, 2023 •

edited

Gallaecio commented Oct 4, 2023

wRAR commented Oct 4, 2023

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Add an option to limit disk and RAM usage by scheduler queue #6085

Add an option to limit disk and RAM usage by scheduler queue #6085

Comments

Prometheus3375 commented Oct 4, 2023 • edited

Summary

Motivation

Describe alternatives you've considered

Additional context

Gallaecio commented Oct 4, 2023

wRAR commented Oct 4, 2023

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Prometheus3375 commented Oct 4, 2023 •

edited