Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to limit disk and RAM usage by scheduler queue #6085

Open
Prometheus3375 opened this issue Oct 4, 2023 · 6 comments
Open

Add an option to limit disk and RAM usage by scheduler queue #6085

Prometheus3375 opened this issue Oct 4, 2023 · 6 comments

Comments

@Prometheus3375
Copy link

Prometheus3375 commented Oct 4, 2023

Summary

An option to limit RAM and disk usage by the scheduler queue will make the engine to take new requests from the spiders only if there is available space.

Motivation

Recently we've run into the issue of high disk usage by the scheduler queue. We are going through company registry and making a lot of requests. These requests are the only requests spider makes. Sample code:

    # crn - company registration number
    def start(self, /) -> Iterator:
        # Reset generators to start from 0 after restarts
        for generator in self.generators:
            generator.reset()

        it = ((gen, crn) for gen in self.generators for crn in gen)
        for generator, crn in it:
            self.requests_in_progress += 1
            yield Request(
                self.url_pattern_check_crn.format(crn),
                self.check_crn_status,
                dont_filter=True,
                errback=self.check_crn_status_failed,
                cb_kwargs=dict(crn=crn, generator=generator),
                )

Currently, generators produce 142,685,210 unique CRNs. Requests can end with 404 (company not found) or 200 (company found).

After reaching ~110k successful requests disk queue occupies 10GB. Meanwhile, RAM usage does not exceed 200MB.

Describe alternatives you've considered

Currently, we resolved the issue with increasing CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN to big enough number, but this helps only if the site can afford so many connections or does not have any rate limiting techniques.

Additional context

There is SCRAPER_SLOT_MAX_ACTIVE_SIZE setting which is described as follows:

Soft limit (in bytes) for response data being processed.
While the sum of the sizes of all responses being processed is above this value, Scrapy does not process new requests.

"Scrapy does not process new requests" means Scrapy does not take new requests from the spider or does not put already scheduled requests to the downloader?

We are also using FIFO queue for this spider, but I do not think this matters.

@Gallaecio
Copy link
Member

Somewhat related to #3237, i.e. future work on that direction may help here.

@wRAR
Copy link
Member

wRAR commented Oct 4, 2023

"Scrapy does not process new requests" means Scrapy does not take new requests from the spider or does not put already scheduled requests to the downloader?

The code (self._needs_backout() which calls self.scraper.slot.needs_backout()) is in

def _next_request(self) -> None:

It is about taking the next request from the scheduler and putting it into the downloader (and also about processing start_requests()).

@bhavuk2002

This comment was marked as resolved.

@Gallaecio

This comment was marked as resolved.

@bhavuk2002

This comment was marked as resolved.

@Gallaecio

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants