Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add line buffering to file requests.seen of RFPDupeFilter #6019

Open
Prometheus3375 opened this issue Aug 18, 2023 · 1 comment
Open

Add line buffering to file requests.seen of RFPDupeFilter #6019

Prometheus3375 opened this issue Aug 18, 2023 · 1 comment

Comments

@Prometheus3375
Copy link

Prometheus3375 commented Aug 18, 2023

Motivation

Make RFPDupeFilter more reliable if spider fails terribly.

Context

RFPDupeFilter which is used by default in Scrapy, writes all fingerprints to file requests.seen, each fingerprint on one line. The idea of saving to a file is to preserve seen requests between restarts.
Currently, buffering option is not specified for this file.

Take a look on available buffering options:

  • 0 - switches buffering off (only allowed in binary mode)
  • 1 - selects line buffering (only usable in text mode)
  • Any integer > 1 - indicates the size of a fixed-size chunk buffer. Note that specifying a buffer size this way applies for binary buffered I/O, but TextIOWrapper (i.e., files opened with mode='r+') would have another buffering.

When buffering argument is not provided, next policy is used:

  • Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device's "block size" and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.
  • "Interactive" text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.

I quickly checked, file requests.seen has isatty() set to False, so it uses heuristic buffering.

I made a quick test in python console with line buffering. I run this code and then closed console.

>>> f = open('test.txt', 'w')
>>> f.write('123\n')
4
>>> f.write('456\n')
4
>>> f.write('789\n')
4

File 'test.txt' was empty. Then I run the same code, but added line buffering:

>>> f = open('test.txt', 'w', buffering=1)
>>> f.write('123\n')
4
>>> f.write('456\n')
4
>>> f.write('789\n')
4

After console was closed, all text was present.

My point is to add line buffering to requests.seen of RFPDupeFilter. Currently it behaves differently on different machines. If spider fails terribly, only a part of seen fingerprint will be present in the file or worse, the last fingerprint will be incomplete.
Line buffering will unify behavior, always writing whole fingerprint to the file.

Implementation

It is very easy, just add buffering=1 to method open inside RFPDupeFilter implementation. Alternatively, it is possible to flush every Nth fingerprint:

self.file = Path(path, "requests.seen").open("a+", buffering=41 * N - 1, encoding="utf-8")
# write_through=True disables TextIOWrapper buffer making all lines
# to go directly to binary buffer for which chunk size is set at 41N - 1
f.reconfigure(write_through=True)
self.file.seek(0)
self.fingerprints.update(x.rstrip() for x in self.file)

This is a bit tricky, because on Windows line feed is \r\n which is two chars and in such case 42 * N - 1 should be used or newline='\n' is specified to force \n as line feed.

If some flexibility is required, there can be a setting for buffering: current behavior, line buffering or custom chunk size. And it will be nice to have an option to disable file at all.

@Gallaecio
Copy link
Member

Gallaecio commented Aug 21, 2023

I think the change makes sense.

No strong opinion on whether it should be default or not, or optional or hardcoded. I guess it mostly depends on how much of a performance impact we expect it to have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants