Add line buffering to file `requests.seen` of `RFPDupeFilter` #6019

Prometheus3375 · 2023-08-18T11:20:24Z

Motivation

Make RFPDupeFilter more reliable if spider fails terribly.

Context

RFPDupeFilter which is used by default in Scrapy, writes all fingerprints to file requests.seen, each fingerprint on one line. The idea of saving to a file is to preserve seen requests between restarts.
Currently, buffering option is not specified for this file.

Take a look on available buffering options:

0 - switches buffering off (only allowed in binary mode)
1 - selects line buffering (only usable in text mode)
Any integer > 1 - indicates the size of a fixed-size chunk buffer. Note that specifying a buffer size this way applies for binary buffered I/O, but TextIOWrapper (i.e., files opened with mode='r+') would have another buffering.

When buffering argument is not provided, next policy is used:

Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device's "block size" and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.

"Interactive" text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.

I quickly checked, file requests.seen has isatty() set to False, so it uses heuristic buffering.

I made a quick test in python console with line buffering. I run this code and then closed console.

>>> f = open('test.txt', 'w')
>>> f.write('123\n')
4
>>> f.write('456\n')
4
>>> f.write('789\n')
4

File 'test.txt' was empty. Then I run the same code, but added line buffering:

>>> f = open('test.txt', 'w', buffering=1)
>>> f.write('123\n')
4
>>> f.write('456\n')
4
>>> f.write('789\n')
4

After console was closed, all text was present.

My point is to add line buffering to requests.seen of RFPDupeFilter. Currently it behaves differently on different machines. If spider fails terribly, only a part of seen fingerprint will be present in the file or worse, the last fingerprint will be incomplete.
Line buffering will unify behavior, always writing whole fingerprint to the file.

Implementation

It is very easy, just add buffering=1 to method open inside RFPDupeFilter implementation. Alternatively, it is possible to flush every Nth fingerprint:

self.file = Path(path, "requests.seen").open("a+", buffering=41 * N - 1, encoding="utf-8")
# write_through=True disables TextIOWrapper buffer making all lines
# to go directly to binary buffer for which chunk size is set at 41N - 1
f.reconfigure(write_through=True)
self.file.seek(0)
self.fingerprints.update(x.rstrip() for x in self.file)

This is a bit tricky, because on Windows line feed is \r\n which is two chars and in such case 42 * N - 1 should be used or newline='\n' is specified to force \n as line feed.

If some flexibility is required, there can be a setting for buffering: current behavior, line buffering or custom chunk size. And it will be nice to have an option to disable file at all.

The text was updated successfully, but these errors were encountered:

Gallaecio · 2023-08-21T14:06:45Z

I think the change makes sense.

No strong opinion on whether it should be default or not, or optional or hardcoded. I guess it mostly depends on how much of a performance impact we expect it to have.

wRAR added the enhancement label Aug 18, 2023

This was referenced Oct 11, 2023

Add line buffering to file requests.seen of RFPDupeFilter #6092

Closed

feat: add line buffer try2 new repo #6093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add line buffering to file `requests.seen` of `RFPDupeFilter` #6019

Add line buffering to file `requests.seen` of `RFPDupeFilter` #6019

Prometheus3375 commented Aug 18, 2023 •

edited

Gallaecio commented Aug 21, 2023 •

edited

Add line buffering to file requests.seen of RFPDupeFilter #6019

Add line buffering to file requests.seen of RFPDupeFilter #6019

Comments

Prometheus3375 commented Aug 18, 2023 • edited

Motivation

Context

Implementation

Gallaecio commented Aug 21, 2023 • edited

Add line buffering to file `requests.seen` of `RFPDupeFilter` #6019

Add line buffering to file `requests.seen` of `RFPDupeFilter` #6019

Prometheus3375 commented Aug 18, 2023 •

edited

Gallaecio commented Aug 21, 2023 •

edited