You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make RFPDupeFilter more reliable if spider fails terribly.
Context
RFPDupeFilter which is used by default in Scrapy, writes all fingerprints to file requests.seen, each fingerprint on one line. The idea of saving to a file is to preserve seen requests between restarts.
Currently, buffering option is not specified for this file.
Take a look on available buffering options:
0 - switches buffering off (only allowed in binary mode)
1 - selects line buffering (only usable in text mode)
Any integer > 1 - indicates the size of a fixed-size chunk buffer. Note that specifying a buffer size this way applies for binary buffered I/O, but TextIOWrapper (i.e., files opened with mode='r+') would have another buffering.
When buffering argument is not provided, next policy is used:
Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device's "block size" and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.
"Interactive" text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.
I quickly checked, file requests.seen has isatty() set to False, so it uses heuristic buffering.
I made a quick test in python console with line buffering. I run this code and then closed console.
My point is to add line buffering to requests.seen of RFPDupeFilter. Currently it behaves differently on different machines. If spider fails terribly, only a part of seen fingerprint will be present in the file or worse, the last fingerprint will be incomplete.
Line buffering will unify behavior, always writing whole fingerprint to the file.
Implementation
It is very easy, just add buffering=1 to method open inside RFPDupeFilter implementation. Alternatively, it is possible to flush every Nth fingerprint:
self.file=Path(path, "requests.seen").open("a+", buffering=41*N-1, encoding="utf-8")
# write_through=True disables TextIOWrapper buffer making all lines# to go directly to binary buffer for which chunk size is set at 41N - 1f.reconfigure(write_through=True)
self.file.seek(0)
self.fingerprints.update(x.rstrip() forxinself.file)
This is a bit tricky, because on Windows line feed is \r\n which is two chars and in such case 42 * N - 1 should be used or newline='\n' is specified to force \n as line feed.
If some flexibility is required, there can be a setting for buffering: current behavior, line buffering or custom chunk size. And it will be nice to have an option to disable file at all.
The text was updated successfully, but these errors were encountered:
No strong opinion on whether it should be default or not, or optional or hardcoded. I guess it mostly depends on how much of a performance impact we expect it to have.
Motivation
Make
RFPDupeFilter
more reliable if spider fails terribly.Context
RFPDupeFilter
which is used by default in Scrapy, writes all fingerprints to filerequests.seen
, each fingerprint on one line. The idea of saving to a file is to preserve seen requests between restarts.Currently, buffering option is not specified for this file.
Take a look on available buffering options:
When buffering argument is not provided, next policy is used:
I quickly checked, file
requests.seen
hasisatty()
set toFalse
, so it uses heuristic buffering.I made a quick test in python console with line buffering. I run this code and then closed console.
File 'test.txt' was empty. Then I run the same code, but added line buffering:
After console was closed, all text was present.
My point is to add line buffering to
requests.seen
ofRFPDupeFilter
. Currently it behaves differently on different machines. If spider fails terribly, only a part of seen fingerprint will be present in the file or worse, the last fingerprint will be incomplete.Line buffering will unify behavior, always writing whole fingerprint to the file.
Implementation
It is very easy, just add
buffering=1
to methodopen
insideRFPDupeFilter
implementation. Alternatively, it is possible to flush every Nth fingerprint:This is a bit tricky, because on Windows line feed is
\r\n
which is two chars and in such case42 * N - 1
should be used ornewline='\n'
is specified to force\n
as line feed.If some flexibility is required, there can be a setting for buffering: current behavior, line buffering or custom chunk size. And it will be nice to have an option to disable file at all.
The text was updated successfully, but these errors were encountered: