New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PI: Don't load entire file into memory when passed file name #2520
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2520 +/- ##
==========================================
- Coverage 94.54% 94.49% -0.06%
==========================================
Files 49 49
Lines 8173 8187 +14
Branches 1658 1659 +1
==========================================
+ Hits 7727 7736 +9
- Misses 276 280 +4
- Partials 170 171 +1 β View full report in Codecov by Sentry. |
Thanks for the PR. Are the stats correct? You need twice the memory afterwards, thus it would indicate that this is indeed no performance improvement? And could you please have a look at the failing tests? Your changes lead to new test parallelization issues on Windows as each file can be open only once at each point in time. |
Sorry, I got the before & after mixed up. fixed
Yeah, I can do. I'll have a bit more difficulty fixing the windows tests since I don't have a windows box to test on easily but I'll figure something out. |
AFAIK the concurrent access issues will only occur on Windows, but I cannot really state how much this would indeed affect real use-cases. I am not really sure about the fixed tests either - explicitly calling |
pypdf/_reader.py
Outdated
@@ -314,6 +314,7 @@ def __init__( | |||
|
|||
if isinstance(stream, (str, Path)): | |||
stream = open(stream, "rb") # noqa: SIM115 | |||
# Wish I could just close stream in __del__ but that fails a test very strangely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity: Do you have some details about the failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure how much was relevant to drop in the commit but:
when adding a self.stream.close()
in a __del__
function, that does work most of the time.
The one test failure I was seeing was in tests/test_reader.py
, the failing test was test_get_page_of_encrypted_file
but interestingly this would pass on it's own. I narrowed down the source of the issue to the previous test test_issue297
's exception block where the PdfReader()
initializer was failing (that's what the test is testing for) and the __del__
block wasn't being called due to the exception happening in the __init__
.
It's very possible at some point the objects would be GCd but test failures were happening due to dangling file pointers at the following test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to add this to the commit
1a4b1af
to
a0415db
Compare
It's even worst than that, unfortunately! I'm not sure what the reference chain is from «Writer» -> «» If it's any consolation, the test failures are kind of an edge case where:
Sorry for jumping the gun on calling the tests solved! Still iterating on them. |
I could potentially add a Making it a context manager might work too and would mirror |
5c25bc8
to
0786520
Compare
I don't want this merged as it currently is, calling garbage collection manually in tests feels yucky. |
when you call |
This breaks if PdfReader contains any un-pickleable attributes (such as file pointers)
Was only ever being used unintentionally in the tests and doesn't really make sense. Use .clone() instead
This halves allocated memory when doing a simple PdfWriter(clone_from=Β«strΒ») I can't just close the self.stream in `__del__` because for some strange reason the unit tests mark it as unflagged even after the test block ends. Something about `__del__` finalizers being run on a second pass while `weakref.finalize()` is run on the first pass.
See py-pdf#2520, basically this was the last failing (only on windows) test because if the pdfreaders are implicitly opening file streams that don't get closed until they get garbage collected the .unlinks() create file lock errors.
pypdf/_reader.py
Outdated
if self._we_opened: | ||
self.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add test to maintain test coverage please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, I would prefer something like _opened_automatically
instead of _we_opened
to sound "more generic".
I should also add using |
def __deepcopy__(self, memo: Any) -> "IndirectObject": | ||
return IndirectObject(self.idnum, self.generation, self.pdf) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so found about removing deepcopy : some people may use it this could be considered as a regression. If we really want to remove it we shall use the depredication process
have you also been able to advance in your proposal ? |
Hi, sorry, I've been taking a break from things due to mental health but plan to be back on them sometime later next month. Moving this back to draft for now. |
This functionality originally added back in ced2890
Reduces memory usage by size of loaded file.
Benchmark script
Before stats
After stats