Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PI: Don't load entire file into memory when passed file name #2520

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

mjsir911
Copy link

@mjsir911 mjsir911 commented Mar 15, 2024

This functionality originally added back in ced2890

Reduces memory usage by size of loaded file.

Benchmark script
from pypdf import *

filename = '/home/msirabella/tmp/100MB-TESTFILE.ORG.pdf'

writer = PdfWriter(clone_from=filename)

writer.write("out.pdf")
Before stats
πŸ“ Total allocations:
	109695

πŸ“¦ Total memory allocated:
	409.726MB

πŸ“Š Histogram of allocation size:
	min: 1.000B
	--------------------------------------------
	< 6.000B   : 40707 β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡
	< 40.000B  :   229 β–‡
	< 256.000B :    33 β–‡
	< 1.590KB  : 67394 β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡
	< 10.104KB :  1060 β–‡
	< 64.190KB :   141 β–‡
	< 407.789KB:    47 β–‡
	< 2.530MB  :    82 β–‡
	< 16.072MB :     0 
	<=102.099MB:     2 β–‡
	--------------------------------------------
	max: 102.099MB

πŸ“‚ Allocator type distribution:
	 MALLOC: 107587
	 CALLOC: 1223
	 REALLOC: 865
	 MMAP: 20

πŸ₯‡ Top 5 largest allocating locations (by size):
	- __init__:./pypdf/_reader.py:315 -> 204.210MB
	- read_from_stream:./pypdf/generic/_data_structures.py:541 -> 101.628MB
	- read_until_regex:./pypdf/_utils.py:233 -> 48.318MB
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 26.012MB
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 7.360MB

πŸ₯‡ Top 5 largest allocating locations (by number of allocations):
	- read_until_regex:./pypdf/_utils.py:233 -> 81058
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 23017
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 2101
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 988
	- _create_fn:/usr/lib/python3.11/dataclasses.py:433 -> 365
After stats
πŸ“ Total allocations:
	109687

πŸ“¦ Total memory allocated:
	205.521MB

πŸ“Š Histogram of allocation size:
	min: 1.000B
	--------------------------------------------
	< 4.000B   : 40707 β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡
	< 18.000B  :     4 β–‡
	< 80.000B  :   227 β–‡
	< 348.000B :    39 β–‡
	< 1.468KB  : 67239 β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡
	< 6.341KB  :   737 β–‡
	< 27.388KB :   563 β–‡
	< 118.297KB:    68 β–‡
	< 510.959KB:    21 β–‡
	<=2.155MB  :    82 β–‡
	--------------------------------------------
	max: 2.155MB

πŸ“‚ Allocator type distribution:
	 MALLOC: 107587
	 CALLOC: 1218
	 REALLOC: 862
	 MMAP: 20

πŸ₯‡ Top 5 largest allocating locations (by size):
	- read_from_stream:./pypdf/generic/_data_structures.py:541 -> 101.628MB
	- read_until_regex:./pypdf/_utils.py:233 -> 46.318MB
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 24.012MB
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 7.356MB
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 4.844MB

πŸ₯‡ Top 5 largest allocating locations (by number of allocations):
	- read_until_regex:./pypdf/_utils.py:233 -> 81056
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 23015
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 2095
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 989
	- _create_fn:/usr/lib/python3.11/dataclasses.py:433 -> 365

Copy link

codecov bot commented Mar 15, 2024

Codecov Report

Attention: Patch coverage is 78.57143% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 94.49%. Comparing base (c943f5f) to head (5209fcd).
Report is 5 commits behind head on main.

❗ Current head 5209fcd differs from pull request most recent head 1f81b68. Consider uploading reports for the commit 1f81b68 to get more accurate results

Files Patch % Lines
pypdf/_reader.py 78.57% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2520      +/-   ##
==========================================
- Coverage   94.54%   94.49%   -0.06%     
==========================================
  Files          49       49              
  Lines        8173     8187      +14     
  Branches     1658     1659       +1     
==========================================
+ Hits         7727     7736       +9     
- Misses        276      280       +4     
- Partials      170      171       +1     

β˜” View full report in Codecov by Sentry.
πŸ“’ Have feedback on the report? Share it here.

@mjsir911 mjsir911 changed the title Don't load entire file into memory when PdfReader passed file name PI: Don't load entire file into memory when passed file name Mar 15, 2024
@mjsir911 mjsir911 changed the title PI: Don't load entire file into memory when passed file name PI: Don't load entire file into memory when passed file name Mar 15, 2024
@stefan6419846
Copy link
Collaborator

Thanks for the PR. Are the stats correct? You need twice the memory afterwards, thus it would indicate that this is indeed no performance improvement?

And could you please have a look at the failing tests? Your changes lead to new test parallelization issues on Windows as each file can be open only once at each point in time.

@stefan6419846 stefan6419846 added the PdfReader The PdfReader component is affected label Mar 15, 2024
@mjsir911
Copy link
Author

mjsir911 commented Mar 15, 2024

Thanks for the PR. Are the stats correct? You need twice the memory afterwards, thus it would indicate that this is indeed no performance improvement?

Sorry, I got the before & after mixed up. fixed

And could you please have a look at the failing tests? Your changes lead to new test parallelization issues on Windows as each file can be open only once at each point in time.

Yeah, I can do. I'll have a bit more difficulty fixing the windows tests since I don't have a windows box to test on easily but I'll figure something out.

@stefan6419846
Copy link
Collaborator

stefan6419846 commented Mar 15, 2024

AFAIK the concurrent access issues will only occur on Windows, but I cannot really state how much this would indeed affect real use-cases.

I am not really sure about the fixed tests either - explicitly calling .delete() or even having to close the embedded stream object does not really feel intuitive and maybe even clumsy.

pypdf/_reader.py Outdated
@@ -314,6 +314,7 @@ def __init__(

if isinstance(stream, (str, Path)):
stream = open(stream, "rb") # noqa: SIM115
# Wish I could just close stream in __del__ but that fails a test very strangely
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity: Do you have some details about the failure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure how much was relevant to drop in the commit but:

when adding a self.stream.close() in a __del__ function, that does work most of the time.

The one test failure I was seeing was in tests/test_reader.py, the failing test was test_get_page_of_encrypted_file but interestingly this would pass on it's own. I narrowed down the source of the issue to the previous test test_issue297's exception block where the PdfReader() initializer was failing (that's what the test is testing for) and the __del__ block wasn't being called due to the exception happening in the __init__.

It's very possible at some point the objects would be GCd but test failures were happening due to dangling file pointers at the following test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to add this to the commit

@mjsir911 mjsir911 force-pushed the memory branch 2 times, most recently from 1a4b1af to a0415db Compare March 15, 2024 14:53
@mjsir911
Copy link
Author

mjsir911 commented Mar 15, 2024

I am not really sure about the fixed tests either - explicitly calling .delete() or even having to close the embedded stream object does not really feel intuitive and maybe even clumsy.

It's even worst than that, unfortunately! I'm not sure what the reference chain is from «Writer» -> «»
Β«cloned from reader's streamΒ», but del writer seems to unclaim the dangling file pointer.

If it's any consolation, the test failures are kind of an edge case where:

  • user is running on windows
  • filereader (or clone_from transitively) is instantiated via string/Path
  • file is acted upon outside of the pdfreader's context (opened/unlinked/whatever again by name) while the pdfreader is still in scope / not garbage collected

Sorry for jumping the gun on calling the tests solved! Still iterating on them.

@mjsir911
Copy link
Author

mjsir911 commented Mar 15, 2024

I am not really sure about the fixed tests either - explicitly calling .delete() or even having to close the embedded stream object does not really feel intuitive and maybe even clumsy.

It's even worst than that, unfortunately! I'm not sure what the reference chain is from -> <cloned from reader's stream>, but del writer seems to unclaim the dangling file pointer.

I could potentially add a .close() or something to PdfReader which would at least make this process explicit. I would still be unsure how to propogate that to PdfWriter's API though.

Making it a context manager might work too and would mirror PdfWriter

@mjsir911 mjsir911 force-pushed the memory branch 2 times, most recently from 5c25bc8 to 0786520 Compare March 15, 2024 15:15
pypdf/_reader.py Outdated Show resolved Hide resolved
tests/test_page.py Show resolved Hide resolved
@mjsir911
Copy link
Author

I don't want this merged as it currently is, calling garbage collection manually in tests feels yucky.

@mjsir911 mjsir911 marked this pull request as draft March 15, 2024 18:16
@pubpub-zz
Copy link
Collaborator

It's even worst than that, unfortunately! I'm not sure what the reference chain is from «Writer» -> «» «cloned from reader's stream», but del writer seems to unclaim the dangling file pointer.

when you call .clone_document_from_reader() or append(pages), you clone all objects from PdfReader(). during this process we need to keep connection between the writer's objects and the reader's object in order to keep parents links for example.
When you have finished your work, or when you need to append a new set of pages detached from the PdfReader, you have to call
writer.reset_translation().

This breaks if PdfReader contains any un-pickleable attributes (such as
file pointers)
Was only ever being used unintentionally in the tests and doesn't really
make sense. Use .clone() instead
This halves allocated memory when doing a simple
PdfWriter(clone_from=Β«strΒ»)

I can't just close the self.stream in `__del__` because for some strange
reason the unit tests mark it as unflagged even after the test block
ends. Something about `__del__` finalizers being run on a second pass
while `weakref.finalize()` is run on the first pass.
See py-pdf#2520, basically this was the last failing (only on windows) test
because if the pdfreaders are implicitly opening file streams that don't
get closed until they get garbage collected the .unlinks() create file
lock errors.
pypdf/_reader.py Outdated Show resolved Hide resolved
pypdf/_reader.py Outdated
Comment on lines 367 to 368
if self._we_opened:
self.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add test to maintain test coverage please

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I would prefer something like _opened_automatically instead of _we_opened to sound "more generic".

@mjsir911
Copy link
Author

I should also add using PdfReader as a contextmanager in some documentation somewhere

Comment on lines -286 to -288
def __deepcopy__(self, memo: Any) -> "IndirectObject":
return IndirectObject(self.idnum, self.generation, self.pdf)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so found about removing deepcopy : some people may use it this could be considered as a regression. If we really want to remove it we shall use the depredication process

@pubpub-zz
Copy link
Collaborator

I should also add using PdfReader as a contextmanager in some documentation somewhere

have you also been able to advance in your proposal ?

@mjsir911
Copy link
Author

have you also been able to advance in your proposal ?

Hi, sorry, I've been taking a break from things due to mental health but plan to be back on them sometime later next month. Moving this back to draft for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PdfReader The PdfReader component is affected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants