PI: Don't load entire file into memory when passed file name #2520

mjsir911 · 2024-03-15T02:32:58Z

This functionality originally added back in ced2890

Reduces memory usage by size of loaded file.

Benchmark script

from pypdf import *

filename = '/home/msirabella/tmp/100MB-TESTFILE.ORG.pdf'

writer = PdfWriter(clone_from=filename)

writer.write("out.pdf")

Before stats

📏 Total allocations:
	109695

📦 Total memory allocated:
	409.726MB

📊 Histogram of allocation size:
	min: 1.000B
	--------------------------------------------
	< 6.000B   : 40707 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 40.000B  :   229 ▇
	< 256.000B :    33 ▇
	< 1.590KB  : 67394 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 10.104KB :  1060 ▇
	< 64.190KB :   141 ▇
	< 407.789KB:    47 ▇
	< 2.530MB  :    82 ▇
	< 16.072MB :     0 
	<=102.099MB:     2 ▇
	--------------------------------------------
	max: 102.099MB

📂 Allocator type distribution:
	 MALLOC: 107587
	 CALLOC: 1223
	 REALLOC: 865
	 MMAP: 20

🥇 Top 5 largest allocating locations (by size):
	- __init__:./pypdf/_reader.py:315 -> 204.210MB
	- read_from_stream:./pypdf/generic/_data_structures.py:541 -> 101.628MB
	- read_until_regex:./pypdf/_utils.py:233 -> 48.318MB
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 26.012MB
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 7.360MB

🥇 Top 5 largest allocating locations (by number of allocations):
	- read_until_regex:./pypdf/_utils.py:233 -> 81058
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 23017
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 2101
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 988
	- _create_fn:/usr/lib/python3.11/dataclasses.py:433 -> 365

After stats

📏 Total allocations:
	109687

📦 Total memory allocated:
	205.521MB

📊 Histogram of allocation size:
	min: 1.000B
	--------------------------------------------
	< 4.000B   : 40707 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 18.000B  :     4 ▇
	< 80.000B  :   227 ▇
	< 348.000B :    39 ▇
	< 1.468KB  : 67239 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 6.341KB  :   737 ▇
	< 27.388KB :   563 ▇
	< 118.297KB:    68 ▇
	< 510.959KB:    21 ▇
	<=2.155MB  :    82 ▇
	--------------------------------------------
	max: 2.155MB

📂 Allocator type distribution:
	 MALLOC: 107587
	 CALLOC: 1218
	 REALLOC: 862
	 MMAP: 20

🥇 Top 5 largest allocating locations (by size):
	- read_from_stream:./pypdf/generic/_data_structures.py:541 -> 101.628MB
	- read_until_regex:./pypdf/_utils.py:233 -> 46.318MB
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 24.012MB
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 7.356MB
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 4.844MB

🥇 Top 5 largest allocating locations (by number of allocations):
	- read_until_regex:./pypdf/_utils.py:233 -> 81056
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 23015
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 2095
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 989
	- _create_fn:/usr/lib/python3.11/dataclasses.py:433 -> 365

codecov · 2024-03-15T02:38:58Z

Codecov Report

Attention: Patch coverage is 78.57143% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 94.49%. Comparing base (c943f5f) to head (5209fcd).
Report is 5 commits behind head on main.

❗ Current head 5209fcd differs from pull request most recent head 1f81b68. Consider uploading reports for the commit 1f81b68 to get more accurate results

Files	Patch %	Lines
pypdf/_reader.py	78.57%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2520      +/-   ##
==========================================
- Coverage   94.54%   94.49%   -0.06%     
==========================================
  Files          49       49              
  Lines        8173     8187      +14     
  Branches     1658     1659       +1     
==========================================
+ Hits         7727     7736       +9     
- Misses        276      280       +4     
- Partials      170      171       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tests/test_page.py

stefan6419846 · 2024-03-15T08:09:52Z

Thanks for the PR. Are the stats correct? You need twice the memory afterwards, thus it would indicate that this is indeed no performance improvement?

And could you please have a look at the failing tests? Your changes lead to new test parallelization issues on Windows as each file can be open only once at each point in time.

mjsir911 · 2024-03-15T08:20:31Z

Thanks for the PR. Are the stats correct? You need twice the memory afterwards, thus it would indicate that this is indeed no performance improvement?

Sorry, I got the before & after mixed up. fixed

And could you please have a look at the failing tests? Your changes lead to new test parallelization issues on Windows as each file can be open only once at each point in time.

Yeah, I can do. I'll have a bit more difficulty fixing the windows tests since I don't have a windows box to test on easily but I'll figure something out.

stefan6419846 · 2024-03-15T14:39:20Z

AFAIK the concurrent access issues will only occur on Windows, but I cannot really state how much this would indeed affect real use-cases.

I am not really sure about the fixed tests either - explicitly calling .delete() or even having to close the embedded stream object does not really feel intuitive and maybe even clumsy.

stefan6419846 · 2024-03-15T14:39:44Z

pypdf/_reader.py

@@ -314,6 +314,7 @@ def __init__(

        if isinstance(stream, (str, Path)):
            stream = open(stream, "rb")  #  noqa: SIM115
+            # Wish I could just close stream in __del__ but that fails a test very strangely


Just out of curiosity: Do you have some details about the failure?

Yeah, I'm not sure how much was relevant to drop in the commit but:

when adding a self.stream.close() in a __del__ function, that does work most of the time.

The one test failure I was seeing was in tests/test_reader.py, the failing test was test_get_page_of_encrypted_file but interestingly this would pass on it's own. I narrowed down the source of the issue to the previous test test_issue297's exception block where the PdfReader() initializer was failing (that's what the test is testing for) and the __del__ block wasn't being called due to the exception happening in the __init__.

It's very possible at some point the objects would be GCd but test failures were happening due to dangling file pointers at the following test.

I'm going to add this to the commit

mjsir911 · 2024-03-15T15:01:49Z

I am not really sure about the fixed tests either - explicitly calling .delete() or even having to close the embedded stream object does not really feel intuitive and maybe even clumsy.

It's even worst than that, unfortunately! I'm not sure what the reference chain is from «Writer» -> «»
«cloned from reader's stream», but del writer seems to unclaim the dangling file pointer.

If it's any consolation, the test failures are kind of an edge case where:

user is running on windows
filereader (or clone_from transitively) is instantiated via string/Path
file is acted upon outside of the pdfreader's context (opened/unlinked/whatever again by name) while the pdfreader is still in scope / not garbage collected

Sorry for jumping the gun on calling the tests solved! Still iterating on them.

mjsir911 · 2024-03-15T15:04:58Z

I am not really sure about the fixed tests either - explicitly calling .delete() or even having to close the embedded stream object does not really feel intuitive and maybe even clumsy.

It's even worst than that, unfortunately! I'm not sure what the reference chain is from -> <cloned from reader's stream>, but del writer seems to unclaim the dangling file pointer.

I could potentially add a .close() or something to PdfReader which would at least make this process explicit. I would still be unsure how to propogate that to PdfWriter's API though.

Making it a context manager might work too and would mirror PdfWriter

pypdf/_reader.py

tests/test_page.py

mjsir911 · 2024-03-15T18:16:09Z

I don't want this merged as it currently is, calling garbage collection manually in tests feels yucky.

pubpub-zz · 2024-03-15T20:11:40Z

It's even worst than that, unfortunately! I'm not sure what the reference chain is from «Writer» -> «» «cloned from reader's stream», but del writer seems to unclaim the dangling file pointer.

when you call .clone_document_from_reader() or append(pages), you clone all objects from PdfReader(). during this process we need to keep connection between the writer's objects and the reader's object in order to keep parents links for example.
When you have finished your work, or when you need to append a new set of pages detached from the PdfReader, you have to call
writer.reset_translation().

This breaks if PdfReader contains any un-pickleable attributes (such as file pointers)

Was only ever being used unintentionally in the tests and doesn't really make sense. Use .clone() instead

This halves allocated memory when doing a simple PdfWriter(clone_from=«str») I can't just close the self.stream in `__del__` because for some strange reason the unit tests mark it as unflagged even after the test block ends. Something about `__del__` finalizers being run on a second pass while `weakref.finalize()` is run on the first pass.

See py-pdf#2520, basically this was the last failing (only on windows) test because if the pdfreaders are implicitly opening file streams that don't get closed until they get garbage collected the .unlinks() create file lock errors.

pypdf/_reader.py

pubpub-zz · 2024-03-21T05:32:28Z

pypdf/_reader.py

+        if self._we_opened:
+            self.close()


can you add test to maintain test coverage please

Additionally, I would prefer something like _opened_automatically instead of _we_opened to sound "more generic".

mjsir911 · 2024-03-21T18:44:41Z

I should also add using PdfReader as a contextmanager in some documentation somewhere

pubpub-zz · 2024-04-13T14:53:08Z

pypdf/generic/_base.py

-    def __deepcopy__(self, memo: Any) -> "IndirectObject":
-        return IndirectObject(self.idnum, self.generation, self.pdf)
-


I'm not so found about removing deepcopy : some people may use it this could be considered as a regression. If we really want to remove it we shall use the depredication process

pubpub-zz · 2024-04-13T14:54:04Z

I should also add using PdfReader as a contextmanager in some documentation somewhere

have you also been able to advance in your proposal ?

mjsir911 · 2024-04-13T20:27:49Z

have you also been able to advance in your proposal ?

Hi, sorry, I've been taking a break from things due to mental health but plan to be back on them sometime later next month. Moving this back to draft for now.

mjsir911 force-pushed the memory branch from 04fbcb3 to f66f49b Compare March 15, 2024 07:34

mjsir911 changed the title ~~Don't load entire file into memory when PdfReader passed file name~~ PI: Don't load entire file into memory when passed file name Mar 15, 2024

mjsir911 force-pushed the memory branch from f66f49b to 9ccce80 Compare March 15, 2024 07:41

mjsir911 changed the title ~~PI: Don't load entire file into memory when passed file name~~ PI: Don't load entire file into memory when passed file name Mar 15, 2024

stefan6419846 reviewed Mar 15, 2024

View reviewed changes

tests/test_page.py Show resolved Hide resolved

stefan6419846 added the PdfReader The PdfReader component is affected label Mar 15, 2024

stefan6419846 reviewed Mar 15, 2024

View reviewed changes

mjsir911 force-pushed the memory branch 2 times, most recently from 1a4b1af to a0415db Compare March 15, 2024 14:53

mjsir911 force-pushed the memory branch 2 times, most recently from 5c25bc8 to 0786520 Compare March 15, 2024 15:15

pubpub-zz reviewed Mar 15, 2024

View reviewed changes

pypdf/_reader.py Outdated Show resolved Hide resolved

tests/test_page.py Show resolved Hide resolved

mjsir911 marked this pull request as draft March 15, 2024 18:16

mjsir911 added 5 commits March 20, 2024 17:11

TST: Don't deepcopy PdfReader objects

9802481

This breaks if PdfReader contains any un-pickleable attributes (such as file pointers)

MAINT: remove deepcopy functionality

0057334

Was only ever being used unintentionally in the tests and doesn't really make sense. Use .clone() instead

STY: fix typo

1c82256

TST: Use buffer instead of opening file many times

294da34

See py-pdf#2520, basically this was the last failing (only on windows) test because if the pdfreaders are implicitly opening file streams that don't get closed until they get garbage collected the .unlinks() create file lock errors.

mjsir911 force-pushed the memory branch from d9e673e to b105b76 Compare March 21, 2024 00:27

fixup! PI: Don't load entire file into memory when passed file name

5209fcd

mjsir911 force-pushed the memory branch from b105b76 to 5209fcd Compare March 21, 2024 00:48

pubpub-zz reviewed Mar 21, 2024

View reviewed changes

fixup! PI: Don't load entire file into memory when passed file name

1f81b68

pubpub-zz reviewed Apr 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PI: Don't load entire file into memory when passed file name #2520

PI: Don't load entire file into memory when passed file name #2520

mjsir911 commented Mar 15, 2024 •

edited

codecov bot commented Mar 15, 2024 •

edited

stefan6419846 commented Mar 15, 2024

mjsir911 commented Mar 15, 2024 •

edited

stefan6419846 commented Mar 15, 2024 •

edited

stefan6419846 Mar 15, 2024

mjsir911 Mar 15, 2024

mjsir911 Mar 15, 2024

mjsir911 commented Mar 15, 2024 •

edited

mjsir911 commented Mar 15, 2024 •

edited

mjsir911 commented Mar 15, 2024

pubpub-zz commented Mar 15, 2024

pubpub-zz Mar 21, 2024

stefan6419846 Mar 21, 2024

mjsir911 commented Mar 21, 2024

pubpub-zz Apr 13, 2024

pubpub-zz commented Apr 13, 2024

mjsir911 commented Apr 13, 2024

		def __deepcopy__(self, memo: Any) -> "IndirectObject":
		return IndirectObject(self.idnum, self.generation, self.pdf)

PI: Don't load entire file into memory when passed file name #2520

Are you sure you want to change the base?

PI: Don't load entire file into memory when passed file name #2520

Conversation

mjsir911 commented Mar 15, 2024 • edited

codecov bot commented Mar 15, 2024 • edited

Codecov Report

stefan6419846 commented Mar 15, 2024

mjsir911 commented Mar 15, 2024 • edited

stefan6419846 commented Mar 15, 2024 • edited

stefan6419846 Mar 15, 2024

Choose a reason for hiding this comment

mjsir911 Mar 15, 2024

Choose a reason for hiding this comment

mjsir911 Mar 15, 2024

Choose a reason for hiding this comment

mjsir911 commented Mar 15, 2024 • edited

mjsir911 commented Mar 15, 2024 • edited

mjsir911 commented Mar 15, 2024

pubpub-zz commented Mar 15, 2024

pubpub-zz Mar 21, 2024

Choose a reason for hiding this comment

stefan6419846 Mar 21, 2024

Choose a reason for hiding this comment

mjsir911 commented Mar 21, 2024

pubpub-zz Apr 13, 2024

Choose a reason for hiding this comment

pubpub-zz commented Apr 13, 2024

mjsir911 commented Apr 13, 2024

mjsir911 commented Mar 15, 2024 •

edited

codecov bot commented Mar 15, 2024 •

edited

mjsir911 commented Mar 15, 2024 •

edited

stefan6419846 commented Mar 15, 2024 •

edited

mjsir911 commented Mar 15, 2024 •

edited

mjsir911 commented Mar 15, 2024 •

edited