Uses buffered input for inline images to speed up reading #390

archivsozialebewegungen · 2018-01-26T14:13:19Z

There are pdf files out there in the wild which use large inline images - although this is not recommended by the pdf specification. If you try to open one of these files with PyPDF2 the process seems to be stuck because for inline images the read is done byte by byte. This patch introduces buffering for inline images, the actual (huge) buffer size has been found experimentally to get the best reading speed.

MartinThoma · 2022-04-16T07:52:28Z

Thank you for your contribution!

MartinThoma · 2022-04-16T07:52:32Z

#740 which was merged+released yesterday, probably already did this

MartinThoma · 2022-04-16T07:54:26Z

I know it's been a long time since you created this PR. Would you mind to check if your PR still adds value (and potentially fix the merge conflicts?)

archivsozialebewegungen · 2022-04-17T10:20:56Z

After a first glance over the code changed according to #740, the buffering introduced should mitigate the problem. I'm not really sure if the find/seek solution for detecting the end of the image data stream is faster or slower than my regex solution, but this should not make a big difference. More of a concern might be the buffersize of only 8k, I used 1m for a reason, so 8k might still result in poor performance for large images. I will create a performance test next week when I'm in the office again where I have real life data samples.

MartinThoma · 2022-04-17T11:59:30Z

Thank you so much! I would love having performance tests in the test suite! (maybe even in CI?)

MartinThoma · 2022-06-12T06:45:20Z

Hi @archivsozialebewegungen! Did you have the time to run performance tests?

MartinThoma · 2022-06-14T19:48:43Z

The main point of this PR was to use a buffer / read in bigger chunks.

We do read 8kB chunks now: https://github.com/py-pdf/PyPDF2/blob/main/PyPDF2/generic.py#L1176

As the code base has changed quite a bit, I'm closing this PR now. Feel free to submit another PR (I'll handle that one quicker 🤞 )

archivsozialebewegungen · 2022-06-15T09:01:38Z

I'm also sorry for answering late. I tried yesterday to find one of the pdf files that made trouble for us and could not find one. Finally I looped over all our pdf-Files and found no performance issues even with version 1.26. I have two possible explanations for this puzzling behaviour: - A performance boost by better hardware (unlikely in this degree) - A much better memory management when reading files with Python 3.9 compared to Python 3.2, where the problem arose and made me introduce buffering I think, the latter is the most likely explanation, although I do not have proof. So in short: The problem seems to be solved with and without buffering and I fully agree with closing the PR. Kind regards Michael Am Dienstag, dem 14.06.2022 um 12:48 -0700 schrieb Martin Thoma:

…

The main point of this PR was to use a buffer / read in bigger chunks. We do read 8kB chunks now: https://github.com/py-pdf/PyPDF2/blob/main/PyPDF2/generic.py#L1176 As the code base has changed quite a bit, I'm closing this PR now. Feel free to submit another PR (I'll handle that one quicker 🤞 ) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Michael Koltan Runzstraße 6 79102 Freiburg Telefon 0761 76 78 033 Mobil 0152 52951842

Uses buffered input for inline images to speed up reading

deba79f

MartinThoma added nf-performance Non-functional change: Performance Tiny Pull requests that make a tiny change - and thus should be easy to merge labels Apr 6, 2022

MartinThoma added the needs-discussion The PR/issue needs more discussion before we can continue label Apr 16, 2022

MartinThoma closed this Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uses buffered input for inline images to speed up reading #390

Uses buffered input for inline images to speed up reading #390

archivsozialebewegungen commented Jan 26, 2018

MartinThoma commented Apr 16, 2022

MartinThoma commented Apr 16, 2022

MartinThoma commented Apr 16, 2022

archivsozialebewegungen commented Apr 17, 2022

MartinThoma commented Apr 17, 2022

MartinThoma commented Jun 12, 2022

MartinThoma commented Jun 14, 2022

archivsozialebewegungen commented Jun 15, 2022 via email

Uses buffered input for inline images to speed up reading #390

Uses buffered input for inline images to speed up reading #390

Conversation

archivsozialebewegungen commented Jan 26, 2018

MartinThoma commented Apr 16, 2022

MartinThoma commented Apr 16, 2022

MartinThoma commented Apr 16, 2022

archivsozialebewegungen commented Apr 17, 2022

MartinThoma commented Apr 17, 2022

MartinThoma commented Jun 12, 2022

MartinThoma commented Jun 14, 2022

archivsozialebewegungen commented Jun 15, 2022 via email