Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endless Loop When Processing Certain Large PDF with PdfFileWriter #358

Closed
suokunlong opened this issue Jul 6, 2017 · 6 comments
Closed
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfWriter The PdfWriter component is affected

Comments

@suokunlong
Copy link

STEPS TO REPRODUCE:

  1. Download the test pdf file:
    https://suokunlong.cn/owncloud/index.php/s/bWyTHYfoMii3Yh9
    The file is named 2017-Textbook-EconomicLaw.pdf, which is 65.1MB of 511 pages.

  2. Run the following code:

from PyPDF2 import PdfFileWriter, PdfFileReader

pdf_in_filename = r"/path/to/2017-Textbook-EconomicLaw.pdf"
pdf_out_filename = r"/path/to/2017-Textbook-EconomicLaw-new.pdf"

pdf_out = PdfFileWriter()
pdf_in = PdfFileReader(open(pdf_in_filename, 'rb'))

numpages = pdf_in.getNumPages()
for i in range(numpages):
    pdf_out.addPage(pdf_in.getPage(i))

with open(pdf_out_filename, 'wb') as outputStream:
    pdf_out.write(outputStream)
  1. The code is running forever at the last row.

OTHER USEFUL INFORMATION:

  1. I noticed that if I revise the line:
for i in range(numpages):

to:

for i in range(3):

then I will get the output very quickly.

  1. I also noticed that if I open the test pdf file using evince in my Linux desktop, and print it to a new pdf file, then the above code finishes within 5s.

PyPDF2.version
'1.25.1'

@suokunlong
Copy link
Author

Just for your information, I noticed the above bug when I was trying to add bookmarks to the pdf file using:
https://github.com/RussellLuo/pdfbookmarker.

@suokunlong
Copy link
Author

When I run pdfinfo to the test pdf file, I see:
Encrypted: yes (print:yes copy:no change:no addNotes:no algorithm:RC4)

The re-printed pdf is not encrypted.

The issue may be related to the encryption staff.

@vstoykov
Copy link
Contributor

If there is some broken image or the file is not correctly decripted then probably changes in PR #331 will raise error instead of taking forever. Can you check?

@sekrause
Copy link
Contributor

Issue #329 could be related. As @vstoykov said you should check out my PR #331.

@hecko
Copy link

hecko commented Nov 4, 2020

I have just hit this issue with embedded picture in template PDF file on Mac with PyPDF2 - 1.26.0 on Python3

@MartinThoma MartinThoma added PdfWriter The PdfWriter component is affected is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Apr 7, 2022
@MartinThoma
Copy link
Member

I'm closing this issue as I'm pretty certain this is solved (e.g. via #740) in recent versions of PyPDF2.

If you still encounter this issue with a recent PyPDF2 version, please let me know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfWriter The PdfWriter component is affected
Projects
None yet
Development

No branches or pull requests

5 participants