Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reusing PdfMerger after write generates PDF with extra pages #1337

Closed
bashirmindee opened this issue Sep 9, 2022 · 4 comments
Closed

Reusing PdfMerger after write generates PDF with extra pages #1337

bashirmindee opened this issue Sep 9, 2022 · 4 comments
Labels
help wanted We appreciate help everywhere - this one might be an easy start! is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-merge From a users perspective, merging is the affected feature/workflow

Comments

@bashirmindee
Copy link

bashirmindee commented Sep 9, 2022

I was trying to merge the same PDF with itself multiple number of times. 1st I want to have the original PDF, then I want to have the PDF duplicated, then I want to duplicate it three times, and so forth.

Environment

ubuntu 20.04
Python==3.8.12+
Package Version


pip==21.1.1
PyPDF2==2.10.5
setuptools==56.0.0
typing-extensions==4.3.0

$ python -m platform
Linux-5.11.0-40-generic-x86_64-with-glibc2.29
$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.5

Code + PDF

This is a minimal, complete example that shows the issue:

##script.py

from PyPDF2 import PdfReader, PdfMerger

merger = PdfMerger()
reader = PdfReader("blank.pdf")

for j in range(9):
    merger.append(reader)
    merger.write(f"generated_pdfs/{len(merger.pages)}.pdf")

Here is the blank.pdf that causes the issue.

Expected behavior

1.pdf: must contain 1 page but contains 1 page ✅
2.pdf: must contain 2 page but contains 2 page ❌
3.pdf: must contain 3 page but contains 3 page ❌
4.pdf: must contain 4 page but contains 4 page ❌
5.pdf: must contain 5 page but contains 5 page ❌
6.pdf: must contain 6 page but contains 6 page ❌

@MartinThoma MartinThoma added the workflow-merge From a users perspective, merging is the affected feature/workflow label Sep 24, 2022
@MartinThoma
Copy link
Member

Interesting. For PyPDF2==2.10.9

  • 1.pdf contains 1 page
  • 2.pdf contains 3 pages (+2)
  • 3.pdf contains 6 pages (+3)
  • 4.pdf contains 10 pages (+4)
  • 5.pdf contains 15 pages (+5)
  • ...

I'm not sure why ...

@MartinThoma MartinThoma added help wanted We appreciate help everywhere - this one might be an easy start! is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Sep 24, 2022
@MartinThoma MartinThoma changed the title Reusing PdfFileMerger after write generates PDF with extra pages Reusing PdfMerger after write generates PDF with extra pages Sep 24, 2022
@pubpub-zz
Copy link
Collaborator

@bashirmindee
a change part of PR #1371 in f9d7d19 should fix it. The other commits should not be required.
the change is small you should be able to copy it, if you want to try

MartinThoma pushed a commit that referenced this issue Dec 11, 2022
The method `.clone(pdf_dest,[force_duplicate])` clones the objects and all referenced objects.

If an object is already cloned, the already cloned object is returned (unless force_duplicate is set)
mainly for internal use but can be used on a page
for pageObject/DictionnaryObject/[Encoded/Decoded/Content]Stream an extra parameter ignore_fields list that provide the list of fields that should not be cloned.

When available, the pointer to an object is available in `indirect_obj` attribute.

New API for add_page/insert_page that :

* returns the cloned page object
* ignore_fields can be provided as a parameter.

## Others

* file is closed at the end of PdfWriter.write when a filename is provided
* Breaking Change: `add_outline_item` now has a parameter before which is not the last parameter

## Update
* The public API of PdfMerger has been added to PdfWriter (ready to make PdfMerger an alias of it)
* Process properly Outline merging
* Process properly Named destinated

Deals with #1194, #1322, #471, #1337
@pubpub-zz
Copy link
Collaborator

@bashirmindee
with the lastest version of pypdf

##script.py

from PyPDF2 import PdfReader, PdfWriter

writer = PdfWriter()
reader = PdfReader("blank.pdf")

for j in range(9):
    writer.append(reader)    
    writer.write(f"generated_pdfs/{len(writer.pages)}.pdf")
    writer.reset_translation(reader)  # to append independent pages

@pubpub-zz
Copy link
Collaborator

I close this as solved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted We appreciate help everywhere - this one might be an easy start! is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-merge From a users perspective, merging is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

3 participants