Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML links to document page broken after merge #471

Closed
sanohin opened this issue Nov 21, 2018 · 15 comments · Fixed by #1858
Closed

HTML links to document page broken after merge #471

sanohin opened this issue Nov 21, 2018 · 15 comments · Fixed by #1858
Assignees
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@sanohin
Copy link

sanohin commented Nov 21, 2018

If you have links in PDF file (html anchor tag with element id as href) they would not work after merging.

<a href="#target">Go to target</a>
....some content
<div id="target>Target here</div>
report = PdfFileReader(BytesIO(pdf)) # rendered html file to pdf with html links
merger = PdfFileMerger(strict=False)
merger.append(report)

result = BytesIO()
merger.write(result)
result.seek(0)
return result.read()
@jackneil
Copy link

jackneil commented Jun 7, 2019

I second this issue. It occurs when merging even via the pageObject.mergePage method

@advename
Copy link

I'll jump on the train - having the same issue here!

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022
@MartinThoma MartinThoma added the needs-pdf The issue needs a PDF file to show the problem label Jun 26, 2022
@MartinThoma
Copy link
Member

Can anybody share a PDF which shows this issue? Is it still an issue with the latest PyPDF2 version?

@SimplyOm
Copy link

SimplyOm commented Jul 19, 2022

@MartinThoma Not sure if you're still looking for an example, but you can find one below:

I have generated book.pdf using sample doc creation of jupyter-book project. You can see that internal HTML links in Contents section of book_out.pdf don't work, which work fine in book.pdf. The conversion from book.pdf to book_out.pdf uses the below code snippet:

from PyPDF2 import PdfReader, PdfWriter

PDF = "./doc/_build/pdf/book.pdf"
OUT_PDF = "./doc/_build/pdf/book_out.pdf"

reader = PdfReader(PDF)
writer = PdfWriter()

for page in reader.pages:
    writer.addPage(page)

with open(OUT_PDF, "wb") as f:
    writer.write(f)

@MartinThoma MartinThoma added Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests and removed needs-pdf The issue needs a PDF file to show the problem labels Jul 19, 2022
@MartinThoma
Copy link
Member

Thank you @SimplyOm 🤗

@dc-em
Copy link

dc-em commented Oct 4, 2022

Hi @MartinThoma , is this solved? facing same issue

@pubpub-zz
Copy link
Collaborator

In progress, should come back soon

@pubpub-zz
Copy link
Collaborator

I think the issue is found :
a) The links are using named dest, not copied with the add_page : I've coded the append/merge functions into PdfWriter
b) some types were not matching : I've added a function for that (implemented in the merge
changes in pr #1371 (still in progress)

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Oct 12, 2022
@manathan1984
Copy link

On a relevant issue, when using merge, the internal links of a pdf seem to be broken. I refer to links, for example, to a reference at the end of the pdf in a research paper or to a section of the paper. Any ideas on how to keep those links active when mergin?

MartinThoma pushed a commit that referenced this issue Dec 11, 2022
The method `.clone(pdf_dest,[force_duplicate])` clones the objects and all referenced objects.

If an object is already cloned, the already cloned object is returned (unless force_duplicate is set)
mainly for internal use but can be used on a page
for pageObject/DictionnaryObject/[Encoded/Decoded/Content]Stream an extra parameter ignore_fields list that provide the list of fields that should not be cloned.

When available, the pointer to an object is available in `indirect_obj` attribute.

New API for add_page/insert_page that :

* returns the cloned page object
* ignore_fields can be provided as a parameter.

## Others

* file is closed at the end of PdfWriter.write when a filename is provided
* Breaking Change: `add_outline_item` now has a parameter before which is not the last parameter

## Update
* The public API of PdfMerger has been added to PdfWriter (ready to make PdfMerger an alias of it)
* Process properly Outline merging
* Process properly Named destinated

Deals with #1194, #1322, #471, #1337
@pubpub-zz
Copy link
Collaborator

@manathan1984,
it is now recommended to use PdfWriter and append() that should fix the issues. Can you try it and update the status of this issue?

@DX9807
Copy link

DX9807 commented Feb 9, 2023

@pubpub-zz

writer = PdfWriter()
for pdf in ["cover_page.pdf", "main_report.pdf", "back_cover.pdf"]:
    writer.append(pdf)

with open("result.pdf", "wb") as f:
    writer.write(f)

getting below error when using PdfWriter and append() .

AttributeError: 'NumberObject' object has no attribute 'indirect_reference'

@pubpub-zz
Copy link
Collaborator

@DX9807
Can you please provide the pdf

@DX9807
Copy link

DX9807 commented Feb 10, 2023

@pubpub-zz
Check the files given below
back_cover.pdf
central.pdf
cover_page.pdf

While trying to merge the above pdfs using PdfWriter and its append method I am getting this error.

AttributeError: 'NumberObject' object has no attribute 'indirect_reference'

But when I use PdfMerger class and the corresponding append method the pdfs get merged but the internal hyperlinks are not
working in this case,

@pubpub-zz pubpub-zz self-assigned this Feb 26, 2023
pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue May 23, 2023
closes py-pdf#471

the issue was with named destination using numbers instead of indirect object to point pages. This is normally not expected.
MartinThoma pushed a commit that referenced this issue Jun 25, 2023
The issue was with named destination using numbers instead of indirect object to point pages. This is normally not expected.

Closes #471
Closes #1898
@rocketrefrigerator
Copy link

Hello, it is included in 3.16.0?

@stefan6419846
Copy link
Collaborator

If you have a look at the last commit referenced here (b1fa953), you will see that this fix is included since version 3.11.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

Successfully merging a pull request may close this issue.