Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Invalid Link #2450

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

BUG: Invalid Link #2450

wants to merge 6 commits into from

Conversation

rsinger417
Copy link
Contributor

@rsinger417 rsinger417 commented Feb 8, 2024

passes an IndirectObject for the target page instead of an integer. passing an integer creates an invalid link.

Fixes #2443

passes an IndirectObject for the target page instead of an integer. passing an integer creates an invalid link.
resolves py-pdf#2443 Issue
@stefan6419846
Copy link
Collaborator

Are you able to add a corresponding test case as well which shows the previous issue and demonstrates that your fix does indeed solve this?

@rsinger417
Copy link
Contributor Author

testpages.csv
testIndexCenterPnt.csv
Test Book0.pdf
link test .txt

link test.txt is the code. I could not upload a *.py file. The code needs to be changed to find the supporting files as the locations are hard coded.

An invalid link is created, it works, but if you delete a page the links are broken. Acrobat will re move then as invalid links when optimized.

@stefan6419846
Copy link
Collaborator

Could you add something of this as some automated unit/integration test?

@rsinger417
Copy link
Contributor Author

I don't know what I'm doing. I never read the book on GitHub. I don't known how to automated unit/integration test.
The code I uploaded was how I tested it. The links work fine with the final pdf until until you remove pages or optimize it
with Acrobat. The links are invalid because the destination page in the link are integers. The code was broke with PyPDF2
version 2.90 (7/31/2022) with the introduction of the method add_annotation in the class PdfWriter. add_link was deprecated
in version 2.9.0. I got the code fix from version 2.8.1 (7/25/2022) from the add_link method in the class PdfWriter line 1532
and line 1534, this was the latest version that I found where the link worked correctly. I left line 1532 unchanged.
1532 pages_obj = cast(Dict[str, Any], self.get_object(self._pages))
1534 page_dest = pages_obj[PA.KIDS][pagedest] # TODO: switch for external link
I replaced "pagedest" with "tmp["target_page_index"]" which is the integer value of the page and
"page_dest" with "taget_page" which is the IndirectObject of the page. This IndirectObject references the same page even when
other pages are removed and keeping it from being an invalid link. I got rid of the TODO comment.

@MartinThoma MartinThoma changed the title Update _writer.py - Invalid Link Fix #2448 BUG: Invalid Link Feb 13, 2024
@MartinThoma MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Feb 13, 2024
@MartinThoma
Copy link
Member

@rsinger417 The test tests/test_generic.py::test_annotation_builder_link fails. Do you see why? (It could also be a test issue; I haven't looked into it so far)

@rsinger417
Copy link
Contributor Author

rsinger417 commented Feb 13, 2024 via email

@stefan6419846
Copy link
Collaborator

I do not see the warning in the CI log, but an actual error which maps to a changed code line:

>           target_page = pages_obj[PA.KIDS][tmp["target_page_index"]]
E           IndexError: list index out of range

It seems like tmp["target_page_index"] returns an invalid page number?

@rsinger417
Copy link
Contributor Author

what is the value in "target_page_index"? It should be a page number in the pdf.

@stefan6419846
Copy link
Collaborator

what is the value in "target_page_index"? It should be a page number in the pdf.

See the CI output which displays this value:

annotation = {'/Type': '/Annot', '/Subtype': '/Link', '/Rect': RectangleObject([100, 100, 300, 200]), '/Border': [50, 10, 4], '/Dest': {'target_page_index': 1, 'fit': '/Fit', 'fit_args': []}, '/P': IndirectObject(4, 0, 140453995250688)}

You should be able to locally debug this as well, as this is a permanent error. The background is that writer only has one page due to

writer.add_page(page)
Thus target_page_index=1 now points to an invalid page:

pages_obj: {'/Type': '/Pages', '/Count': 1, '/Kids': [IndirectObject(4, 0, 140453995250688)]}
pages_obj[PA.KIDS]: [IndirectObject(4, 0, 140453995250688)]
tmp: {'target_page_index': 1, 'fit': '/Fit', 'fit_args': []}

With this analysis, your proposed change is a breaking one and thus most likely requires a deprecation process - although it might be debatable whether the current implementation would constitute as a bug or desired behavior.

@rsinger417
Copy link
Contributor Author

The test is wrong
lines 954-955 should come before line 952 since there is no page 1 when link_annotation is call you will get the
"IndexError: list index out of range." You have to add the page first. The first page in the pdf is page 0, the second page is page 1.
If there is no second page in the pdf you will still get an error. You could put a test to check if it is out of range and return an error
such as "No such page 1 in pdf"

945 # Part 4: Internal Link
946 with pytest.warns(DeprecationWarning):
947 link_annotation = AnnotationBuilder.link(
948 rect=(100, 100, 300, 200),
949 target_page_index=1,
950 border=[50, 10, 4],
951 )
952 writer.add_annotation(0, link_annotation)
953
954 for page in reader.pages[1:]:
955 writer.add_page(page)

@stefan6419846
Copy link
Collaborator

The test is wrong

The test has been there before your change and worked, thus I would assume that this is/was some intended functionality. Apparently it was considered a valid use case to generate annotations for invalid pages which might be added later on. Yes, we could argue about how useful this is, but this is just how it has been in the past. The original change where this has been introduced is #1189.

Let's wait for the opinion of the other maintainers regarding this.

@rsinger417
Copy link
Contributor Author

Unless you add_page first there is no page and thus no IndirectLink to it. If there is no IndirecLink then the link will be invalid. How can you link to an internal page that does not exist. PR #1189 does not have this test in it. The date of this PR #1189 commit was when the links became invalid PyPDF2 2.9.0 7/31/2022. The links were good in PyPDF2 2.8.1. The invalid links will work but they are not valid and will be broken if pages are removed and if the file is optimized in Acrobat they will be removed.

@stefan6419846 stefan6419846 added on-hold PR requests that need clarification before they can be merged.A comment must give details needs-discussion The PR/issue needs more discussion before we can continue and removed soon PRs that are almost ready to be merged, issues that get solved pretty soon labels Feb 23, 2024
@stefan6419846
Copy link
Collaborator

@pubpub-zz @MartinThoma Any input on #2450 (comment) and how to continue with the (now failing) test regarding possibly breaking behavior?

@pubpub-zz
Copy link
Collaborator

destinations within the document are using indirect objects:
image
but for remote destination numbers are accepted:
image
(...)
image

Having a look deeper, I think the issue is coming from the error in the typing in the Link constructor which is not dealing with link to external documents.

@stefan6419846
Copy link
Collaborator

@pubpub-zz It seems like your misread my comment. The remaining issue is that until now, pypdf would allow links to pages not (yet) added to the file; while with the solution proposed in this PR, we would restrict this to pages already present in the file. I consider this a possibly breaking change, but wanted to get a second opinion on this before continuing.

@pubpub-zz
Copy link
Collaborator

@pubpub-zz It seems like your misread my comment. The remaining issue is that until now, pypdf would allow links to pages not (yet) added to the file; while with the solution proposed in this PR, we would restrict this to pages already present in the file. I consider this a possibly breaking change, but wanted to get a second opinion on this before continuing.

Having to add the pages within the PdfWriter before referencing in links is mandatory : you can not guess what will be the IndirectObject before adding it into the PdfWriter.

@stefan6419846
Copy link
Collaborator

In the current release, this would work indeed, id est referencing an invalid page. After merging this PR, this would change, thus it might be a breaking one - although I am not sure whether we consider the old behavior (allowing invalid references) a bug or not, as for a bug we would not have to really consider this a breaking change.

@pubpub-zz
Copy link
Collaborator

The legacy code was compatible with both arguments
With the current code you cannot create links to external documents

@stefan6419846
Copy link
Collaborator

@rsinger417 Could you please check/verify whether you can keep support for external links which do not point to the same document?

@rsinger417
Copy link
Contributor Author

external link works, the annotation uses "url" instead of "target_page_index" for an internal link such as
mylink=Link(rect=[600,600,700,700], border=[0,0,1,[3,2]], url="#2450")
rather than
mylink=Link(rect=[130,60,230,25], target_page_index=0)

@pubpub-zz
Copy link
Collaborator

external link works, the annotation uses "url" instead of "target_page_index" for an internal link such as mylink=Link(rect=[600,600,700,700], border=[0,0,1,[3,2]], url="#2450") rather than mylink=Link(rect=[130,60,230,25], target_page_index=0)

in PDF there is different links:
/Goto which are for link to pages
/URI which are links to URI/URL
but also:
/GotoR which are links to pages in a remote document
/GotoE which are links to pages in a document embedded in the document

It is there two links where page index are used with.

@ZupoLlask
Copy link

I think this PR may also fix issue #2346. Probably it may make sense to check if indeed that issue will also be solved.

@MartinThoma
Copy link
Member

I would assume that this is/was some intended functionality. Apparently it was considered a valid use case to generate annotations for invalid pages which might be added later on.

I'd be fine with removing it. The main question is if we need to go through the deprecation process (which would take quite long) or if we can simply say that it was a bug.

I don't see the use-case here, hence I'd say it is a bug - especially when there was this change of behavior in pypdf==2.9.0 (written by rsinger417 before; I didn't check).

Is there anything stopping people from adding the pages first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-discussion The PR/issue needs more discussion before we can continue on-hold PR requests that need clarification before they can be merged.A comment must give details
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pypdf creates invalid links with add_annotation since PyPDF2 2.9.0
5 participants