Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microsoft Word table of contents Link annotation error. #2346

Closed
vokson opened this issue Dec 15, 2023 · 4 comments
Closed

Microsoft Word table of contents Link annotation error. #2346

vokson opened this issue Dec 15, 2023 · 4 comments

Comments

@vokson
Copy link

vokson commented Dec 15, 2023

I am trying to use PdfReader and PdfWriter to read/write annotations in pdf file. I use PDF file produced by Microsoft Word -> Save As PDF. Word file has 3 simple pages with headings Page 1, Page 2, Page 3 and automatic table of contents made from these headings.
Links in table of contents become to be Link annotations in PDF file. Annotation itself looks like this

{'/Subtype': '/Link', '/Rect': [82.8, 711.57, 554.55, 731.07], '/BS': {'/W': 0}, '/F': 4, '/Dest': [IndirectObject(3, 0, 1202232362752), '/XYZ', 82, 785, 0], '/StructParent': 3}

Problem is value of '/Dest' key is list, but your code in _writer.py always expects dictionary. Then program tries to get value of tmp["target_page_index" from list, so that crash with error.

Please, help.

      if to_add.get("/Subtype") == "/Link" and "/Dest" in to_add:
            tmp = cast(Dict[Any, Any], to_add[NameObject("/Dest")])
            dest = Destination(
                NameObject("/LinkName"),
                tmp["target_page_index"],
                Fit(
                    fit_type=tmp["fit"], fit_args=dict(tmp)["fit_args"]
                ),  # I have no clue why this dict-hack is necessary
            )
            to_add[NameObject("/Dest")] = dest.dest_array

Environment

$ python -m platform
Windows-10-10.0.19043-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.2, crypt_provider=('cryptography', '37.0.4'), PIL=9.4.0

Code + PDF

    annotations = {}
    writer = PdfWriter()
    in_memory_file = BytesIO()

    for filename in filenames:
        reader = PdfReader(filename, strict=False)
        for page_idx, page in enumerate(reader.pages):
            if "/Annots" in page:
                for annot in page["/Annots"]:
                    if not annotations.get(page_idx):
                        annotations[page_idx] = []

                    annotations[page_idx].append(annot.get_object())
        del reader

    reader = PdfReader(filenames[0])
    for page_idx, page in enumerate(reader.pages):
        writer.add_page(page)

    del reader
    writer.remove_links()

    for page_idx in annotations:
        for annot in annotations[page_idx]:
            writer.add_annotation(page_number=page_idx, annotation=annot)

    writer.write(in_memory_file)

Test.docx
Test.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "C:\NOSKOV\030_DEV\web_services\skotch3\src\backend\entrypoints\..\logic\service_layer\message_bus.py", line 537, in handle_command
    result = handler(command, self._uow, self.handle)
  File "C:\NOSKOV\030_DEV\web_services\skotch3\src\backend\entrypoints\..\logic\service_layer\command_handlers\command_service_handlers.py", line 929, in mix_pdf_files
    writer.add_annotation(page_number=page_idx, annotation=annot)
  File "C:\NOSKOV\030_DEV\web_services\skotch3\src\backend\venv\lib\site-packages\pypdf\_writer.py", line 2803, in add_annotation
    tmp["target_page_index"],
TypeError: list indices must be integers or slices, not str

@ZupoLlask
Copy link

I think this issue may be related with #2443. Maybe PR #2450 will also fix this specific issue...

@stefan6419846
Copy link
Collaborator

You could easily verify this yourself by applying the patch to a local copy of your code.

With some shorter version of the above code I get:

Traceback (most recent call last):
  File "/home/stefan/tmp/pypdf/pypdf_upstream/run.py", line 15, in <module>
    writer.add_annotation(page_number=page_idx, annotation=annot)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_writer.py", line 2299, in add_annotation
    target_page = pages_obj[PA.KIDS][tmp["target_page_index"]]
TypeError: list indices must be integers or slices, not str

Thus it just fails earlier.

For reference, this is the shorter code:

from collections import defaultdict
from pypdf import PdfWriter


annotations = defaultdict(list)
writer = PdfWriter(clone_from="Test.pdf")
for page_idx, page in enumerate(writer.pages):
    if "/Annots" in page:
        for annot in page["/Annots"]:
            annotations[page_idx].append(annot.get_object())
writer.remove_links()

for page_idx in annotations:
    for annot in annotations[page_idx]:
        writer.add_annotation(page_number=page_idx, annotation=annot)

@stefan6419846
Copy link
Collaborator

In this specific case, /Dest is an array where the first entry is an IndirectObject pointing to a Page object.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Apr 2, 2024

@vokson
Add_annotations expect as second argument an "annotation" created from pypdf.annotation as stated in the documentation:
https://pypdf.readthedocs.io/en/stable/user/adding-pdf-annotations.html

in your code you are using a DictionaryObject extracted from /Annots array which is not compatible.

I close this issue as non relevant. Feel free to clarify what you mean if you want this to be re-opened

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants