Microsoft Word table of contents Link annotation error. #2346

vokson · 2023-12-15T11:40:21Z

I am trying to use PdfReader and PdfWriter to read/write annotations in pdf file. I use PDF file produced by Microsoft Word -> Save As PDF. Word file has 3 simple pages with headings Page 1, Page 2, Page 3 and automatic table of contents made from these headings.
Links in table of contents become to be Link annotations in PDF file. Annotation itself looks like this

{'/Subtype': '/Link', '/Rect': [82.8, 711.57, 554.55, 731.07], '/BS': {'/W': 0}, '/F': 4, '/Dest': [IndirectObject(3, 0, 1202232362752), '/XYZ', 82, 785, 0], '/StructParent': 3}

Problem is value of '/Dest' key is list, but your code in _writer.py always expects dictionary. Then program tries to get value of tmp["target_page_index" from list, so that crash with error.

Please, help.

      if to_add.get("/Subtype") == "/Link" and "/Dest" in to_add:
            tmp = cast(Dict[Any, Any], to_add[NameObject("/Dest")])
            dest = Destination(
                NameObject("/LinkName"),
                tmp["target_page_index"],
                Fit(
                    fit_type=tmp["fit"], fit_args=dict(tmp)["fit_args"]
                ),  # I have no clue why this dict-hack is necessary
            )
            to_add[NameObject("/Dest")] = dest.dest_array

Environment

$ python -m platform
Windows-10-10.0.19043-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.2, crypt_provider=('cryptography', '37.0.4'), PIL=9.4.0

Code + PDF

    annotations = {}
    writer = PdfWriter()
    in_memory_file = BytesIO()

    for filename in filenames:
        reader = PdfReader(filename, strict=False)
        for page_idx, page in enumerate(reader.pages):
            if "/Annots" in page:
                for annot in page["/Annots"]:
                    if not annotations.get(page_idx):
                        annotations[page_idx] = []

                    annotations[page_idx].append(annot.get_object())
        del reader

    reader = PdfReader(filenames[0])
    for page_idx, page in enumerate(reader.pages):
        writer.add_page(page)

    del reader
    writer.remove_links()

    for page_idx in annotations:
        for annot in annotations[page_idx]:
            writer.add_annotation(page_number=page_idx, annotation=annot)

    writer.write(in_memory_file)

Test.docx
Test.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "C:\NOSKOV\030_DEV\web_services\skotch3\src\backend\entrypoints\..\logic\service_layer\message_bus.py", line 537, in handle_command
    result = handler(command, self._uow, self.handle)
  File "C:\NOSKOV\030_DEV\web_services\skotch3\src\backend\entrypoints\..\logic\service_layer\command_handlers\command_service_handlers.py", line 929, in mix_pdf_files
    writer.add_annotation(page_number=page_idx, annotation=annot)
  File "C:\NOSKOV\030_DEV\web_services\skotch3\src\backend\venv\lib\site-packages\pypdf\_writer.py", line 2803, in add_annotation
    tmp["target_page_index"],
TypeError: list indices must be integers or slices, not str

The text was updated successfully, but these errors were encountered:

ZupoLlask · 2024-03-22T21:20:43Z

I think this issue may be related with #2443. Maybe PR #2450 will also fix this specific issue...

stefan6419846 · 2024-03-23T09:21:53Z

You could easily verify this yourself by applying the patch to a local copy of your code.

With some shorter version of the above code I get:

Traceback (most recent call last):
  File "/home/stefan/tmp/pypdf/pypdf_upstream/run.py", line 15, in <module>
    writer.add_annotation(page_number=page_idx, annotation=annot)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_writer.py", line 2299, in add_annotation
    target_page = pages_obj[PA.KIDS][tmp["target_page_index"]]
TypeError: list indices must be integers or slices, not str

Thus it just fails earlier.

For reference, this is the shorter code:

from collections import defaultdict
from pypdf import PdfWriter


annotations = defaultdict(list)
writer = PdfWriter(clone_from="Test.pdf")
for page_idx, page in enumerate(writer.pages):
    if "/Annots" in page:
        for annot in page["/Annots"]:
            annotations[page_idx].append(annot.get_object())
writer.remove_links()

for page_idx in annotations:
    for annot in annotations[page_idx]:
        writer.add_annotation(page_number=page_idx, annotation=annot)

stefan6419846 · 2024-03-23T09:33:26Z

In this specific case, /Dest is an array where the first entry is an IndirectObject pointing to a Page object.

pubpub-zz · 2024-04-02T19:14:14Z

@vokson
Add_annotations expect as second argument an "annotation" created from pypdf.annotation as stated in the documentation:
https://pypdf.readthedocs.io/en/stable/user/adding-pdf-annotations.html

in your code you are using a DictionaryObject extracted from /Annots array which is not compatible.

I close this issue as non relevant. Feel free to clarify what you mean if you want this to be re-opened

ZupoLlask mentioned this issue Mar 22, 2024

BUG: Invalid Link #2450

Open

pubpub-zz closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Microsoft Word table of contents Link annotation error. #2346

Microsoft Word table of contents Link annotation error. #2346

vokson commented Dec 15, 2023

ZupoLlask commented Mar 22, 2024

stefan6419846 commented Mar 23, 2024

stefan6419846 commented Mar 23, 2024

pubpub-zz commented Apr 2, 2024 •

edited

Microsoft Word table of contents Link annotation error. #2346

Microsoft Word table of contents Link annotation error. #2346

Comments

vokson commented Dec 15, 2023

Environment

Code + PDF

Traceback

ZupoLlask commented Mar 22, 2024

stefan6419846 commented Mar 23, 2024

stefan6419846 commented Mar 23, 2024

pubpub-zz commented Apr 2, 2024 • edited

pubpub-zz commented Apr 2, 2024 •

edited