Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A PDF with dodgy (yet apparently valid per qpdf --check) structure is causing a crash #568

Open
fidothe opened this issue Mar 2, 2024 · 1 comment

Comments

@fidothe
Copy link

fidothe commented Mar 2, 2024

I'm using Paperless-ngx to import a whole trove of archived PDFs, and the ocrmypdf process it uses is exploding in a few cases, with an error deep in pikepdf.

I've included the traceback here for context, but I'm guessing that the problem is that the PDF in question has a Root object with a Metadata member that is very definitely not an XMP object:

1 0 obj
<< /Metadata 19 0 R /Outlines 44 0 R /Pages 18 0 R /Type /Catalog>>
endobj
19 0 obj
<< /Ascent 770 /AvgWidth 441 /CapHeight 717 /Descent -230 /Flags 32 /FontBBox [ -951 -481 1445 1122 ] /FontFile2 20 0 R /FontName /KFCHCO+Helvetica /ItalicAngle 0 /MaxWidth 1500 /StemH 85 /StemV 98 /Type /FontDescriptor /XHeight 523>>
endobj

From the PDF trailer, object 1 0 is the Root: /Root 1 0 R

I'm not sure how to provide a sample PDF here, since the documents in question contain private data, and were generated by an old version of some PDF software a number of years ago...

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/ocrmypdf/16.1.1/libexec/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 249, in cli_exception_handler
    return fn(options, plugin_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/ocrmypdf/16.1.1/libexec/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 191, in _run_pipeline
    optimize_messages = exec_concurrent(context, executor)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/ocrmypdf/16.1.1/libexec/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 141, in exec_concurrent
    pdf = ocrgraft.finalize()
          ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/ocrmypdf/16.1.1/libexec/lib/python3.12/site-packages/ocrmypdf/_graft.py", line 203, in finalize
    self.pdf_base.save(self.output_file)
  File "/opt/homebrew/opt/img2pdf/libexec/lib/python3.12/site-packages/pikepdf/_methods.py", line 316, in save
    self._save(
  File "/opt/homebrew/opt/img2pdf/libexec/lib/python3.12/site-packages/pikepdf/_cpphelpers.py", line 26, in update_xmp_pdfversion
    with pdf.open_metadata(set_pikepdf_as_editor=False, update_docinfo=False) as meta:
  File "/opt/homebrew/opt/img2pdf/libexec/lib/python3.12/site-packages/pikepdf/models/metadata.py", line 307, in wrapper
    self._load()
  File "/opt/homebrew/opt/img2pdf/libexec/lib/python3.12/site-packages/pikepdf/models/metadata.py", line 446, in _load
    data = self._pdf.Root.Metadata.read_bytes()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operation for stream attempted on object of type dictionary
@jbarlow83
Copy link
Member

As much as this is frustrating, the PDF is severely malformed - there's absolutely no support for a random font object being inserted at that position. The PDF generator went quite far off the rails of reasonableness and I imagine this is the only the first issue that came up. It's also not entirely out of line to raise a runtime error here - it's a total unexpected situation, and deleting the metadata isn't necessarily the right thing to do.

My policy for this kind of error is to wait and see if lightning strikes twice. I assume that if another user comes forward with the same issue, then it may be widespread enough to warrant mitigation. I'd also accept evidence that, for example, a particular application consistently generates files with this problem and so there's a reasonable expectation of finding other PDFs with the same issue. Until then, for the sake of my sanity, I'll have to wait till more data is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants