You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using Paperless-ngx to import a whole trove of archived PDFs, and the ocrmypdf process it uses is exploding in a few cases, with an error deep in pikepdf.
I've included the traceback here for context, but I'm guessing that the problem is that the PDF in question has a Root object with a Metadata member that is very definitely not an XMP object:
1 0 obj
<< /Metadata 19 0 R /Outlines 44 0 R /Pages 18 0 R /Type /Catalog>>
endobj
From the PDF trailer, object 1 0 is the Root: /Root 1 0 R
I'm not sure how to provide a sample PDF here, since the documents in question contain private data, and were generated by an old version of some PDF software a number of years ago...
Traceback (most recent call last):
File "/opt/homebrew/Cellar/ocrmypdf/16.1.1/libexec/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 249, in cli_exception_handler
return fn(options, plugin_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/ocrmypdf/16.1.1/libexec/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 191, in _run_pipeline
optimize_messages = exec_concurrent(context, executor)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/ocrmypdf/16.1.1/libexec/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 141, in exec_concurrent
pdf = ocrgraft.finalize()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/ocrmypdf/16.1.1/libexec/lib/python3.12/site-packages/ocrmypdf/_graft.py", line 203, in finalize
self.pdf_base.save(self.output_file)
File "/opt/homebrew/opt/img2pdf/libexec/lib/python3.12/site-packages/pikepdf/_methods.py", line 316, in save
self._save(
File "/opt/homebrew/opt/img2pdf/libexec/lib/python3.12/site-packages/pikepdf/_cpphelpers.py", line 26, in update_xmp_pdfversion
with pdf.open_metadata(set_pikepdf_as_editor=False, update_docinfo=False) as meta:
File "/opt/homebrew/opt/img2pdf/libexec/lib/python3.12/site-packages/pikepdf/models/metadata.py", line 307, in wrapper
self._load()
File "/opt/homebrew/opt/img2pdf/libexec/lib/python3.12/site-packages/pikepdf/models/metadata.py", line 446, in _load
data = self._pdf.Root.Metadata.read_bytes()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operation for stream attempted on object of type dictionary
The text was updated successfully, but these errors were encountered:
As much as this is frustrating, the PDF is severely malformed - there's absolutely no support for a random font object being inserted at that position. The PDF generator went quite far off the rails of reasonableness and I imagine this is the only the first issue that came up. It's also not entirely out of line to raise a runtime error here - it's a total unexpected situation, and deleting the metadata isn't necessarily the right thing to do.
My policy for this kind of error is to wait and see if lightning strikes twice. I assume that if another user comes forward with the same issue, then it may be widespread enough to warrant mitigation. I'd also accept evidence that, for example, a particular application consistently generates files with this problem and so there's a reasonable expectation of finding other PDFs with the same issue. Until then, for the sake of my sanity, I'll have to wait till more data is available.
I'm using Paperless-ngx to import a whole trove of archived PDFs, and the
ocrmypdf
process it uses is exploding in a few cases, with an error deep inpikepdf
.I've included the traceback here for context, but I'm guessing that the problem is that the PDF in question has a Root object with a Metadata member that is very definitely not an XMP object:
From the PDF trailer, object
1 0
is the Root:/Root 1 0 R
I'm not sure how to provide a sample PDF here, since the documents in question contain private data, and were generated by an old version of some PDF software a number of years ago...
The text was updated successfully, but these errors were encountered: