While emplacing a pdf, it changes underlying page metadata incorrectly #565

mikejokic · 2024-02-08T07:06:06Z

Hi! I am using the emplace function to swap pages but I would like to preserve references. I then perform some string-match on the output pdf and would like to extract the matched text bounding box coordinates.

I use the following for emplacing.

pdf = Pdf.open('../tests/resources/fourpages.pdf')
congress = Pdf.open('../tests/resources/congress.pdf')
pdf.pages.append(congress.pages[0])  # Transfer page to new pdf
pdf.pages[2].emplace(pdf.pages[-1])
del pdf.pages[-1]  # Remove donor page
pdf.pages[2].objgen
pdf.save()

I then use pdfplumber to read the saved pdf, and find the matching words and its bounding box coordinates are way off for the emplaced pages. I have to repair the pdf with ghostscript to correct this issue.

pike = pdfplumber.open('path')
pike.pages[x].search("value",regex = False,case= False,return_chars=False) #where x is the emplaced pdf page number

So, since the emplace function is causing this downstream error, should I be retaining any additional elements with the retain argument?
Name.Parent,Name.Contents, Name.CropBox, Name.MediaBox, Name.Resources, Name.Rotate, Name.Type

If I simply copy the pages over one another, this error does not happen. So something within emplace causes this error.

Any help is appreciated @jbarlow83

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2024-02-09T01:16:04Z

I did the following

and results seem to be fine to me, although it looks as if plumber returns position relative to the top left corner rather than bottom left as is conventional for PDF. So it seems fine to me, although perhaps your example is different from your actual code.

OCRmyPDF uses emplace as the primary means of adding OCR text to PDFs, i.e. if it were broken somehow, OCRmyPDF would be failing in most cases too.

If either PDF has structural markup, they won't be preserved by the emplace function, and migrating them unfortunately is quite complicated. QPDF doesn't do that yet but the author intends to implement it, so it will have to wait for that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

While emplacing a pdf, it changes underlying page metadata incorrectly #565

While emplacing a pdf, it changes underlying page metadata incorrectly #565

mikejokic commented Feb 8, 2024 •

edited by jbarlow83

jbarlow83 commented Feb 9, 2024

While emplacing a pdf, it changes underlying page metadata incorrectly #565

While emplacing a pdf, it changes underlying page metadata incorrectly #565

Comments

mikejokic commented Feb 8, 2024 • edited by jbarlow83

jbarlow83 commented Feb 9, 2024

mikejokic commented Feb 8, 2024 •

edited by jbarlow83