Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

While emplacing a pdf, it changes underlying page metadata incorrectly #565

Open
mikejokic opened this issue Feb 8, 2024 · 1 comment
Open

Comments

@mikejokic
Copy link

mikejokic commented Feb 8, 2024

Hi! I am using the emplace function to swap pages but I would like to preserve references. I then perform some string-match on the output pdf and would like to extract the matched text bounding box coordinates.

I use the following for emplacing.

pdf = Pdf.open('../tests/resources/fourpages.pdf')
congress = Pdf.open('../tests/resources/congress.pdf')
pdf.pages.append(congress.pages[0])  # Transfer page to new pdf
pdf.pages[2].emplace(pdf.pages[-1])
del pdf.pages[-1]  # Remove donor page
pdf.pages[2].objgen
pdf.save()

I then use pdfplumber to read the saved pdf, and find the matching words and its bounding box coordinates are way off for the emplaced pages. I have to repair the pdf with ghostscript to correct this issue.

pike = pdfplumber.open('path')
pike.pages[x].search("value",regex = False,case= False,return_chars=False) #where x is the emplaced pdf page number

So, since the emplace function is causing this downstream error, should I be retaining any additional elements with the retain argument?
Name.Parent,Name.Contents, Name.CropBox, Name.MediaBox, Name.Resources, Name.Rotate, Name.Type

If I simply copy the pages over one another, this error does not happen. So something within emplace causes this error.

Any help is appreciated @jbarlow83

@jbarlow83
Copy link
Member

I did the following
image
and results seem to be fine to me, although it looks as if plumber returns position relative to the top left corner rather than bottom left as is conventional for PDF. So it seems fine to me, although perhaps your example is different from your actual code.

OCRmyPDF uses emplace as the primary means of adding OCR text to PDFs, i.e. if it were broken somehow, OCRmyPDF would be failing in most cases too.

If either PDF has structural markup, they won't be preserved by the emplace function, and migrating them unfortunately is quite complicated. QPDF doesn't do that yet but the author intends to implement it, so it will have to wait for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants