New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Page Segmentation Mode for calling Tesseract OCR #3122
Comments
Actually after experiments with more PSMs in Tesseract, I'm now less sure PSM makes a difference. I half suspect there's something between Tesseract and I eliminated the chance that PyMuPDF was only sending part of my page to Tesseract by doing Here's the code to create do PyMuPDF on the page image rather than the original PDF page: doc2 = fitz.Document(stream=img)
page = doc2[0]
tp = page.get_textpage_ocr(
dpi=300, full=True, language='eng', tessdata='/usr/share/tesseract-ocr/5/tessdata',
)
text = tp.extractTEXT()
text_block = text.replace('\n', '|').replace('||', '|')
print(f"PyMuPDF on page image with get_textpage_ocr at 300dpi: text length: {len(text)}\n{text_block}\n") |
Interesting ideas for sure - thanks for submitting this! At this point I should mention that it actually is our base library MuPDF that does the Tesseract communication. MuPDF would have to offer specifying that parameter and hand it through to Tesseract. May I suggest discussing options directly with the MuPDF colleagues? They are just a click away at our sister MuPDF Discord channel. |
Stepping back from the detail, I'm surprised (in a way that I rarely am with PyMuPDF!) that get_textpage_ocr is missing big chunks of clear text from my PDF... I'll raise in the MuPDF discord as you suggest, and post the end result back here to close off the issue. |
Well, that hurts. Can you let me have an example? |
Submitted enhancement request to the MuPDF team here. |
Feature request
Can OCR using Tesseract add a user-settable parameters for page segmentation mode (psm)?
This would be very useful because when source documents are forms, OCR recognizes the scattered pieces of text much better with psm 11 than the default psm 3.
It would be easiest with an optional parameter for
psm
inPage.get_textpage_ocr
like this:Benefit
Here's an example with a one-page form I tried it on.
PyMuPDF today extracts 483 characters using standard "full page" OCR. Calling Tesseract directly with psm 11 gets 703 characters, 40% more. The missing text makes a huge amount of difference!
Implementation notes
The new psm parameter would need to be passed to
Pixmap.pdfocr_save(....)
. (L8369 in https://github.com/pymupdf/PyMuPDF/blob/056e3e43c8b99b6ec9657d7e4edb398f7826c03c/src_classic/fitz_old.i)And then MuPDF's
pixmap.ocr_recognize
(L231 in https://git.ghostscript.com/?p=mupdf.git;a=blob;f=source/fitz/tessocr.cpp).I found an example of how the psm parameter is set in the Tesseract C API docs:
https://tesseract-ocr.github.io/tessdoc/APIExample.html
Background on PSMs
A good writeup of the various page segmentation modes is here:
https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/
The text was updated successfully, but these errors were encountered: