Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Page Segmentation Mode for calling Tesseract OCR #3122

Open
stevesimmons opened this issue Feb 1, 2024 · 5 comments
Open

Support for Page Segmentation Mode for calling Tesseract OCR #3122

stevesimmons opened this issue Feb 1, 2024 · 5 comments

Comments

@stevesimmons
Copy link

stevesimmons commented Feb 1, 2024

Feature request

Can OCR using Tesseract add a user-settable parameters for page segmentation mode (psm)?

This would be very useful because when source documents are forms, OCR recognizes the scattered pieces of text much better with psm 11 than the default psm 3.

It would be easiest with an optional parameter for psm in Page.get_textpage_ocr like this:

tp = page.get_textpage_ocr(dpi=300, full=True, psm=11, tessdata="...")

Benefit

Here's an example with a one-page form I tried it on.

PyMuPDF today extracts 483 characters using standard "full page" OCR. Calling Tesseract directly with psm 11 gets 703 characters, 40% more. The missing text makes a huge amount of difference!

doc = fitz.Document(stream=raw)

# Standard PyMuPDF OCR
page = doc[0]
tp = page.get_textpage_ocr(
    flags=fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_PRESERVE_IMAGES,
    dpi=300, full=True, tessdata='/usr/share/tesseract-ocr/5/tessdata',
)
text = tp.extractTEXT()
print(len(text))                                # Default OCR on my sample doc got 483 characters

# Calling Tesseract directly, setting psm to 11 for disconnected text
pm = page.get_pixmap(dpi=300)
img = pm.tobytes('png')
rc = subprocess.run(
    "tesseract stdin stdout --psm 11 -l eng",
    input=img, stdout=subprocess.PIPE, shell=True,
)                                                 
text = rc.stdout.decode()
print(len(text))                               # OCR with psm=11 got 704 characters

Implementation notes

The new psm parameter would need to be passed to Pixmap.pdfocr_save(....). (L8369 in https://github.com/pymupdf/PyMuPDF/blob/056e3e43c8b99b6ec9657d7e4edb398f7826c03c/src_classic/fitz_old.i)

And then MuPDF's pixmap.ocr_recognize (L231 in https://git.ghostscript.com/?p=mupdf.git;a=blob;f=source/fitz/tessocr.cpp).

I found an example of how the psm parameter is set in the Tesseract C API docs:
https://tesseract-ocr.github.io/tessdoc/APIExample.html

  PIX *image = pixRead(inputfile);
  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  api->Init("/usr/src/tesseract/", "eng");
  api->SetPageSegMode(tesseract::PSM_AUTO_OSD); /* We'd need our input PSM here! */
  api->SetImage(image);
  api->Recognize(0);

Background on PSMs

A good writeup of the various page segmentation modes is here:
https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/

@stevesimmons
Copy link
Author

stevesimmons commented Feb 1, 2024

Actually after experiments with more PSMs in Tesseract, I'm now less sure PSM makes a difference. I half suspect there's something between Tesseract and get_textpage_ocr that drops output.

I eliminated the chance that PyMuPDF was only sending part of my page to Tesseract by doing get_textpage_ocr on the pixmap image in my comment above (which I checked has all the text, in a 300dpi png) rather than my original PDF (which, for the record, is from Microsoft Print To PDF). The same result as before came back: 40% of my text is missing.

Here's the code to create do PyMuPDF on the page image rather than the original PDF page:

doc2 = fitz.Document(stream=img)
page = doc2[0]
tp = page.get_textpage_ocr(
    dpi=300, full=True, language='eng', tessdata='/usr/share/tesseract-ocr/5/tessdata',
)
text = tp.extractTEXT()
text_block = text.replace('\n', '|').replace('||', '|')
print(f"PyMuPDF on page image with get_textpage_ocr at 300dpi: text length: {len(text)}\n{text_block}\n")

@stevesimmons stevesimmons changed the title Specify Page Segmentation Mode (PSM) when calling Tesseract for OCR Page Segmentation Mode (PSM) for calling Tesseract OCR / or / possible issue in pixmap.get_textpage_ocr() Feb 1, 2024
@JorjMcKie
Copy link
Collaborator

Interesting ideas for sure - thanks for submitting this!
As you wrote, given a general PDF page, PyMuPDF behavior is quite flexible in terms of OCRing either the full page or only the images on it, and accept any standard text as is on the page.
For a full page OCR the DPI value makes sense - although only to the extent of the inherent resolution of the image that represent the PDF page. In such a case (scanned document), extracting that image and letting it OCR probably delivers the best recognition rate possible - except potentially using PSM.

At this point I should mention that it actually is our base library MuPDF that does the Tesseract communication. MuPDF would have to offer specifying that parameter and hand it through to Tesseract.
PyMuPDF is unable to do anything on its own here.

May I suggest discussing options directly with the MuPDF colleagues? They are just a click away at our sister MuPDF Discord channel.

@stevesimmons
Copy link
Author

Stepping back from the detail, I'm surprised (in a way that I rarely am with PyMuPDF!) that get_textpage_ocr is missing big chunks of clear text from my PDF... I'll raise in the MuPDF discord as you suggest, and post the end result back here to close off the issue.

@JorjMcKie
Copy link
Collaborator

Stepping back from the detail, I'm surprised (in a way that I rarely am with PyMuPDF!) that get_textpage_ocr is missing big chunks of clear text from my PDF... I'll raise in the MuPDF discord as you suggest, and post the end result back here to close off the issue.

Well, that hurts. Can you let me have an example?

@JorjMcKie
Copy link
Collaborator

Submitted enhancement request to the MuPDF team here.

@JorjMcKie JorjMcKie changed the title Page Segmentation Mode (PSM) for calling Tesseract OCR / or / possible issue in pixmap.get_textpage_ocr() Support for Page Segmentation Mode for calling Tesseract OCR Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants