Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Pytesseract doesn't properly support multiframe images (e.g. TIFF) #343

Open
mnechita opened this issue Apr 19, 2021 · 3 comments
Open

Comments

@mnechita
Copy link

Reproduce:

  1. Grab a multiframe TIFF
  2. Call pytesseract.image_to_osd
  3. Output will be only for the first frame (page)

Whereas calling the tesseract process on the image will generate the correct output containing each page.

Source of the bug:
When calling save on the in-memory data, pillow requires the save_all=True parameter (pillow docs) to save multiframe images on the disk. The parameter is not specified, thus the image gets truncated to the first frame.

image.save(input_file_name, format=image.format)

Possible solution
Check Image.n_frames before saving and set the save_all parameter accordingly

I can create a PR with the changes if solution sounds good enough

@mnechita mnechita changed the title [BUG] Pytesseract doesn't properly support multiframes images (e.g. TIFF) [BUG] Pytesseract doesn't properly support multiframe images (e.g. TIFF) Apr 19, 2021
@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Apr 19, 2021

Hi @mnechita and thank you for reporting this bug.
You are welcome to make a PR.

Possible workaround if you don't need the Image object -- pytesseract also support passing the image as path (str).
For example:

pytesseract.image_to_osd('example/image/path.jpeg')

I believe that this should still work, because tesseract itself can handle multiframe images (or?)

PS: Also, we will need a multiframe test image (test-multiframe.tiff?) and a test for this case.

@mnechita
Copy link
Author

Hey,

Thanks for the reply. Nice suggestion, that actually helps me get around this in the meantime.

Using the path to the image, while image_to_osd returns the proper string, when changing the output_type to dict, the information will only contain the last frame. This made me realise the osd_to_dict function needs to be changed as well. Either with a bigger dict with page_number as primary key, or a list of current dicts. However, both these approaches break existing code using the library due to change in structure, unless the function returns different dict structures per case (singlepage/multipage). What do you think?

To confirm, tesseract supports multiframe images, as such, I've attached a sample osd generated from a 9 frame TIFF.
test_osd.txt

Will start working on a PR tonight after work.

@AjjuSolanki
Copy link

Hello,
I have given one complete Tif file, for almost all pages img_to_data runs fine and gives ocr, but for few pages data in image_to_data returns blank dictionary.
It has no levels, blocks or any text ....blank output.
What could be the possibile reason..
Any help or hint will be appreciated.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants