[BUG] Pytesseract doesn't properly support multiframe images (e.g. TIFF) #343

mnechita · 2021-04-19T10:16:39Z

Reproduce:

Grab a multiframe TIFF
Call pytesseract.image_to_osd
Output will be only for the first frame (page)

Whereas calling the tesseract process on the image will generate the correct output containing each page.

Source of the bug:
When calling save on the in-memory data, pillow requires the save_all=True parameter (pillow docs) to save multiframe images on the disk. The parameter is not specified, thus the image gets truncated to the first frame.

pytesseract/pytesseract/pytesseract.py

Line 201 in 45fe798

image.save(input_file_name, format=image.format)

Possible solution
Check Image.n_frames before saving and set the save_all parameter accordingly

I can create a PR with the changes if solution sounds good enough

The text was updated successfully, but these errors were encountered:

bozhodimitrov · 2021-04-19T17:23:58Z

Hi @mnechita and thank you for reporting this bug.
You are welcome to make a PR.

Possible workaround if you don't need the Image object -- pytesseract also support passing the image as path (str).
For example:

pytesseract.image_to_osd('example/image/path.jpeg')

I believe that this should still work, because tesseract itself can handle multiframe images (or?)

PS: Also, we will need a multiframe test image (test-multiframe.tiff?) and a test for this case.

mnechita · 2021-04-20T10:11:06Z

Hey,

Thanks for the reply. Nice suggestion, that actually helps me get around this in the meantime.

Using the path to the image, while image_to_osd returns the proper string, when changing the output_type to dict, the information will only contain the last frame. This made me realise the osd_to_dict function needs to be changed as well. Either with a bigger dict with page_number as primary key, or a list of current dicts. However, both these approaches break existing code using the library due to change in structure, unless the function returns different dict structures per case (singlepage/multipage). What do you think?

To confirm, tesseract supports multiframe images, as such, I've attached a sample osd generated from a 9 frame TIFF.
test_osd.txt

Will start working on a PR tonight after work.

AjjuSolanki · 2022-12-05T18:36:51Z

Hello,
I have given one complete Tif file, for almost all pages img_to_data runs fine and gives ocr, but for few pages data in image_to_data returns blank dictionary.
It has no levels, blocks or any text ....blank output.
What could be the possibile reason..
Any help or hint will be appreciated.

Thanks

mnechita changed the title ~~[BUG] Pytesseract doesn't properly support multiframes images (e.g. TIFF)~~ [BUG] Pytesseract doesn't properly support multiframe images (e.g. TIFF) Apr 19, 2021

bozhodimitrov added the Feature Request label May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Pytesseract doesn't properly support multiframe images (e.g. TIFF) #343

[BUG] Pytesseract doesn't properly support multiframe images (e.g. TIFF) #343

mnechita commented Apr 19, 2021

bozhodimitrov commented Apr 19, 2021 •

edited

mnechita commented Apr 20, 2021

AjjuSolanki commented Dec 5, 2022

[BUG] Pytesseract doesn't properly support multiframe images (e.g. TIFF) #343

[BUG] Pytesseract doesn't properly support multiframe images (e.g. TIFF) #343

Comments

mnechita commented Apr 19, 2021

bozhodimitrov commented Apr 19, 2021 • edited

mnechita commented Apr 20, 2021

AjjuSolanki commented Dec 5, 2022

bozhodimitrov commented Apr 19, 2021 •

edited