Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for .heic file format #363

Open
datatalking opened this issue Jun 12, 2021 · 5 comments
Open

Support for .heic file format #363

datatalking opened this issue Jun 12, 2021 · 5 comments

Comments

@datatalking
Copy link

What would be the process for extending support for pytesseract being able to work with .heic file types?

Essentially we are batch renaming the filetype to .jpg but they are not always supported.

I read through a bit of https://tesseract-ocr.github.io and did not see references to this file type for lossless files.

This is for a public works project using archived public documents going back over 200 years so having access to this filetype would be not only a public service but saving taxpayers money.

@bozhodimitrov
Copy link
Collaborator

Hi @datatalking,
I'm not sure if tesseract itself support this file format?
But it looks that there are lots of options suggested on StackOverflow: How to work with HEIC image file types in Python.
Very Interesting and exotic image format by the way - my assumption is that the pre-processing will be required anyways.

@datatalking
Copy link
Author

Hi @int3l,
For argument's sake where might we patch/add the feature into pytesseract?

@bozhodimitrov
Copy link
Collaborator

One possibility is to change the prepare function. But it will require specific conversion for this format into one that is supported by Tesseract itself.

@datatalking
Copy link
Author

@int3l What are the differences in the formats? As it stands now I have to convert thousands of them to jpg or other formats so I am curious what format Tesseract supports.

@bozhodimitrov
Copy link
Collaborator

Tesseract itself uses the Leptopnica image library -- here is unofficial list of supported formats: Leptopnica Image I/O formats

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants