Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_image_to_string_with_image_type[jpeg2000] failure with tesseract >4.1.x #419

Open
polyzen opened this issue Mar 15, 2022 · 8 comments
Labels

Comments

@polyzen
Copy link

polyzen commented Mar 15, 2022

pytesseract.pytesseract.TesseractError: (1, 'Error in pixReadStreamJp2k: function not present Error in pixReadStream: jp2: no pix returned Error in pixRead: pix not read Error during processing.')

pytesseract 0.3.10
tesseract 5.1.0
pillow 9.0.1
openjpeg2 2.4.0
pytest 7.1.0
python 3.10.2

@bozhodimitrov
Copy link
Collaborator

This error is related to tesseract itself - which version that?
Also, is there a sample image that causes that error?

@polyzen
Copy link
Author

polyzen commented Mar 15, 2022

Oh right: tesseract 5.1.0

The image used by the test: https://github.com/madmaze/pytesseract/blob/v0.3.10/tests/data/test.jpeg2000

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Mar 15, 2022

Well, hmmm. CI on master passes, so not shure what is going on there.
PS: Yep, your tesseract version is new enough and CI still uses 4.1.x

At this point, I would check what changed in 5.1.0 in order to not support jpeg2000, because clearly 4.x works with jpeg2000.
It might be the imaging library support in Tesseract or something like that.

Have you tried using tesseract directly with the jpeg2000 image?

@polyzen
Copy link
Author

polyzen commented Mar 16, 2022

Have you tried using tesseract directly with the jpeg2000 image?

I haven't yet used tesseract, I only build pytesseract to provide as an optional dependency for urlwatch in the Arch repos.

@bozhodimitrov
Copy link
Collaborator

At the moment, I don't have tesseract 5.1.0 around + Arch instance in order to test if it is pytesseract related or tesseract specific issue. When I have time, I will try to boot up a container with that setup in order to check.

@polyzen polyzen changed the title test_image_to_string_with_image_type[jpeg2000] failure in 0.3.10 test_image_to_string_with_image_type[jpeg2000] failure with tesseract >4.1.x Mar 16, 2022
@mandree
Copy link

mandree commented Mar 30, 2022

Same issue here. I debugged it, and in my case the root cause was determined as follows:

  • tesseract 5.1.0 (on FreeBSD 13.0 amd64) failed to process a JPEG2000 file, because:
  • tesseract uses leptonica for reading images; and
  • leptonica was compiled without OPENJPEG option, omitting the libopenjp2 library

The remedy for me was to recompile leptonica with OpenJPEG 2.4.0 support.

However for py-pytesseract, it should skip the test if there are indications that tesseract does not support JPEG2000.

@bozhodimitrov
Copy link
Collaborator

Thank you for investigating that @mandree - I am not sure if there is a nice way to ask tesseract if that is the case or not.
Sadly pytesseract is designed as a thin wrapper around the tesseract executable and doesn't provide any feel integration.

@mandree
Copy link

mandree commented Mar 30, 2022

You can query tesseract with -v or --version apparently.
See the line right below leptonica, it mentions liboopenjp2 (or not).

First two examples from FreeBSD 13.0 amd64, third and last example on Fedora 35 x86_64.

With JPEG2000 support:

$ tesseract -v 
tesseract 5.1.0
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37+apng : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found OpenMP 201811
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.2
 Found libcurl/7.82.0 OpenSSL/1.1.1k zlib/1.2.11 libssh2/1.10.0 nghttp2/1.46.0

And without:

$ tesseract -v 
tesseract 5.1.0
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37+apng : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2
 Found OpenMP 201811
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.2
 Found libcurl/7.82.0 OpenSSL/1.1.1k zlib/1.2.11 libssh2/1.10.0 nghttp2/1.46.0

Fedora Linux:

$ tesseract -v
tesseract 4.1.3
 leptonica-1.81.1
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.0) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2
 Found AVX2
 Found AVX
 Found FMA
 Found SSE

archlinux-github pushed a commit to archlinux/svntogit-community that referenced this issue Mar 30, 2022
ref: madmaze/pytesseract#419 (comment)


git-svn-id: file:///srv/repos/svn-community/svn@1177882 9fca08f4-af9d-4005-b8df-a31f2cc04f65
archlinux-github pushed a commit to archlinux/svntogit-community that referenced this issue Mar 30, 2022
ref: madmaze/pytesseract#419 (comment)

git-svn-id: file:///srv/repos/svn-community/svn@1177882 9fca08f4-af9d-4005-b8df-a31f2cc04f65
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants