Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possibility to capture stderr #898

Closed
didiercolens opened this issue Feb 27, 2024 · 3 comments
Closed

possibility to capture stderr #898

didiercolens opened this issue Feb 27, 2024 · 3 comments

Comments

@didiercolens
Copy link

thanks a lot for this great project, I played with it for the past week and ran into a few issues when recognising png/jpeg images:

Is your feature request related to a problem? Please describe.

  • when a png file has a Bad CRC is recognised, tesseract.js prints
libpng error: IDAT: CRC error
missing function: setThrew
Aborted(-1)

to stderr, this is not caught by logger or errorLogger and then throws a string: RuntimeError: Aborted(-1). Build with -sASSERTIONS for more info.

Invalid SOS parameters for sequential JPEG
Error in pixReadStreamJpeg: read error at scanline 0; nwarn = 1
Error in pixReadStreamJpeg: bad data
Error in pixReadStream: jpeg: no pix returned
Error in pixRead: pix not read
Image file /input cannot be read!

to stderr and then throws a string: Error: Error attempting to read image.

Describe the solution you'd like

  • instead of writing to stderr, write to errorHandler
  • return an error object with a more meaningful error

Describe alternatives you've considered

Additional context
if you want to reproduce the png error, just open a png image in a hex editor and modify a byte near the end of the file.

@Balearica
Copy link
Collaborator

I replicated the .jpg result using the provided file, but was unable to replicate the .png result, so if you think this represents a distinct case please upload a sample image.

Regarding the .jpg image, it sounds like the core issue here is that there's a disconnect between the JavaScript exception thrown (and handled by errorHandler) and the messages printed to stderr, with the latter being more informative. While I would agree this is not ideal, I do not think changing this would be feasible.

Of the messages listed, the only one that is created within this repo (and the only one that is a JavaScript exception) is Error attempting to read image.. As can be seen in the code that throws this error (below), the only information we have to go on when creating this exception is an integer return code 1 indicating the image was not read correctly, so the error message is as informative as it could be given that information.

if (res === 1) throw Error('Error attempting to read image.');

The other messages listed are printed to stderr by dependencies, and are not created by the code in this repo. For example, the Invalid SOS parameters message is printed by libjpeg and the Error in pixReadStreamJpeg errors are printed by leptonica.

We cannot change the fact that these dependencies send these messages to stderr, as we do not edit dependencies. Furthermore, we cannot send all stderr text to errorHandler, as errorHandler is for JavaScript exceptions, and not all messages printed to stderr will result in an exception (and vice versa). While many Tesseract.js exceptions are accompanied by meaningful messages printed to stderr by a dependency, this cannot be assumed as a rule. With these limitations in mind, I think that throwing the Error attempting to read image exception and having the stderr messages print to console is a reasonable behavior.

@didiercolens
Copy link
Author

thanks for looking into this, btw I'm running this with a scheduler and 6 workers, 4GB RAM limit, after a while the container is killed by the kernel because it consumes too much RAM, did not investigate, but it looks like a leak. I'll see if I can create a repro for you when I have time!

@Balearica
Copy link
Collaborator

@didiercolens Okay, sounds good. I opened a new Git Issue (#900) to describe worker memory increasing due to large images, which is one cause of worker memory usage increasing over time. This may or may not be related to what you are experiencing. I am not currently aware of any leaks, however there have been memory leaks in past versions, so it is possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants