Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable non-text output formats by default #916

Open
Balearica opened this issue Apr 16, 2024 · 0 comments
Open

Disable non-text output formats by default #916

Balearica opened this issue Apr 16, 2024 · 0 comments
Milestone

Comments

@Balearica
Copy link
Collaborator

By default, 4 different output formats are produced: text, blocks, hocr, and tsv. It's safe to say that few if any users make use of more than one format. However, producing all 4 formats can significantly inflate runtime. This is especially true for blocks, which iterates individually over every symbol (and symbol choice) in the data, and retrieves information about them all.

I recently encountered an image where creating the blocks output took 12 seconds, whereas running recognition took just 10 seconds. While this is uncharacteristically long, it is unacceptable for a default option few users benefit from to inflate runtime >100% for any images. Even outside of this fringe case, testing on other documents shows that creating blocks often inflates runtime in the 0.25-0.50 second range when scanning documents, which is a non-trivial increase.

I think it makes sense to leave text on by default, as presumably this is the most used and quickest to render, and some output format needs to be enabled by default. However, other formats should not be enabled unless the user actually wants them.

This is a breaking change so it would need to wait until Tesseract.js v6. Restoring the previous behavior would simply be a matter of manually specifying formats in the output argument to worker.recognize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant