Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow multiple output #511

Conversation

badGarnet
Copy link
Contributor

summary

This PR resolves #304 by adding a new function run_and_get_multiple_output that can take multiple extensions (output formats) and return them after one invocation of tesseract. This saves compute time when the user tries to get multiple outputs from one input, e.g.,

text, pdf = run_and_get_multiple_output(image, extensions=['txt', 'pdf'])

walkthrough

The main addition in this PR is the function run_and_get_multiple_output. It accepts a list of extensions like ['pdf', 'txt']. Internally this function:

  1. assembles the command line config arguments needed by mapping each extension to its required config arguments (stored as a constant in EXTENTION_TO_CONFIG).
  2. invokes tesseract just once to generate all the files needed
  3. for each extension load its result and return in the same order as in the input extensions

Note that this PR only allows a subset of all supported extensions. This is to limit the config to those that are compatible to assemble. E.g., the extension osd requires a different command line param --psm instead of -c therefore is not supported yet by this new function.

This PR refactors the function run_tesseract so it can handle multiple extensions: the key change is to filter out extensions that do not need to be appended to the command line arguments.

This PR also refactors the code that reads the output into a helper _read_output so it can be reused by both the new run_and_get_multiple_output and existing run_and_get_output.

test

This PR adds a unit test to test a few combinations of different extension lists. I'd encourage the reviewer to run the function locally with a simple example of

text, boxes = run_and_get_multiple_output(image, extensions=['txt', 'box'])

and compare its runtime to

text = image_to_string(image)
boxes = image_to_box(image)

The above example can a common usage pattern for followup analysis on the OCR results.

@badGarnet badGarnet marked this pull request as draft September 1, 2023 01:18
@badGarnet badGarnet marked this pull request as ready for review September 1, 2023 01:27
badGarnet and others added 3 commits September 1, 2023 11:25
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
@bozhodimitrov bozhodimitrov merged commit 07da369 into madmaze:master Sep 7, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Over Process when multi format out put on one image like get text and pdf results
2 participants