Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
summary
This PR resolves #304 by adding a new function
run_and_get_multiple_output
that can take multiple extensions (output formats) and return them after one invocation oftesseract
. This saves compute time when the user tries to get multiple outputs from one input, e.g.,walkthrough
The main addition in this PR is the function
run_and_get_multiple_output
. It accepts a list of extensions like['pdf', 'txt']
. Internally this function:EXTENTION_TO_CONFIG
).tesseract
just once to generate all the files neededextensions
Note that this PR only allows a subset of all supported extensions. This is to limit the config to those that are compatible to assemble. E.g., the extension
osd
requires a different command line param--psm
instead of-c
therefore is not supported yet by this new function.This PR refactors the function
run_tesseract
so it can handle multiple extensions: the key change is to filter out extensions that do not need to be appended to the command line arguments.This PR also refactors the code that reads the output into a helper
_read_output
so it can be reused by both the newrun_and_get_multiple_output
and existingrun_and_get_output
.test
This PR adds a unit test to test a few combinations of different extension lists. I'd encourage the reviewer to run the function locally with a simple example of
and compare its runtime to
The above example can a common usage pattern for followup analysis on the OCR results.