allow multiple output #511

badGarnet · 2023-08-31T20:09:49Z

summary

This PR resolves #304 by adding a new function run_and_get_multiple_output that can take multiple extensions (output formats) and return them after one invocation of tesseract. This saves compute time when the user tries to get multiple outputs from one input, e.g.,

text, pdf = run_and_get_multiple_output(image, extensions=['txt', 'pdf'])

walkthrough

The main addition in this PR is the function run_and_get_multiple_output. It accepts a list of extensions like ['pdf', 'txt']. Internally this function:

assembles the command line config arguments needed by mapping each extension to its required config arguments (stored as a constant in EXTENTION_TO_CONFIG).
invokes tesseract just once to generate all the files needed
for each extension load its result and return in the same order as in the input extensions

Note that this PR only allows a subset of all supported extensions. This is to limit the config to those that are compatible to assemble. E.g., the extension osd requires a different command line param --psm instead of -c therefore is not supported yet by this new function.

This PR refactors the function run_tesseract so it can handle multiple extensions: the key change is to filter out extensions that do not need to be appended to the command line arguments.

This PR also refactors the code that reads the output into a helper _read_output so it can be reused by both the new run_and_get_multiple_output and existing run_and_get_output.

test

This PR adds a unit test to test a few combinations of different extension lists. I'd encourage the reviewer to run the function locally with a simple example of

text, boxes = run_and_get_multiple_output(image, extensions=['txt', 'box'])

and compare its runtime to

text = image_to_string(image)
boxes = image_to_box(image)

The above example can a common usage pattern for followup analysis on the OCR results.

for more information, see https://pre-commit.ci

….com:badGarnet/pytesseract into yao/allow-multiple-output-formats-in-one-run

for more information, see https://pre-commit.ci

….com:badGarnet/pytesseract into yao/allow-multiple-output-formats-in-one-run

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

badGarnet and others added 8 commits August 31, 2023 14:55

allow multiple output

7667dc8

[pre-commit.ci] auto fixes from pre-commit.com hooks

ff5bcb3

for more information, see https://pre-commit.ci

tidy

07d2dc0

Merge branch 'yao/allow-multiple-output-formats-in-one-run' of github…

68acc1b

….com:badGarnet/pytesseract into yao/allow-multiple-output-formats-in-one-run

fix init

7be4cd1

[pre-commit.ci] auto fixes from pre-commit.com hooks

b749424

for more information, see https://pre-commit.ci

break up test to avoid racing condition

8c8e439

Merge branch 'yao/allow-multiple-output-formats-in-one-run' of github…

6e61257

….com:badGarnet/pytesseract into yao/allow-multiple-output-formats-in-one-run

badGarnet marked this pull request as draft September 1, 2023 01:18

badGarnet marked this pull request as ready for review September 1, 2023 01:27

badGarnet and others added 3 commits September 1, 2023 11:25

Update tests/pytesseract_test.py

3162b83

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

fix flakey pdf comparison

8aaf8c6

enhance readme

e43ffd6

bozhodimitrov merged commit 07da369 into madmaze:master Sep 7, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow multiple output #511

allow multiple output #511

badGarnet commented Aug 31, 2023

allow multiple output #511

allow multiple output #511

Conversation

badGarnet commented Aug 31, 2023

summary

walkthrough

test