Multiple output formats #159

ELToulemonde · 2018-11-06T15:52:43Z

Hi,

Tesseract feature
Tesseract allows to make a single call and have multiple output format for example:

tesseract yourimage.tif out pdf tsv

This will generate an out.pdf and an out.tsv; hence at the same time retrieve OCR results in a readable format by python and a searchable pdf.

Doing both formats at the same time is interesting because according to my experiences it is twice faster. I believe that it is due to avoiding redoing the OCR computation.

Not possible with pytesseract
But using this feature is not possible with pytesseract since you expose only specific functions (one for each task)

Potential solutions

Edit function image_to_pdf_or_hocr to make it accept extension such as pdf tsv
Meaning modifying

 if extension not in ['pdf', 'hocr']:
     extension = 'txt'

Expose function run_tesseract and (with some precaution on extension) run_and_get_output
Create a new specific function which handle list of formats

The text was updated successfully, but these errors were encountered:

bozhodimitrov · 2018-11-06T17:23:48Z

As a matter of fact, pytesseract supports this scenario partially - you can use the config argument to pass the second extension (which is weird way of specifying the both output extensions).
But the problem is that pytesseract will return only one of the outputs specified with the extension argument.

I am ok with with solution 1 and 3. But for 2, we need to agree that those will be the final function signatures for run_tesseract/run_and_get_output. And if I judge by the commits - we change those every year, so I prefer to not expose them or if we expose them, there should be a clear default warning, that the interface is not final.

My vote is for 3 and we can use image_to_outputs or something like that.

ELToulemonde · 2018-11-07T12:30:18Z

Ok for solution number 3 with function image_to_outputs

Rergarding the output, I see multiple scenaris:

Return one of the outputs (the first one)
For example
if extension = "tsv pdf" only return output for "tsv" even if both are computed
Return generated file names
For example:
if extension = "tsv pdf" return ("out.tsv", "out.pdf")
Letting user choose type for each output with output_type
For example
extention = ("tsv", "pdf"), output_type = ("string", "filepath")
Then output would be a tuple (string of tsv, path to pdf)

bozhodimitrov · 2018-11-07T12:44:34Z

I like the tuple approach (3) the most, since we can also extend it (if we are crazy enough :D).
Unfortunately this change is not going to be trivial - we will need at least single mapping (dictionary), between the specific extension and the function name that is going to be executed for that particular extension.

As far as the output - I guess that we can keep the same approach like before - asking the user for the Output format and returning that (again) as tuple.

Lets think a bit on this, because if we do it nice and simple now, we will have less problems maintaining that in the future.

ELToulemonde · 2018-11-08T12:45:08Z

New function signature
We could have a signature like this:

def image_to_outputs(image: str,
                     lang: str=None,
                     config: str='',
                     nice: int=0,
                     extension: tuple=('tsv',  'pdf'),
                     output_type: tuple=(Output.FILEPATH, Output.FILEPATH)) -> tuple:

Implementation
The way I see it, we need to have two part in function:

Call tesseract (with all extensions)
Retrieve results (extension by extension)

For part 1: we can reuse run_tesseract with a few modifications on extension management

For part 2: I think we need to developp a load_outputs function that will have a signature like
def load_outputs(file_path: str, output_type: str):
This new function could by the way be reused in run_and_get_output or even discard it

alwhelan22 · 2021-05-11T12:38:33Z

I'd love to see this feature as well. I see there's been no updates since 2018, so repicking this back up

binkjakub · 2022-09-02T09:02:35Z

I would also like to see this feature.

bozhodimitrov added the Feature Request label Nov 6, 2018

bozhodimitrov changed the title ~~[New feature] Multiple output formats~~ Multiple output formats Nov 6, 2018

ELToulemonde mentioned this issue Oct 18, 2022

[Refactoring] Reduce numbers of identique calls to tesseract #459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple output formats #159

Multiple output formats #159

ELToulemonde commented Nov 6, 2018

bozhodimitrov commented Nov 6, 2018

ELToulemonde commented Nov 7, 2018

bozhodimitrov commented Nov 7, 2018

ELToulemonde commented Nov 8, 2018

alwhelan22 commented May 11, 2021

binkjakub commented Sep 2, 2022

Multiple output formats #159

Multiple output formats #159

Comments

ELToulemonde commented Nov 6, 2018

bozhodimitrov commented Nov 6, 2018

ELToulemonde commented Nov 7, 2018

bozhodimitrov commented Nov 7, 2018

ELToulemonde commented Nov 8, 2018

alwhelan22 commented May 11, 2021

binkjakub commented Sep 2, 2022