Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple output formats #159

Open
ELToulemonde opened this issue Nov 6, 2018 · 6 comments
Open

Multiple output formats #159

ELToulemonde opened this issue Nov 6, 2018 · 6 comments

Comments

@ELToulemonde
Copy link
Contributor

Hi,

Tesseract feature
Tesseract allows to make a single call and have multiple output format for example:

tesseract yourimage.tif out pdf tsv

This will generate an out.pdf and an out.tsv; hence at the same time retrieve OCR results in a readable format by python and a searchable pdf.

Doing both formats at the same time is interesting because according to my experiences it is twice faster. I believe that it is due to avoiding redoing the OCR computation.

Not possible with pytesseract
But using this feature is not possible with pytesseract since you expose only specific functions (one for each task)

Potential solutions

  1. Edit function image_to_pdf_or_hocr to make it accept extension such as pdf tsv
    Meaning modifying
 if extension not in ['pdf', 'hocr']:
     extension = 'txt'
  1. Expose function run_tesseract and (with some precaution on extension) run_and_get_output
  2. Create a new specific function which handle list of formats
@bozhodimitrov
Copy link
Collaborator

As a matter of fact, pytesseract supports this scenario partially - you can use the config argument to pass the second extension (which is weird way of specifying the both output extensions).
But the problem is that pytesseract will return only one of the outputs specified with the extension argument.

I am ok with with solution 1 and 3. But for 2, we need to agree that those will be the final function signatures for run_tesseract/run_and_get_output. And if I judge by the commits - we change those every year, so I prefer to not expose them or if we expose them, there should be a clear default warning, that the interface is not final.

My vote is for 3 and we can use image_to_outputs or something like that.

@bozhodimitrov bozhodimitrov changed the title [New feature] Multiple output formats Multiple output formats Nov 6, 2018
@ELToulemonde
Copy link
Contributor Author

Ok for solution number 3 with function image_to_outputs

Rergarding the output, I see multiple scenaris:

  1. Return one of the outputs (the first one)
    For example
    if extension = "tsv pdf" only return output for "tsv" even if both are computed

  2. Return generated file names
    For example:
    if extension = "tsv pdf" return ("out.tsv", "out.pdf")

  3. Letting user choose type for each output with output_type
    For example
    extention = ("tsv", "pdf"), output_type = ("string", "filepath")
    Then output would be a tuple (string of tsv, path to pdf)

@bozhodimitrov
Copy link
Collaborator

I like the tuple approach (3) the most, since we can also extend it (if we are crazy enough :D).
Unfortunately this change is not going to be trivial - we will need at least single mapping (dictionary), between the specific extension and the function name that is going to be executed for that particular extension.

As far as the output - I guess that we can keep the same approach like before - asking the user for the Output format and returning that (again) as tuple.

Lets think a bit on this, because if we do it nice and simple now, we will have less problems maintaining that in the future.

@ELToulemonde
Copy link
Contributor Author

New function signature
We could have a signature like this:

def image_to_outputs(image: str,
                     lang: str=None,
                     config: str='',
                     nice: int=0,
                     extension: tuple=('tsv',  'pdf'),
                     output_type: tuple=(Output.FILEPATH, Output.FILEPATH)) -> tuple:

Implementation
The way I see it, we need to have two part in function:

  1. Call tesseract (with all extensions)
  2. Retrieve results (extension by extension)

For part 1: we can reuse run_tesseract with a few modifications on extension management

For part 2: I think we need to developp a load_outputs function that will have a signature like
def load_outputs(file_path: str, output_type: str):
This new function could by the way be reused in run_and_get_output or even discard it

@alwhelan22
Copy link

I'd love to see this feature as well. I see there's been no updates since 2018, so repicking this back up

@binkjakub
Copy link

I would also like to see this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants