[Refactoring] Reduce numbers of identique calls to tesseract #459

ELToulemonde · 2022-10-18T13:26:31Z

Context

Exploring the code base to develop the functionality proposed in #159 I discovered that the call to tesseract is often performed multiple times on exact same data with same arguments for formatting purposes.

Example :

For function image_to_string it is called 3 times. I checked in terms of process, 3 process are indeed launched.

def image_to_string(
    image,
    lang=None,
    config='',
    nice=0,
    output_type=Output.STRING,
    timeout=0,
):
    """
    Returns the result of a Tesseract OCR run on the provided image to string
    """
    args = [image, 'txt', lang, config, nice, timeout]

    return {
        Output.BYTES: lambda: run_and_get_output(*(args + [True])),
        Output.DICT: lambda: {'text': run_and_get_output(*args)},
        Output.STRING: lambda: run_and_get_output(*args),
    }[output_type]()

Consequences

Computation seems to occur in parallel so it doesn't have an immediate impact on computation time.
But it is sub-optimal :

Having a small machine will cause longer wait time
Having multiple "huge" jobs calling tesseract will cause longer wait time
Computation ressources are wasted it is not very energy efficient

Proposition

A small refacto could allow us to reduce by 2 to 3 the number of calls.

Refactoring would look like this.

Remove return_bytes=False option in run_and_get_output always return bytes
Complete implementation would be

def run_and_get_output(
    image,
    extension='',
    lang=None,
    config='',
    nice=0,
    timeout=0,
):
    with save(image) as (temp_name, input_filename):
        kwargs = {
            'input_filename': input_filename,
            'output_filename_base': temp_name,
            'extension': extension,
            'lang': lang,
            'config': config,
            'nice': nice,
            'timeout': timeout,
        }

        run_tesseract(**kwargs)
        filename = f"{kwargs['output_filename_base']}{extsep}{extension}"
        with open(filename, 'rb') as output_file:
            return output_file.read()

NB: If we want to avoid changing the signature of this function, we could keep it as it is and always call it with return_bytes=True

Implement a decode function

def decode_result(result: bytes) -> str:
    return result.decode(DEFAULT_ENCODING)

In interfaces functions make on call to run_and_get_output and manipulate output to have expected results

def image_to_string(
            image,
            lang=None,
            config='',
            nice=0,
            output_type=Output.STRING,
            timeout=0,
    ):
    args = [image, 'txt', lang, config, nice, timeout]
    bytes_result = run_and_get_output(*args)
    decoded_result = decode_result(bytes_result)
    return {
        Output.BYTES: lambda: bytes_result,
        Output.DICT: lambda: {'text': decoded_result},
        Output.STRING: lambda: decoded_result,
    }[output_type]()

Finally we would need to modify function get_pandas_output

def get_pandas_output(tesseract_outputs, config=None):
    if not pandas_installed:
        raise PandasNotSupported()

    kwargs = {'quoting': QUOTE_NONE, 'sep': '\t'}
    try:
        kwargs.update(config)
    except (TypeError, ValueError):
        pass

    return pd.read_csv(BytesIO(decode_result(tesseract_outputs)), **kwargs)

NB: If we want to avoid changing the signature of this function we could create another one named get_pandas_from_tesseract_output.

Nice side effects from this refactoring

run_and_get_output would be simpler : def run_and_get_output(image, extension='', lang=None, config='', nice=0, timeout=0) -> bytes
Output manipulation from bytes to decode would be moved with other output manipulations. code will be more readable.

Conclusion

What do you think ? Did I miss something ?

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2022-10-19T07:15:55Z

It seems like you are mixing something up here: The above construct will only call the branch actually needed, while Tesseract will use multiple threads by default when actually running (this is what you observed).

You can verify this with the following basic example as well:

choice = {
    'key1': lambda: print('1'),
    'key2': lambda: print('2'),
    'key3': lambda: print('3'),
}['key2']

choice()

This will only print 2 as expected.

ELToulemonde · 2022-10-19T13:14:47Z

Indeed I miss-interpreted what I saw.

Thanks, I learn something and I close the issue.

ELToulemonde changed the title ~~[Refactoring] Calls to tesseract reductions~~ [Refactoring] Reduce numbers of identique calls to tesseract Oct 18, 2022

ELToulemonde closed this as completed Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactoring] Reduce numbers of identique calls to tesseract #459

[Refactoring] Reduce numbers of identique calls to tesseract #459

ELToulemonde commented Oct 18, 2022 •

edited

stefan6419846 commented Oct 19, 2022

ELToulemonde commented Oct 19, 2022

[Refactoring] Reduce numbers of identique calls to tesseract #459

[Refactoring] Reduce numbers of identique calls to tesseract #459

Comments

ELToulemonde commented Oct 18, 2022 • edited

Context

Consequences

Proposition

Nice side effects from this refactoring

Conclusion

stefan6419846 commented Oct 19, 2022

ELToulemonde commented Oct 19, 2022

ELToulemonde commented Oct 18, 2022 •

edited