Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research the option of using stdin/stdout instead saving image on disk #172

Open
cgallay opened this issue Jan 2, 2019 · 14 comments
Open

Comments

@cgallay
Copy link

cgallay commented Jan 2, 2019

Hi,
I am wondering why you don't use stdin argument to send the image to tesseract instead of saving it on the disk?

temp_name, input_filename = save_image(image)

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Jan 2, 2019

Hi @cgallay
short answer: that was the initial implementation
I agree that it's not the optimal solution and maybe it should be used only for debugging purposes.

This question is also relevant for the stdout.

@bozhodimitrov
Copy link
Collaborator

I found some issues with the tesseract stdin/stout and some modes/versions are affected.
For reference:
tesseract-ocr/tesseract#785 , tesseract-ocr/tesseract#85 and etc.

@bozhodimitrov bozhodimitrov changed the title why saving image on disk instead of sending it through stdin? Research the option of using stdin/stdout instead saving image on disk Jun 28, 2019
@bozhodimitrov
Copy link
Collaborator

It seems that the problems with the stdin are fixed in tesseract 5.0, but testing is needed.

@AyushP123
Copy link

AyushP123 commented Oct 30, 2019

Hi, wanted to ask if this feature is added in the latest release of pytesseract. Saving image on the disk the purpose of running the tesseract command is quite slow as of now ( I think that is what PyTesseract is doing right now ). Can I work on this feature if its not implemented ( will need help from you to implement this ).

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Oct 30, 2019

At the moment pytesseract supports passing the path of the images itself, which will skip creating temp files on disk. You can also pass the path to a text file with list of images for batch processing - this will also skip the pytesseract temp files. Both of those options are documented.

About the feature itself - we need to first test the tesseract stdin on most recent versions in order to be sure that it will work correctly with pytesseract.

@j-hap
Copy link

j-hap commented Mar 11, 2021

I just fiddled around with it and this does the trick, for tesseract 4.1.1 and 5.0.0-alpha-20201231-243-gff83 at least (I'm on windows 10 with both tesseract versions in MSYS2). It does not really honor the interface yet, because it does not give you any output files, even if an extension is given, but I was just trying to get tesseract with stdin to work.

def run_and_get_output(
    image,
    extension='',
    lang=None,
    config='',
    nice=0,
    timeout=0,
    return_bytes=False,
):

    cmd_args = [tesseract_cmd, 'stdin', 'stdout']

    if not sys.platform.startswith('win32') and nice != 0:
        cmd_args += ('nice', '-n', str(nice))

    if lang is not None:
        cmd_args += ('-l', lang)

    if config:
        cmd_args += shlex.split(config)

    try:
        proc = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE)
        image.save(proc.stdin, 'PNG')
        (stdout_data, stderr_data) = proc.communicate()
        return stdout_data.decode(DEFAULT_ENCODING)
    except OSError as e:
        if e.errno != ENOENT:
            raise e
        raise TesseractNotFoundError()

    with timeout_manager(proc, timeout) as error_string:
        if proc.returncode:
            raise TesseractError(proc.returncode, get_errors(error_string))

@bozhodimitrov
Copy link
Collaborator

Nice, you can always monkey patch the module level function and make it work for you.
One problem here is that this functionality might be limited to the newer versions of tesseract and we should consider the older 3.x versions too.

@j-hap
Copy link

j-hap commented Mar 12, 2021

I just compiled version 3.05.02 from https://github.com/tesseract-ocr/tesseract/tree/3.05.02 and ran the same modification as above with no problems with traineddata from https://github.com/tesseract-ocr/tessdata/tree/3.04.00. I know it's just a sample and not a full fledged test, but maybe that helps.

@bozhodimitrov
Copy link
Collaborator

Yes, this is very helpful. This means that this feature can be added to pytesseract.
But we will need to add additional tests in order to be sure that the other functionality doesn't break.
By other functionality, I mean the option to pass image paths as raw strings to pytesseract functions.
Also, I am not sure if tesseract will honor the configuration options in combination with the stdin input.

And finally, based on the above - we should decide if this will be the default implementation (and if we should keep the old one around or not).

@mo-han
Copy link

mo-han commented Apr 1, 2021

I just wrote a CLI wrapper for self-using, which use stdin and stdout to communicate with tesseract executable. Also I convert the tsv output into a more intuitive dict structure. The wrapper object could be fed with image in different flavors: image_file_path, image_bytes, PIL image object, ndarray. No temp file created for input or output.
https://github.com/mo-han/mo-han-toolbox/blob/master/mylib/wrapper/tesseract_ocr.py

from mylib.wrapper import tesseract_ocr
from PIL import Image

t=tesseract_ocr.TesseractOCRCLIWrapper(r'C:\Users\mo-han\AppData\Local\Programs\Tesseract-OCR\tesseract.exe')
t.set_language('chi_sim', 'eng').set_image_object(Image.open('r:1.jpg')).get_ocr_tsv_to_dict(psm=3, min_confidence=0.8)

[{'text': '名',
  'confidence': 0.91,
  'box': ((320, 10), (333, 10), (333, 23), (320, 23)),
  'page block paragraph line word level': (1, 1, 1, 1, 1, 5)},
 {'text': '称',
  'confidence': 0.87,
  'box': ((320, 28), (333, 28), (333, 40), (320, 40)),
  'page block paragraph line word level': (1, 1, 1, 1, 2, 5)},
 {'text': '修改',
  'confidence': 0.9,
  'box': ((320, 336), (333, 336), (333, 366), (320, 366)),
  'page block paragraph line word level': (1, 1, 1, 1, 3, 5)},
 {'text': '日',
  'confidence': 0.96,
  'box': ((320, 372), (333, 372), (333, 379), (320, 379)),
  'page block paragraph line word level': (1, 1, 1, 1, 4, 5)},
 {'text': '期',
  'confidence': 0.96,
  'box': ((320, 387), (333, 387), (333, 395), (320, 395)),
  'page block paragraph line word level': (1, 1, 1, 1, 5, 5)},
 {'text': ' ',
  'confidence': 0.95,
  'box': ((0, 195), (0, 195), (0, 224), (0, 224)),
  'page block paragraph line word level': (1, 2, 1, 1, 1, 5)},
 {'text': 'label_cn.txt',
  'confidence': 0.8,
  'box': ((283, 38), (295, 38), (295, 120), (283, 120)),
  'page block paragraph line word level': (1, 3, 1, 1, 1, 5)},
 {'text': '2019/8/5',
  'confidence': 0.87,
  'box': ((281, 337), (294, 337), (294, 401), (281, 401)),
  'page block paragraph line word level': (1, 3, 1, 1, 2, 5)},
 {'text': '15:24',
  'confidence': 0.95,
  'box': ((283, 408), (294, 408), (294, 446), (283, 446)),
  'page block paragraph line word level': (1, 3, 1, 1, 3, 5)},
 {'text': '2019/8/5',
  'confidence': 0.91,
  'box': ((254, 337), (267, 337), (267, 401), (254, 401)),
  'page block paragraph line word level': (1, 3, 1, 2, 3, 5)},
 {'text': '15:24',
  'confidence': 0.96,
  'box': ((256, 408), (267, 408), (267, 446), (256, 446)),
  'page block paragraph line word level': (1, 3, 1, 2, 4, 5)},
 {'text': '2019/8/5',
  'confidence': 0.89,
  'box': ((227, 337), (240, 337), (240, 401), (227, 401)),
  'page block paragraph line word level': (1, 3, 1, 3, 3, 5)},
 {'text': '15:24',
  'confidence': 0.96,
  'box': ((229, 408), (240, 408), (240, 446), (229, 446)),
  'page block paragraph line word level': (1, 3, 1, 3, 4, 5)}]

@GreenCobalt
Copy link

Hi there,
Having this implemented would be very useful as me and another dev are trying to read frames in a video and having a 400ms process time for each frame times 30fps for the video leads to very long process times. I have a OpenCV image in Python and would like to just pass that directly into Tesseract instead of having it saved on the disk.

@mo-han
Copy link

mo-han commented Jun 4, 2021

@GreenCobalt
you could try my example code above

@bozhodimitrov
Copy link
Collaborator

@GreenCobalt you can try https://github.com/sirfz/tesserocr for your use case, but I am not sure what underlying version of the tesseract implementation is used.

@dilerbatu
Copy link

I just fiddled around with it and this does the trick, for tesseract 4.1.1 and 5.0.0-alpha-20201231-243-gff83 at least (I'm on windows 10 with both tesseract versions in MSYS2). It does not really honor the interface yet, because it does not give you any output files, even if an extension is given, but I was just trying to get tesseract with stdin to work.

def run_and_get_output(
    image,
    extension='',
    lang=None,
    config='',
    nice=0,
    timeout=0,
    return_bytes=False,
):

    cmd_args = [tesseract_cmd, 'stdin', 'stdout']

    if not sys.platform.startswith('win32') and nice != 0:
        cmd_args += ('nice', '-n', str(nice))

    if lang is not None:
        cmd_args += ('-l', lang)

    if config:
        cmd_args += shlex.split(config)

    try:
        proc = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE)
        image.save(proc.stdin, 'PNG')
        (stdout_data, stderr_data) = proc.communicate()
        return stdout_data.decode(DEFAULT_ENCODING)
    except OSError as e:
        if e.errno != ENOENT:
            raise e
        raise TesseractNotFoundError()

    with timeout_manager(proc, timeout) as error_string:
        if proc.returncode:
            raise TesseractError(proc.returncode, get_errors(error_string))

Wow! It really works but It does not work for OpenCV users. They should convert their image into PIL Image.

trentshapiro added a commit to trentshapiro/DraftGPT that referenced this issue Feb 11, 2023
updated run_and_get_output based on madmaze/pytesseract#172 (comment) to use stdin to avoid unnecessary disk writes, minor speed improvement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants