Research the option of using stdin/stdout instead saving image on disk #172

cgallay · 2019-01-02T06:47:29Z

Hi,
I am wondering why you don't use stdin argument to send the image to tesseract instead of saving it on the disk?

Line 208 in 25a9d38

temp_name, input_filename = save_image(image)

bozhodimitrov · 2019-01-02T18:15:05Z

Hi @cgallay
short answer: that was the initial implementation
I agree that it's not the optimal solution and maybe it should be used only for debugging purposes.

This question is also relevant for the stdout.

bozhodimitrov · 2019-01-04T15:11:17Z

I found some issues with the tesseract stdin/stout and some modes/versions are affected.
For reference:
tesseract-ocr/tesseract#785 , tesseract-ocr/tesseract#85 and etc.

bozhodimitrov · 2019-07-26T11:41:57Z

It seems that the problems with the stdin are fixed in tesseract 5.0, but testing is needed.

AyushP123 · 2019-10-30T05:03:35Z

Hi, wanted to ask if this feature is added in the latest release of pytesseract. Saving image on the disk the purpose of running the tesseract command is quite slow as of now ( I think that is what PyTesseract is doing right now ). Can I work on this feature if its not implemented ( will need help from you to implement this ).

bozhodimitrov · 2019-10-30T08:37:00Z

At the moment pytesseract supports passing the path of the images itself, which will skip creating temp files on disk. You can also pass the path to a text file with list of images for batch processing - this will also skip the pytesseract temp files. Both of those options are documented.

About the feature itself - we need to first test the tesseract stdin on most recent versions in order to be sure that it will work correctly with pytesseract.

j-hap · 2021-03-11T07:03:08Z

I just fiddled around with it and this does the trick, for tesseract 4.1.1 and 5.0.0-alpha-20201231-243-gff83 at least (I'm on windows 10 with both tesseract versions in MSYS2). It does not really honor the interface yet, because it does not give you any output files, even if an extension is given, but I was just trying to get tesseract with stdin to work.

def run_and_get_output(
    image,
    extension='',
    lang=None,
    config='',
    nice=0,
    timeout=0,
    return_bytes=False,
):

    cmd_args = [tesseract_cmd, 'stdin', 'stdout']

    if not sys.platform.startswith('win32') and nice != 0:
        cmd_args += ('nice', '-n', str(nice))

    if lang is not None:
        cmd_args += ('-l', lang)

    if config:
        cmd_args += shlex.split(config)

    try:
        proc = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE)
        image.save(proc.stdin, 'PNG')
        (stdout_data, stderr_data) = proc.communicate()
        return stdout_data.decode(DEFAULT_ENCODING)
    except OSError as e:
        if e.errno != ENOENT:
            raise e
        raise TesseractNotFoundError()

    with timeout_manager(proc, timeout) as error_string:
        if proc.returncode:
            raise TesseractError(proc.returncode, get_errors(error_string))

bozhodimitrov · 2021-03-11T10:58:15Z

Nice, you can always monkey patch the module level function and make it work for you.
One problem here is that this functionality might be limited to the newer versions of tesseract and we should consider the older 3.x versions too.

j-hap · 2021-03-12T07:26:46Z

I just compiled version 3.05.02 from https://github.com/tesseract-ocr/tesseract/tree/3.05.02 and ran the same modification as above with no problems with traineddata from https://github.com/tesseract-ocr/tessdata/tree/3.04.00. I know it's just a sample and not a full fledged test, but maybe that helps.

bozhodimitrov · 2021-03-12T11:12:47Z

Yes, this is very helpful. This means that this feature can be added to pytesseract.
But we will need to add additional tests in order to be sure that the other functionality doesn't break.
By other functionality, I mean the option to pass image paths as raw strings to pytesseract functions.
Also, I am not sure if tesseract will honor the configuration options in combination with the stdin input.

And finally, based on the above - we should decide if this will be the default implementation (and if we should keep the old one around or not).

mo-han · 2021-04-01T04:52:33Z

I just wrote a CLI wrapper for self-using, which use stdin and stdout to communicate with tesseract executable. Also I convert the tsv output into a more intuitive dict structure. The wrapper object could be fed with image in different flavors: image_file_path, image_bytes, PIL image object, ndarray. No temp file created for input or output.
https://github.com/mo-han/mo-han-toolbox/blob/master/mylib/wrapper/tesseract_ocr.py

from mylib.wrapper import tesseract_ocr
from PIL import Image

t=tesseract_ocr.TesseractOCRCLIWrapper(r'C:\Users\mo-han\AppData\Local\Programs\Tesseract-OCR\tesseract.exe')
t.set_language('chi_sim', 'eng').set_image_object(Image.open('r:1.jpg')).get_ocr_tsv_to_dict(psm=3, min_confidence=0.8)

[{'text': '名',
  'confidence': 0.91,
  'box': ((320, 10), (333, 10), (333, 23), (320, 23)),
  'page block paragraph line word level': (1, 1, 1, 1, 1, 5)},
 {'text': '称',
  'confidence': 0.87,
  'box': ((320, 28), (333, 28), (333, 40), (320, 40)),
  'page block paragraph line word level': (1, 1, 1, 1, 2, 5)},
 {'text': '修改',
  'confidence': 0.9,
  'box': ((320, 336), (333, 336), (333, 366), (320, 366)),
  'page block paragraph line word level': (1, 1, 1, 1, 3, 5)},
 {'text': '日',
  'confidence': 0.96,
  'box': ((320, 372), (333, 372), (333, 379), (320, 379)),
  'page block paragraph line word level': (1, 1, 1, 1, 4, 5)},
 {'text': '期',
  'confidence': 0.96,
  'box': ((320, 387), (333, 387), (333, 395), (320, 395)),
  'page block paragraph line word level': (1, 1, 1, 1, 5, 5)},
 {'text': ' ',
  'confidence': 0.95,
  'box': ((0, 195), (0, 195), (0, 224), (0, 224)),
  'page block paragraph line word level': (1, 2, 1, 1, 1, 5)},
 {'text': 'label_cn.txt',
  'confidence': 0.8,
  'box': ((283, 38), (295, 38), (295, 120), (283, 120)),
  'page block paragraph line word level': (1, 3, 1, 1, 1, 5)},
 {'text': '2019/8/5',
  'confidence': 0.87,
  'box': ((281, 337), (294, 337), (294, 401), (281, 401)),
  'page block paragraph line word level': (1, 3, 1, 1, 2, 5)},
 {'text': '15:24',
  'confidence': 0.95,
  'box': ((283, 408), (294, 408), (294, 446), (283, 446)),
  'page block paragraph line word level': (1, 3, 1, 1, 3, 5)},
 {'text': '2019/8/5',
  'confidence': 0.91,
  'box': ((254, 337), (267, 337), (267, 401), (254, 401)),
  'page block paragraph line word level': (1, 3, 1, 2, 3, 5)},
 {'text': '15:24',
  'confidence': 0.96,
  'box': ((256, 408), (267, 408), (267, 446), (256, 446)),
  'page block paragraph line word level': (1, 3, 1, 2, 4, 5)},
 {'text': '2019/8/5',
  'confidence': 0.89,
  'box': ((227, 337), (240, 337), (240, 401), (227, 401)),
  'page block paragraph line word level': (1, 3, 1, 3, 3, 5)},
 {'text': '15:24',
  'confidence': 0.96,
  'box': ((229, 408), (240, 408), (240, 446), (229, 446)),
  'page block paragraph line word level': (1, 3, 1, 3, 4, 5)}]

GreenCobalt · 2021-06-04T22:20:00Z

Hi there,
Having this implemented would be very useful as me and another dev are trying to read frames in a video and having a 400ms process time for each frame times 30fps for the video leads to very long process times. I have a OpenCV image in Python and would like to just pass that directly into Tesseract instead of having it saved on the disk.

mo-han · 2021-06-04T23:32:00Z

@GreenCobalt
you could try my example code above

bozhodimitrov · 2021-06-05T15:07:12Z

@GreenCobalt you can try https://github.com/sirfz/tesserocr for your use case, but I am not sure what underlying version of the tesseract implementation is used.

dilerbatu · 2022-08-29T08:04:25Z

I just fiddled around with it and this does the trick, for tesseract 4.1.1 and 5.0.0-alpha-20201231-243-gff83 at least (I'm on windows 10 with both tesseract versions in MSYS2). It does not really honor the interface yet, because it does not give you any output files, even if an extension is given, but I was just trying to get tesseract with stdin to work.

def run_and_get_output(
    image,
    extension='',
    lang=None,
    config='',
    nice=0,
    timeout=0,
    return_bytes=False,
):

    cmd_args = [tesseract_cmd, 'stdin', 'stdout']

    if not sys.platform.startswith('win32') and nice != 0:
        cmd_args += ('nice', '-n', str(nice))

    if lang is not None:
        cmd_args += ('-l', lang)

    if config:
        cmd_args += shlex.split(config)

    try:
        proc = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE)
        image.save(proc.stdin, 'PNG')
        (stdout_data, stderr_data) = proc.communicate()
        return stdout_data.decode(DEFAULT_ENCODING)
    except OSError as e:
        if e.errno != ENOENT:
            raise e
        raise TesseractNotFoundError()

    with timeout_manager(proc, timeout) as error_string:
        if proc.returncode:
            raise TesseractError(proc.returncode, get_errors(error_string))

Wow! It really works but It does not work for OpenCV users. They should convert their image into PIL Image.

updated run_and_get_output based on madmaze/pytesseract#172 (comment) to use stdin to avoid unnecessary disk writes, minor speed improvement

bozhodimitrov changed the title ~~why saving image on disk instead of sending it through stdin?~~ Research the option of using stdin/stdout instead saving image on disk Jun 28, 2019

bozhodimitrov added the Feature Request label Jun 28, 2019

nok mentioned this issue Jul 27, 2019

Update Tesseract for testing #216

Closed

bozhodimitrov mentioned this issue Aug 14, 2019

Process batch files from momery #224

Closed

trentshapiro added a commit to trentshapiro/DraftGPT that referenced this issue Feb 11, 2023

add forked pytesseract

19a7d53

updated run_and_get_output based on madmaze/pytesseract#172 (comment) to use stdin to avoid unnecessary disk writes, minor speed improvement

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research the option of using stdin/stdout instead saving image on disk #172

Research the option of using stdin/stdout instead saving image on disk #172

cgallay commented Jan 2, 2019

bozhodimitrov commented Jan 2, 2019 •

edited

bozhodimitrov commented Jan 4, 2019

bozhodimitrov commented Jul 26, 2019

AyushP123 commented Oct 30, 2019 •

edited

bozhodimitrov commented Oct 30, 2019 •

edited

j-hap commented Mar 11, 2021

bozhodimitrov commented Mar 11, 2021

j-hap commented Mar 12, 2021

bozhodimitrov commented Mar 12, 2021

mo-han commented Apr 1, 2021 •

edited

GreenCobalt commented Jun 4, 2021

mo-han commented Jun 4, 2021

bozhodimitrov commented Jun 5, 2021

dilerbatu commented Aug 29, 2022

Research the option of using stdin/stdout instead saving image on disk #172

Research the option of using stdin/stdout instead saving image on disk #172

Comments

cgallay commented Jan 2, 2019

bozhodimitrov commented Jan 2, 2019 • edited

bozhodimitrov commented Jan 4, 2019

bozhodimitrov commented Jul 26, 2019

AyushP123 commented Oct 30, 2019 • edited

bozhodimitrov commented Oct 30, 2019 • edited

j-hap commented Mar 11, 2021

bozhodimitrov commented Mar 11, 2021

j-hap commented Mar 12, 2021

bozhodimitrov commented Mar 12, 2021

mo-han commented Apr 1, 2021 • edited

GreenCobalt commented Jun 4, 2021

mo-han commented Jun 4, 2021

bozhodimitrov commented Jun 5, 2021

dilerbatu commented Aug 29, 2022

bozhodimitrov commented Jan 2, 2019 •

edited

AyushP123 commented Oct 30, 2019 •

edited

bozhodimitrov commented Oct 30, 2019 •

edited

mo-han commented Apr 1, 2021 •

edited