Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 4 Development and Changes #662

Closed
Balearica opened this issue Sep 17, 2022 · 4 comments
Closed

Version 4 Development and Changes #662

Balearica opened this issue Sep 17, 2022 · 4 comments

Comments

@Balearica
Copy link
Collaborator

Balearica commented Sep 17, 2022

Overview

While bug fixes continue to be released for Version 3, all breaking changes will be released in Version 4, which is currently under development in the branch named dev/v4. This branch should be usable at present by users eager to use any new features, however there is no guarantee that additional breaking changes will not be implemented. Note that using this branch also requires using the Tesseract.js-core branch dev/v4.

Summary

Breaking Changes

  1. createWorker is now async
    1. In most code this means worker = Tesseract.createWorker() should be replaced with worker = await Tesseract.createWorker()
    2. Calling with invalid workerPath or corePath now produces error/rejected promise (Rework error reporting from worker threads so all promises resolve #654)
  2. worker.load is no longer needed (createWorker now returns worker pre-loaded)
  3. getPDF function replaced by pdf recognize option (GetPDF() with Scheduler returns the same PDF file #488)
    1. This allows PDFs to be created when using a scheduler
    2. See browser and node examples for usage

Major New Features

  1. Processed images created by Tesseract can be retrieved using imageColor, imageGrey, and imageBinary options (Is it possible to obtain the Thresholded Image from tesseract? #588)
    1. See image-processing.html example for usage
  2. Image rotation options rotateAuto and rotateRadians have been added, which significantly improve accuracy on certain documents
    1. See Issue Add rotation preprocessing option #648 example of how auto-rotation improves accuracy
    2. See image-processing.html example for usage of rotateAuto option
  3. Tesseract parameters (usually set using worker.setParameters) can now be set for single jobs using worker.recognize options (Allow for setting parameters for single recognize job when using scheduler #665)
    1. For example, a single job can be set to recognize only numbers using worker.recognize(image, {tessedit_char_whitelist: "0123456789"})
    2. As these settings are reverted after the job, this allows for using different parameters for specific jobs when working with schedulers
  4. Initialization parameters (e.g. load_system_dawg, load_number_dawg, and load_punc_dawg) can now be set (Add a way to set "Init Only" parameters (user_word_suffix, etc.) #613)
    1. The third argument to worker.initialize now accepts either (1) an object with key/value pairs or (2) a string containing contents to write to a config file
    2. For example, both of these lines set load_number_dawg to 0:
      1. worker.initialize('eng', "0", {load_number_dawg: "0"});
      2. worker.initialize('eng', "0", "load_number_dawg 0");

Other Changes

  1. loadLanguage now resolves without error when language is loaded but writing to cache fails
    1. This allows for running in Firefox incognito mode using default settings (Tesseract fails when running in Firefox incognito browser #609)
  2. detect returns null values when OS detection fails rather than throwing error (Failed to dectet OS #526)
  3. Memory leak causing crashes fixed (worker.recognize memory leak #678)
  4. Cache corruption should now be much less common (Fix asynchronous caching bug #666)

Detail

New Output Format Interface

A single, unified interface has been added for specifying all output formats. output is now the 3rd argument to recognize (see example below). This replaces the separate getPDF function, as well as various setParameters options (tessjs_create_box, tessjs_create_hocr, tessjs_create_osd, tessjs_create_tsv, and tessjs_create_unlv).

const outputOpts = {
  text: true,
  blocks: true,
  hocr: true,
  tsv: true,
  box: false,
  unlv: false,
  osd: false,
  pdf: false,
  imageColor: false,
  imageGrey: false,
  imageBinary: false
};

const res = await worker.recognize(files[0], undefined, outputOpts);

Note: the default output formats (text, blocks, hocr, and tsv) are not changing between v3 and v4, so this change only impacts users who want non-default options. This also means that users who want text and pdf outputs only need to specify {pdf: true}, as text is already a default.

@Balearica Balearica pinned this issue Sep 17, 2022
This was referenced Oct 13, 2022
Balearica added a commit that referenced this issue Nov 25, 2022
See #662 for explanation of Tesseract.js Version 4 changes.  List below is auto-generated from commits. 

* Added image preprocessing functions (rotate + save images)

* Updated createWorker to be async

* Reworked createWorker to be async and throw errors per #654

* Reworked createWorker to be async and throw errors per #654

* Edited detect to return null when detection fails rather than throwing error per #526

* Updated types per #606 and #580 (#663) (#664)

* Removed unused files

* Added savePDF option to recognize per #488; cleaned up code for linter

* Updated download-pdf example for node to use new savePDF option

* Added OutputFormats option/interface for setting output

* Allowed for Tesseract parameters to be set through recognition options per #665

* Updated docs

* Edited loadLanguage to no longer overwrite cache with data from cache per #666

* Added interface for setting 'init only' options per #613

* Wrapped caching in try block per #609

* Fixed unit tests

* Updated setImage to resolve memory leak per #678

* Added debug output option per #681

* Fixed bug with saving images per #588

* Updated examples

* Updated readme and Tesseract.js-core version
@alaaeid1993
Copy link

is arabic avalible in load language

@Balearica
Copy link
Collaborator Author

Balearica commented Apr 30, 2023

@alaaeid1993 Yes, the code for Arabic is ara. See this file for a list of languages and codes.

@alaaeid1993
Copy link

@alaaeid1993 Yes, the code for Arabic is ara. See this file for a list of languages and codes.

Yes, the Arabic letters are no problem, but the Arabic numbers do not appear correctly
im using angular version

@Balearica
Copy link
Collaborator Author

@alaaeid1993 This repo is for the JavaScript/webassembly port of Tesseract. We do not make changes to the Tesseract OCR engine or language data (.traineddata files). Issues related to accuracy are generally caused by Tesseract, and therefore outside of the scope of this repo.

To confirm, you can install the Tesseract CLI (the main project) with an equivalent version (v5.3.0 as of this writing) and run with equivalent settings. If you find that accuracy in Tesseract CLI is also unacceptable, then the issue is with Tesseract (not Tesseract.js), and you should look for a fix in the main Tesseract repo. If you find that Tesseract CLI produces correct results (with equivalent version/settings) but Tesseract.js does not, then we can discuss further here.

@Balearica Balearica unpinned this issue May 29, 2023
@naptha naptha deleted a comment from Kumar6174 Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants