Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce file sizes #806

Closed
Balearica opened this issue Aug 21, 2023 · 10 comments
Closed

Reduce file sizes #806

Balearica opened this issue Aug 21, 2023 · 10 comments
Milestone

Comments

@Balearica
Copy link
Collaborator

Balearica commented Aug 21, 2023

The amount of data loaded by Tesseract.js is quite large. For example, if default settings are used, a new user will end up downloading 15.34 MB of JavaScript and language data before recognition will be run (not taking into account compression). While this is largely mitigated by caching language data after it is first downloaded (and should not be an issue for Node users at all), this amount of data likely causes annoyance for first-time browser users.

File Size
tesseract.min.js 0.07 MB
worker.min.js 0.13 MB
tesseract-core-simd.wasm.js 4.74 MB
eng.traineddata.gz 10.4 MB
total 15.34 MB

We should investigate whether this can be reduced without significant tradeoffs (e.g. runtime increase, dropping support for file formats, etc.).

@Balearica
Copy link
Collaborator Author

Balearica commented Aug 21, 2023

Upon investigation, the primary change that would reduce file size is removing the Legacy engine and corresponding .traineddata (by default). As the vast majority of users do not use the Legacy model, and it takes up a significant amount of space, this should be opt-in rather than opt-out.

English - 54% Total Reduction

File Size [LSTM + Legacy] Size [LSTM Only]
tesseract.min.js 0.07 MB 0.07 MB
worker.min.js 0.13 MB 0.13 MB
tesseract-core-simd.wasm.js 4.74 MB 3.95 MB
eng.traineddata.gz 10.4 MB 2.95 MB
total 15.34 MB 7.1 MB

Chinese (Simplified) - 73% Total Reduction

File Size [LSTM + Legacy] Size [LSTM Only]
tesseract.min.js 0.07 MB 0.07 MB
worker.min.js 0.13 MB 0.13 MB
tesseract-core-simd.wasm.js 4.74 MB 3.95 MB
chi_sim.traineddata.gz 20.2 MB 1.72 MB
total 25.14 MB 5.87 MB

Compiling Tesseract with different optimization settings would also significantly reduce size (by ~1.1 MB), however this makes recognition significantly slower, so is not worth it. See these benchmarks.

@Balearica Balearica added this to the v5.0 milestone Aug 30, 2023
@Balearica
Copy link
Collaborator Author

A benchmark shows this change leads to a ~50% reduction in runtime for first-time users. Numbers are shown below.

Network Speed Before After % Reduction
Slow 13.9s 6.3s 55%
Med 5.6s 2.6s 54%
Fast 2.7s 1.4s 48%

Details:

  1. This test was conducted using the "network throttling" feature in Chrome.
    1. "Slow" corresponds to 10 Mb/s + 20ms latency, "medium" corresponds to 30 Mb/s + 15ms latency, and "fast" corresponds to 100 Mb/s + 20ms latency.
  2. Cache was disabled and local storage was cleared.
    1. This forces code and language data to be re-downloaded, emulating the experience of a first-time user (the performance impact this change will have on repeat users is marginal, as the files will already be cached).
  3. This file was recognized.
    1. A more complex input would lead to a smaller change in percentage (although not absolute) terms, as a larger proportion of runtime would be spent on recognition.

@Balearica
Copy link
Collaborator Author

Closing as completed. As of v5, by default only the LSTM code and data are loaded.

@lmk123
Copy link

lmk123 commented Sep 28, 2023

How do I use the new language data?

Currently I have the langPath set to "https://tessdata.projectnaptha.com/4.0.0_best", but I noticed that https://tessdata.projectnaptha.com/4.0.0_best/chi_sim.traineddata.gz downloads at 11.4MB, not 1.72MB as you said.

@Balearica
Copy link
Collaborator Author

@lmk123 If you use Tesseract.js v5 and do not set langPath, the new language data will be loaded automatically. If you wish to self-host the language data, then you should create a directory on your site with the language data files found here, and set langPath to that directory.

Some insight as to why the language data files are different sizes:

  1. The default data in Tesseract.js v5 comes from the 4.0.0_best_int directory
    1. This contains an integerized version of the tessdata_best data for LSTM, and no data for Legacy
    2. English is ~2.8 MB, Chinese is ~1.6 MB
  2. The default data in Tesseract.js v4 comes from the 4.0.0 directory
    1. This contains an integerized version of the tessdata_best data for LSTM, as well as data for Legacy
    2. English is ~10 MB, Chinese is ~19 MB
  3. The data you are currently using is from the 4.0.0_best directory
    1. This contains a non-integerized version of the tessdata_best data for LSTM, and no data for Legacy
    2. English is ~12 MB, Chinese is ~11 MB
      1. The non-integerized versions of the data are significantly larger, despite (purportedly) having minimal impact on recognition accuracy

@lmk123
Copy link

lmk123 commented Sep 30, 2023

Thanks for the detailed explanation, I figured it out.

Can you please put 4.0.0_best_int on "https://tessdata.projectnaptha.com/" as well? I found that "https://tessdata.projectnaptha.com/4.0.0_best_int/eng.traineddata.gz" opens to a 404.

This is because cdn.jsdelivr.net doesn't work in my country, but I tested https://tessdata.projectnaptha.com and it works.

@Balearica
Copy link
Collaborator Author

@lmk123 Unfortunately the fact that the https://tessdata.projectnaptha.com/ is not updating to include these files is not trivial to fix. That is a GitHub pages site configured to host that entire repo. Unfortunately, at some point that repo passed GitHub's file size limit so it stopped updating. This was part of the reason why the default was changed to use jsDelivr.

What country are you in? I was not aware that jsDelivr was blocked in certain countries. Should probably implement some sort of fallback mechanism if there are regional issues I was not aware of.

@lmk123
Copy link

lmk123 commented Sep 30, 2023

What country are you in?

China. The following picture (source) shows the connection of cdn.jsdelivr.net in China, most of them are red, which means they can't connect.

image

at some point that repo passed GitHub's file size limit

Maybe put 4.0.0_best_int in a separate repository?

@ivysrono
Copy link

ivysrono commented Mar 3, 2024

What country are you in? I was not aware that jsDelivr was blocked in certain countries. Should probably implement some sort of fallback mechanism if there are regional issues I was not aware of.

We can download the data manual, however, where should we put the file in PC?

By the way, would you convert the data to pure JS so we can store them anywhere?

@Balearica
Copy link
Collaborator Author

The fact that Tesseract.js uses the JSDelivr CDN by default is unrelated to the reduction of file sizes, which is the topic of this issue. I created issue #899 for discussing the topic of JSDelivr not working in China, so discussion about the CDN should move there.

@naptha naptha locked as off-topic and limited conversation to collaborators Mar 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants