Delete invalid .traineddata files in cache #753

Balearica · 2023-05-07T21:51:07Z

One of the most common error messages reported is Error opening data file ./eng.traineddata (or the equivalent for other languages). This is due to our current caching behavior.

When a .traineddata file is downloaded, any fetch response reported as ok (which corresponds to a status of 200-299) is cached.

tesseract.js/src/worker-script/index.js

Lines 108 to 111 in 7a087ca

    
           if (!resp.ok) { 
        
             throw Error(`Network error while fetching ${fetchUrl}. Response code: ${resp.status}`); 
        
           } 
        
           data = await resp.arrayBuffer();

The cached file is then used until the user manually deletes it, even if the file is invalid. The assumption this code makes is that an ok response indicates that some .traineddata file was successfully downloaded, and if that file is somehow corrupted, that is because the developer uploaded a corrupted .traineddata file.

This does not appear to be the case. Some server configurations appear to return 200 responses, even if the langPath value is invalid (see #714). Furthermore, given user reports, this may even happen when the default langPath value is used (see #521), although the mechanism for this is unclear.

We should edit so that tesseract.js deletes the saved .traineddata file when it detects that it is invalid. With this change, the next time the code is run it will again try and download the .traineddata file from langPath, rather than re-using the cached data that has already been determined to be invalid.

The text was updated successfully, but these errors were encountered:

Balearica · 2023-05-16T03:18:36Z

Summary of this change

TL;DR Setting cacheMethod: 'none' or cacheMethod: 'refresh' to avoid invalid files being cached should no longer be necessary.

Explanation

By default, Tesseract.js caches .traineddata files to ensure they are only downloaded once. This is because .traineddata files are very large (most common languages are 10-25MB) and are virtually never updated. In certain uses of Tesseract.js, the majority of runtime is attributable to downloading the .traineddata file.

Prior to v4.0.6 there was a bug where cached .traineddata files were never cleared even if they were invalid. Therefore, if a user somehow received an invalid .traineddata file, Tesseract.js would stop working until it was manually cleared (throwing the error "Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.").

Due to this bug, many developers using Tesseract.js started bypassing the caching feature entirely by setting cacheMethod: 'none' or cacheMethod: 'refresh'. This is widely cited in other issues as the solution for the caching bug (e.g. #334, #351, #398 #414 #439, #481, #618, #676).

Starting in v4.0.6 invalid .traineddata files should be automatically cleared from the cache. Therefore, setting cacheMethod: 'none' or cacheMethod: 'refresh' as a workaround for this bug should no longer be necessary.

laurent22 · 2024-03-12T17:34:25Z

I'm wondering why Tesseract.js is handling this caching and downloading of training data? I would much prefer having full control over this rather than having to rely on some built-in solution which may or may not work (for me, as of 5.0.4 it doesn't work) and that's not really related to the core feature of Tesseract.

That way you could focus on developing what's unique about Tesseract.js. Downloading and caching files everybody can do that and often the solution differs depending on the application. For example someone may want a full offline solution and bundle the training data with the app, or check for updates at a regular interval, etc.

Balearica · 2024-03-15T21:32:48Z

@laurent22 The purpose of Tesseract.js is to provide a high-level, user friendly interface for running OCR. The vast majority of users do not want to manage training data. Therefore, managing language data is within the scope of this project.

That being said, if you have some application that would benefit from having more control over language data than Tesseract.js currently provides, you can open a new Git Issue with a feature request. For example, it would not be particularly difficult to allow for providing language data directly as an ArrayBuffer in an optional argument for createWorker.

Balearica changed the title ~~Delete invalid .traineddata files in cache~~ Rework cache options, delete invalid .traineddata files in cache May 11, 2023

Balearica changed the title ~~Rework cache options, delete invalid .traineddata files in cache~~ Delete invalid .traineddata files in cache May 11, 2023

Balearica added a commit that referenced this issue May 11, 2023

Delete invalid .traineddata files in cache per #753

d267874

Balearica mentioned this issue May 11, 2023

Delete invalid .traineddata files in cache per #753 #757

Merged

Balearica closed this as completed in #757 May 11, 2023

Balearica added a commit that referenced this issue May 11, 2023

Delete invalid .traineddata files in cache per #753 (#757)

6d8e9fa

Balearica mentioned this issue May 12, 2023

When I load chi_sim with the 4.0.0_best tessdata, the console throws a warning message #521

Closed

Balearica mentioned this issue May 29, 2023

Upgrading from v2 to v5 Guide #771

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete invalid .traineddata files in cache #753

Delete invalid .traineddata files in cache #753

Balearica commented May 7, 2023

Balearica commented May 16, 2023

laurent22 commented Mar 12, 2024 •

edited

Balearica commented Mar 15, 2024

Delete invalid .traineddata files in cache #753

Delete invalid .traineddata files in cache #753

Comments

Balearica commented May 7, 2023

Balearica commented May 16, 2023

Summary of this change

Explanation

laurent22 commented Mar 12, 2024 • edited

Balearica commented Mar 15, 2024

laurent22 commented Mar 12, 2024 •

edited