Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete invalid .traineddata files in cache #753

Closed
Balearica opened this issue May 7, 2023 · 3 comments · Fixed by #757
Closed

Delete invalid .traineddata files in cache #753

Balearica opened this issue May 7, 2023 · 3 comments · Fixed by #757

Comments

@Balearica
Copy link
Collaborator

One of the most common error messages reported is Error opening data file ./eng.traineddata (or the equivalent for other languages). This is due to our current caching behavior.

When a .traineddata file is downloaded, any fetch response reported as ok (which corresponds to a status of 200-299) is cached.

if (!resp.ok) {
throw Error(`Network error while fetching ${fetchUrl}. Response code: ${resp.status}`);
}
data = await resp.arrayBuffer();

The cached file is then used until the user manually deletes it, even if the file is invalid. The assumption this code makes is that an ok response indicates that some .traineddata file was successfully downloaded, and if that file is somehow corrupted, that is because the developer uploaded a corrupted .traineddata file.

This does not appear to be the case. Some server configurations appear to return 200 responses, even if the langPath value is invalid (see #714). Furthermore, given user reports, this may even happen when the default langPath value is used (see #521), although the mechanism for this is unclear.

We should edit so that tesseract.js deletes the saved .traineddata file when it detects that it is invalid. With this change, the next time the code is run it will again try and download the .traineddata file from langPath, rather than re-using the cached data that has already been determined to be invalid.

@Balearica Balearica changed the title Delete invalid .traineddata files in cache Rework cache options, delete invalid .traineddata files in cache May 11, 2023
@Balearica Balearica changed the title Rework cache options, delete invalid .traineddata files in cache Delete invalid .traineddata files in cache May 11, 2023
@Balearica
Copy link
Collaborator Author

Summary of this change

TL;DR Setting cacheMethod: 'none' or cacheMethod: 'refresh' to avoid invalid files being cached should no longer be necessary.

Explanation

By default, Tesseract.js caches .traineddata files to ensure they are only downloaded once. This is because .traineddata files are very large (most common languages are 10-25MB) and are virtually never updated. In certain uses of Tesseract.js, the majority of runtime is attributable to downloading the .traineddata file.

Prior to v4.0.6 there was a bug where cached .traineddata files were never cleared even if they were invalid. Therefore, if a user somehow received an invalid .traineddata file, Tesseract.js would stop working until it was manually cleared (throwing the error "Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.").

Due to this bug, many developers using Tesseract.js started bypassing the caching feature entirely by setting cacheMethod: 'none' or cacheMethod: 'refresh'. This is widely cited in other issues as the solution for the caching bug (e.g. #334, #351, #398 #414 #439, #481, #618, #676).

Starting in v4.0.6 invalid .traineddata files should be automatically cleared from the cache. Therefore, setting cacheMethod: 'none' or cacheMethod: 'refresh' as a workaround for this bug should no longer be necessary.

@laurent22
Copy link

laurent22 commented Mar 12, 2024

I'm wondering why Tesseract.js is handling this caching and downloading of training data? I would much prefer having full control over this rather than having to rely on some built-in solution which may or may not work (for me, as of 5.0.4 it doesn't work) and that's not really related to the core feature of Tesseract.

That way you could focus on developing what's unique about Tesseract.js. Downloading and caching files everybody can do that and often the solution differs depending on the application. For example someone may want a full offline solution and bundle the training data with the app, or check for updates at a regular interval, etc.

@Balearica
Copy link
Collaborator Author

@laurent22 The purpose of Tesseract.js is to provide a high-level, user friendly interface for running OCR. The vast majority of users do not want to manage training data. Therefore, managing language data is within the scope of this project.

That being said, if you have some application that would benefit from having more control over language data than Tesseract.js currently provides, you can open a new Git Issue with a feature request. For example, it would not be particularly difficult to allow for providing language data directly as an ArrayBuffer in an optional argument for createWorker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants