Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[16.0] pdf content indexing: PyMupdf + Tesseract #431

Open
wants to merge 17 commits into
base: 16.0
Choose a base branch
from

Conversation

len-foss
Copy link
Contributor

@len-foss len-foss commented Sep 8, 2023

It integrates PyMuPDF to perform text extraction.

The OCR with tesseract has also been migrated from version 8; to be able to perform extraction on long documents, PyMuPDF is used to split the content into multiple images, as it has a limit on the image size it can process. I've also added the possibility to explicitly change Tesseract's base language with a context key.

I also adds a new module to perform OCR in individual jobs rather than a cron.

@len-foss len-foss mentioned this pull request Sep 15, 2023
4 tasks
@len-foss
Copy link
Contributor Author

len-foss commented Nov 7, 2023

@agent-z28 FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants