[16.0] pdf content indexing: PyMupdf + Tesseract #431

len-foss · 2023-09-08T15:32:22Z

It integrates PyMuPDF to perform text extraction.

The OCR with tesseract has also been migrated from version 8; to be able to perform extraction on long documents, PyMuPDF is used to split the content into multiple images, as it has a limit on the image size it can process. I've also added the possibility to explicitly change Tesseract's base language with a context key.

I also adds a new module to perform OCR in individual jobs rather than a cron.

[ADD] tests for attachments_to_filesystem

…text

This is more performant and easily split pages to avoid getting into errors with maximum image size of tessearact.

len-foss · 2023-11-07T11:33:16Z

@agent-z28 FYI

hbrunn and others added 13 commits September 7, 2023 11:09

[ADD] document_ocr

8d69a35

[FIX] CI

d3a37e5

[ADD] tests for attachments_to_filesystem

[ADD] cap the amount of documents to ocr per cronjob run

5ee5b3f

[FIX] ignore files with unknown mimetype

8ac1ca8

[FIX] use png as for pillow interchange

c10e84e

[IMP] document_ocr: handle invalid data in attachments gracefully

a00735b

[IMP] document_ocr: pre-commit execution

f1f13f1

[MIG] document_ocr -> attachment_indexation_ocr

5edbe16

[IMP] attachment_indexation_ocr: option to pass tesseract lang in con…

73cff87

…text

[IMP] attachment_indexation_ocr: convert pdf with fitz

6196f30

This is more performant and easily split pages to avoid getting into errors with maximum image size of tessearact.

[REF] attachment_indexation_ocr: refactor test class for inheritance

ecadc83

[ADD] attachment_indexation_ocr_job

f31245a

[ADD] attachment_indexation_mupdf

7f394be

len-foss mentioned this pull request Sep 15, 2023

Migration to version 16.0 OCA/dms#213

Open

4 tasks

len-foss added 4 commits February 15, 2024 11:19

[UPD] requirements.txt: add dependency on textract

8d1125a

[ADD] attachment_indexation_textract

1125d29

[UPD] requirements.txt: use textract fork

8e98036

[FIX] attachment_indexation_textract: use a textract fork to fix pip

834573b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[16.0] pdf content indexing: PyMupdf + Tesseract #431

[16.0] pdf content indexing: PyMupdf + Tesseract #431

len-foss commented Sep 8, 2023

len-foss commented Nov 7, 2023

[16.0] pdf content indexing: PyMupdf + Tesseract #431

Are you sure you want to change the base?

[16.0] pdf content indexing: PyMupdf + Tesseract #431

Conversation

len-foss commented Sep 8, 2023

len-foss commented Nov 7, 2023