Intelligent document processing with AI

ML-based PDF infomartion extraction system with storage and search functions.

Currently pretrained for Hungarian EKR documents (some official national contracts), but you can train 6 different models with your own data.

Services:

Backend

upload pdfs
- store pdf file in AWS S3
- exract text with Tesseract OCR
- exract information using the model service
- store text data in Elastic Search
search
- return pdf text data by query (match/levenhstein/regex/...)
download
- download pdf file by filename

Tech: JavaScript, Express.js, Pdf-Poppler, Tesseract-OCR, Elastic Search, AWS S3

Model

The backend can run any .py and .ipynb files as with the excepted input/output formats

predict
- batch text information extraction with CRFSuite ML model (Conditional Random Fields)
- many other models have been tried, but those reached lower accuracy for this amount of data
train
- todo
tested models (dataset):
- Custom neural networks:
  - Embedding + bi-LSTM
  - Embedding + bi-LSTM + LSTM
  - Embedding + bi-LSTM + LSTM + CRF
- Bert
- XGBoost
- CRFSuite

Tech: Python, Flask, Keras, PyTorch, Bert, XGboost, PyCRFSuite

Frontend

draft

Tech: JavaScript, React

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
backend		backend
frontend		frontend
model		model
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend

backend

frontend

frontend

model

model

.DS_Store

.DS_Store

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Intelligent document processing with AI

Services:

Backend

Model

Frontend

About

Releases

Packages

Languages

gergomiklos/document-processing-with-ai

Folders and files

Latest commit

History

Repository files navigation

Intelligent document processing with AI

Services:

Backend

Model

Frontend

About

Topics

Resources

Stars

Watchers

Forks

Languages