Skip to content

ML-based PDF information extraction system with storage and search functions

Notifications You must be signed in to change notification settings

gergomiklos/document-processing-with-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intelligent document processing with AI

ML-based PDF infomartion extraction system with storage and search functions.

Currently pretrained for Hungarian EKR documents (some official national contracts), but you can train 6 different models with your own data.

Services:

Backend

  • upload pdfs
    • store pdf file in AWS S3
    • exract text with Tesseract OCR
    • exract information using the model service
    • store text data in Elastic Search
  • search
    • return pdf text data by query (match/levenhstein/regex/...)
  • download
    • download pdf file by filename

Tech: JavaScript, Express.js, Pdf-Poppler, Tesseract-OCR, Elastic Search, AWS S3

Model

The backend can run any .py and .ipynb files as with the excepted input/output formats

  • predict

    • batch text information extraction with CRFSuite ML model (Conditional Random Fields)
    • many other models have been tried, but those reached lower accuracy for this amount of data
  • train

    • todo
  • tested models (dataset):

    • Custom neural networks:
      • Embedding + bi-LSTM
      • Embedding + bi-LSTM + LSTM
      • Embedding + bi-LSTM + LSTM + CRF
    • Bert
    • XGBoost
    • CRFSuite

Tech: Python, Flask, Keras, PyTorch, Bert, XGboost, PyCRFSuite

Frontend

  • draft

Tech: JavaScript, React