Automatic Data Enrichment of Optical Recognition Systems on Forms

With OCR technologies the contents of a form can be read, the position of each word and its contents can be extracted, however the relation between the words cannot be understood. This prototype works by feeding an image of an unfilled form and another image of a filled form which contains the data to be enriched to an OCR engine. The output of OCR engine is run through a post-processing step which together with a modified Euclidean and fuzzy string search algorithms is able to cluster field names and field values in the filled in form image.

Quick start

Clone repository: git clone https://github.com/Adilius/form-ocr-data-enrichment.git
Change directory to repository: cd form-ocr-data-enrichment/
Install required packages: pip install -r .\requirements.txt
Run script: python .\app.py

Input files location

In this prototype three different types of forms were tested. Each form type requires both a picture of an unfilled form and filled forms.

.
├── input                   # Contains input images
│   ├── form_blank          # Images of blank forms
│       ├── bottom_form 
│       ├── middle_form 
│       └── top_form  
│   └── form_filled         # Images of filled forms
│       ├── bottom_form 
│       ├── middle_form 
│       └── top_form  
└── ...

Input files example

Output files location

The output files for each form type contain a .csv file and image file for each input image.

.
├── output                # Contains output
│   ├── top_form          # Output for top form
│       ├── 1.csv         # Raw output in text
│       ├── 1.png         # Image containing resulting bounding boxes
│       └── ...
│   └── ...               
└── ...

Output

Image with bounding boxes draw on the image, original bounding boxes come from the text detection engine, the better the engine the better boxes. Each colour indicates a grouping and thicker border indicates indicates the field value (label):

Text recognition run on the boxes (result is very bad, this all depends on the text recognition engine you use, I used a very bad one):

name	johyjohy dioe
occupation	dioecnixt ician
hometown	cnixtustly
favorite animal	ician0q

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
app		app
input		input
models		models
output		output
temp_files		temp_files
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pipeline_transparent.png		pipeline_transparent.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

input

input

models

models

output

output

temp_files

temp_files

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

app.py

app.py

pipeline_transparent.png

pipeline_transparent.png

requirements.txt

requirements.txt

Repository files navigation

Automatic Data Enrichment of Optical Recognition Systems on Forms

Quick start

Input files location

Input files example

Output files location

Output

About

Releases

Packages

Languages

License

Adilius/form-ocr-data-enrichment

Folders and files

Latest commit

History

Repository files navigation

Automatic Data Enrichment of Optical Recognition Systems on Forms

Quick start

Input files location

Input files example

Output files location

Output

About

Resources

License

Stars

Watchers

Forks

Languages