Skip to content

OCR Confidence Analysis script written in python

License

Notifications You must be signed in to change notification settings

UB-Mannheim/ocapy

 
 

Repository files navigation

OCA.py - Visualizing the word confidence of OCR results (ALTO-XML)

by Michael Kubina

OCA.py is an acronym and describes this OCR Confidence Analysis script written in python.

This is a graduation work for the 2022 Data Librarian Certificate Course from the Technical University Cologne. The result of the graduation work is a script, which is called OCA.py. The script was published in August 2022. This is the corresponding jupyter-notebook with additional insights.

OCA.py is licensed under GPL3 (https://www.gnu.org/licenses/gpl-3.0.en.html)

Requirements

The following Python libraries are required:

  • requests
  • BeautifulSoup
  • pandas
  • os
  • numpy
  • pprint
  • matplotlib
  • pillow
  • shutil
  • seaborn

Install them with pip install -r requirements.txt.

This software also uses Bootstrap (https://getbootstrap.com/)

Usage

In this graduation_work branch, the script is specifically tailored towards the METS-file-location from the Staats- und Universitätsbibliothek Hamburg. You only need to provide the record identifier in order to use it. This also means, that you can currently just test it on objects from this specific library. For other METS-files a refactoring is necessary.

About

OCR Confidence Analysis script written in python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 75.0%
  • Jupyter Notebook 22.4%
  • Python 2.6%