pdf-data-extraction

extracting financial information from pdf files and saving it in a RDBMS table for further processing

Background information

I'm regularly receiving account statements from the online trading platform, over which I'm trading financial instruments. All of these statements are solely available in pdf format - an improper format for further data processing. Thus, I implemented a small application that reads pdf files, parses the text streams, detects and extracts financial key figures, and finally writes this information into a (relational) database table.

Main application

The main application is written in Python 3 and is started by executing run_data_extraction.py. This application has one optional command line argument: the absolute path in which the pdf files are stored. The default value is the relative path ./pdf (w.r.t. the main application's path). These steps are executed when running the main application:

check command line arguments and initialize variables for file handling (paths/directories)
read all pdf files (filter on file extension *.pdf) into a list, loop over it, and execute steps 3. to 7.
read the content of the pdf file and convert it into a string (function: convert_pdf_to_txt())
write the string into a temporary file and create an instance of the FileStream class that handles this file
initialize ANTLR's lexer and parser
listen to the words (tokens) and store the detected information in an instance of ListenerDividend
parameterize an SQL template (insert statement) and execute it (method: Database.write_to_database())

The function convert_data_to_pdf() is based on the PDF Miner (https://github.com/pdfminer/pdfminer.six).

ListenerDividend is inherited from PdfDividendListener, that is automatically generated by ANTLR (see below). It initializes internal variables with default values and fills them with values read when processing the parse result. Furthermore, a check function is implemented that checks if a detected ISIN (international securities identification number) is valid.

The customized class Database handles the connection to the Oracle database (based on cx_Oracle), stores the SQL insert statement with binding variables to avoid SQL injections, and parametrizes the SQL tempalte before executing it.

ANother Tool for Language Recognition https://www.antlr.org/

ANTLR is a tool that helps you to create parsers. A parser takes a piece of text and transforms it in an organized structure, such as an Abstract Syntax Tree (AST). In order to get an AST, you need to

define a lexer and parser grammar
invoke ANTLR: it will generate a lexer and a parser in your target language
use the generated lexer and parser: you invoke them passing the code to recognize and they return to you an AST

The lexer and parser grammer used in this application is defined in PdfDividend.g4. By calling antlr4 -Dlanguage=Python3 PdfDividend.g4 from the shell (I use Windows' command line interpreter), ANTLR generates the lexer PdfDividendLexer.py, the parser PdfDividendParser.py, and the listener PdfDividendListener.py based on the grammer. These classes are imported into the main application (a class is inherited from the listener) and used during the parsing subprocess (see above).

Please refer to https://tomassetti.me/antlr-mega-tutorial/ which provides an elaborated tutorial for getting started with ANTRL and beyond.

Oracle RDBMS

The extracted financial information is written to the table raw_dividend in schema pms (portfolio management system). Each table in pms uses a 4 letter abbreviation, that is used as prefix for the table's columns. raw_dividend's primary key is rdiv_id, that is automatically filled by the trigger trg_rdiv_id fired during inserting data. This ID is generated by the sequence seq_rdiv_id, which generates a 8 digit number starting with 10000000. Furthermore, the column rdiv_tech_date is also filled automatically by the trigger trg_rdiv_id and stores the system's timestamp when inserting data. All other columns of raw_dividend are used for storing data extracted from or metadata of the pdf file.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
antlr		antlr
oracle_rdbms		oracle_rdbms
LICENSE		LICENSE
README.md		README.md
run_data_extraction.py		run_data_extraction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

antlr

antlr

oracle_rdbms

oracle_rdbms

LICENSE

LICENSE

README.md

README.md

run_data_extraction.py

run_data_extraction.py

Repository files navigation

pdf-data-extraction

Background information

Main application

ANother Tool for Language Recognition https://www.antlr.org/

Oracle RDBMS

About

Releases

Packages

Languages

License

doc0815/pdf-data-extraction

Folders and files

Latest commit

History

Repository files navigation

pdf-data-extraction

Background information

Main application

ANother Tool for Language Recognition https://www.antlr.org/

Oracle RDBMS

About

Resources

License

Stars

Watchers

Forks

Languages