Skip to content

extracting financial information from pdf files and saving it in a RDBMS table for further processing

License

Notifications You must be signed in to change notification settings

doc0815/pdf-data-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf-data-extraction

extracting financial information from pdf files and saving it in a RDBMS table for further processing

Background information

I'm regularly receiving account statements from the online trading platform, over which I'm trading financial instruments. All of these statements are solely available in pdf format - an improper format for further data processing. Thus, I implemented a small application that reads pdf files, parses the text streams, detects and extracts financial key figures, and finally writes this information into a (relational) database table.

Main application

The main application is written in Python 3 and is started by executing run_data_extraction.py. This application has one optional command line argument: the absolute path in which the pdf files are stored. The default value is the relative path ./pdf (w.r.t. the main application's path). These steps are executed when running the main application:

  1. check command line arguments and initialize variables for file handling (paths/directories)
  2. read all pdf files (filter on file extension *.pdf) into a list, loop over it, and execute steps 3. to 7.
  3. read the content of the pdf file and convert it into a string (function: convert_pdf_to_txt())
  4. write the string into a temporary file and create an instance of the FileStream class that handles this file
  5. initialize ANTLR's lexer and parser
  6. listen to the words (tokens) and store the detected information in an instance of ListenerDividend
  7. parameterize an SQL template (insert statement) and execute it (method: Database.write_to_database())

The function convert_data_to_pdf() is based on the PDF Miner (https://github.com/pdfminer/pdfminer.six).

ListenerDividend is inherited from PdfDividendListener, that is automatically generated by ANTLR (see below). It initializes internal variables with default values and fills them with values read when processing the parse result. Furthermore, a check function is implemented that checks if a detected ISIN (international securities identification number) is valid.

The customized class Database handles the connection to the Oracle database (based on cx_Oracle), stores the SQL insert statement with binding variables to avoid SQL injections, and parametrizes the SQL tempalte before executing it.

ANother Tool for Language Recognition https://www.antlr.org/

ANTLR is a tool that helps you to create parsers. A parser takes a piece of text and transforms it in an organized structure, such as an Abstract Syntax Tree (AST). In order to get an AST, you need to

  • define a lexer and parser grammar
  • invoke ANTLR: it will generate a lexer and a parser in your target language
  • use the generated lexer and parser: you invoke them passing the code to recognize and they return to you an AST

The lexer and parser grammer used in this application is defined in PdfDividend.g4. By calling antlr4 -Dlanguage=Python3 PdfDividend.g4 from the shell (I use Windows' command line interpreter), ANTLR generates the lexer PdfDividendLexer.py, the parser PdfDividendParser.py, and the listener PdfDividendListener.py based on the grammer. These classes are imported into the main application (a class is inherited from the listener) and used during the parsing subprocess (see above).

Please refer to https://tomassetti.me/antlr-mega-tutorial/ which provides an elaborated tutorial for getting started with ANTRL and beyond.

Oracle RDBMS

The extracted financial information is written to the table raw_dividend in schema pms (portfolio management system). Each table in pms uses a 4 letter abbreviation, that is used as prefix for the table's columns. raw_dividend's primary key is rdiv_id, that is automatically filled by the trigger trg_rdiv_id fired during inserting data. This ID is generated by the sequence seq_rdiv_id, which generates a 8 digit number starting with 10000000. Furthermore, the column rdiv_tech_date is also filled automatically by the trigger trg_rdiv_id and stores the system's timestamp when inserting data. All other columns of raw_dividend are used for storing data extracted from or metadata of the pdf file.

About

extracting financial information from pdf files and saving it in a RDBMS table for further processing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published