The goal of this project is to enable quantitative analysis of academic papers.
To achieve the goal, the project contains logic for
- Parsing academic white papers into structured representation
- Doing analysis on the structured representations
The project as it currently stands focuses on the task of taking a list of arbitrary papers in the form of PDFs, and then creating a dependency graph of citations amongst these papers. This graph then shows how each of the PDFs reference each other. Paper analyser achieves this by going through the steps:
- Parse the papers to extract relevant data
- Read the PDF files to a format usable in Python
- Extract title of a given paper
- Extract the raw data of the "References" section
- Parse the raw "References" section into individual refereces:
- Extract the title and authors of the citation
- Normalise the data of the extracted citations
- Do dependency analysis based on the above citation extractions
To see example usage of a simple exmplae look at the simple_example.md
Paper analyser takes as input PDF files of academic papers and outputs data about these papers. For convenience we maintain a list of links to software analysis papers focused on software security in our sister repository software-security-paper-list
To see an example of doing analysis on many papers look at the explanation here large_example.md
We have also created visualisations for the output of the paper analyser, which makes it very nice to rapidly understand the relationship between the academic papers in the data set.
See a link here for an example of the visualisations https://adalogics.com/software-security-research-citations-visualiser
These visualistions will be open sourced in the near future.
Example of a wordcloud generated by the papers in the "Fuzzing" section of software-security-paper-list. This wordcloud discounts the use of the 100 most common english words https://www.espressoenglish.net/the-100-most-common-words-in-english/
Doing a barplot of the words in the papers in the "Fuzzing" section of software-security-paper-list. This plot discounts the use of the 100 most commond english words https://www.espressoenglish.net/the-100-most-common-words-in-english/
git clone https://github.com/AdaLogics/paper-analyser
cd paper-analyser
./install.sh
We welcome contributions.
Paper analyser is maintained by:
We are particularly interested in features for:
- Improved parsing of the PDF files to get better structured ouput out
- More data analysis into the project
If you would like to contribute but dont have a feature in mind, please see the list below for suggestions:
- Extraction of authors from papers
- Extraction of the actual text from the papers. This could be used for a lot of cool data analysis