Paper analyser

The goal of this project is to enable quantitative analysis of academic papers.

To achieve the goal, the project contains logic for

Parsing academic white papers into structured representation
Doing analysis on the structured representations

Paper dependency graph

The project as it currently stands focuses on the task of taking a list of arbitrary papers in the form of PDFs, and then creating a dependency graph of citations amongst these papers. This graph then shows how each of the PDFs reference each other. Paper analyser achieves this by going through the steps:

Parse the papers to extract relevant data
1. Read the PDF files to a format usable in Python
2. Extract title of a given paper
3. Extract the raw data of the "References" section
Parse the raw "References" section into individual refereces:
1. Extract the title and authors of the citation
2. Normalise the data of the extracted citations
Do dependency analysis based on the above citation extractions

Usage

To see example usage of a simple exmplae look at the simple_example.md

Paper analyser takes as input PDF files of academic papers and outputs data about these papers. For convenience we maintain a list of links to software analysis papers focused on software security in our sister repository software-security-paper-list

To see an example of doing analysis on many papers look at the explanation here large_example.md

Example visualisation

We have also created visualisations for the output of the paper analyser, which makes it very nice to rapidly understand the relationship between the academic papers in the data set.

See a link here for an example of the visualisations https://adalogics.com/software-security-research-citations-visualiser

These visualistions will be open sourced in the near future.

Citation graph example:

Wordcloud of 85 fuzzing papers

Example of a wordcloud generated by the papers in the "Fuzzing" section of software-security-paper-list. This wordcloud discounts the use of the 100 most common english words https://www.espressoenglish.net/the-100-most-common-words-in-english/

Wordcount of 85 fuzzing papers

Doing a barplot of the words in the papers in the "Fuzzing" section of software-security-paper-list. This plot discounts the use of the 100 most commond english words https://www.espressoenglish.net/the-100-most-common-words-in-english/

Installation

git clone https://github.com/AdaLogics/paper-analyser
cd paper-analyser
./install.sh

Contribute

We welcome contributions.

Paper analyser is maintained by:

We are particularly interested in features for:

Improved parsing of the PDF files to get better structured ouput out
More data analysis into the project

Feature suggestions

If you would like to contribute but dont have a feature in mind, please see the list below for suggestions:

Extraction of authors from papers
Extraction of the actual text from the papers. This could be used for a lot of cool data analysis

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
alexandria		alexandria
docs		docs
example-images		example-images
example-papers		example-papers
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
gen_wordcloud.py		gen_wordcloud.py
install.sh		install.sh
pq_format_reader.py		pq_format_reader.py
pq_main.py		pq_main.py
pq_pdf_utility.py		pq_pdf_utility.py
requirements.txt		requirements.txt

License

AdaLogics/paper-analyser

Folders and files

Latest commit

History

Repository files navigation

Paper analyser

Paper dependency graph

Usage

Example visualisation

Citation graph example:

Wordcloud of 85 fuzzing papers

Wordcount of 85 fuzzing papers

Installation

Contribute

Feature suggestions

About

Resources

License

Stars

Watchers

Forks

Languages