Text Processing: PageRank

The goal of this project is to use PageRank to derive a ranking of words in a document based on their PageRank scores. The PageRank score of a word serves as an indicator of the importance of the word in the text. Also compare the MRR of the above PageRank algorithm with the MRR of a ranking of words based on their TF-IDF ranking scheme.

The project name is: HW2_PageRank and it was created using the Eclipse IDE. The program solves all the questions in a single execution (run).

For point 1, a directed graph was used (using a hash table and adjacency lists as data structures). For point 2 and 2', a function was created that returns the PageRank scores per word by document. For point 3, a parameterized function was created that generates both unigrams and n-grams. For point 4, a function was created that calculates the MRR from a list of proposed n-grams and the phrases available in the gold-standard files. For point 5, a function was created that calculates the TF-IDF scores for each word of each abstract document and then calculates the MRR score to the proposed n-grams.

Data

For this project, it was used the WWW collection consisting of titles and abstracts of research articles published in the WWW conference.

Algorithms

Simple PageRank: run PageRank on each word graph corresponding to each document in the collection as follows:
- Initialization: s = [s(V1), . . . , s(Vn)] = [1/n, . . . , 1/n], where n = |V|.
- Score nodes in a graph using their PageRank obtained by iteratively computing the equation.
- Where α is a damping factor (α = 0.85) and p = [1/n, . . . , 1/n].

Weighted PageRank: assume the weight Wij of an edge (Vi, Vj) is calculated as the number of times the corresponding words Wi and Wj are adjacent in text. Run PageRank on each word graph corresponding to each document in the collection using:

Mean reciprocal rank (MRR): Rd is the rank at which the first correct prediction was found for d ∈ D.

Technologies and Techniques

Java (JDK 1.7)
Eclipse IDE
Page Rank
TF-IDF
Tokenization
Stemming
Lemmatization
Remove stop-words

Program Execution Rules

The project has an executable in the 'jar' folder. The JAR name is: KD_HW2_v1.7.jar and you must send the path of the "data/www/" directory as a parameter, the n-grams size and the comparison type as parameters.

Therefore, to run the program you should go into the jar directory on your machine and then execute the following command:

    cd jar_folder
    java -jar KD_HW2_v1.7.jar data/ [UNIGRAMS|NGRAMS] [DIRECT|BY_TOKENS]

Execution examples:

    cd jar_folder
    java -jar KD_HW2_v1.7.jar data/ UNIGRAMS DIRECT
    java -jar KD_HW2_v1.7.jar data/ UNIGRAMS BY_TOKENS
    java -jar KD_HW2_v1.7.jar data/ NGRAMS DIRECT			(DEFAULT CONFIGURATION)
    java -jar KD_HW2_v1.7.jar data/ NGRAMS BY_TOKENS

The .JAR program must be run with Java 7 or higher.

Program Output

Show the MRR Scores: Using only Unigrams and Direct (Full) comparison

	k=1	k=2	k=3	k=4	k=5	k=6	k=7	k=8	k=9	k=10
Simple MRR	0.056	0.080	0.090	0.098	0.104	0.106	0.109	0.111	0.113	0.115
Weighted MRR	0.053	0.079	0.088	0.096	0.101	0.103	0.106	0.108	0.110	0.112
TF-IDF MRR	0.167	0.209	0.228	0.239	0.243	0.248	0.250	0.252	0.253	0.254

Show the MRR Scores: Using only Unigrams and comparison by Token

	k=1	k=2	k=3	k=4	k=5	k=6	k=7	k=8	k=9	k=10
Simple MRR	0.394	0.503	0.537	0.556	0.566	0.570	0.574	0.576	0.577	0.579
Weighted MRR	0.388	0.500	0.534	0.554	0.564	0.569	0.572	0.574	0.575	0.576
TF-IDF MRR	0.576	0.674	0.706	0.714	0.718	0.721	0.722	0.723	0.723	0.723

Show the MRR Scores: Using N-grams and Direct (Full) comparison

	k=1	k=2	k=3	k=4	k=5	k=6	k=7	k=8	k=9	k=10
Simple MRR	0.050	0.069	0.086	0.100	0.112	0.120	0.125	0.129	0.134	0.136
Weighted MRR	0.053	0.076	0.094	0.107	0.120	0.128	0.132	0.137	0.141	0.143
TF-IDF MRR	0.077	0.118	0.149	0.166	0.179	0.189	0.195	0.199	0.203	0.206

Show the MRR Scores: Using N-grams and comparison by Token

	k=1	k=2	k=3	k=4	k=5	k=6	k=7	k=8	k=9	k=10
Simple MRR	0.717	0.777	0.791	0.798	0.803	0.805	0.807	0.807	0.808	0.808
Weighted MRR	0.720	0.779	0.794	0.801	0.806	0.808	0.809	0.809	0.810	0.810
TF-IDF MRR	0.764	0.804	0.819	0.827	0.831	0.832	0.833	0.834	0.834	0.835

Conclusions

The results between the simple and the weighted PageRank are practically the same. This may be due to the fact that the graphs weights are mostly low.
The results of TF-IDF were always better than those obtained by PageRanks.
The way in which the n-grams are created (unigrams, bigrams or trigrams) and the way in which they compare with the exits of the gold-standard files (directly or by tokens) have a marked influence on the results of the MRR.

Authors

Created By: Andres Segura Tinoco
Created On: June 27, 2018

License

This project is licensed under the terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
code		code
data		data
images		images
jar		jar
support		support
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

images

images

jar

jar

support

support

LICENSE

LICENSE

README.md

README.md

_config.yml

_config.yml

Repository files navigation

Text Processing: PageRank

Data

Algorithms

Technologies and Techniques

Program Execution Rules

Program Output

Conclusions

Authors

License

About

Releases 1

Packages

Languages

License

ansegura7/TextProcessing_PageRank

Folders and files

Latest commit

History

Repository files navigation

Text Processing: PageRank

Data

Algorithms

Technologies and Techniques

Program Execution Rules

Program Output

Conclusions

Authors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages