Skip to content

Document Indexing and retrieval on query using CACM corpus

Notifications You must be signed in to change notification settings

rajulkumar/BM25SearchEngine

Repository files navigation

Index and Retrieve
------------------
This is an implementation of a Indexing and Retrieval system using Okapi BM25 ranking algorithm.
This generates an inverted index of a document or a set of documents with textual index terms only(no numerics),from a given corpus, in a single file as the part of indexing.
The retrieval process is done by scoring all the documnents on a score generated by applying BM25 algorithm for a query. The documents are then ranked till the given count and returned as the result in a file.

Build
-----
Developed in Java using jdk1.8.0_05.
Compiled for jre8.
No third party/external libraries used.

Compile and Run
----------------
For windows:
1. Open executeIndex.bat/executeBM25.bat
2. Set the path of jdk bin on this system at the path variable.
3. Set the file path of the input files in the given variables.
4. Save and execute .bat
5. Response files are generated at the same location.

For linux/unix*:
1. Open executeIndex.sh/executeBM25.sh
2. Set the path of jdk bin on this system at the path variable.
3. Set the file path of the input files in the given variables.
4. Save and execute .sh
5. Response files are generated at the same location.

* not tested for linux/unix 

Alternatively,
The java files Indexer.java and BM25.java could be compiled from here by:
javac src/com/search/index/Indexer.java
javac src/com/search/retrieve/BM25.java

This should be executed from this location by:
java com/search/index/Indexer <corpus_filePath> <index.out_filePath>
e.g. java com/search/index/Indexer tccorpus.txt index.out
java com/search/retrieve/BM25 <index.out_filePath> <queries.txt_filePath> <resultCount> <result.eval_filePath>
e.g. java com/search/retrieve/BM25 index.out queries.txt 100 results.eval

* Please change the results.eval file after an execution as the results get appended to the given file

Results/response of execution
-----------------------------
1. The index file generated from the Indexer contains the inverted index of all the terms in the document along with the token counts of all the documnents for the key    ~all_docs in the file. The index lines are in the format <term>=<docID>:<termFrequency>#<docID>:<termFrequency>....#<docID>:<termFrequency>.
   The inverted index generated for the given corpus is in "index.out" at this same location.

2. The results of the ranking done based on BM25 scores are generated by BM25 in the file given as the input for the results. The result length for each query is      determined by the result count given as input. For more than one query, the results get appended in the same result file.
   The format of the result is <queryID> q0 <docID> <rank> <BM25_score> <system_name>.  
   The retrieval results for the given queries on the given document set is in "results.eval" at this same location.

Implementation
--------------
Indexer::

It starts by reading the corpus line by line. It looks for the # and a number to know the begining of the document and the next occurance of the # denotes the end.
On reading a line, it tokenises the line and checks each token if it's a complete number or not. 
If not then it get added to the inverted index by looking for the term and document combination and incermenting the count stored against them. If the combination is not found then a new key with the value 1 is added to index. 
This is done for each line till the end of document is reached. At this point the token count for the document is available which is then stored in a data structure.
Once the process for all the documnets end, the inverted index and the token counts for each document is written to the given output file. The token counts are all written in the first line of the file and then the inverted lines from the index data structure are witten down on the file.


BM25:

It begins by reading the first line of the index to get the documents and the token count for each document. This data is loaded in the memory for further evaluation.
Then the values are used to calculate the average document length to be used while scoring.
Then it reads the queries file for the given queries to work on and keeps it in a list.
After this the scoring process begins. In this, for each query the inverted index is searched for the query terms and the inverted lines are fetched and stored in the memory. Then the query term frequency is calculated to be used while scoring. 
With this we have all the data required to begin the scoring. First the value of K is calculated for each document. Then for each term in the query, the BM25 score is calculated and aggregated with the previous score which gives a final score. This process is done for all the documents in the corpus. In the end we receive BM25 scores for the documents with respect to a given query.
The documents are then sorted according the BM25 score and the result of the given result size is then written on the given output file. This process is repeated for each query in the file and the results for each is appended in the same result file in the prescribed format mentioned in the results section above.


References
-----------
1. JavaSE Documentation: https://docs.oracle.com/javase/8/docs/
2. Stackoverflow forum: http://stackoverflow.com

About

Document Indexing and retrieval on query using CACM corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published