Information Retrieval (IR) Engine

Considering the increasing volume of unstructured data in the world, Information Retrieval (IR) (a sub-area of text mining) and Information Extraction (IE) are extremely important to deal efficiently with all that data. Industry, IR, companies, marketing, economics and many other sectors highly depend on the efficiency and robustness of these techniques and tools.

Developed at Aveiro University by @luminoso and @ruifpedro, this IR/IE engine deals with the overall process of gathering, indexing and searching for relevant documents from huge collections of textual data in order to extract knowledge from unstructured existing data.

Features:

Components are developed in a modules;
Memory adaptability to the host;
Multi threaded.

The engine is currently adapted to process a CSV corpus collected from StackOverflow questions and answers, we include a small stack in the repository for demonstration purposes. Given the modularity of the engine it can be easily adapted to any other type of corpus. For further testing, the full sized corpus can be downloaded here.

How to compile

This project uses Apache Maven for build management, and so, you can run package everything into a jar by executing:

mvn clean package

The shade goal is included in the pom file to package a jar with every dependence needed.

How to run

The engine requires Java 8 as it minimal major java version, it is compatible with both Oracle Java 8 and OpenJDK 8 but it isn't backward compatible with older Java versions.

There are several ways to run the engine. You can import it as a Maven project using your favorite IDE and run it from there, use the provided compiled jar, or use your own jar, packaged by you from the existing source code.

Display help

Run with -h switch for help:

$ java -jar IR-2016_17-0.0.1-SNAPSHOT.jar -h

Option	Description	Default
-d <arg>	Directory containing text corpus to process	./stacksample
-f <arg>	Stop words to use	./stop_processed.txt
-o <arg>	Output directory to store processed index	./disk
-h	print the help message

Processing the given sample

Processing ./stacksample requires no arguments. Default stack directory is ./stacksample

$ java -jar IR-2016_17-0.0.1-SNAPSHOT.jar

Output of the progress is displayed while running.

Query the database

Query processed stack for the words buffer and color.

The interface for querying the database is shown. Example:

$ java -jar IR-2016_17-0.0.1-SNAPSHOT.jar -q
Insert query (Control+c to exit): buffer color
Number of results to query (10): 
┌──────────────────────────────────────────────────────────────────────────┐
│                         Information Retrieval                            │
├──────────────────────────────────────────────────────────────────────────┤
├───────────────────────────────┬──────────────────────────────────────────┤
│ Terms:                        │                                          │
│ • Query                       │ [buffer, color]                          │
│ • Tokenized                   │ [buffer, color]                          │
├───────────────────────────────┴──────────────────────────────────────────┤
├───────────────────────────────┬──────────────────────────────────────────┤
│ Results found                 │ 14                                       │
├───────────────────────────────┼──────────────────────────────────────────┤
│ Database size                 │ 958                                      │
├───────────────────────────────┼──────────────────────────────────────────┤
│ Token count                   │ 4804                                     │
├───────────────────────────────┼──────────────────────────────────────────┤
│ Results to retrieve           │ 10                                       │
├───────────────────────────────┴──────────────────────────────────────────┤
├──────┬────────────────────────┬──────────┬───────────────────────────────┤
│ Rank │         Score          │ Document │             Path              │
├──────┼────────────────────────┼──────────┼───────────────────────────────┤
│  1   │ 0.2792691357898761     │   896    │ ./stacksample/Questions.csv   │
│  2   │ 0.2544781393660751     │   781    │ ./stacksample/Questions.csv   │
│  3   │ 0.20702654309708213    │   354    │ ./stacksample/Questions.csv   │
│  4   │ 0.18761207533221913    │   340    │ ./stacksample/Questions.csv   │
│  5   │ 0.16297804872950658    │    37    │ ./stacksample/Answers.csv     │
│  6   │ 0.16273914301263454    │    12    │ ./stacksample/Answers.csv     │
│  7   │ 0.1576064181842389     │   394    │ ./stacksample/Questions.csv   │
│  8   │ 0.15217728526150623    │   108    │ ./stacksample/Answers.csv     │
│  9   │ 0.1287816215117765     │   322    │ ./stacksample/Questions.csv   │
│  10  │ 0.12816593333587364    │    6     │ ./stacksample/Answers.csv     │
└──────┴────────────────────────┴──────────┴───────────────────────────────┘

Project architecture

The engine is designed as set of macro modules that interact with each other. Overall view is the following:

Module	Description
Corpus Reader	Parses the input. In the given example, files in ./stacksample
Tokenizer	Tokenizes document (removal of stop words, stemming, etc)
Indexer	Processes the tokens, computes LNC and serialize the results
Searcher	Controls the query interface and the mechanisms to perform a query
Ranker	Ranks the results using LNC/TLC approach

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
doc		doc
src/main/java/pt/ua/deti/ir		src/main/java/pt/ua/deti/ir
stacksample		stacksample
LICENSE		LICENSE
README.md		README.md
dependency-reduced-pom.xml		dependency-reduced-pom.xml
nbactions.xml		nbactions.xml
pom.xml		pom.xml
stop.txt		stop.txt
stop_processed.txt		stop_processed.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

src/main/java/pt/ua/deti/ir

src/main/java/pt/ua/deti/ir

stacksample

stacksample

LICENSE

LICENSE

README.md

README.md

dependency-reduced-pom.xml

dependency-reduced-pom.xml

nbactions.xml

nbactions.xml

pom.xml

pom.xml

stop.txt

stop.txt

stop_processed.txt

stop_processed.txt

Repository files navigation

Information Retrieval (IR) Engine

How to compile

How to run

Display help

Processing the given sample

Query the database

Project architecture

About

Releases 1

Packages

Contributors 3

Languages

License

luminoso/information-retrieval

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval (IR) Engine

How to compile

How to run

Display help

Processing the given sample

Query the database

Project architecture

About

Topics

Resources

License

Stars

Watchers

Forks

Languages