Skip to content

loukwn/similarity-and-range-queries-spark-thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Similarity and range queries in Spark

This is the code that accompanies my diploma thesis "Processing of similarity and range queries in cloud systems". It was developed in Scala and it uses the Spark API so that it can be run in a distributed environment.

Overview

The goal of this thesis is to answer complex queries of the form:

Given a collection of documents return the pairs that contain the term A [minA,maxA] times, the term B [minB,maxB] times.. etc. From these pairs filter out the ones that have less than x% similarity between their docs.

The code uses the MinHash technique to reduce the time it takes to calculate a Jaccard similarity between two documents, as well as the technique of Locality Sensitive Hashing to reduce the number of times that MinHash will be run during the execution of the aforementioned complex query.

Screenshots

Comparison of the different methods, running on the platform Databricks (Community version) and using 20Newsgroups as the input dataset (notebook is included in the repo):

Execution of Minhash (signature size: 200) -> Time: 12.39 sec

Execution of Minhash + LSH (signature size: 200, bands: 25, rows: 8) -> Time: 9.01 sec

Execution of plain Jaccard -> Time: 31.52 sec

License

MIT License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages