Similarity and range queries in Spark

This is the code that accompanies my diploma thesis "Processing of similarity and range queries in cloud systems". It was developed in Scala and it uses the Spark API so that it can be run in a distributed environment.

Overview

The goal of this thesis is to answer complex queries of the form:

Given a collection of documents return the pairs that contain the term A [minA,maxA] times, the term B [minB,maxB] times.. etc. From these pairs filter out the ones that have less than x% similarity between their docs.

The code uses the MinHash technique to reduce the time it takes to calculate a Jaccard similarity between two documents, as well as the technique of Locality Sensitive Hashing to reduce the number of times that MinHash will be run during the execution of the aforementioned complex query.

Screenshots

Comparison of the different methods, running on the platform Databricks (Community version) and using 20Newsgroups as the input dataset (notebook is included in the repo):

Execution of Minhash (signature size: 200) -> Time: 12.39 sec

Execution of Minhash + LSH (signature size: 200, bands: 25, rows: 8) -> Time: 9.01 sec

Execution of plain Jaccard -> Time: 31.52 sec

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
project		project
screens		screens
src/main		src/main
.gitignore		.gitignore
DatabricksNotebook.zip		DatabricksNotebook.zip
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project

project

screens

screens

src/main

src/main

.gitignore

.gitignore

DatabricksNotebook.zip

DatabricksNotebook.zip

LICENSE

LICENSE

README.md

README.md

build.sbt

build.sbt

Repository files navigation

Similarity and range queries in Spark

Overview

Screenshots

License

About

Releases

Packages

Languages

License

loukwn/similarity-and-range-queries-spark-thesis

Folders and files

Latest commit

History

Repository files navigation

Similarity and range queries in Spark

Overview

Screenshots

License

About

Resources

License

Stars

Watchers

Forks

Languages