Skip to content

Solutions of the task 'Speed up Entity Resolution with Bit Arrays' for the big data prak 2017: https://dbs.uni-leipzig.de/study/ss_2017/bigdprak

Notifications You must be signed in to change notification settings

mam10eks/Big-Data-Prak-17

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data Prak SS 17: Speed up Entity Resolution with Bit Arrays

This repository contains the work that was done during the big data internship during the summer semester 2017 to the subject "Speed up Entity Resolution with Bit Arrays".

If you want fast introduction, you could use:

  • some scripts (located in /scripts)
    • If you want to execute a single entity resolution for two input data sets of size 5000 and 20000, you should execute the ./scripts/run.sh script. This script finishes within 10 minutes on my laptop, but you could already compare the performance of the trivial, the sorted set and the two stage filter approach. All three will yield to the same result, but since the last one uses bit arrays to filter the dataset, it will be much faster.
    • If you want a more detailed evaluation of more input data sets and other approaches with different configurations, you could continue with the ./scripts/batch_run.sh script which could be used to run a huge amount of different entity resolution approaches in a single batch run. This script will generate detailed outputs into an json file which you could use to evaluate the different approaches in its different configurations against each other and draw nice graphs.

Since this is an internship, there are additionally:

About

Solutions of the task 'Speed up Entity Resolution with Bit Arrays' for the big data prak 2017: https://dbs.uni-leipzig.de/study/ss_2017/bigdprak

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published