The Backbone Application: Instance Matching for Company Data using Dedupe

Description

The Backbone Application is a client-server application, based on a machine learning algorithm, that matches instances from different datasets and stores them in a database.

Instance matching algorithm

The instance matching algorithm is used to match data about companies. It receives a configuration file and two input datasets (.csv files), each containing data about different companies, and tries to find and create links between two entities that refer to the same company. The 'links' are represented by clusters, i.e., if the algorithm matched two companies, they will be put in the same cluster. The output is made of two .csv files, which are composed of the data that was in the input files and two new columns: 'cluster_id' (the id of the cluster the company was assigned to) and 'link_score' (score representing how similar that company is with the others that were assigned to the same cluster).

Notes:

companies that do not match with other companies from the other dataset are assigned to their own cluster (a 1 element cluster)
the algorithm does a one-to-one match between the two datasets, i.e., it matches at most two companies from the two datasets; if there are two entities in the same dataset that refer to the same company, they will for sure not end up in the same cluster

Server Application

The server side is a RESTful application that:

can receive all the neccessary input files that the matching algorithm needs
if the user only sent one dataset with company data, the server can extract company entities from the database (based on the jurisdiction the user specifies in the configuration file) and create the second dataset that the algorithm needs
can insert into the database the results of the matching algorithm
can run the matching algorithm when all the neccessary input files were provided
can search by name or address in the database for companies

Client Application

The client side is a desktop application where the user can:

select and upload the input files that the algorithm needs
create a training file for the algorithm (the file is automatically sent to the server after it was created)
start the algorithm (after all the neccessary files were uploaded)
search for companies, by their names or addresses, in the database

Getting Started

Server Application

Run the api.py script in an IDE that supports Python or from a terminal like in the example below:

python3 api.py

Client Application

Run the client_app.py script in an IDE that supports Python or from a terminal like in the example below

python3 client_app.py

Optional: Instance matching algorithm

To run only the instance matching algorithm, one needs to have Jupyter Notebook installed and open the dedupe_interlinking_data.ipynb file, that can be found in the server_app folder, with Jupyter notebook.

Prerequisites

Jupyter notebook (needed only if the matching algorithm is to be run individually) - intallation guide
Python 3 - installation guide
pandas (if Jupyter notebook and Anaconda are not installed) - installation guide
numPy (if Jupyter notebook and Anaconda are not installed) - installation guide
dedupe - Dedupe's GitHub page can be found here

pip install dedupe

unidecode (used in the instance matching algorithm for preprocessing data)

pip install unidecode

simplejson (e.g.: used in the instance matching algorithm to read the JSON configuration file)

pip install simplejson

flask - installation guide
requests
- official installation guide
- stackoverflow installation guide
Database:
- PostgreSQL database - installation guide
- psycopg - installation guide
- (OPTIONAL) pgAdmin - tool for managing and visualizing the postgreSQL database; download here

The versions of the modules at the development time can be accessed here

Documentation

More detailed documentation can be found in the wiki pages.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
client_app		client_app
documentation_files		documentation_files
examples_input_files		examples_input_files
server_app		server_app
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client_app

client_app

documentation_files

documentation_files

examples_input_files

examples_input_files

server_app

server_app

.gitignore

.gitignore

README.md

README.md

Repository files navigation

The Backbone Application: Instance Matching for Company Data using Dedupe

Description

Instance matching algorithm

Server Application

Client Application

Getting Started

Server Application

Client Application

Optional: Instance matching algorithm

Prerequisites

Documentation

About

Releases

Packages

Contributors 2

Languages

datagraft/interlinking-company-data-service

Folders and files

Latest commit

History

Repository files navigation

The Backbone Application: Instance Matching for Company Data using Dedupe

Description

Instance matching algorithm

Server Application

Client Application

Getting Started

Server Application

Client Application

Optional: Instance matching algorithm

Prerequisites

Documentation

About

Resources

Stars

Watchers

Forks

Languages