wikisp

Six Degrees of Wikipedia is a captivating concept inspired by the theory of six degrees of separation, commonly used in social networks, demonstrating that any two Wikipedia articles can be connected within six clicks or fewer. This project specifically focuses on uncovering the shortest path between articles on the English version of Wikipedia, exploring the vast web of interconnected knowledge present on the platform.

This project also allows you to build a clean SQLITE3 database with a adjacency list and partioned graph for easy traversal and use in your own projects.

Requirements

Tool	Explanation for why it's needed
Python 3.x	Used for dump processing
Go 1.20	Used to build serialized adjacency lists and webserver
NodeJS v18+	Needed for development only
Docker	Build and run webserver

Getting started

Building Wikipedia link adjacency lists

In order to build Wikipedia link adjacency it is required to download a Wikipedia dump file from here (~30gb). The file required to download from Wikipedia archives should be named: enwiki-xxxxxxxx-pages-articles-multistream.xml.bz2.

Once downloaded, set the following environnement variables:

Name	Description
OUT_DIR	Path of a directory to put link adjacency lists to
WIKI_XML_DUMP	Wikipedia XML dump file path
SQLITE3_DB_PATH	Path to a SQLite3 database for informations about adjacency and articles
ADJACENCY_LIST_PATH	Path to serialized adjacency list (Should equal to `OUT_DIR`) (Optional if you're not planning to run the webserver)

To run all the steps for dump processing run this command on a terminal:

make dump-processing

Detailed guide (To keep clean CSV and SQLite3 graph database)

The following section is a step by step guide on building wikipedia link adjacency lists in a CSV format which is not processed and in a SQLite3 database format that

The SQLite3 schema is available here

1. Parsing dumps to CSV files

After environnement variables are set the first step required to build Wikipedia link adjacency lists is to parse the dumps and write them to a csv file (Note: They are written to a CSV for faster parsing). This is done doing the following command on a terminal:

make step1-dp

This will create 3 csv files (article.csv, redirect.csv, pagesmentioned.csv) to the directory set by OUT_DIR.

article.csv:

Article titles

string
redirect.csv:

Article A title actually redirects to Article B title

string string
pagesmentioned.csv: Article A has a link to Article B

Article A title Contains links to article B title

string string

2. Writing CSV files to SQLITE3 database

Once the dumps has been processed by step 1, it is necessary to write them to a sqlite3 database to perform some data manipulation such as deleting articles that don't exists, removing redirect loops and chains, knowing which articles are simply aliases to another article, and partioning the graph in step 3.

This is done using the command on a terminal:

make step2-dp

3. "Partitioning" the graph

Once step 2 is done, the final step required is partioning the graph. This is done to reduce execution time for requests to articles that doesn't have a path.

NOTE: Because this is a directed graph, partitioning the graph doesn't allow to know every pairs of articles that doesn't have a path but a significant amount.

This is done using the following command:

make step3-dp

NOTE: Don't run this if you want to use the adjacency lists for your own projects.

4. Optional: Building the serialized adjacency lists for the webserver to use

make step4-dp

5. Optional: Cleaning up the database to reduce size

After step 3, there are some large tables in the database set by SQLITE3_DB_PATH such as article_link_edge_directed that will not be used by the webserver. This can be done by running

make step5-dp

NOTE: Running VACUUM in the SQLite3 database will reduce the size also.

Running the webserver

In production mode

This is for production only.

Environment variables needed:

Name	Description
ADJACENCY_LIST_PATH	Path to serialized adjacency list directory generated in section 1
SQLITE3_DB_DIR	Path to SQLITE3 database directory generated in section 1
CAPTCHA_ENABLED	Determine if captcha should be enabled (default: 1)
CAPTCHA_SECRET	Google Recaptcha secret (Optional)
CAPTCHA_SITEKEY	Google Recaptcha site key (optional)

Requirement: Docker and Docker Compose

Build the webserver image
```
make build-image
```
Running the server
```
make run-webapp
```

In development mode

Environment variables needed:

Name	Description
ADJACENCY_LIST_PATH	Path to serialized adjacency list directory generated in section 1
SQLITE3_DB_PATH	Path to SQLITE3 database file generated in section 1
CAPTCHA_ENABLED	Determine if captcha should be enabled (default: 1)
CAPTCHA_SECRET	Captcha secret (Optional)
CAPTCHA_SITEKEY	Captcha site key (Optional)

Run webpack to enable hot reloading of the webpage

make webpack-watch
cd shortestpath/webapp/client && npm run watch

Host a Redis server

Environment variables needed:

Name Description

REDIS_HOST Hostname where Redis is running

REDIS_PORT Port where Redis is running
Run the webserver

Environment variable WIKISP_DEBUG must be set to 1
```
make run-webpp-dev
cd shortestpath/webapp && go run main.go
```

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
dump-processing		dump-processing
shortestpath		shortestpath
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dump-processing

dump-processing

shortestpath

shortestpath

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

Repository files navigation

wikisp

Table of contents

Requirements

Getting started

Building Wikipedia link adjacency lists

Detailed guide (To keep clean CSV and SQLite3 graph database)

1. Parsing dumps to CSV files

2. Writing CSV files to SQLITE3 database

3. "Partitioning" the graph

4. Optional: Building the serialized adjacency lists for the webserver to use

5. Optional: Cleaning up the database to reduce size

Running the webserver

In production mode

In development mode

About

Releases

Packages

Languages

Name	Description
REDIS_HOST	Hostname where Redis is running
REDIS_PORT	Port where Redis is running

License

haskaalo/wikisp

Folders and files

Latest commit

History

Repository files navigation

wikisp

Table of contents

Requirements

Getting started

Building Wikipedia link adjacency lists

Detailed guide (To keep clean CSV and SQLite3 graph database)

1. Parsing dumps to CSV files

2. Writing CSV files to SQLITE3 database

3. "Partitioning" the graph

4. Optional: Building the serialized adjacency lists for the webserver to use

5. Optional: Cleaning up the database to reduce size

Running the webserver

In production mode

In development mode

About

Resources

License

Stars

Watchers

Forks

Languages