SortedWordCounter

Description

This application allows to execute WordCount from a remote file, passing the URL as an execution parameter. After counting words, it orders by descending frequency order and store it in a remote FTP folder, defined by parameters, as well. It all works on docker containers which contains a hadoop cluster configuration with 1 master and 2 slaves.

The remote file to be processed, can be just a single plain text file or a zip file with many files inside, both works the same way.

Libraries like Apache FileUtils, SDK to store/receive data from Amazon S3 and other utilites has been ommited in order to develop with the basis, the code has been compiled with Java 7 because of the docker containers configuration.

First it was tried to use maven to handle dependencies, but because of problems with some libraries, I decided to priorize other developments.

Packages structure

eu.antoniolopez.mapreduce: contains the main file
eu.antoniolopez.mapreduce.io: contains the classes to work with files: download, unzip, upload,...
eu.antoniolopez.mapreduce.mapper: two mappers, one to count words, the other to order by frequency
eu.antoniolopez.mapreduce.reducer: two reducers, one to count words, the other to order by frequency

Installing Docker Containers

Follow these instructions in order to install Docker containers cluster to test this MapReducer application: http://kiwenlau.blogspot.com.es/2015/05/quickly-build-arbitrary-size-hadoop.html

This is a 3 node hadoop cluster based on ubuntu, with hadoop 2.3.0 and java 7.

Start container, hadoop and run command

There is an script in this project root called run.sh, this script does almost the same than the original scripts but with some minimal modifications:

Doesn't open a bash console with the master container.
Copy a root jar file to master container
Print the command line to run hadoop with this jar and a sample book: "War and Peace" by Tolstoy: http://www.gutenberg.org/ebooks/2600.txt.utf-8 (3.3MB)

Also has been tested with: https://dumps.wikimedia.org/enwiki/20160113/enwiki-20160113-abstract1.xml 843.6 MB file.

Dependencies

All needed dependencies are stored in src/lib folder

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.settings		.settings
src		src
target		target
.classpath		.classpath
.project		.project
README.md		README.md
SortedWordCounter.jar		SortedWordCounter.jar
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.settings

.settings

src

src

target

target

.classpath

.classpath

.project

.project

README.md

README.md