Skip to content

An efficient quick-start tool to build a Raspberry Pi (or Debian-based) Cluster with popular ecosystem like Hadoop, Spark

Notifications You must be signed in to change notification settings

daviddwlee84/RaspPi-Cluster

Repository files navigation

RaspPi-Cluster

The following tutorial I will use a four node Raspberry Pi Cluster for an example.

(And I'll use my preferred selection between many similar options.)

i.e. the memory allocation settings is fit for Raspberry Pi 3 with 1G RAM

After setting up the environment, I'll implement some popular distributed computing ecosystem on it. And try to write a quick start script for them. And maybe some example demo.

Usage

Usage in Detail !! (Manual)

Quick Start

Important!!. First check user settings in configure.yaml, (for deeper settings check out fabfile.py User Settings part)

# Install local dependencies
python3 -m pip install -r requirements.txt

Quick Setup

fab update-and-upgrade # Make apt-get up to date (this can be done using the first login GUI of Raspbian Buster)
fab env-setup # Quick install basic utility function
fab set-hostname # Set hostname for each node (will need to reboot)
fab hosts-config # Set each others' hostname-to-IP on each Raspberry Pi (or they can't find each other using hostname)
fab ssh-config # Generate ssh-key and setup to all nodes
fab change-passwd # Change password for more security (Remember also change in fabfile.py later if you have changed pi's passowrd)
fab expand-swap # Expand swap (default 1024MB use --size=MEMSIZE to match your need) (System default is 100MB)

Regular used function (make sure you've generated ssh-key or move your ssh-key to ./connection/id_rsa)

fab ssh-connect NODE_NUM # Connect to any node by it's index without password (use -h flag to be hadoop user)
fab uploadfile file_or_dir -s -p # Upload file or folder to remote (specific node use -n=NODE_NUM flag)

Hadoop

If you changed default hostname in fabfile.py or configure.yaml. Make sure you also changed the Hadoop configuraiton file in ./Files.

(if you're using cloud server, make sure you've opened the ports that Hadoop need.)

fab install-hadoop # An one button setup for hadoop environment on all nodes!!!

fab update-hadoop-conf # Every time you update configure file in local you can update it to all nodes at once

(the key of Hadoop user is store in ./connection/hadoopSSH)

Utility function

fab start-hadoop
fab restart-hadoop
fab stop-hadoop

fab status-hadoop # Monitor Hadoop behavior

fab example-hadoop # If everything is done. You can play around with some hadoop official example

Spark

If you changed default hostname in fabfile.py or configure.yaml. Make sure you also changed the Spark configuraiton file in ./Files.

fab install-spark

There are lots of utility function like I did for Hadoop. Check it out by fab --list

Jupyter Notebook with PySpark

This will be installed with Hadoop user

fab install-jupyter

Docker Swarm

fab install-docker

VSCode code-server

fab install-codeserver

Example

Subject Ecosystem Purpose
MapReduce Practice Hadoop MapReduce practice with Hadoop Streaming
Spark Practice Spark
Inverted Index Focus on multiple inverted index strategy for search

Steps

A step by step record of how I build this system.

  • Preparation
    • Hardware purchase
    • Software package and dependencies (PC/Laptop)
      • Python > 3.6
      • Fabric 2.X
  1. Setup Raspberry Pis

  2. Assemble hardwares

    rpi-cluster

  3. Follow steps in Quick Setup

    • Make sure
      1. (setup locale)
      2. update and upgrade
      3. setup environment
        1. git
        2. Java (JDK)
      4. setup hostname (for each and between each others)
      5. ssh keys
      6. expand swap (if use Raspberry Pi 3 or small RAM Raspberry Pi 4)
  4. Setup fabric (brief notes) - execute shell commands remotely over SSH to all hosts at once!

    • I've built some utility function first and then move on setup Hadoop
    • when any general purpose manipulation needed I'll add it.
  5. Setup Hadoop

  6. Setup Spark

  7. Setup Jupyter with PySpark and Parallel IPython

  8. Setup Docker Swarm - TODO

  9. Setup Kubernetes - TODO

  10. Setup Distributed Tensorflow - TODO

    • on Hadoop
    • on Kubernetes

Not Big Data / Cluster Related

  1. Setup VSCode code-server - TODO

Notes about distributed computing

Algorithm

Links

Notes about specific ecosystem

Hadoop

Spark

Distributed MongoDB

Kubernetes

Distributed Tensorflow

Elasticsearch

RediSearch

High Performance Computing (HPC)

Resource Manager

Intel has updated their DevCloud system and currently called oneAPI

Resource Allocation System (RAS)

Sun Grid Engine (SGE)

Torque/PBS

TODO

About

An efficient quick-start tool to build a Raspberry Pi (or Debian-based) Cluster with popular ecosystem like Hadoop, Spark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published