RaspPi-Cluster

The following tutorial I will use a four node Raspberry Pi Cluster for an example.

(And I'll use my preferred selection between many similar options.)

i.e. the memory allocation settings is fit for Raspberry Pi 3 with 1G RAM

After setting up the environment, I'll implement some popular distributed computing ecosystem on it. And try to write a quick start script for them. And maybe some example demo.

Usage

Usage in Detail !! (Manual)

Quick Start

Important!!. First check user settings in configure.yaml, (for deeper settings check out fabfile.py User Settings part)

# Install local dependencies
python3 -m pip install -r requirements.txt

Quick Setup

fab update-and-upgrade # Make apt-get up to date (this can be done using the first login GUI of Raspbian Buster)
fab env-setup # Quick install basic utility function
fab set-hostname # Set hostname for each node (will need to reboot)
fab hosts-config # Set each others' hostname-to-IP on each Raspberry Pi (or they can't find each other using hostname)
fab ssh-config # Generate ssh-key and setup to all nodes
fab change-passwd # Change password for more security (Remember also change in fabfile.py later if you have changed pi's passowrd)
fab expand-swap # Expand swap (default 1024MB use --size=MEMSIZE to match your need) (System default is 100MB)

Regular used function (make sure you've generated ssh-key or move your ssh-key to ./connection/id_rsa)

fab ssh-connect NODE_NUM # Connect to any node by it's index without password (use -h flag to be hadoop user)
fab uploadfile file_or_dir -s -p # Upload file or folder to remote (specific node use -n=NODE_NUM flag)

Hadoop

If you changed default hostname in fabfile.py or configure.yaml. Make sure you also changed the Hadoop configuraiton file in ./Files.

(if you're using cloud server, make sure you've opened the ports that Hadoop need.)

fab install-hadoop # An one button setup for hadoop environment on all nodes!!!

fab update-hadoop-conf # Every time you update configure file in local you can update it to all nodes at once

(the key of Hadoop user is store in ./connection/hadoopSSH)

Utility function

fab start-hadoop
fab restart-hadoop
fab stop-hadoop

fab status-hadoop # Monitor Hadoop behavior

fab example-hadoop # If everything is done. You can play around with some hadoop official example

Spark

If you changed default hostname in fabfile.py or configure.yaml. Make sure you also changed the Spark configuraiton file in ./Files.

fab install-spark

There are lots of utility function like I did for Hadoop. Check it out by fab --list

Jupyter Notebook with PySpark

This will be installed with Hadoop user

fab install-jupyter

Docker Swarm

fab install-docker

VSCode code-server

fab install-codeserver

Example

Subject	Ecosystem	Purpose
MapReduce Practice	Hadoop	MapReduce practice with Hadoop Streaming
Spark Practice	Spark
Inverted Index		Focus on multiple inverted index strategy for search

Steps

A step by step record of how I build this system.

Preparation
- Hardware purchase
- Software package and dependencies (PC/Laptop)
  - Python > 3.6
  - Fabric 2.X

Setup Raspberry Pis
Assemble hardwares
Follow steps in Quick Setup
- Make sure
  1. (setup locale)
  2. update and upgrade
  3. setup environment
    1. git
    2. Java (JDK)
  4. setup hostname (for each and between each others)
  5. ssh keys
  6. expand swap (if use Raspberry Pi 3 or small RAM Raspberry Pi 4)
Setup fabric (brief notes) - execute shell commands remotely over SSH to all hosts at once!
- I've built some utility function first and then move on setup Hadoop
- when any general purpose manipulation needed I'll add it.
Setup Hadoop
Setup Spark
Setup Jupyter with PySpark and Parallel IPython
Setup Docker Swarm - TODO
Setup Kubernetes - TODO
Setup Distributed Tensorflow - TODO
- on Hadoop
- on Kubernetes

Not Big Data / Cluster Related

Setup VSCode code-server - TODO

Notes about distributed computing

Algorithm

MapReduce

Links

Notes about specific ecosystem

Hadoop

HDFS
YARN

Spark

Distributed MongoDB

Kubernetes

Distributed Tensorflow

Elasticsearch

RediSearch

High Performance Computing (HPC)

MPI
PBS

Resource Manager

Parallel Computing

Intel has updated their DevCloud system and currently called oneAPI

Intel AI DevCloud oneAPI (Intel AI DevCloud (Old))

Resource Allocation System (RAS)

Sun Grid Engine (SGE)

Torque/PBS

TODO

Deal with PySpark and Jupyter Notebook problem
More friendly Document
Hadoop utility function introduction
Dynamic Configure based on different hardware and maybe GUI and save multiple settings
- Set up hardware detail e.g. RAM size
- Read and write *.xml
list some alterative note
- pdsh == fab CMD
- ssh-copy-id == ssh-config
Hive, HBase, Pig, ...
Git server maybe
- Setting up Your Raspberry Pi as a Git Server
14+ Raspberry Pi Server Projects
Change apt-get to apt?!
- Difference Between apt and apt-get Explained - It's FOSS
MPI
Dask
- Deploy Dask Clusters — Dask documentation
  - Dask-MPI
  - Cluster manager: PBS, SLURM, LSF, SGE
- Configuring a Distributed Dask Cluster
Fabric alternative
- AsyncSSH: Asynchronous SSH for Python — AsyncSSH 2.13.2 documentation

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
Documentation		Documentation
Example		Example
Files		Files
Notes		Notes
Picture		Picture
Tutorial		Tutorial
.gitignore		.gitignore
README.md		README.md
configure.yaml		configure.yaml
fabfile.py		fabfile.py
requirements.txt		requirements.txt

daviddwlee84/RaspPi-Cluster

Folders and files

Latest commit

History

Repository files navigation

RaspPi-Cluster

Usage

Quick Start

Quick Setup

Hadoop

Spark

Jupyter Notebook with PySpark

Docker Swarm

VSCode code-server

Example

Steps

Not Big Data / Cluster Related

Notes about distributed computing

Notes about specific ecosystem

Resource Allocation System (RAS)

TODO

About

Topics

Resources

Stars

Watchers

Forks

Languages