Skip to content

noleto/pyspark-jupyter

Repository files navigation

Jupyter Notebook Python + Spark for Toulouse Data Science workshop

What it Gives You

  • Jupyter Notebook 3.2
  • Conda Python 2.7.x
  • pyspark, pandas, matplotlib, scipy, seaborn, scikit-learn pre-installed
  • Spark 1.6.0 for use in local mode
  • Unprivileged user jovyan (uid=1000, configurable, see options) in group users (gid=100) with ownership over /home/jovyan and /opt/conda

Basic Use

The following command starts a container with the Notebook server listening for HTTP connections on port 8888 without authentication configured.

docker run -d -p 8888:8888 -p 4040:4040 noleto/pyspark-jupyter

Using Spark Local Mode

This configuration is nice for using Spark on small, local data.

  1. Run the container as shown above.
  2. Open a Python 2 notebook.
  3. SparkContext is already configured for local mode.

For example, the first few cells in a Python 2 notebook might read:

# do something to prove it works
rdd = sc.parallelize(xrange(1000))
rdd.takeSample(False, 5)

Notebook Options

See base image page Minimal Jupyter Notebook Stack

About

Workshop Data Munging with PySpark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published