PySpark Cookbook

Code base for the PySpark Coookbook by Denny Lee and Tomasz Drabas.

Introduction

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem.

You'll start by learning the Apache Spark architecture and how to set up a Python environment for Spark. You’ll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you’ll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You'll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark and use GraphFrames to solve graph-processing problems.

By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.

About authors

Denny Lee is a Technical Product Marketing Manager with Databricks, working as closely to Apache Spark as humanly possible. Previously, Denny was a Principal Program Manager at Microsoft for the Azure Cosmos DB team – Microsoft’s blazing fast, planet-scale managed document store service. He is a hands-on distributed systems and data sciences engineer with more than 20 years of experience developing Internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments.

He has extensive experience in building green field teams as well as turnaround / change catalyst. Prior to joining the Azure Cosmos DB team, Denny worked as a Technology Evangelist at Databricks; he has been working with Apache Spark since 0.5. He was also the Senior Director of Data Sciences Engineering at Concur, and was on the incubation team that built Microsoft’s Hadoop on Windows and Azure service (currently known as HDInsight). Denny also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers for the last fifteen years.

Tomasz Drabas is a Senior Data Scientist working for Microsoft and currently residing in Seattle area. He has over 15 years of experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance and consulting he gained while working on three continents: Europe, Australia and North America. While in Australia, Tomasz has been working on his PhD in Operations Research with focus on choice modeling and revenue management applications in the airline industry.

At Microsoft, Tomasz works with big data on a daily basis solving machine learning problems such as anomaly detection, churn prediction or pattern recognition using Spark.

Tomasz has also authored the Learning PySpark with Denny Lee in 2017 and the Practical Data Analysis Cookbook (Python focused) published by Packt Publishing in 2016.

You can purchase our books and videos from

Packt Publishing
- Learning PySpark: https://www.packtpub.com/big-data-and-business-intelligence/learning-pyspark
- Learning PySpark (videos): https://www.packtpub.com/big-data-and-business-intelligence/learning-pyspark-video
- Practical Data Analysis Cookbook: https://www.amazon.com/Practical-Analysis-Cookbook-Tomasz-Drabas/dp/1783551666
Amazon
- Learning PySpark: https://www.amazon.com/Learning-PySpark-Tomasz-Drabas/dp/1786463709
- Practical Data Analysis Cookbook: https://www.packtpub.com/big-data-and-business-intelligence/practical-data-analysis-cookbook
O'Reilly
- Learning PySpark: http://shop.oreilly.com/product/9781786463708.do
- Learning PySpark (videos): http://shop.oreilly.com/product/0636920172277.do
- Introduction to Apache Spark 2.0: http://shop.oreilly.com/product/0636920088851.do
- Practical Data Analysis Cookbook: http://shop.oreilly.com/product/9781783551668.do

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
Chapter01		Chapter01
Chapter02		Chapter02
Chapter03		Chapter03
Chapter04		Chapter04
Chapter05		Chapter05
Chapter06		Chapter06
Chapter07		Chapter07
Chapter08		Chapter08
Data		Data
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter01

Chapter01

Chapter02

Chapter02

Chapter03

Chapter03

Chapter04

Chapter04

Chapter05

Chapter05

Chapter06

Chapter06

Chapter07

Chapter07

Chapter08

Chapter08

Data

Data

.DS_Store

.DS_Store

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

PySpark Cookbook

Introduction

Table of contents:

About authors

About

Releases

Packages

Languages

License

sharkgurl/PySparkCookbook

Folders and files

Latest commit

History

Repository files navigation

PySpark Cookbook

Introduction

Table of contents:

About authors

About

Resources

License

Stars

Watchers

Forks

Languages