Scaling up empirical research to bigger data with Python

This workshop is intended for researchers who have experience in analyzing data that comfortably fits in memory but are interested in scaling up to bigger than memory datasets. The following topics will be covered: measuring performance and memory usage; sampling and split-apply-combine strategy; data type optimization; efficient storage with parquet; simple parallelization; introduction to Dask. Participants interested in following along will be provided with an example dataset and instructions on setting up a programming environment. All workshop materials will be publicly available in this GitHub repository. A prerequisite exercise should give you an idea of expected prior knowledge of Python and pandas.

If you are new to Python, I recommend reading about Jupyter and pandas. This book shows how to use Jupyter notebooks for teaching and learning, and QuantEcon lectures use Python for economics and finance and are also a good resource for beginners.

This workshop has been conducted by Anton Babkin at:

Department of Agricultural and Resource Economics, University of Connecticut, January 27 and February 3, 2021
Department of Agricultural and Applied Economics, University of Wisconsin-Madison, February 3 and 9, 2021
2021 Data Science Research Bazaar, University of Wisconsin-Madison, February 10, 2021

Setup

You need a running Jupyter server in order to work with workshop notebooks. The easiest way is to launch a free cloud instance in Binder. A more difficult (but potentially more reliable) alternative is to create conda Python environment on your local computer.

Using Binder

Click this link to launch a new Binder instance and connect to it from your browser, then open and run the setup notebook to test the environment and download data. Normal launch time is under 30 seconds, but it might take longer if the repository has been recently updated, because Binder will need to rebuild the environment from scratch.

Notice that Binder platform provides computational resources for free, and so limitations are in place and availability can not be guaranteed. Read here about usage policy and available resources.

Local Python

This method requires some experience or readiness to read documentation. As reward, you will have persistent environment under your control that does not depend on cloud service availability. This is also a typical way to set up Python for data work.

Download and install miniconda (Python 3), following instructions for your operating system.
Open terminal (Anaconda Prompt on Windows) and clone this repository in a folder of your choice (git clone https://github.com/antonbabkin/ds-bazaar-workshop.git). Alternatively, download and unpack repository code as ZIP.
In the terminal, navigate to the repository folder and create new conda environment. Environment specification will be read from the environment.yml file, all required packages will be downloaded and installed.

cd ds-bazaar-workshop
conda env create -f binder/environment.yml

Activate the environment and start JupyterLab server. This will start a new Jupyter server and open Jupyter interface in browser window.

conda activate ds-bazaar-workshop
jupyter lab

In Jupyter, open and run the setup notebook to test the environment and download data.

Data

Run cells of the setup notebook to download data into your environment.

The core dataset used in examples is a synthetic fake, generated from annual historical snapshops of InfoGroup data. InfoGroup is a proprietary database of all businesses in the US, available to University of Wisconsin researchers.

Synthetic version (SynIG) provides a subset of core variables and was generated from the original data using a combination of random fake data, modeling, random sampling, record shuffling and noise infusion to protect original data confidentiality. It has the same format and some resemblance of the original (eg. cross-sectional distribution of establishments and employment across states and sectors) and is suitable for educational purposes or methodology development, but can not be used for analysis of actual businesses. Generation is described and performed in this notebook.

License

Project code is licensed under the MIT license.

The content and provided data are licensed under the Creative Commons Attribution 4.0 International license.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
binder		binder
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
dask.ipynb		dask.ipynb
dtype_parquet.ipynb		dtype_parquet.ipynb
measure.ipynb		measure.ipynb
paral.ipynb		paral.ipynb
prereq.ipynb		prereq.ipynb
setup.ipynb		setup.ipynb
slides.ipynb		slides.ipynb
subset.ipynb		subset.ipynb
tools.py		tools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

binder

binder

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

dask.ipynb

dask.ipynb

dtype_parquet.ipynb

dtype_parquet.ipynb

measure.ipynb

measure.ipynb

paral.ipynb

paral.ipynb

prereq.ipynb

prereq.ipynb

setup.ipynb

setup.ipynb

slides.ipynb

slides.ipynb

subset.ipynb

subset.ipynb

tools.py

tools.py

Repository files navigation

Scaling up empirical research to bigger data with Python

Setup

Using Binder

Local Python

Data

License

About

Releases

Packages

Languages

License

antonbabkin/bigger-data-workshop

Folders and files

Latest commit

History

Repository files navigation

Scaling up empirical research to bigger data with Python

Setup

Using Binder

Local Python

Data

License

About

Resources

License

Stars

Watchers

Forks

Languages