Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better SEO for 10 Minutes to Dask #9182

Merged
merged 10 commits into from Jun 24, 2022
1 change: 1 addition & 0 deletions docs/requirements-docs.txt
Expand Up @@ -7,6 +7,7 @@ sphinx-remove-toctrees
sphinx_autosummary_accessors
sphinx-tabs
sphinx-design
jupyter_sphinx
toolz
cloudpickle>=1.5.0
pandas>=1.4.0
Expand Down
101 changes: 58 additions & 43 deletions docs/source/10-minutes-to-dask.rst
@@ -1,10 +1,19 @@
10 Minutes to Dask
==================

This is a short overview of what you can do with Dask. It is geared towards new users.
.. meta::
:description: This is a short overview of Dask geared towards new users. Additional Dask information can be found in the rest of the Dask documentation.

This is a short overview of Dask geared towards new users.
There is much more information contained in the rest of the documentation.

We normally import dask as follows:
.. figure:: images/dask-overview.svg
:alt: Dask overview. Dask is composed of three parts: collections, task graphs, and schedulers.
:align: center

High level collections are used to generate task graphs which can be executed by schedulers on a single machine or a cluster.

We normally import Dask as follows:

.. code-block:: python

Expand All @@ -17,16 +26,18 @@ We normally import dask as follows:

Based on the type of data you are working with, you might not need all of these.

Create a High-Level Collection
------------------------------
Creating a Dask Object
----------------------

You can make a Dask collection from scratch by supplying existing data and optionally
You can create a Dask object from scratch by supplying existing data and optionally
including information about how the chunks should be structured.

.. tabs::

.. group-tab:: DataFrame

See :doc:`dataframe`.

.. code-block:: python

>>> index = pd.date_range("2021-09-01", periods=2400, freq="1H")
Expand All @@ -43,7 +54,7 @@ including information about how the chunks should be structured.
2021-12-09 23:00:00 ... ...
Dask Name: from_pandas, 10 tasks

Now we have a DataFrame with 2 columns and 2400 rows composed of 10 partitions where
Now we have a Dask DataFrame with 2 columns and 2400 rows composed of 10 partitions where
each partition has 240 rows. Each partition represents a piece of the data.

Here are some key properties of an DataFrame:
Expand Down Expand Up @@ -75,30 +86,34 @@ including information about how the chunks should be structured.

.. group-tab:: Array

.. code-block:: python
See :doc:`array`.

.. jupyter-execute::
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the widget that pops up for Dask arrays, so that's why I changed this only for the Dask array sections.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I'm happy for you to change it for all of them if you like! I don't think I knew about jupyter-execute

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok cool! I think it really only makes a difference for array, unless you're saying they should use jupyter-execute for consistency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I assumed that the dataframe would also have their jupyter repr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does! to me it didn't look all that different, but I can add it in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok just pushed new changes w/ jupyter-execute added for dask bag and dask dataframe. visually, not sure which way I prefer (there's a bit more whitespace around some blocks, e.g.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah I see what you mean. I am fine with either. I'll just merge whatever is here before releasing tomorrow :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changed it back! thx for your help :)


>>> data = np.arange(100_000).reshape(200, 500)
... a = da.from_array(data, chunks=(100, 100))
... a
dask.array<array, shape=(200, 500), dtype=int64, chunksize=(100, 100), chunktype=numpy.ndarray>
import numpy as np
import dask.array as da

data = np.arange(100_000).reshape(200, 500)
a = da.from_array(data, chunks=(100, 100))
a

Now we have a 2D array with the shape (200, 500) composed of 10 chunks where
each chunk has the shape (100, 100). Each chunk represents a piece of the data.

Here are some key properties of an Array:

.. code-block:: python
Here are some key properties of a Dask Array:

>>> # inspect the chunks
... a.chunks
((100, 100), (100, 100, 100, 100, 100))
.. jupyter-execute::

>>> # access a particular block of data
... a.blocks[1, 3]
dask.array<blocks, shape=(100, 100), dtype=int64, chunksize=(100, 100), chunktype=numpy.ndarray>
# inspect the chunks
a.chunks

# access a particular block of data
a.blocks[1, 3]

.. group-tab:: Bag

See :doc:`bag`.

.. code-block:: python

>>> b = db.from_sequence([1, 2, 3, 4, 5, 6, 2, 1], npartitions=2)
Expand All @@ -112,7 +127,7 @@ including information about how the chunks should be structured.
Indexing
--------

Indexing Dask collections feels just like slicing numpy arrays or pandas dataframes.
Indexing Dask collections feels just like slicing NumPy arrays or pandas DataFrame.

.. tabs::

Expand Down Expand Up @@ -141,10 +156,9 @@ Indexing Dask collections feels just like slicing numpy arrays or pandas datafra

.. group-tab:: Array

.. code-block:: python
.. jupyter-execute::

>>> a[:50, 200]
dask.array<getitem, shape=(50,), dtype=int64, chunksize=(50,), chunktype=numpy.ndarray>
a[:50, 200]

.. group-tab:: Bag

Expand Down Expand Up @@ -362,13 +376,13 @@ triggering computation, we can inspect the task graph to figure out what's going
>>> result.dask
HighLevelGraph with 7 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7f129df7a9d0>
0. from_pandas-0b850a81e4dfe2d272df4dc718065116
1. loc-fb7ada1e5ba8f343678fdc54a36e9b3e
2. getitem-55d10498f88fc709e600e2c6054a0625
3. series-cumsum-map-131dc242aeba09a82fea94e5442f3da9
4. series-cumsum-take-last-9ebf1cce482a441d819d8199eac0f721
5. series-cumsum-d51d7003e20bd5d2f767cd554bdd5299
6. sub-fed3e4af52ad0bd9c3cc3bf800544f57
1. from_pandas-0b850a81e4dfe2d272df4dc718065116
2. loc-fb7ada1e5ba8f343678fdc54a36e9b3e
3. getitem-55d10498f88fc709e600e2c6054a0625
4. series-cumsum-map-131dc242aeba09a82fea94e5442f3da9
5. series-cumsum-take-last-9ebf1cce482a441d819d8199eac0f721
6. series-cumsum-d51d7003e20bd5d2f767cd554bdd5299
7. sub-fed3e4af52ad0bd9c3cc3bf800544f57

>>> result.visualize()

Expand All @@ -382,12 +396,12 @@ triggering computation, we can inspect the task graph to figure out what's going
>>> b.dask
HighLevelGraph with 6 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7fd33a4aa400>
0. array-ef3148ecc2e8957c6abe629e08306680
1. amax-b9b637c165d9bf139f7b93458cd68ec3
2. amax-partial-aaf8028d4a4785f579b8d03ffc1ec615
3. amax-aggregate-07b2f92aee59691afaf1680569ee4a63
4. getitem-f9e225a2fd32b3d2f5681070d2c3d767
5. add-f54f3a929c7efca76a23d6c42cdbbe84
1. array-ef3148ecc2e8957c6abe629e08306680
2. amax-b9b637c165d9bf139f7b93458cd68ec3
3. amax-partial-aaf8028d4a4785f579b8d03ffc1ec615
4. amax-aggregate-07b2f92aee59691afaf1680569ee4a63
5. getitem-f9e225a2fd32b3d2f5681070d2c3d767
6. add-f54f3a929c7efca76a23d6c42cdbbe84

>>> b.visualize()

Expand All @@ -401,9 +415,9 @@ triggering computation, we can inspect the task graph to figure out what's going
>>> c.dask
HighLevelGraph with 3 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7f96d0814fd0>
0. from_sequence-cca2a33ba6e12645a0c9bc0fd3fe6c88
1. lambda-93a7a982c4231fea874e07f71b4bcd7d
2. zip-474300792cc4f502f1c1f632d50e0272
1. from_sequence-cca2a33ba6e12645a0c9bc0fd3fe6c88
2. lambda-93a7a982c4231fea874e07f71b4bcd7d
3. zip-474300792cc4f502f1c1f632d50e0272

>>> c.visualize()

Expand All @@ -419,7 +433,7 @@ run into code that is parallelizable, but isn't just a big DataFrame or array.

.. group-tab:: Delayed: Lazy

Dask Delayed let you to wrap individual function calls into a lazily constructed task graph:
:doc:`delayed` lets you to wrap individual function calls into a lazily constructed task graph:

.. code-block:: python

Expand All @@ -442,7 +456,7 @@ run into code that is parallelizable, but isn't just a big DataFrame or array.
.. group-tab:: Futures: Immediate

Unlike the interfaces described so far, Futures are eager. Computation starts as soon
as the function is submitted.
as the function is submitted (see :doc:`futures`).

.. code-block:: python

Expand Down Expand Up @@ -471,7 +485,8 @@ run into code that is parallelizable, but isn't just a big DataFrame or array.
Scheduling
----------

After you have generated a task graph, it is the scheduler's job to execute it.
After you have generated a task graph, it is the scheduler's job to execute it
(see :doc:`scheduling`).

By default when you call ``compute`` on a Dask object, Dask uses the thread
pool on your computer to run computations in parallel.
Expand Down
17 changes: 16 additions & 1 deletion docs/source/_static/style.css
Expand Up @@ -6,4 +6,19 @@
.classifier::before {
content: ": ";
}


/* options for jupyter-sphinx extension */
div.jupyter_container {
box-shadow: None;
font-family: var(--pst-font-family-monospace);
border-radius: 0.4em;
}

.jupyter_container div.code_cell {
padding: 10px
}

.jupyter_container .output {
font-size: 16px;
padding: 10px
}
1 change: 1 addition & 0 deletions docs/source/conf.py
Expand Up @@ -47,6 +47,7 @@
"sphinx_remove_toctrees",
"IPython.sphinxext.ipython_console_highlighting",
"IPython.sphinxext.ipython_directive",
"jupyter_sphinx",
"sphinx_copybutton",
"sphinx_design",
]
Expand Down