Skip to content

Commit

Permalink
Better SEO for 10 Minutes to Dask (#9182)
Browse files Browse the repository at this point in the history
  • Loading branch information
scharlottej13 committed Jun 24, 2022
1 parent aa801de commit 3c87d4e
Show file tree
Hide file tree
Showing 4 changed files with 78 additions and 43 deletions.
1 change: 1 addition & 0 deletions docs/requirements-docs.txt
Expand Up @@ -7,6 +7,7 @@ sphinx-remove-toctrees
sphinx_autosummary_accessors
sphinx-tabs
sphinx-design
jupyter_sphinx
toolz
cloudpickle>=1.5.0
pandas>=1.4.0
Expand Down
101 changes: 59 additions & 42 deletions docs/source/10-minutes-to-dask.rst
@@ -1,10 +1,19 @@
10 Minutes to Dask
==================

This is a short overview of what you can do with Dask. It is geared towards new users.
.. meta::
:description: This is a short overview of Dask geared towards new users. Additional Dask information can be found in the rest of the Dask documentation.

This is a short overview of Dask geared towards new users.
There is much more information contained in the rest of the documentation.

We normally import dask as follows:
.. figure:: images/dask-overview.svg
:alt: Dask overview. Dask is composed of three parts: collections, task graphs, and schedulers.
:align: center

High level collections are used to generate task graphs which can be executed by schedulers on a single machine or a cluster.

We normally import Dask as follows:

.. code-block:: python
Expand All @@ -17,16 +26,18 @@ We normally import dask as follows:
Based on the type of data you are working with, you might not need all of these.

Create a High-Level Collection
------------------------------
Creating a Dask Object
----------------------

You can make a Dask collection from scratch by supplying existing data and optionally
You can create a Dask object from scratch by supplying existing data and optionally
including information about how the chunks should be structured.

.. tabs::

.. group-tab:: DataFrame

See :doc:`dataframe`.

.. code-block:: python
>>> index = pd.date_range("2021-09-01", periods=2400, freq="1H")
Expand All @@ -43,7 +54,7 @@ including information about how the chunks should be structured.
2021-12-09 23:00:00 ... ...
Dask Name: from_pandas, 10 tasks
Now we have a DataFrame with 2 columns and 2400 rows composed of 10 partitions where
Now we have a Dask DataFrame with 2 columns and 2400 rows composed of 10 partitions where
each partition has 240 rows. Each partition represents a piece of the data.

Here are some key properties of an DataFrame:
Expand Down Expand Up @@ -75,30 +86,36 @@ including information about how the chunks should be structured.
.. group-tab:: Array

.. code-block:: python
See :doc:`array`.

>>> data = np.arange(100_000).reshape(200, 500)
... a = da.from_array(data, chunks=(100, 100))
... a
dask.array<array, shape=(200, 500), dtype=int64, chunksize=(100, 100), chunktype=numpy.ndarray>
.. jupyter-execute::

import numpy as np
import dask.array as da

data = np.arange(100_000).reshape(200, 500)
a = da.from_array(data, chunks=(100, 100))
a

Now we have a 2D array with the shape (200, 500) composed of 10 chunks where
each chunk has the shape (100, 100). Each chunk represents a piece of the data.

Here are some key properties of an Array:
Here are some key properties of a Dask Array:

.. code-block:: python
.. jupyter-execute::

>>> # inspect the chunks
... a.chunks
((100, 100), (100, 100, 100, 100, 100))
# inspect the chunks
a.chunks

>>> # access a particular block of data
... a.blocks[1, 3]
dask.array<blocks, shape=(100, 100), dtype=int64, chunksize=(100, 100), chunktype=numpy.ndarray>
.. jupyter-execute::

# access a particular block of data
a.blocks[1, 3]

.. group-tab:: Bag

See :doc:`bag`.

.. code-block:: python
>>> b = db.from_sequence([1, 2, 3, 4, 5, 6, 2, 1], npartitions=2)
Expand All @@ -112,7 +129,7 @@ including information about how the chunks should be structured.
Indexing
--------

Indexing Dask collections feels just like slicing numpy arrays or pandas dataframes.
Indexing Dask collections feels just like slicing NumPy arrays or pandas DataFrame.

.. tabs::

Expand Down Expand Up @@ -141,10 +158,9 @@ Indexing Dask collections feels just like slicing numpy arrays or pandas datafra
.. group-tab:: Array

.. code-block:: python
.. jupyter-execute::

>>> a[:50, 200]
dask.array<getitem, shape=(50,), dtype=int64, chunksize=(50,), chunktype=numpy.ndarray>
a[:50, 200]

.. group-tab:: Bag

Expand Down Expand Up @@ -362,13 +378,13 @@ triggering computation, we can inspect the task graph to figure out what's going
>>> result.dask
HighLevelGraph with 7 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7f129df7a9d0>
0. from_pandas-0b850a81e4dfe2d272df4dc718065116
1. loc-fb7ada1e5ba8f343678fdc54a36e9b3e
2. getitem-55d10498f88fc709e600e2c6054a0625
3. series-cumsum-map-131dc242aeba09a82fea94e5442f3da9
4. series-cumsum-take-last-9ebf1cce482a441d819d8199eac0f721
5. series-cumsum-d51d7003e20bd5d2f767cd554bdd5299
6. sub-fed3e4af52ad0bd9c3cc3bf800544f57
1. from_pandas-0b850a81e4dfe2d272df4dc718065116
2. loc-fb7ada1e5ba8f343678fdc54a36e9b3e
3. getitem-55d10498f88fc709e600e2c6054a0625
4. series-cumsum-map-131dc242aeba09a82fea94e5442f3da9
5. series-cumsum-take-last-9ebf1cce482a441d819d8199eac0f721
6. series-cumsum-d51d7003e20bd5d2f767cd554bdd5299
7. sub-fed3e4af52ad0bd9c3cc3bf800544f57
>>> result.visualize()
Expand All @@ -382,12 +398,12 @@ triggering computation, we can inspect the task graph to figure out what's going
>>> b.dask
HighLevelGraph with 6 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7fd33a4aa400>
0. array-ef3148ecc2e8957c6abe629e08306680
1. amax-b9b637c165d9bf139f7b93458cd68ec3
2. amax-partial-aaf8028d4a4785f579b8d03ffc1ec615
3. amax-aggregate-07b2f92aee59691afaf1680569ee4a63
4. getitem-f9e225a2fd32b3d2f5681070d2c3d767
5. add-f54f3a929c7efca76a23d6c42cdbbe84
1. array-ef3148ecc2e8957c6abe629e08306680
2. amax-b9b637c165d9bf139f7b93458cd68ec3
3. amax-partial-aaf8028d4a4785f579b8d03ffc1ec615
4. amax-aggregate-07b2f92aee59691afaf1680569ee4a63
5. getitem-f9e225a2fd32b3d2f5681070d2c3d767
6. add-f54f3a929c7efca76a23d6c42cdbbe84
>>> b.visualize()
Expand All @@ -401,9 +417,9 @@ triggering computation, we can inspect the task graph to figure out what's going
>>> c.dask
HighLevelGraph with 3 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7f96d0814fd0>
0. from_sequence-cca2a33ba6e12645a0c9bc0fd3fe6c88
1. lambda-93a7a982c4231fea874e07f71b4bcd7d
2. zip-474300792cc4f502f1c1f632d50e0272
1. from_sequence-cca2a33ba6e12645a0c9bc0fd3fe6c88
2. lambda-93a7a982c4231fea874e07f71b4bcd7d
3. zip-474300792cc4f502f1c1f632d50e0272
>>> c.visualize()
Expand All @@ -419,7 +435,7 @@ run into code that is parallelizable, but isn't just a big DataFrame or array.

.. group-tab:: Delayed: Lazy

Dask Delayed let you to wrap individual function calls into a lazily constructed task graph:
:doc:`delayed` lets you to wrap individual function calls into a lazily constructed task graph:

.. code-block:: python
Expand All @@ -442,7 +458,7 @@ run into code that is parallelizable, but isn't just a big DataFrame or array.
.. group-tab:: Futures: Immediate

Unlike the interfaces described so far, Futures are eager. Computation starts as soon
as the function is submitted.
as the function is submitted (see :doc:`futures`).

.. code-block:: python
Expand Down Expand Up @@ -471,7 +487,8 @@ run into code that is parallelizable, but isn't just a big DataFrame or array.
Scheduling
----------

After you have generated a task graph, it is the scheduler's job to execute it.
After you have generated a task graph, it is the scheduler's job to execute it
(see :doc:`scheduling`).

By default when you call ``compute`` on a Dask object, Dask uses the thread
pool on your computer to run computations in parallel.
Expand Down
18 changes: 17 additions & 1 deletion docs/source/_static/style.css
Expand Up @@ -6,4 +6,20 @@
.classifier::before {
content: ": ";
}


/* options for jupyter-sphinx extension */
div.jupyter_container {
box-shadow: None;
font-family: var(--pst-font-family-monospace);
border-radius: 0.4em;
}

.jupyter_container div.code_cell {
padding: 10px;
max-width: None !important;
}

.jupyter_container .output {
font-size: 16px;
padding: 10px
}
1 change: 1 addition & 0 deletions docs/source/conf.py
Expand Up @@ -47,6 +47,7 @@
"sphinx_remove_toctrees",
"IPython.sphinxext.ipython_console_highlighting",
"IPython.sphinxext.ipython_directive",
"jupyter_sphinx",
"sphinx_copybutton",
"sphinx_design",
]
Expand Down

0 comments on commit 3c87d4e

Please sign in to comment.