Skip to content

Commit 3fc6a7e

Browse files
committed
DOCS-modin-project#7287: Update Modin on Dask documentation
Signed-off-by: Kirill Suvorov <kirill.suvorov@intel.com>
1 parent 29861e6 commit 3fc6a7e

File tree

1 file changed

+75
-1
lines changed

1 file changed

+75
-1
lines changed

docs/development/using_pandas_on_dask.rst

+75-1
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,78 @@ or turn them on in source code:
2020
2121
import modin.config as cfg
2222
cfg.Engine.put('dask')
23-
cfg.StorageFormat.put('pandas')
23+
cfg.StorageFormat.put('pandas')
24+
25+
Using Modin on Dask locally
26+
---------------------------
27+
28+
If you want to use a single node, just change the Modin Engine to Dask and
29+
continue working with the Modin Dataframe as if it were a Pandas Dataframe.
30+
You don't even have to initialize the Dask Client, because Modin will do it
31+
yourself or use the current one if it is already initialized:
32+
33+
.. code-block:: python
34+
35+
import modin.pandas as pd
36+
import modin.config as modin_cfg
37+
38+
modin_cfg.Engine.put("dask")
39+
df = pd.read_parquet("s3://my-bucket/big.parquet")
40+
41+
.. note:: In previous versions of Modin, you had to initialize Dask before importing Modin. As of Modin 0.9.0, This is no longer the case.
42+
43+
Using Modin on Dask Clusters
44+
----------------------------
45+
46+
If you want to use clusters of many machines, you don't need to do any additional steps.
47+
Just initialize a Dask Client on your cluster and use Modin as you would on a single node.
48+
As long as Dask Client is initialized before any dataframes are created, Modin
49+
will be able to connect to and use the Dask Cluster.
50+
51+
.. code-block:: python
52+
53+
from distributed import Client
54+
import modin.pandas as pd
55+
import modin.config as modin_cfg
56+
57+
# Please define your cluster here
58+
cluster = ...
59+
client = Client(cluster)
60+
61+
modin_cfg.Engine.put("dask")
62+
df = pd.read_parquet("s3://my-bucket/big.parquet")
63+
64+
To get more ways to deploy and run Dask clusters, visit the `Deploying Dask Clusters page`_.
65+
66+
How Modin uses Dask
67+
-------------------
68+
69+
Modin has a layered architecture, and the core abstraction for data manipulation
70+
is the Modin Dataframe, which implements a novel algebra that enables Modin to
71+
handle all of pandas (see Modin's documentation_ for more on the architecture).
72+
Modin's internal dataframe object has a scheduling layer that is able to partition
73+
and operate on data with Dask.
74+
75+
Conversion to and from Modin from Dask Dataframe
76+
------------------------------------------------
77+
78+
Modin DataFrame can be converted to/from Dask Dataframe with no-copy partition conversion.
79+
This allows you to take advantage of both Dask and Modin libraries for maximum performance.
80+
81+
.. code-block:: python
82+
83+
import modin.pandas as pd
84+
import modin.config as modin_cfg
85+
from modin.pandas.io import to_dask, from_dask
86+
87+
modin_cfg.Engine.put("dask")
88+
df = pd.read_parquet("s3://my-bucket/big.parquet")
89+
90+
# Convert Modin to Dask Dataframe
91+
dask_df = to_dask(df)
92+
93+
# Convert Dask to Modin Dataframe
94+
modin_df = from_dask(dask_df)
95+
96+
.. _Deploying Dask Clusters page: https://docs.dask.org/en/stable/deploying.html
97+
.. _documentation: https://modin.readthedocs.io/en/latest/development/architecture.html

0 commit comments

Comments
 (0)