Skip to content

Commit ebe94b0

Browse files
committed
Remove dublication
1 parent f2e0952 commit ebe94b0

File tree

4 files changed

+170
-274
lines changed

4 files changed

+170
-274
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,134 @@
11
Using Modin in a Cluster
22
========================
33

4-
In this section, we show how Modin can be used to accelerate your pandas workflows in a cluster.
5-
Each Modin distributed engine has its own specifics regarding running and using a cluster so
6-
you can choose one of the following instructions to suit the engine you are using.
7-
8-
.. toctree::
9-
:maxdepth: 4
10-
11-
using_modin_cluster/using_modin_ray_cluster
12-
4+
.. note::
5+
| *Estimated Reading Time: 15 minutes*
6+
7+
Often in practice we have a need to exceed the capabilities of a single machine.
8+
Modin works and performs well in both local mode and in a cluster environment.
9+
The key advantage of Modin is that your python code does not change between
10+
local development and cluster execution. Users are not required to think about
11+
how many workers exist or how to distribute and partition their data;
12+
Modin handles all of this seamlessly and transparently.
13+
14+
.. note::
15+
It is possible to use a Jupyter notebook, but you will have to deploy a Jupyter server
16+
on the remote cluster head node and connect to it.
17+
18+
.. image:: ../../../img/modin_cluster.png
19+
:alt: Modin cluster
20+
:align: center
21+
22+
Extra requirements for AWS authentication
23+
-----------------------------------------
24+
25+
First of all, install the necessary dependencies in your environment:
26+
27+
.. code-block:: bash
28+
29+
pip install boto3
30+
31+
The next step is to setup your AWS credentials. One can set ``AWS_ACCESS_KEY_ID``,
32+
``AWS_SECRET_ACCESS_KEY`` and ``AWS_SESSION_TOKEN``(Optional) `AWS CLI environment variables`_ or
33+
just run the following command:
34+
35+
.. code-block:: bash
36+
37+
aws configure
38+
39+
Starting and connecting to the cluster
40+
--------------------------------------
41+
42+
This example starts 1 head node (m5.24xlarge) and 5 worker nodes (m5.24xlarge), 576 total CPUs.
43+
You can check the `Amazon EC2 pricing`_ page.
44+
45+
It is possble to manually create AWS EC2 instances and configure them or just use the `Ray CLI`_ to
46+
create and initialize a Ray cluster on AWS using `Modin's Ray cluster setup config`_,
47+
which we are going to utilize in this example.
48+
Refer to `Ray's autoscaler options`_ page on how to modify the file.
49+
50+
More details on how to launch a Ray cluster can be found on `Ray's cluster docs`_.
51+
52+
To start up the Ray cluster, run the following command in your terminal:
53+
54+
.. code-block:: bash
55+
56+
ray up modin-cluster.yaml
57+
58+
Once the head node has completed initialization, you can optionally connect to it by running the following command.
59+
60+
.. code-block:: bash
61+
62+
ray attach modin-cluster.yaml
63+
64+
To exit the ssh session and return back into your local shell session, type:
65+
66+
.. code-block:: bash
67+
68+
exit
69+
70+
Executing in a cluster environment
71+
----------------------------------
72+
73+
.. note::
74+
Be careful when using the `Ray client`_ to connect to a remote cluster.
75+
We don't recommend this connection mode, beacuse it may not work. Known bugs:
76+
- https://github.com/ray-project/ray/issues/38713,
77+
- https://github.com/modin-project/modin/issues/6641.
78+
79+
Modin lets you instantly speed up your workflows with a large data by scaling pandas
80+
on a cluster. In this tutorial, we will use a 12.5 GB `big_yellow.csv` file that was
81+
created by concatenating a 200MB `NYC Taxi dataset`_ file 64 times. Preparing this
82+
file was provided as part of our `Modin's Ray cluster setup config`_.
83+
84+
If you want to use the other dataset, you should provide it to each of
85+
the cluster nodes with the same path. We recomnend doing this by customizing the
86+
``setup_commands`` section of the `Modin's Ray cluster setup config`_.
87+
88+
To run any script in a remote cluster, you need to submit it to the Ray. In this way,
89+
the script file is sent to the the remote cluster head node and executed there.
90+
91+
In this tutorial, we provide the `exercise_5.py`_ script, which reads the data from the
92+
CSV file and executes such pandas operations as count, groupby and applymap.
93+
As a result of the script, you will see the size of the file being read and the execution
94+
time of each function.
95+
96+
.. note::
97+
Some Dataframe functions are executed asynchronously, so to correctly measure execution time
98+
of each function we need to wait for the execution result. We use the special ``execute`` function for this,
99+
but you shouldn't use it in a real case scenario.
100+
101+
You can submit this script to the existing remote cluster by running the following command.
102+
103+
.. code-block:: bash
104+
105+
ray submit modin-cluster.yaml exercise_5.py
106+
107+
To download or upload files to the cluster head node, use `ray rsync_down` or `ray rsync_up`.
108+
It may help you if you want to use some other Python modules that should be available to
109+
execute your own script or download a result file after executing the script.
110+
111+
.. code-block:: bash
112+
113+
# download a file from the cluster to the local machine:
114+
ray rsync_down modin-cluster.yaml '/path/on/cluster' '/local/path'
115+
# upload a file from the local machine to the cluster:
116+
ray rsync_up modin-cluster.yaml '/local/path' '/path/on/cluster'
117+
118+
Modin performance scales as the number of nodes and cores increases. The following
119+
chart shows the performance of the ``read_csv`` operation with different number of nodes,
120+
with improvements in performance as we increase the number of resources Modin can use.
121+
122+
.. image:: ../../../../examples/tutorial/jupyter/img/modin_cluster_perf.png
123+
:alt: Cluster Performance
124+
:align: center
125+
126+
.. _`Ray's autoscaler options`: https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-config
127+
.. _`Ray's cluster docs`: https://docs.ray.io/en/latest/cluster/getting-started.html
128+
.. _`NYC Taxi dataset`: https://modin-datasets.intel.com/testing/yellow_tripdata_2015-01.csv
129+
.. _`Modin's Ray cluster setup config`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml
130+
.. _`Amazon EC2 pricing`: https://aws.amazon.com/ec2/pricing/on-demand/
131+
.. _`exercise_5.py`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.py
132+
.. _`Ray client`: https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html
133+
.. _`Ray CLI`: https://docs.ray.io/en/latest/cluster/vms/getting-started.html#running-applications-on-a-ray-cluster
134+
.. _`AWS CLI environment variables`: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html

docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst

-133
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
![LOGO](../../../img/MODIN_ver2_hrz.png)
2+
3+
<center>
4+
<h1>Scale your pandas workflows on a Ray cluster</h2>
5+
</center>
6+
7+
**NOTE**: Before completing the exercise, please read the full instructions in the
8+
[Modin documenation](https://modin--6872.org.readthedocs.build/en/6872/getting_started/using_modin/using_modin_cluster.html).
9+
10+
The basic steps to run the script on a remote Ray cluster are:
11+
12+
Step 1. Install the necessary dependencies
13+
14+
```bash
15+
pip install boto3
16+
```
17+
18+
Step 2. Setup your AWS credentials.
19+
20+
```bash
21+
aws configure
22+
```
23+
24+
Step 3. Modify configuration file and start up the Ray cluster.
25+
26+
```bash
27+
ray up modin-cluster.yaml
28+
```
29+
30+
Step 4. Submit your script to the remote cluster.
31+
32+
```bash
33+
ray submit modin-cluster.yaml exercise_5.py
34+
```
35+
36+
Step 5. Shut down the Ray remote cluster.
37+
38+
```bash
39+
ray down

0 commit comments

Comments
 (0)