Skip to content

Commit d0377ba

Browse files
author
Ubuntu
committed
Fix aes after code review
1 parent ebe94b0 commit d0377ba

File tree

4 files changed

+21
-50
lines changed

4 files changed

+21
-50
lines changed

docs/getting_started/using_modin/using_modin_cluster.rst

+14-12
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Modin handles all of this seamlessly and transparently.
1515
It is possible to use a Jupyter notebook, but you will have to deploy a Jupyter server
1616
on the remote cluster head node and connect to it.
1717

18-
.. image:: ../../../img/modin_cluster.png
18+
.. image:: ../../img/modin_cluster.png
1919
:alt: Modin cluster
2020
:align: center
2121

@@ -29,7 +29,8 @@ First of all, install the necessary dependencies in your environment:
2929
pip install boto3
3030
3131
The next step is to setup your AWS credentials. One can set ``AWS_ACCESS_KEY_ID``,
32-
``AWS_SECRET_ACCESS_KEY`` and ``AWS_SESSION_TOKEN``(Optional) `AWS CLI environment variables`_ or
32+
``AWS_SECRET_ACCESS_KEY`` and ``AWS_SESSION_TOKEN`` (Optional)
33+
(refer to `AWS CLI environment variables`_ to get more insight on this) or
3334
just run the following command:
3435

3536
.. code-block:: bash
@@ -77,7 +78,7 @@ Executing in a cluster environment
7778
- https://github.com/modin-project/modin/issues/6641.
7879

7980
Modin lets you instantly speed up your workflows with a large data by scaling pandas
80-
on a cluster. In this tutorial, we will use a 12.5 GB `big_yellow.csv` file that was
81+
on a cluster. In this tutorial, we will use a 12.5 GB ``big_yellow.csv`` file that was
8182
created by concatenating a 200MB `NYC Taxi dataset`_ file 64 times. Preparing this
8283
file was provided as part of our `Modin's Ray cluster setup config`_.
8384

@@ -89,7 +90,7 @@ To run any script in a remote cluster, you need to submit it to the Ray. In this
8990
the script file is sent to the the remote cluster head node and executed there.
9091

9192
In this tutorial, we provide the `exercise_5.py`_ script, which reads the data from the
92-
CSV file and executes such pandas operations as count, groupby and applymap.
93+
CSV file and executes such pandas operations as count, groupby and map.
9394
As a result of the script, you will see the size of the file being read and the execution
9495
time of each function.
9596

@@ -104,8 +105,8 @@ You can submit this script to the existing remote cluster by running the followi
104105
105106
ray submit modin-cluster.yaml exercise_5.py
106107
107-
To download or upload files to the cluster head node, use `ray rsync_down` or `ray rsync_up`.
108-
It may help you if you want to use some other Python modules that should be available to
108+
To download or upload files to the cluster head node, use ``ray rsync_down`` or ``ray rsync_up``.
109+
It may help if you want to use some other Python modules that should be available to
109110
execute your own script or download a result file after executing the script.
110111

111112
.. code-block:: bash
@@ -115,13 +116,14 @@ execute your own script or download a result file after executing the script.
115116
# upload a file from the local machine to the cluster:
116117
ray rsync_up modin-cluster.yaml '/local/path' '/path/on/cluster'
117118
118-
Modin performance scales as the number of nodes and cores increases. The following
119-
chart shows the performance of the ``read_csv`` operation with different number of nodes,
120-
with improvements in performance as we increase the number of resources Modin can use.
119+
Shutting down the cluster
120+
--------------------------
121121

122-
.. image:: ../../../../examples/tutorial/jupyter/img/modin_cluster_perf.png
123-
:alt: Cluster Performance
124-
:align: center
122+
Now that we have finished the computation, we need to shut down the cluster with `ray down` command.
123+
124+
.. code-block:: bash
125+
126+
ray down modin-cluster.yaml
125127
126128
.. _`Ray's autoscaler options`: https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-config
127129
.. _`Ray's cluster docs`: https://docs.ray.io/en/latest/cluster/getting-started.html

examples/tutorial/jupyter/execution/pandas_on_ray/cluster/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
<h1>Scale your pandas workflows on a Ray cluster</h2>
55
</center>
66

7-
**NOTE**: Before completing the exercise, please read the full instructions in the
7+
**NOTE**: Before starting the exercise, please read the full instructions in the
88
[Modin documenation](https://modin--6872.org.readthedocs.build/en/6872/getting_started/using_modin/using_modin_cluster.html).
99

1010
The basic steps to run the script on a remote Ray cluster are:
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,20 @@
1-
import os
21
import time
32
import ray
3+
44
import modin.pandas as pd
5-
from modin.utils import execute
65

76
ray.init(address="auto")
87
cpu_count = ray.cluster_resources()["CPU"]
98
assert cpu_count == 576, f"Expected 576 CPUs, but found {cpu_count}"
109

1110
file_path = "big_yellow.csv"
12-
file_size = os.path.getsize(file_path)
13-
14-
15-
# get human readable file size
16-
def sizeof_fmt(num, suffix="B"):
17-
for unit in ("", "K", "M", "G", "T"):
18-
if abs(num) < 1024.0:
19-
return f"{num:3.1f}{unit}{suffix}"
20-
num /= 1024.0
21-
return f"{num:.1f}P{suffix}"
22-
23-
24-
print(f"File size is {sizeof_fmt(file_size)}") # noqa: T201
2511

2612
t0 = time.perf_counter()
27-
df = pd.read_csv(file_path, quoting=3)
28-
t1 = time.perf_counter()
29-
print(f"read_csv time is {(t1 - t0):.3f}") # noqa: T201
30-
31-
"""
32-
IMPORTANT:
33-
Some Dataframe functions are executed asynchronously, so to correctly measure execution time
34-
we need to wait for the execution result. We use the special `execute` function for this,
35-
but you should not use this function as it will slow down your script.
36-
"""
3713

38-
t0 = time.perf_counter()
39-
execute(df.count())
40-
t1 = time.perf_counter()
41-
print(f"count time is {(t1 - t0):.3f}") # noqa: T201
42-
43-
t0 = time.perf_counter()
44-
execute(df.groupby("passenger_count").count())
45-
t1 = time.perf_counter()
46-
print(f"groupby time is {(t1 - t0):.3f}") # noqa: T201
14+
df = pd.read_csv(file_path, quoting=3)
15+
df_count = df.count()
16+
df_groupby_count = df.groupby("passenger_count").count()
17+
df_map = df.map(str)
4718

48-
t0 = time.perf_counter()
49-
execute(df.applymap(str))
5019
t1 = time.perf_counter()
51-
print(f"applymap time is {(t1 - t0):.3f}") # noqa: T201
20+
print(f"Full script time is {(t1 - t0):.3f}") # noqa: T201
Binary file not shown.

0 commit comments

Comments
 (0)