|
1 | 1 | Using Modin in a Cluster
|
2 | 2 | ========================
|
3 | 3 |
|
4 |
| -In this section, we show how Modin can be used to accelerate your pandas workflows in a cluster. |
5 |
| -Each Modin distributed engine has its own specifics regarding running and using a cluster so |
6 |
| -you can choose one of the following instructions to suit the engine you are using. |
7 |
| - |
8 |
| -.. toctree:: |
9 |
| - :maxdepth: 4 |
10 |
| - |
11 |
| - using_modin_cluster/using_modin_ray_cluster |
12 |
| - |
| 4 | +.. note:: |
| 5 | + | *Estimated Reading Time: 15 minutes* |
| 6 | +
|
| 7 | +Often in practice we have a need to exceed the capabilities of a single machine. |
| 8 | +Modin works and performs well in both local mode and in a cluster environment. |
| 9 | +The key advantage of Modin is that your python code does not change between |
| 10 | +local development and cluster execution. Users are not required to think about |
| 11 | +how many workers exist or how to distribute and partition their data; |
| 12 | +Modin handles all of this seamlessly and transparently. |
| 13 | + |
| 14 | +.. note:: |
| 15 | + It is possible to use a Jupyter notebook, but you will have to deploy a Jupyter server |
| 16 | + on the remote cluster head node and connect to it. |
| 17 | + |
| 18 | +.. image:: ../../../img/modin_cluster.png |
| 19 | + :alt: Modin cluster |
| 20 | + :align: center |
| 21 | + |
| 22 | +Extra requirements for AWS authentication |
| 23 | +----------------------------------------- |
| 24 | + |
| 25 | +First of all, install the necessary dependencies in your environment: |
| 26 | + |
| 27 | +.. code-block:: bash |
| 28 | +
|
| 29 | + pip install boto3 |
| 30 | +
|
| 31 | +The next step is to setup your AWS credentials. One can set ``AWS_ACCESS_KEY_ID``, |
| 32 | +``AWS_SECRET_ACCESS_KEY`` and ``AWS_SESSION_TOKEN``(Optional) `AWS CLI environment variables`_ or |
| 33 | +just run the following command: |
| 34 | +
|
| 35 | +.. code-block:: bash |
| 36 | +
|
| 37 | + aws configure |
| 38 | +
|
| 39 | +Starting and connecting to the cluster |
| 40 | +-------------------------------------- |
| 41 | +
|
| 42 | +This example starts 1 head node (m5.24xlarge) and 5 worker nodes (m5.24xlarge), 576 total CPUs. |
| 43 | +You can check the `Amazon EC2 pricing`_ page. |
| 44 | +
|
| 45 | +It is possble to manually create AWS EC2 instances and configure them or just use the `Ray CLI`_ to |
| 46 | +create and initialize a Ray cluster on AWS using `Modin's Ray cluster setup config`_, |
| 47 | +which we are going to utilize in this example. |
| 48 | +Refer to `Ray's autoscaler options`_ page on how to modify the file. |
| 49 | +
|
| 50 | +More details on how to launch a Ray cluster can be found on `Ray's cluster docs`_. |
| 51 | +
|
| 52 | +To start up the Ray cluster, run the following command in your terminal: |
| 53 | +
|
| 54 | +.. code-block:: bash |
| 55 | +
|
| 56 | + ray up modin-cluster.yaml |
| 57 | +
|
| 58 | +Once the head node has completed initialization, you can optionally connect to it by running the following command. |
| 59 | +
|
| 60 | +.. code-block:: bash |
| 61 | +
|
| 62 | + ray attach modin-cluster.yaml |
| 63 | +
|
| 64 | +To exit the ssh session and return back into your local shell session, type: |
| 65 | +
|
| 66 | +.. code-block:: bash |
| 67 | +
|
| 68 | + exit |
| 69 | +
|
| 70 | +Executing in a cluster environment |
| 71 | +---------------------------------- |
| 72 | +
|
| 73 | +.. note:: |
| 74 | + Be careful when using the `Ray client`_ to connect to a remote cluster. |
| 75 | + We don't recommend this connection mode, beacuse it may not work. Known bugs: |
| 76 | + - https://github.com/ray-project/ray/issues/38713, |
| 77 | + - https://github.com/modin-project/modin/issues/6641. |
| 78 | +
|
| 79 | +Modin lets you instantly speed up your workflows with a large data by scaling pandas |
| 80 | +on a cluster. In this tutorial, we will use a 12.5 GB `big_yellow.csv` file that was |
| 81 | +created by concatenating a 200MB `NYC Taxi dataset`_ file 64 times. Preparing this |
| 82 | +file was provided as part of our `Modin's Ray cluster setup config`_. |
| 83 | +
|
| 84 | +If you want to use the other dataset, you should provide it to each of |
| 85 | +the cluster nodes with the same path. We recomnend doing this by customizing the |
| 86 | +``setup_commands`` section of the `Modin's Ray cluster setup config`_. |
| 87 | + |
| 88 | +To run any script in a remote cluster, you need to submit it to the Ray. In this way, |
| 89 | +the script file is sent to the the remote cluster head node and executed there. |
| 90 | + |
| 91 | +In this tutorial, we provide the `exercise_5.py`_ script, which reads the data from the |
| 92 | +CSV file and executes such pandas operations as count, groupby and applymap. |
| 93 | +As a result of the script, you will see the size of the file being read and the execution |
| 94 | +time of each function. |
| 95 | + |
| 96 | +.. note:: |
| 97 | + Some Dataframe functions are executed asynchronously, so to correctly measure execution time |
| 98 | + of each function we need to wait for the execution result. We use the special ``execute`` function for this, |
| 99 | + but you shouldn't use it in a real case scenario. |
| 100 | + |
| 101 | +You can submit this script to the existing remote cluster by running the following command. |
| 102 | + |
| 103 | +.. code-block:: bash |
| 104 | +
|
| 105 | + ray submit modin-cluster.yaml exercise_5.py |
| 106 | +
|
| 107 | +To download or upload files to the cluster head node, use `ray rsync_down` or `ray rsync_up`. |
| 108 | +It may help you if you want to use some other Python modules that should be available to |
| 109 | +execute your own script or download a result file after executing the script. |
| 110 | + |
| 111 | +.. code-block:: bash |
| 112 | +
|
| 113 | + # download a file from the cluster to the local machine: |
| 114 | + ray rsync_down modin-cluster.yaml '/path/on/cluster' '/local/path' |
| 115 | + # upload a file from the local machine to the cluster: |
| 116 | + ray rsync_up modin-cluster.yaml '/local/path' '/path/on/cluster' |
| 117 | +
|
| 118 | +Modin performance scales as the number of nodes and cores increases. The following |
| 119 | +chart shows the performance of the ``read_csv`` operation with different number of nodes, |
| 120 | +with improvements in performance as we increase the number of resources Modin can use. |
| 121 | + |
| 122 | +.. image:: ../../../../examples/tutorial/jupyter/img/modin_cluster_perf.png |
| 123 | + :alt: Cluster Performance |
| 124 | + :align: center |
| 125 | + |
| 126 | +.. _`Ray's autoscaler options`: https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-config |
| 127 | +.. _`Ray's cluster docs`: https://docs.ray.io/en/latest/cluster/getting-started.html |
| 128 | +.. _`NYC Taxi dataset`: https://modin-datasets.intel.com/testing/yellow_tripdata_2015-01.csv |
| 129 | +.. _`Modin's Ray cluster setup config`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml |
| 130 | +.. _`Amazon EC2 pricing`: https://aws.amazon.com/ec2/pricing/on-demand/ |
| 131 | +.. _`exercise_5.py`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.py |
| 132 | +.. _`Ray client`: https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html |
| 133 | +.. _`Ray CLI`: https://docs.ray.io/en/latest/cluster/vms/getting-started.html#running-applications-on-a-ray-cluster |
| 134 | +.. _`AWS CLI environment variables`: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html |
0 commit comments