Skip to content

Commit 97d88b2

Browse files
authored
FEAT-modin-project#6735: Make Modin on MPI through unidist component more obvious (modin-project#6736)
Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
1 parent bee2c28 commit 97d88b2

File tree

21 files changed

+138
-56
lines changed

21 files changed

+138
-56
lines changed

.github/workflows/ci-notebooks.yml

+15-9
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ on:
88
- setup.cfg
99
- setup.py
1010
- requirements/env_hdk.yml
11+
- requirements/env_unidist_linux.yml
1112
concurrency:
1213
# Cancel other jobs in the same branch. We don't care whether CI passes
1314
# on old commits.
@@ -28,12 +29,17 @@ jobs:
2829
steps:
2930
- uses: actions/checkout@v3
3031
- uses: ./.github/actions/python-only
31-
if: matrix.execution != 'hdk_on_native'
32+
if: matrix.execution != 'hdk_on_native' && matrix.execution != 'pandas_on_unidist'
3233
- uses: ./.github/actions/mamba-env
3334
with:
3435
environment-file: requirements/env_hdk.yml
3536
activate-environment: modin_on_hdk
3637
if: matrix.execution == 'hdk_on_native'
38+
- uses: ./.github/actions/mamba-env
39+
with:
40+
environment-file: requirements/env_unidist_linux.yml
41+
activate-environment: modin_on_unidist
42+
if: matrix.execution == 'pandas_on_unidist'
3743
- name: Cache datasets
3844
uses: actions/cache@v2
3945
with:
@@ -43,29 +49,29 @@ jobs:
4349
# replace modin with . in the tutorial requirements file for `pandas_on_ray` and
4450
# `pandas_on_dask` since we need Modin built from sources
4551
- run: sed -i 's/modin/./g' examples/tutorial/jupyter/execution/${{ matrix.execution }}/requirements.txt
46-
if: matrix.execution != 'hdk_on_native'
52+
if: matrix.execution != 'hdk_on_native' && matrix.execution != 'pandas_on_unidist'
4753
# install dependencies required for notebooks execution for `pandas_on_ray` and `pandas_on_dask`
4854
# Override modin-spreadsheet install for now
4955
- run: |
5056
pip install -r examples/tutorial/jupyter/execution/${{ matrix.execution }}/requirements.txt
5157
pip install git+https://github.com/modin-project/modin-spreadsheet.git@49ffd89f683f54c311867d602c55443fb11bf2a5
52-
if: matrix.execution != 'hdk_on_native'
53-
# Build Modin from sources for `hdk_on_native`
58+
if: matrix.execution != 'hdk_on_native' && matrix.execution != 'pandas_on_unidist'
59+
# Build Modin from sources for `hdk_on_native` and `pandas_on_unidist`
5460
- run: pip install -e .
55-
if: matrix.execution == 'hdk_on_native'
61+
if: matrix.execution == 'hdk_on_native' || matrix.execution == 'pandas_on_unidist'
5662
# install test dependencies
5763
# NOTE: If you are changing the set of packages installed here, make sure that
5864
# the dev requirements match them.
5965
- run: pip install pytest pytest-cov black flake8 flake8-print flake8-no-implicit-concat
60-
if: matrix.execution != 'hdk_on_native'
66+
if: matrix.execution != 'hdk_on_native' && matrix.execution != 'pandas_on_unidist'
6167
- run: pip install flake8-print jupyter nbformat nbconvert
62-
if: matrix.execution == 'hdk_on_native'
68+
if: matrix.execution == 'hdk_on_native' || matrix.execution == 'pandas_on_unidist'
6369
- run: pip list
64-
if: matrix.execution != 'hdk_on_native'
70+
if: matrix.execution != 'hdk_on_native' && matrix.execution != 'pandas_on_unidist'
6571
- run: |
6672
conda info
6773
conda list
68-
if: matrix.execution == 'hdk_on_native'
74+
if: matrix.execution == 'hdk_on_native' || matrix.execution == 'pandas_on_unidist'
6975
# setup kernel configuration for `pandas_on_unidist` execution with mpi backend
7076
- run: python examples/tutorial/jupyter/execution/${{ matrix.execution }}/setup_kernel.py
7177
if: matrix.execution == 'pandas_on_unidist'

.github/workflows/ci.yml

+7-1
Original file line numberDiff line numberDiff line change
@@ -93,11 +93,17 @@ jobs:
9393
- uses: actions/checkout@v3
9494
- uses: ./.github/actions/python-only
9595
- run: python -m pip install -e ".[all]"
96-
- name: Ensure all engines start up
96+
- name: Ensure Ray and Dask engines start up
9797
run: |
9898
MODIN_ENGINE=dask python -c "import modin.pandas as pd; print(pd.DataFrame([1,2,3]))"
9999
MODIN_ENGINE=ray python -c "import modin.pandas as pd; print(pd.DataFrame([1,2,3]))"
100+
- name: Ensure MPI engine start up
101+
# Install a working MPI implementation beforehand so mpi4py can link to it
102+
run: |
103+
sudo apt install libmpich-dev
104+
python -m pip install -e ".[mpi]"
100105
MODIN_ENGINE=unidist UNIDIST_BACKEND=mpi mpiexec -n 1 python -c "import modin.pandas as pd; print(pd.DataFrame([1,2,3]))"
106+
if: matrix.os == 'ubuntu'
101107

102108
test-internals:
103109
needs: [lint-flake8, lint-black-isort]

README.md

+16-6
Original file line numberDiff line numberDiff line change
@@ -57,24 +57,30 @@ The charts below show the speedup you get by replacing pandas with Modin based o
5757
Modin can be installed with `pip` on Linux, Windows and MacOS:
5858

5959
```bash
60-
pip install "modin[all]" # (Recommended) Install Modin with all of Modin's currently supported engines.
60+
pip install "modin[all]" # (Recommended) Install Modin with Ray and Dask engines.
6161
```
6262

6363
If you want to install Modin with a specific engine, we recommend:
6464

6565
```bash
6666
pip install "modin[ray]" # Install Modin dependencies and Ray.
6767
pip install "modin[dask]" # Install Modin dependencies and Dask.
68-
pip install "modin[unidist]" # Install Modin dependencies and Unidist.
68+
pip install "modin[mpi]" # Install Modin dependencies and MPI through unidist.
6969
```
7070

71+
To get Modin on MPI through unidist (as of unidist 0.5.0) fully working
72+
it is required to have a working MPI implementation installed beforehand.
73+
Otherwise, installation of `modin[mpi]` may fail. Refer to
74+
[Installing with pip](https://unidist.readthedocs.io/en/latest/installation.html#installing-with-pip)
75+
section of the unidist documentation for more details about installation.
76+
7177
Modin automatically detects which engine(s) you have installed and uses that for scheduling computation.
7278

7379
#### From conda-forge
7480

7581
Installing from [conda forge](https://github.com/conda-forge/modin-feedstock) using `modin-all`
7682
will install Modin and four engines: [Ray](https://github.com/ray-project/ray), [Dask](https://github.com/dask/dask),
77-
[Unidist](https://github.com/modin-project/unidist) and [HDK](https://github.com/intel-ai/hdk).
83+
[MPI through unidist](https://github.com/modin-project/unidist) and [HDK](https://github.com/intel-ai/hdk).
7884

7985
```bash
8086
conda install -c conda-forge modin-all
@@ -85,10 +91,14 @@ Each engine can also be installed individually (and also as a combination of sev
8591
```bash
8692
conda install -c conda-forge modin-ray # Install Modin dependencies and Ray.
8793
conda install -c conda-forge modin-dask # Install Modin dependencies and Dask.
88-
conda install -c conda-forge modin-unidist # Install Modin dependencies and Unidist.
94+
conda install -c conda-forge modin-mpi # Install Modin dependencies and MPI through unidist.
8995
conda install -c conda-forge modin-hdk # Install Modin dependencies and HDK.
9096
```
9197

98+
Refer to
99+
[Installing with conda](https://unidist.readthedocs.io/en/latest/installation.html#installing-with-conda)
100+
section of the unidist documentation for more details on how to install a specific MPI implementation to run on.
101+
92102
To speed up conda installation we recommend using libmamba solver. To do this install it in a base environment:
93103

94104
```bash
@@ -119,7 +129,7 @@ export MODIN_ENGINE=unidist # Modin will use Unidist
119129
```
120130

121131
If you want to choose the Unidist engine, you should set the additional environment
122-
variable ``UNIDIST_BACKEND``, because currently Modin only supports Unidist on MPI:
132+
variable ``UNIDIST_BACKEND``. Currently, Modin only supports MPI through unidist:
123133

124134
```bash
125135
export UNIDIST_BACKEND=mpi # Unidist will use MPI backend
@@ -144,7 +154,7 @@ _Note: You should not change the engine after your first operation with Modin as
144154

145155
#### Which engine should I use?
146156

147-
On Linux, MacOS, and Windows you can install and use either Ray, Dask or Unidist. There is no knowledge required
157+
On Linux, MacOS, and Windows you can install and use either Ray, Dask or MPI through unidist. There is no knowledge required
148158
to use either of these engines as Modin abstracts away all of the complexity, so feel
149159
free to pick either!
150160

docs/conf.py

+20-1
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,16 @@ def noop_decorator(*args, **kwargs):
2626
ray.remote = noop_decorator
2727

2828
# fake modules if they're missing
29-
for mod_name in ("cudf", "cupy", "pyarrow.gandiva", "pyhdk", "pyhdk.hdk", "xgboost"):
29+
for mod_name in (
30+
"cudf",
31+
"cupy",
32+
"pyarrow.gandiva",
33+
"pyhdk",
34+
"pyhdk.hdk",
35+
"xgboost",
36+
"unidist",
37+
"unidist.config",
38+
):
3039
try:
3140
__import__(mod_name)
3241
except ImportError:
@@ -52,6 +61,16 @@ def noop_decorator(*args, **kwargs):
5261
sys.modules["pyhdk"].__version__ = "999"
5362
if not hasattr(sys.modules["xgboost"], "Booster"):
5463
sys.modules["xgboost"].Booster = type("Booster", (object,), {})
64+
if not hasattr(sys.modules["unidist"], "remote"):
65+
sys.modules["unidist"].remote = noop_decorator
66+
if not hasattr(sys.modules["unidist"], "core"):
67+
sys.modules["unidist"].core = type("core", (object,), {})
68+
if not hasattr(sys.modules["unidist"].core, "base"):
69+
sys.modules["unidist"].core.base = type("base", (object,), {})
70+
if not hasattr(sys.modules["unidist"].core.base, "object_ref"):
71+
sys.modules["unidist"].core.base.object_ref = type("object_ref", (object,), {})
72+
if not hasattr(sys.modules["unidist"].core.base.object_ref, "ObjectRef"):
73+
sys.modules["unidist"].core.base.object_ref.ObjectRef = type("ObjectRef", (object,), {})
5574

5675
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
5776
import modin

docs/development/architecture.rst

+5-4
Original file line numberDiff line numberDiff line change
@@ -216,8 +216,8 @@ documentation page on :doc:`contributing </development/contributing>`.
216216
- Uses the `Dask Futures`_ execution framework.
217217
- The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
218218
- For more information on the execution path, see the :doc:`pandas on Dask </flow/modin/core/execution/dask/implementations/pandas_on_dask/index>` page.
219-
- :doc:`pandas on Unidist </development/using_pandas_on_unidist>`
220-
- Uses the Unidist_ execution framework.
219+
- :doc:`pandas on MPI </development/using_pandas_on_mpi>`
220+
- Uses MPI_ through the Unidist_ execution framework.
221221
- The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
222222
- For more information on the execution path, see the :doc:`pandas on Unidist </flow/modin/core/execution/unidist/implementations/pandas_on_unidist/index>` page.
223223
- :doc:`pandas on Python </development/using_pandas_on_python>`
@@ -228,8 +228,8 @@ documentation page on :doc:`contributing </development/contributing>`.
228228
- Uses the Ray_ execution framework.
229229
- The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
230230
- For more information on the execution path, see the :doc:`experimental pandas on Ray </flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/index>` page.
231-
- pandas on Unidist (experimental)
232-
- Uses the Unidist_ execution framework.
231+
- pandas on MPI (experimental)
232+
- Uses MPI_ through the Unidist_ execution framework.
233233
- The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
234234
- For more information on the execution path, see the :doc:`experimental pandas on Unidist </flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/index>` page.
235235
- pandas on Dask (experimental)
@@ -375,6 +375,7 @@ details. The documentation covers most modules, with more docs being added every
375375
.. _Arrow tables: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html
376376
.. _Ray: https://github.com/ray-project/ray
377377
.. _Unidist: https://github.com/modin-project/unidist
378+
.. _MPI: https://www.mpi-forum.org/
378379
.. _code: https://github.com/modin-project/modin/blob/master/modin/core/dataframe
379380
.. _Dask: https://github.com/dask/dask
380381
.. _Dask Futures: https://docs.dask.org/en/latest/futures.html

docs/development/index.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Development
1010
using_pandas_on_ray
1111
using_pandas_on_dask
1212
using_pandas_on_python
13-
using_pandas_on_unidist
13+
using_pandas_on_mpi
1414
using_hdk
1515
using_pyarrow_on_ray
1616

docs/development/partition_api.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ in the worker process that processes a function (please, refer to `Dask document
3636

3737
Unidist engine
3838
--------------
39-
Currently, Modin only supports unidist on MPI backend. There is no mentioned above issue for
39+
Currently, Modin only supports MPI through unidist. There is no mentioned above issue for
4040
Modin on ``Unidist`` engine using ``MPI`` backend with ``pandas`` in-memory format
4141
because ``Unidist`` saves any objects in the MPI worker process that processes a function
4242
(please, refer to `Unidist documentation`_ for more information).

docs/development/using_pandas_on_unidist.rst docs/development/using_pandas_on_mpi.rst

+22-5
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
pandas on Unidist
2-
=================
1+
pandas on MPI through unidist
2+
=============================
33

4-
This section describes usage related documents for the pandas on Unidist component of Modin.
4+
This section describes usage related documents for the pandas on MPI through unidist component of Modin.
55

66
Modin uses pandas as a primary memory format of the underlying partitions and optimizes queries
77
ingested from the API layer in a specific way to this format. Thus, there is no need to care of choosing it
88
but you can explicitly specify it anyway as shown below.
99

10-
One of the execution engines that Modin uses is Unidist. Currently, Modin only supports Unidist on MPI backend.
11-
To enable the pandas on Unidist execution using MPI backend you should set the following environment variables:
10+
One of the execution engines that Modin uses is MPI through unidist.
11+
To enable the pandas on MPI through unidist execution you should set the following environment variables:
1212

1313
.. code-block:: bash
1414
@@ -36,4 +36,21 @@ To run a python application you should use ``mpiexec -n 1 python <script.py>`` c
3636
For more information on how to run a python application with unidist on MPI backend
3737
please refer to `Unidist on MPI`_ section of the unidist documentation.
3838

39+
As of unidist 0.5.0 there is support for a shared object store for MPI backend.
40+
The feature allows to improve performance in the workloads,
41+
where workers use same data multiple times by reducing data copies.
42+
You can enable the feature by setting the following environment variable:
43+
44+
.. code-block:: bash
45+
46+
export UNIDIST_MPI_SHARED_OBJECT_STORE=True
47+
48+
or turn it on in source code:
49+
50+
.. code-block:: python
51+
52+
import unidist.config as unidist_cfg
53+
54+
unidist_cfg.MpiSharedObjectStore.put(True)
55+
3956
.. _`Unidist on MPI`: https://unidist.readthedocs.io/en/latest/using_unidist/unidist_on_mpi.html

docs/flow/modin/core/execution/unidist/implementations/pandas_on_unidist/index.rst

+2-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@ PandasOnUnidist Execution
66
Queries that perform data transformation, data ingress or data egress using the `pandas on Unidist` execution
77
pass through the Modin components detailed below.
88

9-
To enable `pandas on Unidist` execution, please refer to the usage section in :doc:`pandas on Unidist </development/using_pandas_on_unidist>`.
9+
To enable `pandas on MPI through unidist` execution,
10+
please refer to the usage section in :doc:`pandas on MPI through unidist </development/using_pandas_on_mpi>`.
1011

1112
Data Transformation
1213
'''''''''''''''''''

docs/getting_started/faq.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@ and Modin will do computation with that engine:
119119
pip install "modin[dask]" # Install Modin dependencies and Dask to run on Dask
120120
export MODIN_ENGINE=dask # Modin will use Dask
121121
122-
pip install "modin[unidist]" # Install Modin dependencies and Unidist to run on Unidist.
122+
pip install "modin[mpi]" # Install Modin dependencies and MPI to run on MPI through unidist.
123123
export MODIN_ENGINE=unidist # Modin will use Unidist
124124
export UNIDIST_BACKEND=mpi # Unidist will use MPI backend.
125125

0 commit comments

Comments
 (0)