Skip to content

Commit 31f8bd0

Browse files
authored
REFACTOR-modin-project#6812: Remove 'PyarrowOnRay' execution in favour of pyarrow-backed pandas dataframes (modin-project#6848)
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
1 parent 5c15c48 commit 31f8bd0

File tree

30 files changed

+8
-966
lines changed

30 files changed

+8
-966
lines changed

.github/workflows/ci-required.yml

-2
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,6 @@ jobs:
6666
asv_bench/benchmarks/__init__.py asv_bench/benchmarks/io/__init__.py \
6767
asv_bench/benchmarks/scalability/__init__.py \
6868
modin/core/io \
69-
modin/experimental/core/execution/ray/implementations/pyarrow_on_ray \
7069
modin/pandas/series.py \
7170
modin/core/execution/python \
7271
modin/pandas/dataframe.py \
@@ -90,7 +89,6 @@ jobs:
9089
python scripts/doc_checker.py modin/experimental/pandas/io.py \
9190
modin/experimental/pandas/__init__.py
9291
- run: python scripts/doc_checker.py modin/core/storage_formats/base
93-
- run: python scripts/doc_checker.py modin/experimental/core/storage_formats/pyarrow
9492
- run: python scripts/doc_checker.py modin/core/storage_formats/pandas
9593
- run: |
9694
python scripts/doc_checker.py \

.github/workflows/ci.yml

-30
Original file line numberDiff line numberDiff line change
@@ -683,36 +683,6 @@ jobs:
683683
- run: python -m pytest modin/pandas/test/test_io.py --verbose
684684
- uses: ./.github/actions/upload-coverage
685685

686-
test-pyarrow:
687-
needs: [lint-flake8, lint-black-isort]
688-
runs-on: ubuntu-latest
689-
defaults:
690-
run:
691-
shell: bash -l {0}
692-
strategy:
693-
matrix:
694-
python-version: ["3.9"]
695-
env:
696-
MODIN_STORAGE_FORMAT: pyarrow
697-
MODIN_EXPERIMENTAL: "True"
698-
name: test (pyarrow, python ${{matrix.python-version}})
699-
services:
700-
moto:
701-
image: motoserver/moto
702-
ports:
703-
- 5000:5000
704-
env:
705-
AWS_ACCESS_KEY_ID: foobar_key
706-
AWS_SECRET_ACCESS_KEY: foobar_secret
707-
steps:
708-
- uses: actions/checkout@v3
709-
- uses: ./.github/actions/mamba-env
710-
with:
711-
environment-file: environment-dev.yml
712-
python-version: ${{matrix.python-version}}
713-
- run: sudo apt update && sudo apt install -y libhdf5-dev
714-
- run: python -m pytest modin/pandas/test/test_io.py::TestCsv --verbose
715-
716686
test-spreadsheet:
717687
needs: [lint-flake8, lint-black-isort]
718688
runs-on: ubuntu-latest

asv_bench/benchmarks/utils/compatibility.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -47,4 +47,4 @@
4747
assert ASV_USE_IMPL in ("modin", "pandas")
4848
assert ASV_DATASET_SIZE in ("big", "small")
4949
assert ASV_USE_ENGINE in ("ray", "dask", "python", "native", "unidist")
50-
assert ASV_USE_STORAGE_FORMAT in ("pandas", "hdk", "pyarrow")
50+
assert ASV_USE_STORAGE_FORMAT in ("pandas", "hdk")

docs/conf.py

-1
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,6 @@ def noop_decorator(*args, **kwargs):
2929
for mod_name in (
3030
"cudf",
3131
"cupy",
32-
"pyarrow.gandiva",
3332
"pyhdk",
3433
"pyhdk.hdk",
3534
"xgboost",

docs/development/architecture.rst

+3-9
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ For the simplicity the other execution systems - Dask and MPI are omitted and on
5656
on a selected storage format and mapping or compiling the Dataframe Algebra DAG to and actual
5757
execution sequence.
5858
* Storage formats module is responsible for mapping the abstract operation to an actual executor call, e.g. pandas,
59-
PyArrow, custom format.
59+
HDK, custom format.
6060
* Orchestration subsystem is responsible for spawning and controlling the actual execution environment for the
6161
selected execution. It spawns the actual nodes, fires up the execution environment, e.g. Ray, monitors the state
6262
of executors and provides telemetry
@@ -228,10 +228,6 @@ documentation page on :doc:`contributing </development/contributing>`.
228228
- Uses HDK as an engine.
229229
- The storage format is `hdk` and the in-memory partition type is a pyarrow Table. When defaulting to pandas, the pandas DataFrame is used.
230230
- For more information on the execution path, see the :doc:`HDK on Native </flow/modin/experimental/core/execution/native/implementations/hdk_on_native/index>` page.
231-
- :doc:`Pyarrow on Ray </development/using_pyarrow_on_ray>` (experimental)
232-
- Uses the Ray_ execution framework.
233-
- The storage format is `pyarrow` and the in-memory partition type is a pyarrow Table.
234-
- For more information on the execution path, see the :doc:`Pyarrow on Ray </flow/modin/experimental/core/execution/ray/implementations/pyarrow_on_ray>` page.
235231
- cuDF on Ray (experimental)
236232
- Uses the Ray_ execution framework.
237233
- The storage format is `cudf` and the in-memory partition type is a cuDF DataFrame.
@@ -252,7 +248,7 @@ following figure illustrates this concept.
252248
:align: center
253249

254250
Currently, the main in-memory format of each partition is a `pandas DataFrame`_ (:doc:`pandas storage format </flow/modin/core/storage_formats/pandas/index>`).
255-
:doc:`HDK </flow/modin/experimental/core/storage_formats/hdk/index>`, :doc:`PyArrow </flow/modin/experimental/core/storage_formats/pyarrow/index>`
251+
:doc:`HDK </flow/modin/experimental/core/storage_formats/hdk/index>`
256252
and cuDF are also supported as experimental in-memory formats in Modin.
257253

258254

@@ -333,8 +329,7 @@ details. The documentation covers most modules, with more docs being added every
333329
│ │ │ │ └───implementations
334330
│ │ │ │ └─── :doc:`hdk_on_native </flow/modin/experimental/core/execution/native/implementations/hdk_on_native/index>`
335331
│ │ │ ├─── :doc:`storage_formats </flow/modin/experimental/core/storage_formats/index>`
336-
| │ │ | ├─── :doc:`hdk </flow/modin/experimental/core/storage_formats/hdk/index>`
337-
│ │ │ | └─── :doc:`pyarrow </flow/modin/experimental/core/storage_formats/pyarrow/index>`
332+
| │ │ | └───:doc:`hdk </flow/modin/experimental/core/storage_formats/hdk/index>`
338333
| | | └─── :doc:`io </flow/modin/experimental/core/io/index>`
339334
│ │ ├─── :doc:`pandas </flow/modin/experimental/pandas>`
340335
│ │ ├─── :doc:`sklearn </flow/modin/experimental/sklearn>`
@@ -350,7 +345,6 @@ details. The documentation covers most modules, with more docs being added every
350345
└───stress_tests
351346
352347
.. _pandas Dataframe: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
353-
.. _Arrow tables: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html
354348
.. _Ray: https://github.com/ray-project/ray
355349
.. _Unidist: https://github.com/modin-project/unidist
356350
.. _MPI: https://www.mpi-forum.org/

docs/development/index.rst

-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ Development
1212
using_pandas_on_python
1313
using_pandas_on_mpi
1414
using_hdk
15-
using_pyarrow_on_ray
1615

1716
.. meta::
1817
:description lang=en:

docs/development/using_pyarrow_on_ray.rst

-4
This file was deleted.

docs/flow/modin/core/storage_formats/index.rst

+2-3
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,8 @@ of objects that are stored in the partitions of the selected Core Modin Datafram
88
The base storage format in Modin is pandas. In that format, Modin Dataframe operates with
99
partitions that hold ``pandas.DataFrame`` objects. Pandas is the most natural storage format
1010
since high-level DataFrame objects mirror its API, however, Modin's storage formats are not
11-
limited to the objects that conform to pandas API. There are formats that are able to store
12-
``pyarrow.Table`` (:doc:`pyarrow storage format </flow/modin/experimental/core/storage_formats/pyarrow/index>`) or even instances of
13-
SQL-like databases (:doc:`HDK storage format </flow/modin/experimental/core/storage_formats/hdk/index>`)
11+
limited to the objects that conform to pandas API. There is format that are able to store
12+
even instances of SQL-like databases (:doc:`HDK storage format </flow/modin/experimental/core/storage_formats/hdk/index>`)
1413
inside Modin Dataframe's partitions.
1514

1615
The storage format + execution engine (Ray, Dask, etc.) form the execution backend.

docs/flow/modin/experimental/core/execution/ray/implementations/pyarrow_on_ray.rst

-27
This file was deleted.

docs/flow/modin/experimental/core/storage_formats/index.rst

-2
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,9 @@ Experimental storage formats
77
and provides a limited set of functionality:
88

99
* :doc:`hdk <hdk/index>`
10-
* :doc:`pyarrow <pyarrow/index>`
1110

1211

1312
.. toctree::
1413
:hidden:
1514

1615
hdk/index
17-
pyarrow/index

docs/flow/modin/experimental/core/storage_formats/pyarrow/index.rst

-27
This file was deleted.

docs/flow/modin/experimental/core/storage_formats/pyarrow/parsers.rst

-15
This file was deleted.

docs/flow/modin/experimental/core/storage_formats/pyarrow/query_compiler.rst

-21
This file was deleted.

modin/config/envvars.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -266,7 +266,7 @@ class StorageFormat(EnvironmentVariable, type=str):
266266

267267
varname = "MODIN_STORAGE_FORMAT"
268268
default = "Pandas"
269-
choices = ("Pandas", "Hdk", "Pyarrow", "Cudf")
269+
choices = ("Pandas", "Hdk", "Cudf")
270270

271271

272272
class IsExperimental(EnvironmentVariable, type=bool):

modin/core/execution/dispatching/factories/factories.py

-15
Original file line numberDiff line numberDiff line change
@@ -570,21 +570,6 @@ def prepare(cls):
570570
# that have little coverage of implemented functionality or are not stable enough.
571571

572572

573-
@doc(_doc_factory_class, execution_name="experimental PyarrowOnRay")
574-
class ExperimentalPyarrowOnRayFactory(BaseFactory): # pragma: no cover
575-
@classmethod
576-
@doc(_doc_factory_prepare_method, io_module_name="experimental ``PyarrowOnRayIO``")
577-
def prepare(cls):
578-
from modin.experimental.core.execution.ray.implementations.pyarrow_on_ray.io import (
579-
PyarrowOnRayIO,
580-
)
581-
582-
if not IsExperimental.get():
583-
raise ValueError("'PyarrowOnRay' only works in experimental mode.")
584-
585-
cls.io_cls = PyarrowOnRayIO
586-
587-
588573
@doc(_doc_factory_class, execution_name="experimental HdkOnNative")
589574
class ExperimentalHdkOnNativeFactory(BaseFactory):
590575
@classmethod

modin/experimental/core/execution/ray/implementations/pyarrow_on_ray/__init__.py

-14
This file was deleted.

modin/experimental/core/execution/ray/implementations/pyarrow_on_ray/dataframe/__init__.py

-14
This file was deleted.

modin/experimental/core/execution/ray/implementations/pyarrow_on_ray/dataframe/dataframe.py

-74
This file was deleted.

modin/experimental/core/execution/ray/implementations/pyarrow_on_ray/io/__init__.py

-18
This file was deleted.

0 commit comments

Comments
 (0)