Override parallelization mode within a scope #971

manopapad · 2025-02-12T00:55:42Z

@elliottslaughter brought this up today, because he has a small array, that he nonetheless wants to split among leaf tasks. He is working around it today using LEGATE_TEST=1, but that is a global setting.

The options we'd like to expose are:

use size-based heuristic (the default, as is done today)
parallelize across all available cores (what Elliott wanted to do)
(once available) task-parallel execution mode (e.g. assign different iterations of a loop to different devices)

The plan is to apply this with a scope annotation (e.g. a with Python clause), and it affects every task launch within the scope. Alternatively we set this per array, or per task launch.

The text was updated successfully, but these errors were encountered:

elliottslaughter · 2025-02-12T05:14:53Z

Some thoughts, in no particular order:

In the cases I have right now, I always need this in the context of a Python task specifically. (There's one exception I can think of, but I'll get to it later.) It's not as if I care about the granularity of +, *, etc. operations. Your heuristics for those are probably fine; or at least, the current tuning settings probably work well enough.

The reason why Python tasks break your invariants is because the amount of work can be highly disproportionate with the amount of data coming into the task. I have one particular small-ish input (order thousands of elements) that drives the rest of the computation. Partitioning it directly leads to partitioning all the rest of the work. But Legate sees it as a single small input and assumes I'm doing O(N) FLOPs on it, or something. This causes the heuristics to be wildly off.

The observation is that this is a property of the task, not the call site. Any time we call this task, we should partition aggressively. If the user forgets, it is highly unlikely that we want to skip the partitioning.

It would be nice to preserve the property that library writers can provide tasks that induce the right partitioning when called. My understanding is that this is true in "native" Legate libraries today. No one writes a call to Cholesky and expects to have to manually add with statements to get the partitioning right. But Python tasks break this abstraction. In other words, Python tasks are not actually powerful enough to match what you can do with Legate libraries. I'd prefer to think about what abstractions could allow Python tasks to fill this role, because arguably they are (or should be) the easiest way to build Legate libraries.

The only exception I'm aware of is cases where I want to stream operations that are memory constrained. One case that came up for us recently was essentially the equivalent of:

np.min(np.abs(a[:, np.newaxis] - b[np.newaxis]), axis=0)

Right now I have to manually break this down in order to avoid blowing out memory. But even here, it's not that I care about parallelism per se. I mainly care about staging the computation to avoid an OOM condition.

manopapad assigned magnatelee Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Override parallelization mode within a scope #971

Override parallelization mode within a scope #971

manopapad commented Feb 12, 2025

elliottslaughter commented Feb 12, 2025

Override parallelization mode within a scope #971

Override parallelization mode within a scope #971

Comments

manopapad commented Feb 12, 2025

elliottslaughter commented Feb 12, 2025