Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override parallelization mode within a scope #971

Open
manopapad opened this issue Feb 12, 2025 · 1 comment
Open

Override parallelization mode within a scope #971

manopapad opened this issue Feb 12, 2025 · 1 comment
Assignees

Comments

@manopapad
Copy link
Contributor

@elliottslaughter brought this up today, because he has a small array, that he nonetheless wants to split among leaf tasks. He is working around it today using LEGATE_TEST=1, but that is a global setting.

The options we'd like to expose are:

  • use size-based heuristic (the default, as is done today)
  • parallelize across all available cores (what Elliott wanted to do)
  • (once available) task-parallel execution mode (e.g. assign different iterations of a loop to different devices)

The plan is to apply this with a scope annotation (e.g. a with Python clause), and it affects every task launch within the scope. Alternatively we set this per array, or per task launch.

@elliottslaughter
Copy link

Some thoughts, in no particular order:

In the cases I have right now, I always need this in the context of a Python task specifically. (There's one exception I can think of, but I'll get to it later.) It's not as if I care about the granularity of +, *, etc. operations. Your heuristics for those are probably fine; or at least, the current tuning settings probably work well enough.

The reason why Python tasks break your invariants is because the amount of work can be highly disproportionate with the amount of data coming into the task. I have one particular small-ish input (order thousands of elements) that drives the rest of the computation. Partitioning it directly leads to partitioning all the rest of the work. But Legate sees it as a single small input and assumes I'm doing O(N) FLOPs on it, or something. This causes the heuristics to be wildly off.

The observation is that this is a property of the task, not the call site. Any time we call this task, we should partition aggressively. If the user forgets, it is highly unlikely that we want to skip the partitioning.

It would be nice to preserve the property that library writers can provide tasks that induce the right partitioning when called. My understanding is that this is true in "native" Legate libraries today. No one writes a call to Cholesky and expects to have to manually add with statements to get the partitioning right. But Python tasks break this abstraction. In other words, Python tasks are not actually powerful enough to match what you can do with Legate libraries. I'd prefer to think about what abstractions could allow Python tasks to fill this role, because arguably they are (or should be) the easiest way to build Legate libraries.

The only exception I'm aware of is cases where I want to stream operations that are memory constrained. One case that came up for us recently was essentially the equivalent of:

np.min(np.abs(a[:, np.newaxis] - b[np.newaxis]), axis=0)

Right now I have to manually break this down in order to avoid blowing out memory. But even here, it's not that I care about parallelism per se. I mainly care about staging the computation to avoid an OOM condition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants