Skip to content

Commit

Permalink
doc: small improvments
Browse files Browse the repository at this point in the history
  • Loading branch information
nfurmento committed Sep 30, 2024
1 parent 8d512da commit 00257b6
Show file tree
Hide file tree
Showing 4 changed files with 75 additions and 244 deletions.
51 changes: 17 additions & 34 deletions doc/doxygen/chapters/starpu_basics/scheduling.doxy
Original file line number Diff line number Diff line change
Expand Up @@ -18,54 +18,37 @@

\section TaskSchedulingPolicy Task Scheduling Policies

The basics of the scheduling policy are the following:
The basics of the scheduling policy are as follows:

<ul>
<li>The scheduler gets to schedule tasks (<c>push</c> operation) when they become
ready to be executed, i.e. they are not waiting for some tags, data dependencies
or task dependencies.</li>
<li>Workers pull tasks (<c>pop</c> operation) one by one from the scheduler.
<li>
The scheduler can schedule tasks (<c>push</c> operation) when they are ready to run, i.e. not waiting for some tags, data dependencies or task dependencies.
</li>
<li>
Workers pull tasks from the scheduler one by one (<c>pop</c> operation).
</li>
</ul>

This means scheduling policies usually contain at least one queue of tasks to
store them between the time when they become available, and the time when a
worker gets to grab them.
This means that scheduling policies usually contain at least one queue of tasks to store them between the time they become available, and the time a worker can grab them.

By default, StarPU uses the work-stealing scheduler \b lws. This is
because it provides correct load balance and locality even if the application codelets do
not have performance models. Other non-modelling scheduling policies can be
selected among the list below, thanks to the environment variable \ref
STARPU_SCHED. For instance, <c>export STARPU_SCHED=dmda</c> . Use <c>help</c> to
get the list of available schedulers.
By default, StarPU uses the work-stealing scheduler \b lws. This is because it provides correct load balancing and locality even if the application codelets do not have performance models. Other non-modeled scheduling policies can be selected from the list below, thanks to the \ref
STARPU_SCHED environment variable. For example, <c>export STARPU_SCHED=dmda</c> . Use <c>help</c> to get the list of available schedulers.

The function starpu_sched_get_predefined_policies() returns a NULL-terminated array of all predefined scheduling policies that are available in StarPU.
Functions starpu_sched_get_sched_policy_in_ctx() and starpu_sched_get_sched_policy() return the scheduling policy of a task within a specific context or a default context, respectively.
The starpu_sched_get_predefined_policies() function returns a NULL-terminated array of all predefined scheduling policies available in StarPU. The starpu_sched_get_sched_policy_in_ctx() and starpu_sched_get_sched_policy() functions return the scheduling policy of a task within a specific context or a default context, respectively.

\subsection NonPerformanceModelingPolicies Non Performance Modelling Policies

- The <b>eager</b> scheduler uses a central task queue, from which all workers draw tasks
to work on concurrently. This however does not permit to prefetch data since the scheduling
decision is taken late. If a task has a non-0 priority, it is put at the front of the queue.
- The <b>eager</b> scheduler uses a central task queue, from which all workers draw tasks to work on concurrently. However, this does not allow data prefetching since the scheduling decision is made late. If a task has a priority other than 0, it is placed at the front of the queue.

- The <b>random</b> scheduler uses a queue per worker, and distributes tasks randomly according to assumed worker
overall performance.
- The <b>random</b> scheduler uses one queue per worker, and randomly distributes tasks according to the assumed overall performance of the worker.

- The <b>ws</b> (work stealing) scheduler uses a queue per worker, and schedules
a task on the worker which released it by
default. When a worker becomes idle, it steals a task from the most loaded
worker.
- The <b>ws</b> (work stealing) scheduler uses one queue per worker, and schedules a task on the worker that released it by default. When a worker becomes idle, it steals a task from the most busy worker.

- The <b>lws</b> (locality work stealing) scheduler uses a queue per worker, and schedules
a task on the worker which released it by
default. When a worker becomes idle, it steals a task from neighbor workers. It
also takes priorities into account.
- The <b>lws</b> (locality work stealing) scheduler uses one queue per worker, and by default, schedules a task on the worker that released it. When a worker becomes idle, it steals a task from neighboring workers. It also takes priorities into account.

- The <b>prio</b> scheduler also uses a central task queue, but sorts tasks by
priority specified by the application.
- The <b>prio</b> scheduler also uses a central task queue, but sorts tasks by priority as specified by the application.

- The <b>heteroprio</b> scheduler uses different priorities for the different processing units.
This scheduler must be configured to work correctly and to expect high-performance
as described in the corresponding section.
- The <b>heteroprio</b> scheduler uses different priorities for the different processing units. This scheduler must be configured to work properly and to expect high-performance, as described in the appropriate section.

\subsection DMTaskSchedulingPolicy Performance Model-Based Task Scheduling Policies

Expand Down
100 changes: 22 additions & 78 deletions doc/doxygen/chapters/starpu_basics/tasks.doxy
Original file line number Diff line number Diff line change
Expand Up @@ -18,95 +18,44 @@

\section TaskGranularity Task Granularity

Similar to other runtimes, StarPU introduces some overhead in managing
tasks. This overhead, while not always negligible, is mitigated by its
intelligent scheduling and data management capabilities. The typical
order of magnitude for this overhead is a few microseconds, which is
notably smaller than the inherent CUDA overhead. To ensure that this
overhead remains insignificant, the work assigned to a task should be
substantial enough.

The length of tasks should ideally be relatively larger to effectively
counterbalance this overhead. It iss advised to consider the offline
performance feedback, which provides insights into task lengths.
Monitoring task lengths becomes crucial if you're encountering
suboptimal performance.

To gauge the scalability potential based task size, you can run the
<c>tests/microbenchs/tasks_size_overhead.sh</c> script. It provides a
visual representation of the speedup achievable with independent tasks
of very small sizes.

This benchmark is installed in <c>$STARPU_PATH/lib/starpu/examples/</c>.
It gives a glimpse into how long a task should be (in µs) for StarPU overhead
to be low enough to keep efficiency. The script generates a plot
illustrating the speedup trends for tasks of different sizes,
correlated with the number of CPUs in use.

For example, in the figure below, for 128 µs tasks (the red line),
StarPU overhead is low enough to guarantee a good speedup if the
number of CPUs is not more than 36. But with the same number of CPUs,
64 µs tasks (the black line) cannot have a correct speedup. The number
of CPUs must be decreased to about 17 in order to keep efficiency.
Similar to other runtimes, StarPU introduces some overhead in managing tasks. This overhead, while not always negligible, is mitigated by its intelligent scheduling and data management capabilities. The typical order of magnitude for this overhead is a few microseconds, which is significantly less than the inherent CUDA overhead. To ensure that this overhead remains insignificant, the work assigned to a task should be substantial enough.

Ideally, the length of tasks should be relatively large to effectively offset this overhead. It is advisable to consider offline performance feedback, which provides insight into task length. Monitoring task lengths becomes critical when you are experiencing suboptimal performance.

To gauge the scalability potential based on task size, you can run the <c>tests/microbenchs/tasks_size_overhead.sh</c> script. It provides a visual representation of the speedup achievable with independent tasks of very small size.

This benchmark is installed in <c>$STARPU_PATH/lib/starpu/examples/</c>. It gives an idea of how long a task should be (in µs) for StarPU overhead to be low enough to maintain efficiency. The script generates a graph showing the speedup trends for tasks of different sizes, correlated with the number of CPUs used.

For example, in the figure below, for 128 µs tasks (the red line), StarPU overhead is low enough to guarantee a good speedup if the number of CPUs is not more than 36. But with the same number of CPUs, 64 µs tasks (the black line) cannot have a proper speedup. The number of CPUs must be reduced to about 17 to maintain efficiency.

\image html tasks_size_overhead.png
\image latex tasks_size_overhead.png "" width=\textwidth

To determine the task size your application is using, it is possible
to use <c>starpu_fxt_data_trace</c> as explained in \ref DataTrace.
To determine the task size used by your application, it is possible to use <c>starpu_fxt_data_trace</c> as explained in \ref DataTrace.

The selection of a scheduler in StarPU also plays a significant role.
Different schedulers have varying impacts on the overall execution.
For example, the \c dmda scheduler may require additional time to make
decisions, while the \c eager scheduler tends to be more immediate in
its decisions.
The choice of a scheduler in StarPU also plays an important role. Different schedulers have different effects on the overall execution. For example, the \c dmda scheduler may require additional time to make decisions, while the \c eager scheduler tends to be more immediate in its decisions.

To assess the impact of scheduler choice on your target machine, you
can once again utilize the \c tasks_size_overhead.sh script. This
script provides valuable insights into how different schedulers affect
performance in conjunction with task sizes.
To evaluate the impact of scheduler selection on your target machine, you can once again use the \c tasks_size_overhead.sh script. This script provides valuable insight into how different schedulers affect performance in relation to task size.

\section TaskSubmission Task Submission

To enable StarPU to perform online optimizations effectively, it is
recommended to submit tasks asynchronously whenever possible. The goal
is to maximize the level of asynchronous submission, allowing StarPU
to have more flexibility in optimizing the scheduling process.
Ideally, all tasks should be submitted asynchronously, and the use of
functions like starpu_task_wait_for_all() or starpu_data_unregister()
should be limited to waiting for task completion.
To allow StarPU to effectively perform online optimizations, it is recommended to submit tasks asynchronously whenever possible. The goal is to maximize the level of asynchronous submission, allowing StarPU to have more flexibility in optimizing the scheduling process. Ideally, all tasks should be submitted asynchronously, and the use of functions like starpu_task_wait_for_all() or starpu_data_unregister() should be limited to waiting for task completion.

StarPU will then be able to rework the whole schedule, overlap
computation with communication, manage accelerator local memory usage, etc.
A simple example is in the file <c>examples/basic_examples/variable.c</c>
StarPU will then be able to rework the whole schedule, overlap computation with communication, manage local accelerator memory usage, etc. A simple example can be found in <c>examples/basic_examples/variable.c</c>

\section TaskPriorities Task Priorities

StarPU's default behavior considers tasks in the order they
are submitted by the application. However, in scenarios where the
application programmer possesses knowledge about certain tasks that
should take priority due to their impact on performance (such as tasks
whose output is crucial for subsequent tasks), the
starpu_task::priority field can be utilized to convey this information
to StarPU's scheduling process.
StarPU's default behavior is to consider tasks in the order in which they are submitted by the application. However, in scenarios where the application programmer has knowledge about certain tasks that should be prioritized due to their impact on performance (such as tasks whose output is critical to subsequent tasks), the starpu_task::priority field can be used to convey this information to StarPU's scheduling process.

An example is provided in the application
<c>examples/heat/dw_factolu_tag.c</c>.
An example can be found in <c>examples/heat/dw_factolu_tag.c</c>.

\section SettingManyDataHandlesForATask Setting Many Data Handles For a Task

The maximum number of data that a task can manage is fixed by the macro
\ref STARPU_NMAXBUFS. This macro has a default value which can be
customized through the \c configure option \ref enable-maxbuffers
"--enable-maxbuffers".
The maximum number of data that a task can manage is set by the macro \ref STARPU_NMAXBUFS. This macro has a default value that can be changed using the \c configure option \ref enable-maxbuffers "--enable-maxbuffers".

However, if you have specific cases where you need tasks to manage
more data than the maximum allowed, you can use the field
starpu_task::dyn_handles when defining a task, along with the field
starpu_codelet::dyn_modes when defining the corresponding codelet.
However, if you have specific cases where you need tasks to manage more data than the maximum allowed, you can use the starpu_task::dyn_handles field when defining a task, along with the starpu_codelet::dyn_modes field when defining the corresponding codelet.

This dynamic handle mechanism enables tasks to handle additional data
beyond the usual limit imposed by \ref STARPU_NMAXBUFS.
This dynamic handle mechanism allows tasks to handle additional data beyond the usual limit imposed by \ref STARPU_NMAXBUFS.

\code{.c}
enum starpu_data_access_mode modes[STARPU_NMAXBUFS+1] =
Expand Down Expand Up @@ -146,17 +95,12 @@ starpu_task_insert(&dummy_big_cl,
0);
\endcode

The whole code for this complex data interface is available in the
file <c>examples/basic_examples/dynamic_handles.c</c>.
The whole code for this complex data interface is available in <c>examples/basic_examples/dynamic_handles.c</c>.

\section SettingVariableDataHandlesForATask Setting a Variable Number Of Data Handles For a Task

Normally, the number of data handles given to a task is set with
starpu_codelet::nbuffers. This field can however be set to
\ref STARPU_VARIABLE_NBUFFERS, in which case starpu_task::nbuffers
must be set, and starpu_task::modes (or starpu_task::dyn_modes,
see \ref SettingManyDataHandlesForATask) should be used to specify the modes for
the handles. Examples in <c>examples/basic_examples/dynamic_handles.c</c> show how to implement it.
Normally, the number of data handles given to a task is set with starpu_codelet::nbuffers. However, this field can be set to \ref STARPU_VARIABLE_NBUFFERS, in which case starpu_task::nbuffers must be set, and starpu_task::modes (or starpu_task::dyn_modes,
see \ref SettingManyDataHandlesForATask) should be used to specify the modes for the handles. Examples in <c>examples/basic_examples/dynamic_handles.c</c> show how to implement this.

\section InsertTaskUtility Insert Task Utility

Expand Down
41 changes: 11 additions & 30 deletions doc/doxygen/chapters/starpu_installation/building.doxy
Original file line number Diff line number Diff line change
Expand Up @@ -16,59 +16,40 @@

/*! \page BuildingAndInstallingStarPU Building and Installing StarPU

Depending on the level of customization required for the library
installation, we offer several solutions.
Depending on the level of customization required for the library installation, we offer several solutions.

<ol>
<li><b>Basic Installation or Evaluation:</b> If you are looking to
simply try out the library, assess its performance on simple cases,
run examples, or use the latest stable version, we recommend the
following options:
<li><b>Basic Installation or Evaluation:</b> If you just want to try out the library, evaluates its performance on simple cases, run examples, or use the latest stable version, we recommend the following options:
<ul>
<li>
For Linux Debian or Ubuntu distributions, consider using the latest
StarPU Debian package (see \ref InstallingABinaryPackage).
For Linux Debian or Ubuntu distributions, consider using the latest StarPU Debian package (see \ref InstallingABinaryPackage).
</li>
<li>
For macOS, you can opt for Brew and follow the steps in \ref
InstallingASourcePackage.
For macOS, you can use Brew and follow the steps in \ref InstallingASourcePackage.
</li>
<li>
Using an already installed module on a cluster, as explained in
\ref UsingModule
Use an already installed module on a cluster, as explained in \ref UsingModule
</li>
</ul>
</li>
<li><b>Customization for Specific Needs:</b> If you intend to use
StarPU but require modifications, such as switching to another version
(git branch), changing the default MPI, utilizing a preferred
compiler, or altering source code, consider these options:
<li><b>Customize for Specific Needs:</b> If you intend to use StarPU but need modifications, such as switching to a different version (git branch), changing the default MPI, using a preferred compiler, or modifying source code, consider these options:
<ul>
<li>
Guix or Spack can be useful, as these package managers allow dynamic
changes during source-based builds.
Refer to \ref InstallingASourcePackage for details.
Guix or Spack may be useful, as these package managers allow dynamic changes during source-based builds. See \ref InstallingASourcePackage for details.
</li>
<li>
Alternatively, you can directly build from the source using the native
build system of the library (Makefile, GNU autotools). Instructions
can be found in \ref InstallingFromSource.
Alternatively, you can build directly from source using the library's native build system (Makefile, GNU autotools). Instructions can be found in \ref InstallingFromSource.
</li>
</ul>
</li>
<li>
<b>Experiment Reproducibility:</b> If your focus is on experiment
reproducibility, we recommend using Guix. Refer to \ref
InstallingASourcePackage for guidance.
<b>Experiment Reproducibility:</b> If your focus is on reproducibility of experiments, we recommend using Guix. Refer to \ref InstallingASourcePackage for guidance.
</li>
</ol>

Whichever solution you choose, you can utilize the tool
<c>bin/starpu_config</c> to view all the configuration parameters used
during StarPU installation.
Whichever solution you choose, you can use the tool <c>bin/starpu_config</c> to view all the configuration parameters used during the StarPU installation.

Please refer to the provided documentation for specific installation
steps and details for each solution.
Please refer to the documentation provided for specific installation steps and details for each solution.

\section InstallingABinaryPackage Installing a Binary Package

Expand Down
Loading

0 comments on commit 00257b6

Please sign in to comment.