From 3404801507180736693cdb195fb087a7d6e4432e Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Tue, 9 Jul 2024 16:58:59 +0200 Subject: [PATCH 01/19] Improve documentation - Fix typos - Fix variable names - Improve formulation for Complications and Fixes - Improve kokkos_malloc code --- .../Kokkos-and-Virtual-Functions.md | 171 +++++++++++------- 1 file changed, 109 insertions(+), 62 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index d61189a18..1a1455e63 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -4,28 +4,29 @@ Due to oddities of GPU programming, the use of virtual functions in Kokkos paral ## The Problem -In GPU programming, you might have run into the bug of calling a host function from the device. A similar thing can happen for subtle reasons in code using virtual functions. Consider the following code +In GPU programming, you might have run into the bug of calling a host function from the device. A similar thing can happen for subtle reasons in code using virtual functions. Consider the following code (using Cuda instructions as an example) ```c++ class Derived : public Base { - /** fields */ + // fields + public: KOKKOS_FUNCTION virtual void Bar(){ - // TODO: implement all of physics + // all of physics } }; -Base* hostClassInstance = new Derived(); -Base* deviceClassInstance; -cudaMalloc((void**)&deviceClassInstance, sizeof(Derived)); -cudaMemcpy(deviceClassInstance, hostClassInstance, sizeof(Derived), cudaMemcpyHostToDevice); +Base* hostInstance = new Derived(); +Base* deviceInstance; +cudaMalloc((void**)&deviceInstance, sizeof(Derived)); +cudaMemcpy(deviceInstance, hostInstance, sizeof(Derived), cudaMemcpyHostToDevice); -Kokkos::parallel_for("DeviceKernel", SomeCudaPolicy, KOKKOS_LAMBDA(const int i) { - deviceClassInstance->Bar(); +Kokkos::parallel_for("DeviceKernel", SomeCudaPolicy, KOKKOS_LAMBDA (const int i) { + deviceInstance->Bar(); }); ``` -At a glance this should be fine, we've made a device instance of a class, copied the contents of a host instance into it, and then used it. This code will typically crash, however, because `virtualFunction` will call a host version of the function. To understand why, you'll need to understand a bit about how virtual functions are implemented. +At a glance this should be fine, we've made a device instance of a class, copied the contents of a host instance into it, and then used it. This code will typically crash, however, because `deviceInstance` will call a host version of `Bar()`. To understand why, you'll need to understand a bit about how virtual functions are implemented. ## V-Tables, V-Pointers, V-ery annoying with GPUs @@ -45,7 +46,7 @@ Credit: the content of this section is adapted from Pablo Arias [here](https://p ## Then why doesn't my code work? -The reason the intro code might break is that when dealing with GPU-compatible classes with virtual functions, there isn't one Vtable, but two. The first holds the host versions of the virtual functions, while the second holds the device functions. +The reason the intro code might break is that when dealing with GPU-compatible classes with virtual functions, there isn't one Vtable, but two. The first holds the host version of the virtual functions, while the second holds the device functions. ![VTableDevice](./figures/VirtualFunctions-VTablesHostDevice.png) @@ -53,27 +54,35 @@ Since we construct the class instance on the host, so it's Vpointer points to th ![VPointerToHost](./figures/VirtualFunctions-VPointerToHost.png) -Our cudaMemcpy faithfully copied all of the members of the class, including the Vpointer merrily pointing at host functions, which we then call on the device. +Our `cudaMemcpy()` faithfully copied all of the members of the class, including the Vpointer merrily pointing at host functions, which we then call on the device. ## How to fix this -The problem here is that we are constructing the class on the Host. If we were constructing on the Device, we'd get the correct Vpointer, and thus the correct functions (but only for calls on the device). In pseudocode, we want to move from +The problem here is that we are constructing the class on the Host. If we were constructing on the Device, we'd get the correct Vpointer, and thus the correct functions (but only for calls on the device). We want to move from ```c++ -Base* hostInstance = new Derived(); // allocate and initialize host -Base* deviceInstance; // cudaMalloc'd to allocate -cudaMemcpy(deviceInstance, hostInstance); // to initialize the deivce -Kokkos::parallel_for(... { - // use deviceInstance +Base* hostInstance = new Derived(); // allocate and initialize on host +Base* deviceInstance; // declare +cudaMalloc((void**)&deviceInstance, sizeof(Derived)); // alocate on device +cudaMemcpy(deviceInstance, hostInstance, sizeof(Derived), cudaMemcpyHostToDevice); // initialize on device by copy + +Kokkos::parallel_for("DeviceKernel", SomeCudaPolicy, KOKKOS_LAMBDA (const int i) { + // use on device }); ``` -To one where we construct on the device using a technique called `placement new` +to one where we construct on the device using a technique called *placement new* ```c++ -Base* deviceInstance; // cudaMalloc'd to allocate it -Kokkos::parallel_for(... { - new((Derived*)deviceInstance) Derived(); // construct an instance in the place, the pointer deviceInstance points to +Base* deviceInstance; // declare +cudaMalloc((void**)&deviceInstance, sizeof(Derived)); // alocate on device +Kokkos::parallel_for("Initialize", 1, KOKKOS_LAMBDA (const int i) { + new (static_cast(deviceInstance)) Derived(); // initialize on device +}); +Kokkos::fence("Wait for initialize"); + +Kokkos::parallel_for("DeviceKernel", SomeCudaPolicy, KOKKOS_LAMBDA (const int i) { + // use on device }); ``` @@ -82,6 +91,35 @@ This code is extremely ugly, but leads to functional virtual function calls on t ![VPointerToDevice](./figures/VirtualFunctions-VPointerToDevice.png) Note that like with other uses of `new`, you need to later `free` the memory. + +## Make it portable + +We can get rid of the call to `cudaMalloc()` and make the previous code portable + +```cpp +auto deviceInstanceMemory = Kokkos::kokkos_malloc(sizeof(Derived)); // allocate memory on device +Kokkos::parallel_for("Initialize", 1, KOKKOS_LAMBDA (const int i) { + new (static_cast(deviceInstanceMemory)) Derived(); // initialize on device +}); +Kokkos::fence("Wait for initialize"); +Base* deviceInstance = static_cast(deviceInstanceMemory); // declare on this memory + +Kokkos::parallel_for("DeviceKernel", SomeCudaPolicy, KOKKOS_LAMBDA (const int i) { + // use on device +}); +Kokkos::fence(); + +Kokkos::parallel_for("Destroy", 1, KOKKOS_LAMBDA (const int i) { + deviceInstance->~Base(); // destroy on device +}); +Kokkos::fence("Wait for destroy"); +Kokkos::kokkos_free(deviceInstanceMemory); // free +``` + +We replaced `cudaMalloc()` by `Kokkos::kokkos_malloc()` and introduced a distinction between the memory that the instance uses, and the actual instance object. +Since the kernel does not have a return type, we use a static cast to associate the later to the former. +Then, the object is destroyed, and its memory freed. + For a full working example, see [the example in the repo](https://github.com/kokkos/kokkos/blob/master/example/virtual_functions/main.cpp). ## Complications and Fixes @@ -89,80 +127,89 @@ For a full working example, see [the example in the repo](https://github.com/kok The first problem people run into with this is that they want to initialize some fields or nested classes based on host data before moving data down to the device ```c++ -Base* hostInstance = new Derived(); // allocate and initialize host -hostInstance->setAField(someHostValue); -Base* deviceInstance; // cudaMalloc'd to allocate -cudaMemcpy(deviceInstance, hostInstance); // to initialize the deivce -Kokkos::parallel_for(... { - // use deviceInstance +Base* hostInstance = new Derived(); // allocate and initialize on host +hostInstance->setAField(someHostValue); // set on host +Base* deviceInstance; +cudaMalloc((void**)&deviceInstance, sizeof(Derived)); +cudaMemcpy(deviceInstance, hostInstance, sizeof(Derived), cudaMemcpyHostToDevice); + +Kokkos::parallel_for("DeviceKernel", SomeCudaPolicy, KOKKOS_LAMBDA (const int i) { + // use on device }); ``` We can't translate this easily, the naive translation would be ```c++ -Base* deviceInstance; // cudaMalloc'd to allocate it -Kokkos::parallel_for(... { - new((Derived*)deviceInstance) Derived(); // initialize an instance, and place the result in the pointer deviceInstance - deviceInstance->setAField(someHostValue); +auto deviceInstanceMemory = Kokkos::kokkos_malloc(sizeof(Derived)); +Kokkos::parallel_for("Initialize", 1, KOKKOS_LAMBDA (const int i) { + new (static_cast(deviceInstanceMemory)) Derived(); }); +Kokkos::fence("Wait for initialize"); +Base* deviceInstance = static_cast(deviceInstanceMemory); + +Kokkos::parallel_for("Set", 1, KOKKOS_LAMBDA (const int i) { + deviceInstance->setAField(value); // set on device +}); +Kokkos::fence("Wait for set"); ``` -Which would crash for accessing the host value `someHostValue` on the device (or this value would need to be copied into the `parallel_for`). The most productive solution we've found in these cases is to allocate the class in `SharedSpace`, initialize it on the device, and then fill in fields on the host. To wit: +This would crash if `someHostValue` is not accessible on the device (e.g. for a small array). The most productive solution we've found in these cases is to allocate the instance in `SharedSpace`, initialize it on the device, and then fill in fields on the host. To wit: ```c++ -Base* deviceInstance = Kokkos::kokkos_malloc(sizeof(Derived)); -Kokkos::parallel_for(... { - new((Derived*)deviceInstance) Derived(); // construct an instance in the place the the pointer deviceInstance points to +auto deviceInstanceMemory = Kokkos::kokkos_malloc(sizeof(Derived)); // allocate on shared space +Kokkos::parallel_for("Initialize", 1, KOKKOS_LAMBDA (const int i) { + new (static_cast(deviceInstanceMemory)) Derived(); }); -deviceInstance->setAField(someHostValue); // set some field on the host +Kokkos::fence("Wait for initialize"); +Base* deviceInstance = static_cast(deviceInstanceMemory); +deviceInstance->setAField(someHostValue); // set on host ``` This is the solution that the code teams we have talked to have said is the most productive way to solve the problem. Nevertheless, it should be kept in mind, that this restricts virtual function calls to the device. +In the case presented above, `setAField()` is not a virtual function. ## But what if I do not really need the V-Tables on the device side? -Consider the following example which calls the `virtual Bar()` on the device from a pointer of derived class type. -One might think this should work because no V-Table lookup on the device is neccessary. + +Consider the following example which calls the `virtual` `Bar()` on the device from a pointer of derived class type. +One might think this should work because no V-Table lookup on the device is necessary. + ```c++ #include -#include struct Base { - KOKKOS_DEFAULTED_FUNCTION - virtual ~Base() = default; + KOKKOS_DEFAULTED_FUNCTION + virtual ~Base() = default; - KOKKOS_FUNCTION - virtual void Bar() const = 0; + KOKKOS_FUNCTION + virtual void Bar() const = 0; }; struct Derived : public Base { - KOKKOS_FUNCTION - void Bar() const override - { printf("Hello from Derived\n"); } - - void apply(){ - Kokkos::parallel_for("myLoop",10, - KOKKOS_CLASS_LAMBDA (const size_t i) { this->Bar(); } - ); - } + KOKKOS_FUNCTION + void Bar() const override + { Kokkos::printf("Hello from Derived\n"); } + + void apply(){ + Kokkos::parallel_for("myLoop", 10, + KOKKOS_CLASS_LAMBDA (const size_t i) { this->Bar(); } + ); + } }; int main (int argc, char *argv[]) { - Kokkos::initialize(argc,argv); - { - auto derivedPtr = std::make_shared(); - derivedPtr->apply(); - Kokkos::fence(); - } - Kokkos::finalize(); + Kokkos::ScopeGuard kokkos(argc, argv); + auto derivedPtr = std::make_shared(); + derivedPtr->apply(); + Kokkos::fence(); } ``` ### Why is this not portable? -Inside the `parallel_for` `Bar()` is called. As `Derived` derives from the pure virtual class `Base`, the 'Bar()' function is marked `override`. +Inside the `parallel_for`, `Bar()` is called. As `Derived` derives from the pure virtual class `Base`, the `Bar()` function is marked `override`. On ROCm 5.2 this results in a memory access violation. When executing the `this->Bar()` call, the runtime looks into the V-Table and dereferences a host function pointer on the device. @@ -171,7 +218,7 @@ When executing the `this->Bar()` call, the runtime looks into the V-Table and de Notice, that the `parallel_for` is called from a pointer of type `Derived` and not a pointer of type `Base` pointing to an `Derived` object. Thus, no V-Table lookup for the `Bar()` would be necessary as it can be deduced from the context of the call that it will be `Derived::Bar()`. But here it comes down to how the compiler handles the lookup. NVCC understands that the call is coming from an `Derived` object and thinks: "Oh, I see, that you are calling from an `Derived` object, I know it will be the `Bar()` in this class scope, I will do this for you". -ROCm, on the other hand, sees your call and thinks “Oh, this is a call to a virtual method, I will look that up for you” - failing to dereference the host function pointer in the host virtual function table. +ROCm, on the other hand, sees your call and thinks "Oh, this is a call to a virtual method, I will look that up for you", failing to dereference the host function pointer in the host virtual function table. ### How to solve this? Strictly speaking, the observed behavior on NVCC is an optimization that uses the context information to avoid the V-Table lookup. From 06fe6042dcd644c720914a4151463b167533b09e Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Tue, 9 Jul 2024 17:16:25 +0200 Subject: [PATCH 02/19] Fix format typo --- .../ProgrammingGuide/Kokkos-and-Virtual-Functions.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 1a1455e63..35bf64a8e 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -34,15 +34,15 @@ Virtual functions allow a program to handle Derived classes through a pointer to ![VTable](./figures/VirtualFunctions-VTables.png) -Okay, so now we have Vtables, if a class knows what type it is it could call the correct function. But how does it know? +Okay, so now we have Vtables, if a class knows what type it is, it could call the correct function. But how does it know? Remember that we have one Vtable shared amongst all instances of a type. Each instance, however, has a hidden member called the Vpointer, which the compiler points at construction to the correct Vtable. So a call to a virtual function simply dereferences that pointer, and then indexes into the Vtable to find the precise virtual function called. ![VPointer](./figures/VirtualFunctions-VPointers.png) -Now that we know what the compiler is doing to implement virtual functions, we'll look at why it doesn't work with GPU's +Now that we know what the compiler is doing to implement virtual functions, we'll look at why it doesn't work with GPU's. -Credit: the content of this section is adapted from Pablo Arias [here](https://pabloariasal.github.io/2017/06/10/understanding-virtual-tables/) +Credit: the content of this section is adapted from Pablo Arias [here](https://pabloariasal.github.io/2017/06/10/understanding-virtual-tables/). ## Then why doesn't my code work? @@ -171,7 +171,7 @@ In the case presented above, `setAField()` is not a virtual function. ## But what if I do not really need the V-Tables on the device side? -Consider the following example which calls the `virtual` `Bar()` on the device from a pointer of derived class type. +Consider the following example which calls the virtual function `Bar()` on the device from a pointer of derived class type. One might think this should work because no V-Table lookup on the device is necessary. ```c++ From b0d80f87f388aa6e4675de18a2b81ac18de909fb Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Tue, 16 Jul 2024 16:33:40 +0200 Subject: [PATCH 03/19] Update docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Cédric Chevalier --- docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 35bf64a8e..8487b41c0 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -75,7 +75,7 @@ to one where we construct on the device using a technique called *placement new* ```c++ Base* deviceInstance; // declare -cudaMalloc((void**)&deviceInstance, sizeof(Derived)); // alocate on device +cudaMalloc((void**)&deviceInstance, sizeof(Derived)); // allocate on device Kokkos::parallel_for("Initialize", 1, KOKKOS_LAMBDA (const int i) { new (static_cast(deviceInstance)) Derived(); // initialize on device }); From 94dca49cf953d9e74b5a8c7d0e7c5e52798b0f97 Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Tue, 16 Jul 2024 18:08:36 +0200 Subject: [PATCH 04/19] Add warning for virtual functions --- .../Kokkos-and-Virtual-Functions.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 8487b41c0..417680370 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -1,7 +1,16 @@ # 14. Kokkos and Virtual Functions +```{warning} +Using virtual functions in parallel regions is not a good idea in general. It often degrades performance, it requires specific code for a correct execution on GPU, and it is not portable on every backend. We recommend to use a different approach whenever possible. +``` + Due to oddities of GPU programming, the use of virtual functions in Kokkos parallel regions can be complicated. This document describes the problems you're likely to face, where they come from, and how to work around them. +Please note that virtual functions can be executed on device for the following backends: + +- Cuda; and +- HIP. + ## The Problem In GPU programming, you might have run into the bug of calling a host function from the device. A similar thing can happen for subtle reasons in code using virtual functions. Consider the following code (using Cuda instructions as an example) @@ -12,8 +21,8 @@ class Derived : public Base { public: KOKKOS_FUNCTION virtual void Bar(){ - // all of physics - } + // function body + } }; Base* hostInstance = new Derived(); From e90221a0312965d33f044ac755d50c2b5af6500f Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Thu, 18 Jul 2024 13:03:59 +0200 Subject: [PATCH 05/19] Remove all reference to mixed code --- .../Kokkos-and-Virtual-Functions.md | 236 ++++++++---------- 1 file changed, 109 insertions(+), 127 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 417680370..e48589d3e 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -1,7 +1,7 @@ # 14. Kokkos and Virtual Functions ```{warning} -Using virtual functions in parallel regions is not a good idea in general. It often degrades performance, it requires specific code for a correct execution on GPU, and it is not portable on every backend. We recommend to use a different approach whenever possible. +Using virtual functions in parallel regions is not a good idea in general. It often degrades performance, requires specific code for a correct execution on GPU, and is not portable on every backend. We recommend to use a different approach whenever possible. ``` Due to oddities of GPU programming, the use of virtual functions in Kokkos parallel regions can be complicated. This document describes the problems you're likely to face, where they come from, and how to work around them. @@ -13,33 +13,45 @@ Please note that virtual functions can be executed on device for the following b ## The Problem -In GPU programming, you might have run into the bug of calling a host function from the device. A similar thing can happen for subtle reasons in code using virtual functions. Consider the following code (using Cuda instructions as an example) +In GPU programming, you might have run into the bug of calling a host function from the device. A similar thing can happen for subtle reasons in code using virtual functions. Consider the following serial code to accelerate ```c++ -class Derived : public Base { - // fields +class Base { + public: + void Foo() {} + + virtual void Bar() {} +}; +class Derived : public Base { public: - KOKKOS_FUNCTION virtual void Bar(){ - // function body - } + void Bar() override {} }; -Base* hostInstance = new Derived(); -Base* deviceInstance; -cudaMalloc((void**)&deviceInstance, sizeof(Derived)); -cudaMemcpy(deviceInstance, hostInstance, sizeof(Derived), cudaMemcpyHostToDevice); -Kokkos::parallel_for("DeviceKernel", SomeCudaPolicy, KOKKOS_LAMBDA (const int i) { - deviceInstance->Bar(); -}); +int main(int argc, char *argv[]) + // create + Base* instance = new Derived(); + + // use + for (int i = 0; i < 10; i++) { + instance->Bar(); + } + + // cleanup + delete instance; +} ``` -At a glance this should be fine, we've made a device instance of a class, copied the contents of a host instance into it, and then used it. This code will typically crash, however, because `deviceInstance` will call a host version of `Bar()`. To understand why, you'll need to understand a bit about how virtual functions are implemented. +This code is more complex to accelerate than it looks like. +Using a straightforward approach, we would accelerate the `for` loop with `parallel_for` and copy `instance` on the GPU memory (not disclosing how for now). +Then, we would call `Bar()` inside the `parallel_for`. +At a glance this should be fine, but it will typically crash, however, because `instance` will call a host version of `Bar()`. +To understand why, you'll need to understand a bit about how virtual functions are implemented. -## V-Tables, V-Pointers, V-ery annoying with GPUs +## Vtables, Vpointers, Very annoying with GPUs -Virtual functions allow a program to handle Derived classes through a pointer to their Base class and have things work as they should. To make this work, the compiler needs some way to identify whether a pointer which is nominally to a Base class really is a pointer to the Base, or whether it's really a pointer to any Derived class. This happens through Vpointers and Vtables. For every class with virtual functions, there is one Vtable shared among all instances, this table contains function pointers for all the virtual functions the class implements. +Virtual functions allow a program to handle Derived classes through a pointer to their Base class and have things work as they should. To make this work, the compiler needs some way to identify whether a pointer which is nominally to a Base class really is a pointer to the Base, or whether it's really a pointer to any Derived class. This happens through Vpointers and Vtables. For every class with virtual functions, there is one Vtable shared among all instances, this table contains function pointers for all the virtual functions the class implements. ![VTable](./figures/VirtualFunctions-VTables.png) @@ -53,9 +65,9 @@ Now that we know what the compiler is doing to implement virtual functions, we'l Credit: the content of this section is adapted from Pablo Arias [here](https://pabloariasal.github.io/2017/06/10/understanding-virtual-tables/). -## Then why doesn't my code work? +## Then why the straightforward approach doesn't work? -The reason the intro code might break is that when dealing with GPU-compatible classes with virtual functions, there isn't one Vtable, but two. The first holds the host version of the virtual functions, while the second holds the device functions. +The reason the straightforward approach described above fails is that when dealing with GPU-compatible classes with virtual functions, there isn't one Vtable, but two. The first holds the host version of the virtual functions, while the second holds the device functions. ![VTableDevice](./figures/VirtualFunctions-VTablesHostDevice.png) @@ -63,178 +75,148 @@ Since we construct the class instance on the host, so it's Vpointer points to th ![VPointerToHost](./figures/VirtualFunctions-VPointerToHost.png) -Our `cudaMemcpy()` faithfully copied all of the members of the class, including the Vpointer merrily pointing at host functions, which we then call on the device. - -## How to fix this +We faithfully copied all of the members of the class on the GPU memory, including the Vpointer happily pointing at host functions, which we then call on the device. -The problem here is that we are constructing the class on the Host. If we were constructing on the Device, we'd get the correct Vpointer, and thus the correct functions (but only for calls on the device). We want to move from +## Make it work -```c++ -Base* hostInstance = new Derived(); // allocate and initialize on host -Base* deviceInstance; // declare -cudaMalloc((void**)&deviceInstance, sizeof(Derived)); // alocate on device -cudaMemcpy(deviceInstance, hostInstance, sizeof(Derived), cudaMemcpyHostToDevice); // initialize on device by copy - -Kokkos::parallel_for("DeviceKernel", SomeCudaPolicy, KOKKOS_LAMBDA (const int i) { - // use on device -}); -``` +The problem here is that we are constructing the instance on the Host. +If we were constructing it on the Device, we'd get the correct Vpointer, and thus the correct functions. +Note that this would allow to call virtual functions on the device only, not on the host. -to one where we construct on the device using a technique called *placement new* +To that aim, we first allocate memory on the device, then construct on the device using a technique called *placement new* -```c++ -Base* deviceInstance; // declare -cudaMalloc((void**)&deviceInstance, sizeof(Derived)); // allocate on device -Kokkos::parallel_for("Initialize", 1, KOKKOS_LAMBDA (const int i) { - new (static_cast(deviceInstance)) Derived(); // initialize on device -}); -Kokkos::fence("Wait for initialize"); - -Kokkos::parallel_for("DeviceKernel", SomeCudaPolicy, KOKKOS_LAMBDA (const int i) { - // use on device -}); -``` +```cpp +#include -This code is extremely ugly, but leads to functional virtual function calls on the device. The Vpointer now points to the device Vtable. +class Base { + public: + void Foo() {} -![VPointerToDevice](./figures/VirtualFunctions-VPointerToDevice.png) + KOKKOS_FUNCTION + virtual void Bar() {} +}; -Note that like with other uses of `new`, you need to later `free` the memory. +class Derived : public Base { + public: + KOKKOS_FUNCTION + void Bar() override {} +}; -## Make it portable +int main(int argc, char *argv[]) +{ + Kokkos::ScopeGuard kokkos(argc, argv); -We can get rid of the call to `cudaMalloc()` and make the previous code portable + // create + auto deviceInstanceMemory = Kokkos::kokkos_malloc(sizeof(Derived)); // allocate memory on device + Kokkos::parallel_for("initialize", 1, KOKKOS_LAMBDA (const int i) { + new (static_cast(deviceInstanceMemory)) Derived(); // initialize on device + }); + Kokkos::fence("wait for initialize"); + Base* deviceInstance = static_cast(deviceInstanceMemory); // declare on this memory + + // use + Kokkos::parallel_for("myKernel", 10, KOKKOS_LAMBDA (const int i) { + deviceInstance->Bar(); + }); + Kokkos::fence(); -```cpp -auto deviceInstanceMemory = Kokkos::kokkos_malloc(sizeof(Derived)); // allocate memory on device -Kokkos::parallel_for("Initialize", 1, KOKKOS_LAMBDA (const int i) { - new (static_cast(deviceInstanceMemory)) Derived(); // initialize on device -}); -Kokkos::fence("Wait for initialize"); -Base* deviceInstance = static_cast(deviceInstanceMemory); // declare on this memory - -Kokkos::parallel_for("DeviceKernel", SomeCudaPolicy, KOKKOS_LAMBDA (const int i) { - // use on device -}); -Kokkos::fence(); - -Kokkos::parallel_for("Destroy", 1, KOKKOS_LAMBDA (const int i) { - deviceInstance->~Base(); // destroy on device -}); -Kokkos::fence("Wait for destroy"); -Kokkos::kokkos_free(deviceInstanceMemory); // free + // cleanup + Kokkos::parallel_for("destroy", 1, KOKKOS_LAMBDA (const int i) { + deviceInstance->~Base(); // destroy on device + }); + Kokkos::fence("wait for destroy"); + Kokkos::kokkos_free(deviceInstanceMemory); // free +} ``` -We replaced `cudaMalloc()` by `Kokkos::kokkos_malloc()` and introduced a distinction between the memory that the instance uses, and the actual instance object. -Since the kernel does not have a return type, we use a static cast to associate the later to the former. -Then, the object is destroyed, and its memory freed. - -For a full working example, see [the example in the repo](https://github.com/kokkos/kokkos/blob/master/example/virtual_functions/main.cpp). +We first use the `KOKKOS_FUNCTION` macro to make the methods callable from a kernel. +When creating the instance, note that we introduce a distinction between the *memory* that the it uses, and the actual instantiated *object*. +The construct is done on the device, withing a single-iteration `parallel_for`, using placement new. +Since the kernel does not have a return type, we use a static cast to associate the object to the memory allocation. -## Complications and Fixes +Like with other uses of `new`, we need to later free the memory. +The destruct is done on the device, again with a single-iteration `parallel_for`. -The first problem people run into with this is that they want to initialize some fields or nested classes based on host data before moving data down to the device +This code is extremely ugly, but leads to functional virtual function calls on the device. The Vpointer now points to the device Vtable. +Remember that those virtual functions cannot be called on the host anymore! -```c++ -Base* hostInstance = new Derived(); // allocate and initialize on host -hostInstance->setAField(someHostValue); // set on host -Base* deviceInstance; -cudaMalloc((void**)&deviceInstance, sizeof(Derived)); -cudaMemcpy(deviceInstance, hostInstance, sizeof(Derived), cudaMemcpyHostToDevice); - -Kokkos::parallel_for("DeviceKernel", SomeCudaPolicy, KOKKOS_LAMBDA (const int i) { - // use on device -}); -``` +![VPointerToDevice](./figures/VirtualFunctions-VPointerToDevice.png) -We can't translate this easily, the naive translation would be +For a full working example, see [the example in the repo](https://github.com/kokkos/kokkos/blob/master/example/virtual_functions/main.cpp). -```c++ -auto deviceInstanceMemory = Kokkos::kokkos_malloc(sizeof(Derived)); -Kokkos::parallel_for("Initialize", 1, KOKKOS_LAMBDA (const int i) { - new (static_cast(deviceInstanceMemory)) Derived(); -}); -Kokkos::fence("Wait for initialize"); -Base* deviceInstance = static_cast(deviceInstanceMemory); - -Kokkos::parallel_for("Set", 1, KOKKOS_LAMBDA (const int i) { - deviceInstance->setAField(value); // set on device -}); -Kokkos::fence("Wait for set"); -``` +## What if I need a setter with host values? -This would crash if `someHostValue` is not accessible on the device (e.g. for a small array). The most productive solution we've found in these cases is to allocate the instance in `SharedSpace`, initialize it on the device, and then fill in fields on the host. To wit: +The first problem people run into with this is that they want to initialize some fields based on host data, with a setter which is *not* a virtual function. +Calling this setter on the device would crash if the host data is not easily copyable on the device (e.g. for a small array). +The most productive solution we've found in these cases is to allocate the instance in `SharedSpace`, initialize it on the device, and then fill in fields on the host ```c++ +// create auto deviceInstanceMemory = Kokkos::kokkos_malloc(sizeof(Derived)); // allocate on shared space -Kokkos::parallel_for("Initialize", 1, KOKKOS_LAMBDA (const int i) { - new (static_cast(deviceInstanceMemory)) Derived(); -}); -Kokkos::fence("Wait for initialize"); -Base* deviceInstance = static_cast(deviceInstanceMemory); +// ... deviceInstance->setAField(someHostValue); // set on host ``` -This is the solution that the code teams we have talked to have said is the most productive way to solve the problem. Nevertheless, it should be kept in mind, that this restricts virtual function calls to the device. -In the case presented above, `setAField()` is not a virtual function. +The setter is still called on the host. +Beware that this is only valid for backends that support `SharedSpace`. -## But what if I do not really need the V-Tables on the device side? +## But what if I do not really need the Vtables on the device side? Consider the following example which calls the virtual function `Bar()` on the device from a pointer of derived class type. -One might think this should work because no V-Table lookup on the device is necessary. +One might think this should work because no Vtable lookup on the device is necessary. ```c++ #include -struct Base -{ - KOKKOS_DEFAULTED_FUNCTION - virtual ~Base() = default; - +class Base { + public: KOKKOS_FUNCTION virtual void Bar() const = 0; }; -struct Derived : public Base -{ +class Derived : Base { + public: KOKKOS_FUNCTION - void Bar() const override - { Kokkos::printf("Hello from Derived\n"); } + void Bar() const override { + Kokkos::printf("Hello from Derived\n"); + } - void apply(){ + void apply() { Kokkos::parallel_for("myLoop", 10, KOKKOS_CLASS_LAMBDA (const size_t i) { this->Bar(); } ); } }; -int main (int argc, char *argv[]) +int main(int argc, char *argv[]) { Kokkos::ScopeGuard kokkos(argc, argv); + auto derivedPtr = std::make_shared(); derivedPtr->apply(); Kokkos::fence(); } ``` + ### Why is this not portable? Inside the `parallel_for`, `Bar()` is called. As `Derived` derives from the pure virtual class `Base`, the `Bar()` function is marked `override`. On ROCm 5.2 this results in a memory access violation. -When executing the `this->Bar()` call, the runtime looks into the V-Table and dereferences a host function pointer on the device. +When executing the `this->Bar()` call, the runtime looks into the Vtable and dereferences a host function pointer on the device. ### But if that is the case, why does it work with NVCC? Notice, that the `parallel_for` is called from a pointer of type `Derived` and not a pointer of type `Base` pointing to an `Derived` object. -Thus, no V-Table lookup for the `Bar()` would be necessary as it can be deduced from the context of the call that it will be `Derived::Bar()`. +Thus, no Vtable lookup for the `Bar()` would be necessary as it can be deduced from the context of the call that it will be `Derived::Bar()`. But here it comes down to how the compiler handles the lookup. NVCC understands that the call is coming from an `Derived` object and thinks: "Oh, I see, that you are calling from an `Derived` object, I know it will be the `Bar()` in this class scope, I will do this for you". ROCm, on the other hand, sees your call and thinks "Oh, this is a call to a virtual method, I will look that up for you", failing to dereference the host function pointer in the host virtual function table. ### How to solve this? -Strictly speaking, the observed behavior on NVCC is an optimization that uses the context information to avoid the V-Table lookup. +Strictly speaking, the observed behavior on NVCC is an optimization that uses the context information to avoid the Vtable lookup. If the compiler does not apply this optimization, you can help in different ways by providing additional information. Unfortunately, none of these strategies is fully portable to all backends. -- Tell the compiler not to look up any function name in the V-Table when calling `Bar()` by using [qualified name lookup](https://en.cppreference.com/w/cpp/language/qualified_lookup). For this, you tell the compiler which function you want by spelling out the class scope in which the function should be found e.g. `this->Derived::Bar();`. This behavior is specified in the C++ Standard. Nevertheless, some backends are not fully compliant to the Standard. -- Changing the `override` to `final` on the `Bar()` in the `Derived` class. This tells the compiler `Bar()` is not changing in derived objects. Many compilers do use this in optimization and deduce which function to call without the V-Table. Nevertheless, this might only work with certain compilers, as this effect of adding `final` is not specified in the C++ Standard. +- Tell the compiler not to look up any function name in the Vtable when calling `Bar()` by using [qualified name lookup](https://en.cppreference.com/w/cpp/language/qualified_lookup). For this, you tell the compiler which function you want by spelling out the class scope in which the function should be found e.g. `this->Derived::Bar();`. This behavior is specified in the C++ Standard. Nevertheless, some backends are not fully compliant to the Standard. +- Changing the `override` to `final` on the `Bar()` in the `Derived` class. This tells the compiler `Bar()` is not changing in derived objects. Many compilers do use this in optimization and deduce which function to call without the Vtable. Nevertheless, this might only work with certain compilers, as this effect of adding `final` is not specified in the C++ Standard. - Similarly, the entire derived class `Implementation` can be marked `final`. This is compiler dependent too, for the same reasons. ## Questions/Follow-up From 1e1440e6ece117bfd3ef2f62f6048c3c8d59e12d Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Thu, 18 Jul 2024 14:20:26 +0200 Subject: [PATCH 06/19] Minor improvements --- .../Kokkos-and-Virtual-Functions.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index e48589d3e..3afbf39d8 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -9,7 +9,9 @@ Due to oddities of GPU programming, the use of virtual functions in Kokkos paral Please note that virtual functions can be executed on device for the following backends: - Cuda; and -- HIP. +- HIP (with a limitation, as explained at the end). + +Especially, SYCL 2020 [cannot handle virtual functions](https://github.com/AlexeySachkov/llvm/blob/private/asachkov/virtual-functions-extension-spec/sycl/doc/design/VirtualFunctions.md?rgh-link-date=2024-07-16T15%3A15%3A11Z). ## The Problem @@ -43,11 +45,11 @@ int main(int argc, char *argv[]) } ``` -This code is more complex to accelerate than it looks like. +This code is more complex to port on GPU than it looks like. Using a straightforward approach, we would accelerate the `for` loop with `parallel_for` and copy `instance` on the GPU memory (not disclosing how for now). Then, we would call `Bar()` inside the `parallel_for`. At a glance this should be fine, but it will typically crash, however, because `instance` will call a host version of `Bar()`. -To understand why, you'll need to understand a bit about how virtual functions are implemented. +To understand why, we need to understand a bit about how virtual functions are implemented. ## Vtables, Vpointers, Very annoying with GPUs @@ -81,7 +83,7 @@ We faithfully copied all of the members of the class on the GPU memory, includin The problem here is that we are constructing the instance on the Host. If we were constructing it on the Device, we'd get the correct Vpointer, and thus the correct functions. -Note that this would allow to call virtual functions on the device only, not on the host. +Note that this would allow to call virtual functions on the device only, not on the host anymore. To that aim, we first allocate memory on the device, then construct on the device using a technique called *placement new* @@ -144,7 +146,7 @@ Remember that those virtual functions cannot be called on the host anymore! For a full working example, see [the example in the repo](https://github.com/kokkos/kokkos/blob/master/example/virtual_functions/main.cpp). -## What if I need a setter with host values? +## What if I need a setter that works with host values? The first problem people run into with this is that they want to initialize some fields based on host data, with a setter which is *not* a virtual function. Calling this setter on the device would crash if the host data is not easily copyable on the device (e.g. for a small array). @@ -201,7 +203,7 @@ int main(int argc, char *argv[]) ### Why is this not portable? Inside the `parallel_for`, `Bar()` is called. As `Derived` derives from the pure virtual class `Base`, the `Bar()` function is marked `override`. -On ROCm 5.2 this results in a memory access violation. +On ROCm 6.0 this results in a memory access violation. When executing the `this->Bar()` call, the runtime looks into the Vtable and dereferences a host function pointer on the device. ### But if that is the case, why does it work with NVCC? @@ -209,7 +211,7 @@ When executing the `this->Bar()` call, the runtime looks into the Vtable and der Notice, that the `parallel_for` is called from a pointer of type `Derived` and not a pointer of type `Base` pointing to an `Derived` object. Thus, no Vtable lookup for the `Bar()` would be necessary as it can be deduced from the context of the call that it will be `Derived::Bar()`. But here it comes down to how the compiler handles the lookup. NVCC understands that the call is coming from an `Derived` object and thinks: "Oh, I see, that you are calling from an `Derived` object, I know it will be the `Bar()` in this class scope, I will do this for you". -ROCm, on the other hand, sees your call and thinks "Oh, this is a call to a virtual method, I will look that up for you", failing to dereference the host function pointer in the host virtual function table. +ROCm, on the other hand, sees the call and thinks "Oh, this is a call to a virtual method, I will look that up for you", failing to dereference the host function pointer in the host virtual function table. ### How to solve this? Strictly speaking, the observed behavior on NVCC is an optimization that uses the context information to avoid the Vtable lookup. From cae3f87033e317e5205026c224a497fcedf13720 Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Thu, 18 Jul 2024 15:39:36 +0200 Subject: [PATCH 07/19] Apply suggestions from Daniel Co-authored-by: Daniel Arndt --- .../Kokkos-and-Virtual-Functions.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 3afbf39d8..3906b946a 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -6,16 +6,16 @@ Using virtual functions in parallel regions is not a good idea in general. It of Due to oddities of GPU programming, the use of virtual functions in Kokkos parallel regions can be complicated. This document describes the problems you're likely to face, where they come from, and how to work around them. -Please note that virtual functions can be executed on device for the following backends: +Please note that virtual functions can be executed on the device for the following backends: - Cuda; and - HIP (with a limitation, as explained at the end). -Especially, SYCL 2020 [cannot handle virtual functions](https://github.com/AlexeySachkov/llvm/blob/private/asachkov/virtual-functions-extension-spec/sycl/doc/design/VirtualFunctions.md?rgh-link-date=2024-07-16T15%3A15%3A11Z). +Especially, SYCL 2020 [cannot handle virtual functions](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_overview). ## The Problem -In GPU programming, you might have run into the bug of calling a host function from the device. A similar thing can happen for subtle reasons in code using virtual functions. Consider the following serial code to accelerate +In GPU programming, you might have run into the bug of calling a host function from the device. A similar thing can happen for subtle reasons in code using virtual functions. Consider the following serial code ```c++ class Base { @@ -46,7 +46,7 @@ int main(int argc, char *argv[]) ``` This code is more complex to port on GPU than it looks like. -Using a straightforward approach, we would accelerate the `for` loop with `parallel_for` and copy `instance` on the GPU memory (not disclosing how for now). +Using a straightforward approach, we would replace the `for` loop with `parallel_for` and copy `instance` on the GPU memory (not disclosing how for now). Then, we would call `Bar()` inside the `parallel_for`. At a glance this should be fine, but it will typically crash, however, because `instance` will call a host version of `Bar()`. To understand why, we need to understand a bit about how virtual functions are implemented. @@ -73,7 +73,7 @@ The reason the straightforward approach described above fails is that when deali ![VTableDevice](./figures/VirtualFunctions-VTablesHostDevice.png) -Since we construct the class instance on the host, so it's Vpointer points to the host Vtable. +Since we construct the class instance on the host, its Vpointer points to the host Vtable. ![VPointerToHost](./figures/VirtualFunctions-VPointerToHost.png) @@ -133,8 +133,8 @@ int main(int argc, char *argv[]) We first use the `KOKKOS_FUNCTION` macro to make the methods callable from a kernel. When creating the instance, note that we introduce a distinction between the *memory* that the it uses, and the actual instantiated *object*. -The construct is done on the device, withing a single-iteration `parallel_for`, using placement new. -Since the kernel does not have a return type, we use a static cast to associate the object to the memory allocation. +The construct is done on the device, within a single-iteration `parallel_for`, using placement new. +Since the kernel does not have a return type, we use a static cast to associate the object with the memory allocation. Like with other uses of `new`, we need to later free the memory. The destruct is done on the device, again with a single-iteration `parallel_for`. From 9233ed5e88280b5c8eb8c58d85f766a7140058ad Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Fri, 19 Jul 2024 15:46:51 +0200 Subject: [PATCH 08/19] Apply first suggestions from Thomas Co-authored-by: Thomas Padioleau --- docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 3906b946a..9aa5fdafc 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -46,7 +46,7 @@ int main(int argc, char *argv[]) ``` This code is more complex to port on GPU than it looks like. -Using a straightforward approach, we would replace the `for` loop with `parallel_for` and copy `instance` on the GPU memory (not disclosing how for now). +Using a straightforward approach, we would annotate functions with `KOKKOS_FUNCTION`, replace the `for` loop with `parallel_for` and copy `instance` on the GPU memory (not disclosing how for now). Then, we would call `Bar()` inside the `parallel_for`. At a glance this should be fine, but it will typically crash, however, because `instance` will call a host version of `Bar()`. To understand why, we need to understand a bit about how virtual functions are implemented. @@ -136,8 +136,7 @@ When creating the instance, note that we introduce a distinction between the *me The construct is done on the device, within a single-iteration `parallel_for`, using placement new. Since the kernel does not have a return type, we use a static cast to associate the object with the memory allocation. -Like with other uses of `new`, we need to later free the memory. -The destruct is done on the device, again with a single-iteration `parallel_for`. +The destructor must explicitly be called from the device, again with a single-iteration `parallel_for`. Then the memory allocation can be release with `kokkos_free`. This code is extremely ugly, but leads to functional virtual function calls on the device. The Vpointer now points to the device Vtable. Remember that those virtual functions cannot be called on the host anymore! From 835cf2d9afbdef9ccd197234ee260cf4c4ee9cfc Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Fri, 19 Jul 2024 15:47:38 +0200 Subject: [PATCH 09/19] Add a line to warn that SharedSpace doesn't solve the Vpointer/Vtable problem --- docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 9aa5fdafc..1bf0256e5 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -161,6 +161,8 @@ deviceInstance->setAField(someHostValue); // set on host The setter is still called on the host. Beware that this is only valid for backends that support `SharedSpace`. +Keep in mind that, despite using a "unified" `SharedSpace`, you still have to resort to placement new in order to have the correct Vpointer and hence Vtable on the device! + ## But what if I do not really need the Vtables on the device side? Consider the following example which calls the virtual function `Bar()` on the device from a pointer of derived class type. From e2b25615990cf7f7864f2b9c933a524ba31d9c10 Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Mon, 22 Jul 2024 17:48:56 +0200 Subject: [PATCH 10/19] Use void* for the return type of kokkos_malloc --- docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 1bf0256e5..51a2032bd 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -109,7 +109,7 @@ int main(int argc, char *argv[]) Kokkos::ScopeGuard kokkos(argc, argv); // create - auto deviceInstanceMemory = Kokkos::kokkos_malloc(sizeof(Derived)); // allocate memory on device + void* deviceInstanceMemory = Kokkos::kokkos_malloc(sizeof(Derived)); // allocate memory on device Kokkos::parallel_for("initialize", 1, KOKKOS_LAMBDA (const int i) { new (static_cast(deviceInstanceMemory)) Derived(); // initialize on device }); @@ -153,7 +153,7 @@ The most productive solution we've found in these cases is to allocate the insta ```c++ // create -auto deviceInstanceMemory = Kokkos::kokkos_malloc(sizeof(Derived)); // allocate on shared space +void* deviceInstanceMemory = Kokkos::kokkos_malloc(sizeof(Derived)); // allocate on shared space // ... deviceInstance->setAField(someHostValue); // set on host ``` From 8dc084b2da12c644aa040978d2dee40d26265b99 Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Wed, 24 Jul 2024 10:36:24 +0200 Subject: [PATCH 11/19] Rephrase restriction to set field on device --- docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 51a2032bd..de097b48e 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -147,8 +147,8 @@ For a full working example, see [the example in the repo](https://github.com/kok ## What if I need a setter that works with host values? -The first problem people run into with this is that they want to initialize some fields based on host data, with a setter which is *not* a virtual function. -Calling this setter on the device would crash if the host data is not easily copyable on the device (e.g. for a small array). +The first problem people run into with this is when they want to initialize some fields based on host data, using a setter which is *not* a virtual function. +Calling this setter on the device would crash if the host data doesn't form a valid type on the device: especially, if the data type is not copyable on the device (e.g. for a pointer), or if the data instance is invalid on the device. The most productive solution we've found in these cases is to allocate the instance in `SharedSpace`, initialize it on the device, and then fill in fields on the host ```c++ From 345f1c44d974f2aa12fa57e7818d1c6862f56062 Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Tue, 3 Sep 2024 17:23:30 +0200 Subject: [PATCH 12/19] Apply suggestions from Thomas Co-authored-by: Thomas Padioleau --- docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index de097b48e..3163026cc 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -113,14 +113,12 @@ int main(int argc, char *argv[]) Kokkos::parallel_for("initialize", 1, KOKKOS_LAMBDA (const int i) { new (static_cast(deviceInstanceMemory)) Derived(); // initialize on device }); - Kokkos::fence("wait for initialize"); Base* deviceInstance = static_cast(deviceInstanceMemory); // declare on this memory // use Kokkos::parallel_for("myKernel", 10, KOKKOS_LAMBDA (const int i) { deviceInstance->Bar(); }); - Kokkos::fence(); // cleanup Kokkos::parallel_for("destroy", 1, KOKKOS_LAMBDA (const int i) { From 445f9958461845af8a6a76f66956011330c398d7 Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Tue, 3 Sep 2024 17:55:42 +0200 Subject: [PATCH 13/19] Follow Damien's comments --- .../Kokkos-and-Virtual-Functions.md | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 3163026cc..1ce8e73b2 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -11,7 +11,7 @@ Please note that virtual functions can be executed on the device for the followi - Cuda; and - HIP (with a limitation, as explained at the end). -Especially, SYCL 2020 [cannot handle virtual functions](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_overview). + ## The Problem @@ -106,7 +106,8 @@ class Derived : public Base { int main(int argc, char *argv[]) { - Kokkos::ScopeGuard kokkos(argc, argv); + Kokkos::initialize(argc, argv); + { // create void* deviceInstanceMemory = Kokkos::kokkos_malloc(sizeof(Derived)); // allocate memory on device @@ -126,6 +127,9 @@ int main(int argc, char *argv[]) }); Kokkos::fence("wait for destroy"); Kokkos::kokkos_free(deviceInstanceMemory); // free + + } + Kokkos::finalize(); } ``` @@ -191,18 +195,22 @@ class Derived : Base { int main(int argc, char *argv[]) { - Kokkos::ScopeGuard kokkos(argc, argv); + Kokkos::initialize(argc, argv); + { - auto derivedPtr = std::make_shared(); + auto derivedPtr = std::make_unique(); derivedPtr->apply(); Kokkos::fence(); + + } + Kokkos::finalize(); } ``` ### Why is this not portable? Inside the `parallel_for`, `Bar()` is called. As `Derived` derives from the pure virtual class `Base`, the `Bar()` function is marked `override`. -On ROCm 6.0 this results in a memory access violation. +On ROCm (tested up to 6.0) this results in a memory access violation. When executing the `this->Bar()` call, the runtime looks into the Vtable and dereferences a host function pointer on the device. ### But if that is the case, why does it work with NVCC? From 7d7a2c7a2462ce5765265534bb4e0d1f101da51c Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Fri, 27 Sep 2024 11:26:39 +0200 Subject: [PATCH 14/19] Apply suggestions from code review Co-authored-by: JBludau <104908666+JBludau@users.noreply.github.com> --- .../Kokkos-and-Virtual-Functions.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 1ce8e73b2..a3b09b368 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -26,12 +26,14 @@ class Base { }; class Derived : public Base { + public: public: void Bar() override {} }; int main(int argc, char *argv[]) +{ // create Base* instance = new Derived(); @@ -67,7 +69,7 @@ Now that we know what the compiler is doing to implement virtual functions, we'l Credit: the content of this section is adapted from Pablo Arias [here](https://pabloariasal.github.io/2017/06/10/understanding-virtual-tables/). -## Then why the straightforward approach doesn't work? +## Then why doesn't the straightforward approach work? The reason the straightforward approach described above fails is that when dealing with GPU-compatible classes with virtual functions, there isn't one Vtable, but two. The first holds the host version of the virtual functions, while the second holds the device functions. @@ -81,8 +83,8 @@ We faithfully copied all of the members of the class on the GPU memory, includin ## Make it work -The problem here is that we are constructing the instance on the Host. -If we were constructing it on the Device, we'd get the correct Vpointer, and thus the correct functions. +The problem here is that we are constructing the instance on the host. +If we were constructing it on the device, we'd get the correct Vpointer, and thus the correct functions. Note that this would allow to call virtual functions on the device only, not on the host anymore. To that aim, we first allocate memory on the device, then construct on the device using a technique called *placement new* @@ -136,7 +138,7 @@ int main(int argc, char *argv[]) We first use the `KOKKOS_FUNCTION` macro to make the methods callable from a kernel. When creating the instance, note that we introduce a distinction between the *memory* that the it uses, and the actual instantiated *object*. The construct is done on the device, within a single-iteration `parallel_for`, using placement new. -Since the kernel does not have a return type, we use a static cast to associate the object with the memory allocation. +Since the kernel does not have a return type, we use a static cast to associate the object type with the memory allocation. The destructor must explicitly be called from the device, again with a single-iteration `parallel_for`. Then the memory allocation can be release with `kokkos_free`. @@ -149,8 +151,7 @@ For a full working example, see [the example in the repo](https://github.com/kok ## What if I need a setter that works with host values? -The first problem people run into with this is when they want to initialize some fields based on host data, using a setter which is *not* a virtual function. -Calling this setter on the device would crash if the host data doesn't form a valid type on the device: especially, if the data type is not copyable on the device (e.g. for a pointer), or if the data instance is invalid on the device. +The first problem people run into with this is when they want to set some fields based on host data. As the object instance resides in device memory, it might not be accessible by the host. But the fields can be set within a `parallel_for` on the device. Nevertheless, this requires that the lambda or functor that sets the fields on the device must have access to the host data. The most productive solution we've found in these cases is to allocate the instance in `SharedSpace`, initialize it on the device, and then fill in fields on the host ```c++ From 0b30a5f2a1a9a8db1553f926139e20cce8cbf60a Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Fri, 27 Sep 2024 11:36:25 +0200 Subject: [PATCH 15/19] Apply modified comments from Jakob --- .../Kokkos-and-Virtual-Functions.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index a3b09b368..32736f024 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -15,7 +15,7 @@ Please note that virtual functions can be executed on the device for the followi ## The Problem -In GPU programming, you might have run into the bug of calling a host function from the device. A similar thing can happen for subtle reasons in code using virtual functions. Consider the following serial code +In GPU programming, you might have run into the bug of calling a host function from the device. A similar thing can happen for subtle reasons in code using virtual functions. Consider the following serial code: ```c++ class Base { @@ -26,14 +26,12 @@ class Base { }; class Derived : public Base { - public: public: void Bar() override {} }; -int main(int argc, char *argv[]) -{ +int main(int argc, char *argv[]) { // create Base* instance = new Derived(); @@ -87,7 +85,7 @@ The problem here is that we are constructing the instance on the host. If we were constructing it on the device, we'd get the correct Vpointer, and thus the correct functions. Note that this would allow to call virtual functions on the device only, not on the host anymore. -To that aim, we first allocate memory on the device, then construct on the device using a technique called *placement new* +Therefore, we first allocate memory on the device, then construct on the device using a technique called [*placement new*](https://en.cppreference.com/w/cpp/language/new#Placement_new): ```cpp #include @@ -137,10 +135,11 @@ int main(int argc, char *argv[]) We first use the `KOKKOS_FUNCTION` macro to make the methods callable from a kernel. When creating the instance, note that we introduce a distinction between the *memory* that the it uses, and the actual instantiated *object*. -The construct is done on the device, within a single-iteration `parallel_for`, using placement new. +The object instance is constructed on the device, within a single-iteration `parallel_for`, using [placement new](https://en.cppreference.com/w/cpp/language/new#Placement_new). Since the kernel does not have a return type, we use a static cast to associate the object type with the memory allocation. -The destructor must explicitly be called from the device, again with a single-iteration `parallel_for`. Then the memory allocation can be release with `kokkos_free`. +For not [trivially destructable](https://en.cppreference.com/w/cpp/language/destructor#Trivial_destructor) objects the destructor must explicitly be called on the device. +After destructing the object in a single-iteration `parallel_for`, the memory allocation can be finally release with `kokkos_free`. This code is extremely ugly, but leads to functional virtual function calls on the device. The Vpointer now points to the device Vtable. Remember that those virtual functions cannot be called on the host anymore! @@ -152,7 +151,7 @@ For a full working example, see [the example in the repo](https://github.com/kok ## What if I need a setter that works with host values? The first problem people run into with this is when they want to set some fields based on host data. As the object instance resides in device memory, it might not be accessible by the host. But the fields can be set within a `parallel_for` on the device. Nevertheless, this requires that the lambda or functor that sets the fields on the device must have access to the host data. -The most productive solution we've found in these cases is to allocate the instance in `SharedSpace`, initialize it on the device, and then fill in fields on the host +The most productive solution we've found in these cases is to allocate the object instance in `SharedSpace`, which allows to have the object constructed on the device, and then to set fields on the host: ```c++ // create From 431993a776aa546413998909373b040b1a5e6c37 Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Fri, 27 Sep 2024 11:44:24 +0200 Subject: [PATCH 16/19] Small typo improvements --- .../ProgrammingGuide/Kokkos-and-Virtual-Functions.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 32736f024..4959df226 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -11,7 +11,7 @@ Please note that virtual functions can be executed on the device for the followi - Cuda; and - HIP (with a limitation, as explained at the end). - +Especially, SYCL 2020 [cannot handle virtual functions](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#architecture). ## The Problem @@ -65,11 +65,11 @@ Remember that we have one Vtable shared amongst all instances of a type. Each in Now that we know what the compiler is doing to implement virtual functions, we'll look at why it doesn't work with GPU's. -Credit: the content of this section is adapted from Pablo Arias [here](https://pabloariasal.github.io/2017/06/10/understanding-virtual-tables/). +Credit: the content of this section is adapted from [this article of Pablo Arias(https://pabloariasal.github.io/2017/06/10/understanding-virtual-tables/). ## Then why doesn't the straightforward approach work? -The reason the straightforward approach described above fails is that when dealing with GPU-compatible classes with virtual functions, there isn't one Vtable, but two. The first holds the host version of the virtual functions, while the second holds the device functions. +The reason why the straightforward approach described above fails is that when dealing with GPU-compatible classes with virtual functions, there isn't one Vtable, but two. The first holds the host version of the virtual functions, while the second holds the device functions. ![VTableDevice](./figures/VirtualFunctions-VTablesHostDevice.png) @@ -134,7 +134,7 @@ int main(int argc, char *argv[]) ``` We first use the `KOKKOS_FUNCTION` macro to make the methods callable from a kernel. -When creating the instance, note that we introduce a distinction between the *memory* that the it uses, and the actual instantiated *object*. +When creating the instance, note that we introduce a distinction between the *memory* that it uses, and the actual instantiated *object*. The object instance is constructed on the device, within a single-iteration `parallel_for`, using [placement new](https://en.cppreference.com/w/cpp/language/new#Placement_new). Since the kernel does not have a return type, we use a static cast to associate the object type with the memory allocation. From ce49bf36d1906dbcb3ce0e0d258f119beb426003 Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Fri, 27 Sep 2024 11:57:35 +0200 Subject: [PATCH 17/19] Small modifications --- .../Kokkos-and-Virtual-Functions.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 4959df226..7c942bcad 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -65,7 +65,7 @@ Remember that we have one Vtable shared amongst all instances of a type. Each in Now that we know what the compiler is doing to implement virtual functions, we'll look at why it doesn't work with GPU's. -Credit: the content of this section is adapted from [this article of Pablo Arias(https://pabloariasal.github.io/2017/06/10/understanding-virtual-tables/). +Credit: the content of this section is adapted from [this article of Pablo Arias](https://pabloariasal.github.io/2017/06/10/understanding-virtual-tables/). ## Then why doesn't the straightforward approach work? @@ -210,24 +210,24 @@ int main(int argc, char *argv[]) ### Why is this not portable? Inside the `parallel_for`, `Bar()` is called. As `Derived` derives from the pure virtual class `Base`, the `Bar()` function is marked `override`. -On ROCm (tested up to 6.0) this results in a memory access violation. +On ROCm (at least up to 6.0) this results in a memory access violation. When executing the `this->Bar()` call, the runtime looks into the Vtable and dereferences a host function pointer on the device. ### But if that is the case, why does it work with NVCC? -Notice, that the `parallel_for` is called from a pointer of type `Derived` and not a pointer of type `Base` pointing to an `Derived` object. +Notice that the `parallel_for` is called from a pointer of type `Derived` and not a pointer of type `Base` pointing to an `Derived` object. Thus, no Vtable lookup for the `Bar()` would be necessary as it can be deduced from the context of the call that it will be `Derived::Bar()`. But here it comes down to how the compiler handles the lookup. NVCC understands that the call is coming from an `Derived` object and thinks: "Oh, I see, that you are calling from an `Derived` object, I know it will be the `Bar()` in this class scope, I will do this for you". ROCm, on the other hand, sees the call and thinks "Oh, this is a call to a virtual method, I will look that up for you", failing to dereference the host function pointer in the host virtual function table. ### How to solve this? Strictly speaking, the observed behavior on NVCC is an optimization that uses the context information to avoid the Vtable lookup. -If the compiler does not apply this optimization, you can help in different ways by providing additional information. Unfortunately, none of these strategies is fully portable to all backends. +If the compiler does not apply this optimization, you can help in different ways by providing additional information. Unfortunately, none of these strategies are fully portable to all backends. -- Tell the compiler not to look up any function name in the Vtable when calling `Bar()` by using [qualified name lookup](https://en.cppreference.com/w/cpp/language/qualified_lookup). For this, you tell the compiler which function you want by spelling out the class scope in which the function should be found e.g. `this->Derived::Bar();`. This behavior is specified in the C++ Standard. Nevertheless, some backends are not fully compliant to the Standard. -- Changing the `override` to `final` on the `Bar()` in the `Derived` class. This tells the compiler `Bar()` is not changing in derived objects. Many compilers do use this in optimization and deduce which function to call without the Vtable. Nevertheless, this might only work with certain compilers, as this effect of adding `final` is not specified in the C++ Standard. -- Similarly, the entire derived class `Implementation` can be marked `final`. This is compiler dependent too, for the same reasons. +- Tell the compiler not to look up any function name in the Vtable when calling `Bar()` by using [qualified name lookup](https://en.cppreference.com/w/cpp/language/qualified_lookup). For this, you tell the compiler which function you want by spelling out the class scope in which the function should be found e.g. `this->Derived::Bar();`. This behavior is specified in the C++ standard. Nevertheless, some backends are not fully compliant to the standard. +- Changing the `override` to `final` on the `Bar()` in the `Derived` class. This tells the compiler `Bar()` is not changing in derived objects. Many compilers do use this in optimization and deduce which function to call without the Vtable. Nevertheless, this might only work with certain compilers, as this effect of adding `final` is not specified in the C++ standard. +- Similarly, the entire derived class `Derived` can be marked `final`. This is compiler dependent too, for the same reasons. ## Questions/Follow-up -This is intended to be an educational resource for our users. If something doesn't make sense, or you have further questions, you'd be doing us a favor by letting us know on [Slack](https://kokkosteam.slack.com) or [GitHub](https://github.com/kokkos/kokkos) +This is intended to be an educational resource for our users. If something doesn't make sense, or you have further questions, you'd be doing us a favor by letting us know on [Slack](https://kokkosteam.slack.com) or [GitHub](https://github.com/kokkos/kokkos). From a27b00edf0d15074eb2d79f3dfcc69fc5bf04e4a Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Tue, 1 Oct 2024 10:13:51 +0200 Subject: [PATCH 18/19] Update docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md Co-authored-by: Thomas Padioleau --- docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 7c942bcad..789ff7cdd 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -138,7 +138,7 @@ When creating the instance, note that we introduce a distinction between the *me The object instance is constructed on the device, within a single-iteration `parallel_for`, using [placement new](https://en.cppreference.com/w/cpp/language/new#Placement_new). Since the kernel does not have a return type, we use a static cast to associate the object type with the memory allocation. -For not [trivially destructable](https://en.cppreference.com/w/cpp/language/destructor#Trivial_destructor) objects the destructor must explicitly be called on the device. +For not [trivially destructible](https://en.cppreference.com/w/cpp/language/destructor#Trivial_destructor) objects the destructor must explicitly be called on the device. After destructing the object in a single-iteration `parallel_for`, the memory allocation can be finally release with `kokkos_free`. This code is extremely ugly, but leads to functional virtual function calls on the device. The Vpointer now points to the device Vtable. From ad8293ef1f4356cccd9af0237ea6129f552b8abf Mon Sep 17 00:00:00 2001 From: Paul Zehner Date: Tue, 1 Oct 2024 10:16:07 +0200 Subject: [PATCH 19/19] Remove fence before calling kokkos_free --- docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md index 789ff7cdd..b1c752770 100644 --- a/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md +++ b/docs/source/ProgrammingGuide/Kokkos-and-Virtual-Functions.md @@ -125,7 +125,6 @@ int main(int argc, char *argv[]) Kokkos::parallel_for("destroy", 1, KOKKOS_LAMBDA (const int i) { deviceInstance->~Base(); // destroy on device }); - Kokkos::fence("wait for destroy"); Kokkos::kokkos_free(deviceInstanceMemory); // free }