Replies: 3 comments
-
R: Out-of-memory failures on some machinesRelevant link: gramineproject/examples#40 On 20. September 2022, the periodically tested R workload started failing under the following conditions:
The surprising part was that the same R workload failed only under this set of conditions -- it didn't fail on the same machine on bare-metal (not in Docker container), it didn't fail on Ubuntu Docker containers, and it didn't fail on small machines (with 4-8 CPU cores). It turns out that the R workload was updated to version 4.2.1 Patched on 14. September 2022. This was exactly the last date of our previous iteration of weekly R workload's tests. So this R update explained the conditions 2 and 3. The bare-metal OS on that machine was not updated, so it still had the old version of R. Also, it seems that Ubuntu didn't yet put the updated R package in its repository, but CentOS 8 did. And the Docker image with R was re-built every time before the tests, so it got the updated R package. So why did the updated R start failing with "out of memory"? This is because something changed in the OpenBLAS and OpenMP dependencies of R. Previously, R (through these dependencies) spawned a small number of threads -- 4 threads were seen in our debug. Now, the updated R by default spawns as many threads as there are CPU cores. And for each thread, 128MB are pre-allocated. Thus, for a large machine with 100 CPU cores, this results in ~13GB of enclave memory required. Whereas our current R manifest specifies only 2GB of enclave size. This explains condition 1. Solution: fortunately, OpenBLAS + OpenMP can limit the number of threads spawned. Thus, to avoid running out of enclave memory, we simply set |
Beta Was this translation helpful? Give feedback.
-
Get rid of
|
Beta Was this translation helpful? Give feedback.
-
Illegal instruction during
|
Beta Was this translation helpful? Give feedback.
-
[ This is a dump of random notes. When we collect a critical mass, we will re-organize it as FAQ or blog posts. ]
I decided to put each note in its own "Answer", so it is easier to read and navigate here.
The list of notes:
Failed to open Intel SGX device
errorBeta Was this translation helpful? Give feedback.
All reactions