Realm: segmentation fault at startup #1523

mariodirenzo · 2023-08-18T06:41:57Z

I've noticed a regression in one of the CI tests of HTR. In particular, the same code version works fine on commit 707b7479e of control_replication and fails with a segmentation fault at startup on 48d22ecfe. If I revert e42aec2e with a git revert -m 1 e42aec2e, the error goes away.

The error can be reproduced on sapling2 with the following steps on the login node:

source /home/mariodr/load_htr3.sh
cd /home/mariodr/htr3/solverTests/Speelman_DV250
rm -rf slurm-* sample0;DEBUG=1 ../../prometeo.sh -i 1x1x1.json

A job will be submitted to the queue and will hang on one of the gpu nodes.

Legion is installed at /home/mariodr/legion3 and is compiled with DEBUG=1 srun -N 1 --exclusive -p gpu /home/mariodr/legion3/language/scripts/setup_env.py --prefix /home/mariodr/legion/language/

@elliottslaughter, can you please add this issue to #1032 with high priority?

The text was updated successfully, but these errors were encountered:

lightsighter · 2023-08-18T06:54:30Z

Assigning @eddy16112 since these are the changes to for the new programmable machine model in Realm.

lightsighter · 2023-08-18T08:06:20Z

@mariodirenzo This doesn't look like a Realm crash, it looks like a segfault in a Regent-generated task:

(gdb) bt
#0  0x00007fefd4dc623f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fefb254a920, 
    rem=rem@entry=0x7fefb254a920) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x00007fefd4dcbec7 in __GI___nanosleep (requested_time=requested_time@entry=0x7fefb254a920, 
    remaining=remaining@entry=0x7fefb254a920) at nanosleep.c:27
#2  0x00007fefd4dcbdfe in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007fefe0f6ca99 in Realm::realm_freeze (signal=11) at /home/mariodr/legion3/runtime/realm/runtime_impl.cc:200
#4  <signal handler called>
#5  0x000055a0fba5f2b8 in $<workSingle> ()
#6  0x000055a0fba5f27d in $__regent_task_workSingle_primary () at /home/mariodr/legion3/language/src/regent/std_base.t:1214
#7  0x00007fefe12e8dd4 in Realm::LocalTaskProcessor::execute_task (this=0x55a108262880, func_id=311, task_args=...)
    at /home/mariodr/legion3/runtime/realm/proc_impl.cc:1175
#8  0x00007fefe10ee57a in Realm::Task::execute_on_processor (this=0x55a10be9b910, p=...)
    at /home/mariodr/legion3/runtime/realm/tasks.cc:326
#9  0x00007fefe10f2964 in Realm::KernelThreadTaskScheduler::execute_task (this=0x55a108262c20, task=0x55a10be9b910)
    at /home/mariodr/legion3/runtime/realm/tasks.cc:1421
#10 0x00007fefe10f16ae in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55a108262c20)
    at /home/mariodr/legion3/runtime/realm/tasks.cc:1160
#11 0x00007fefe10f1cff in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55a108262c20)
    at /home/mariodr/legion3/runtime/realm/tasks.cc:1272
#12 0x00007fefe10fa3a6 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55a108262c20) at /home/mariodr/legion3/runtime/realm/threads.inl:97
#13 0x00007fefe10c5e9e in Realm::KernelThread::pthread_entry (data=0x7fefb0004d90)
    at /home/mariodr/legion3/runtime/realm/threads.cc:781
#14 0x00007fefd4cc4609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#15 0x00007fefd4e08133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Do you think the code being emitted by Terra is wrong? If so we're probably going to need @elliottslaughter to tell us how to look at that.

lightsighter · 2023-08-18T08:10:11Z

FWIW, there's evidence that at least one other Terra generated task in the same program started just fine:

Thread 37 (Thread 0x7fefbab57000 (LWP 287728)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fefe10bc78b in Realm::Doorbell::wait_slow (this=0x7fefbab56b40) at /home/mariodr/legion3/runtime/realm/mutex.cc:304
#2  0x00007fefe0fb0002 in Realm::Doorbell::wait (this=0x7fefbab56b40) at /home/mariodr/legion3/runtime/realm/mutex.inl:81
#3  0x00007fefe10bdd48 in Realm::FIFOCondVar::wait (this=0x7fefbab4a8f0) at /home/mariodr/legion3/runtime/realm/mutex.cc:1084
#4  0x00007fefe10f2b60 in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x55a108262c20, switch_to=0x7fefb0004d90) at /home/mariodr/legion3/runtime/realm/tasks.cc:1469
#5  0x00007fefe10f08ed in Realm::ThreadedTaskScheduler::thread_blocking (this=0x55a108262c20, thread=0x55a10be5a750) at /home/mariodr/legion3/runtime/realm/tasks.cc:963
#6  0x00007fefe12c84be in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x7fefbab4af87: false) at /home/mariodr/legion3/runtime/realm/threads.inl:218
#7  0x00007fefe12b6c68 in Realm::Event::wait_faultaware (this=0x7fefbab4af88, poisoned=@0x7fefbab4af87: false) at /home/mariodr/legion3/runtime/realm/event_impl.cc:244
#8  0x00007fefe00144ae in Legion::Internal::LgEvent::wait_faultaware (this=0x7fefbab4af88, poisoned=@0x7fefbab4af87: false) at /home/mariodr/legion3/runtime/legion/legion_types.h:3069
#9  0x00007fefe0797364 in Legion::Internal::FutureImpl::wait (this=0x7fefb0001130, silence_warnings=false, warning_string=0x0) at /home/mariodr/legion3/runtime/legion/runtime.cc:991
#10 0x00007fefdfffbc56 in Legion::Future::get_void_result (this=0x7fefb00068b0, silence_warnings=false, warning_string=0x0) at /home/mariodr/legion3/runtime/legion/legion.cc:2389
#11 0x00007fefe004df75 in legion_future_get_void_result (handle_=...) at /home/mariodr/legion3/runtime/legion/legion_c.cc:3198
#12 0x000055a0fbdce5e2 in $<main> () at /home/mariodr/legion3/language/src/regent/codegen.t:10023
#13 0x000055a0fbdcdfdd in $__regent_task_main_primary () at /home/mariodr/legion3/language/src/regent/std_base.t:1214
#14 0x00007fefe12e8dd4 in Realm::LocalTaskProcessor::execute_task (this=0x55a108262880, func_id=313, task_args=...) at /home/mariodr/legion3/runtime/realm/proc_impl.cc:1175
#15 0x00007fefe10ee57a in Realm::Task::execute_on_processor (this=0x55a10c062390, p=...) at /home/mariodr/legion3/runtime/realm/tasks.cc:326
#16 0x00007fefe10f2964 in Realm::KernelThreadTaskScheduler::execute_task (this=0x55a108262c20, task=0x55a10c062390) at /home/mariodr/legion3/runtime/realm/tasks.cc:1421
#17 0x00007fefe10f16ae in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55a108262c20) at /home/mariodr/legion3/runtime/realm/tasks.cc:1160
#18 0x00007fefe10f1cff in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55a108262c20) at /home/mariodr/legion3/runtime/realm/tasks.cc:1272
#19 0x00007fefe10fa3a6 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55a108262c20) at /home/mariodr/legion3/runtime/realm/threads.inl:97
#20 0x00007fefe10c5e9e in Realm::KernelThread::pthread_entry (data=0x55a10be5a750) at /home/mariodr/legion3/runtime/realm/threads.cc:781
#21 0x00007fefd4cc4609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#22 0x00007fefd4e08133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

mariodirenzo · 2023-08-18T15:16:52Z

Whatever it is, it goes away by reverting the commit e42aec2e

elliottslaughter · 2023-08-18T15:21:25Z

@mariodirenzo, I'm trying to figure out the exact state of your Legion repo with and without the commit reverted.

Just so we're all on the same page, could you prepare a branch that (a) has the revert commit (that should pass) and (b) the commit immediately prior fails?

The only thing that changed recently in Regent is the addition of Terra binaries in setup_env.py. If you want to double check that, you can pass --no-terra-binary to be extra sure you aren't getting the binary. That is assuming your reproduction branch is on a commit new enough to have that. My understanding is the Realm config support merged in before my change to setup_env.py.

elliottslaughter · 2023-08-18T15:25:30Z

Just FYI, there are no other recent changes to Regent aside from the setup_env.py change (in the time frame of the commits @mariodirenzo is talking about).

It's always possible for there be a timing difference that exposes a bug. However, in this case I highly suspect that the reverted merge is likely to be much more relevant. We've hit multiple Realm regressions in the last week, and it has become increasingly clear that our existing CI infrastructure just isn't sufficient.

eddy16112 · 2023-08-18T16:03:05Z

@lightsighter How do you run the program? I followed the steps, but I can not see the exact command to run the program.

mariodirenzo · 2023-08-19T08:03:56Z

could you prepare a branch that (a) has the revert commit (that should pass)

You can find it at /home/mariodr/legion4.
It is built with DEBUG=1 srun -N 1 --exclusive -p gpu /home/mariodr/legion4/language/scripts/setup_env.py --prefix /home/mariodr/legion/language/.
If you export LEGION_DIR=/home/mariodr/legion4 and then you run the code, it works fine.

the commit immediately prior fails?

707b7479e and 48d22ecfe are successive commits on control_replication. If you want I can go on master and see if the problem reproduces further bisecting the commits.

How do you run the program?

@eddy16112, rm -rf slurm-* sample0;DEBUG=1 ../../prometeo.sh -i 1x1x1.json should run the program going through slurm. The job will be assigned to one of the GPU nodes

eddy16112 · 2023-08-19T08:33:15Z

@mariodirenzo I copied /home/mariodr/htr3/solverTests/Speelman_DV250 into my directory, and here is what I have in the slurm-*.txt

(base) wwu@sapling2:/scratch2/wwu/htr3/solverTests/Speelman_DV250$ cat slurm-5414.out 
Sending output to .
/scratch2/wwu/htr3/jobscripts/jobscript_shared.sh: line 108: USE_OPENMP: unbound variable

mariodirenzo · 2023-08-19T08:44:02Z

Can you retry after sourcing /home/mariodr/load_htr3.sh again?

eddy16112 · 2023-08-19T08:58:02Z

Yes, I modified it to

export LEGION_DIR="/home/mariodr/legion3"
export HTR_DIR="/home/mariodr/htr3"
alias make_legion3="$LEGION_DIR/language/scripts/setup_env.py --prefix /home/mariodr/legion/language"

mariodirenzo · 2023-08-19T15:47:57Z

The load script should look like this

(base) mariodr@sapling2:~$ cat load_htr3.sh
# Module loads
module load cmake/3.26.3
module load cuda/11.7
module load mpi/openmpi/4.1.5
module load slurm/23.02.1
# Build config
export CONDUIT=ibv
export CC=gcc
export CXX=g++
# Path setup
export LEGION_DIR=/home/mariodr/legion3
export HDF_ROOT="$LEGION_DIR"/language/hdf/install
export HTR_DIR=/home/mariodr/htr3
#export SCRATCH=/scratch/oldhome/`whoami`
# CUDA config
export CUDA_HOME=/usr/local/cuda-11.7
export CUDA="$CUDA_HOME"
export GPU_ARCH=pascal

# FFTW config
export FFTW_ROOT=/home/mariodr/fftw-3.3.8_install
export LD_LIBRARY_PATH=/home/mariodr/fftw-3.3.8_install/lib:$LD_LIBRARY_PATH

export USE_CUDA=1
export USE_OPENMP=1
export USE_GASNET=1
export USE_HDF=1
export USE_FFTW=1
export MAX_DIM=3
export REALM_BACKTRACE=1
``

eddy16112 · 2023-08-19T18:07:58Z

Now I am getting an error of /home/mariodr/htr3/src/prometeo_CH4_43SpIonsMix.exec: error while loading shared libraries: libhdf5.so.101: cannot open shared object file: No such file or directory. Apparently, the hdf is installed into "$LEGION_DIR"/language/hdf/install, but it is not there

lightsighter · 2023-08-21T07:25:57Z

FWIW, the reproducer doesn't work for me anymore either:

Sending output to .
Invoking Legion on 1 rank(s), 1 node(s) (1 rank(s) per node), as follows:
/home/mariodr/htr3/src/prometeo_CH4_43SpIonsMix.exec -i 1x1x1.json -ll:force_kthreads -logfile ./%.log -lg:safe_ctrlrepl 2 -ll:cpu 1 -ll:ocpu 2 -ll:onuma 1 -ll:othr 9 -ll:ostack 8 -ll:gpu 4 -ll:fsize 14000 -ll:zsize 512 -ll:ib_zsize 512 -ll:util 4 -ll:io 4 -ll:bgwork 4 -ll:cpu_bgwork 100 -ll:util_bgwork 100 -ll:csize 220000 -lg:eager_alloc_percentage 30 -ll:rsize 512 -ll:ib_rsize 512 -ll:gsize 0 -ll:stacksize 8 -lg:sched -1 -lg:hysteresis 0
/home/mariodr/htr3/src/prometeo_CH4_43SpIonsMix.exec: error while loading shared libraries: libhdf5.so.101: cannot open shared object file: No such file or directory

mariodirenzo · 2023-08-21T08:15:02Z

you just need to change export HDF_ROOT="$LEGION_DIR"/language/hdf/install into export HDF_ROOT=/home/mariodr/legion/language/hdf/install

lightsighter · 2023-08-21T08:53:21Z

The segfault happens right in the first few instructions of the function:

Dump of assembler code for function $<workSingle>:
   0x0000557420f6a2a0 <+0>:     push   %rbp
   0x0000557420f6a2a1 <+1>:     mov    %rsp,%rbp
   0x0000557420f6a2a4 <+4>:     push   %r15
   0x0000557420f6a2a6 <+6>:     push   %r14
   0x0000557420f6a2a8 <+8>:     push   %r13
   0x0000557420f6a2aa <+10>:    push   %r12
   0x0000557420f6a2ac <+12>:    push   %rbx
   0x0000557420f6a2ad <+13>:    and    $0xffffffffffffffe0,%rsp
   0x0000557420f6a2b1 <+17>:    sub    $0x6102a0,%rsp
=> 0x0000557420f6a2b8 <+24>:    mov    %rdx,0x168(%rsp)
   0x0000557420f6a2c0 <+32>:    mov    %rsi,0x178(%rsp)
   0x0000557420f6a2c8 <+40>:    mov    %rdi,%r15
...

The values look ok, but the memory trying to be read is clearly not accessible.

(gdb) info registers rsp
rsp            0x7ff8bc912e00      0x7ff8bc912e00
(gdb) p *(long*)(0x7ff8bc912e00+0x168)
Cannot access memory at address 0x7ff8bc912f68

There's really only two possibilities:

The compiler generated bad code. (Seems unlikely since we see another Regent-generated function running.)
Realm messed up the arguments being passed into the function or the stack for the task.

elliottslaughter · 2023-08-21T13:58:47Z

Isn't this a stack overflow? We just subtracted something like 61 KiB off of the stack pointer, and the first store after that point is what fails.

elliottslaughter · 2023-08-21T14:03:18Z

Is it possible that Wei's change caused the stack size not to be parsed properly? I know HTR needs a larger than usual stack size to run correctly. And since that code just got modified....

eddy16112 · 2023-08-21T14:51:43Z

Looks like the stack size (-ll:stacksize) is not parsed correctly, but I do not understand why. Here is how I parsed it:

  void CoreModuleConfig::configure_from_cmdline(std::vector<std::string>& cmdline)
  {
    assert(finish_configured == false);
    // parse command line arguments
    CommandLineParser cp;
    cp.add_option_int("-ll:cpu", num_cpu_procs)
      .add_option_int("-ll:util", num_util_procs)
      .add_option_int("-ll:io", num_io_procs)
      .add_option_int("-ll:concurrent_io", concurrent_io_threads)
      .add_option_int_units("-ll:csize", sysmem_size, 'm')
      .add_option_int_units("-ll:stacksize", stack_size, 'm', true /*binary*/, true /*keep*/)
      .add_option_bool("-ll:pin_util", pin_util_procs)
      .add_option_int("-ll:cpu_bgwork", cpu_bgwork_timeslice)
      .add_option_int("-ll:util_bgwork", util_bgwork_timeslice)
      .add_option_int("-ll:ext_sysmem", use_ext_sysmem)
      .parse_command_line(cmdline);
    printf("cpus %d, stack size %lu\n", num_cpu_procs, stack_size);
  }

When I run any realm example with -ll:cpu 3 -ll:stacksize 4, I am getting a print:

cpus 3, stack size 2097152

The -ll:cpu is correct, but the stack size is not.

eddy16112 · 2023-08-21T15:31:00Z

I found the problem, the -ll:stacksize is used twice, one in core module and the other in runtimeimpl, and I did not parse it in the correct order. I created a fix https://gitlab.com/StanfordLegion/legion/-/merge_requests/888, @mariodirenzo could you please cherry-pick that commit and see if it works for you?

elliottslaughter · 2023-08-21T15:55:12Z

In addition to @mariodirenzo's confirmation, do we have any direct tests of the command-line parser in Realm? Because this is a subtle issue (i.e., only certain applications will actually crash if we don't parse this flag), it would be nice to directly test this, so we don't rely on user reports to discover future regressions.

mariodirenzo · 2023-08-22T07:58:56Z

@eddy16112, your patch fixes the issue. Please let me know when it is merged in control_replication

eddy16112 · 2023-08-22T08:13:37Z

I will run the full CI tonight, and if it is passed, I will merge it tomorrow.

eddy16112 · 2023-08-22T08:15:26Z

In addition to @mariodirenzo's confirmation, do we have any direct tests of the command-line parser in Realm? Because this is a subtle issue (i.e., only certain applications will actually crash if we don't parse this flag), it would be nice to directly test this, so we don't rely on user reports to discover future regressions.

The problem is currently Realm does not provide an API to query every command line args, so I am not sure how can we test it.

lightsighter · 2023-08-22T08:54:15Z

So at least in this case, there is a way to test it: have a task that does a massive call to alloca that could only be satisfied if the -ll:stacksize argument had been properly passed and parsed by Realm. Maybe the test could also take a bunch of standard machine configuration flags and then check that the machine model matches the specification. Obviously it would be hard to test that every Realm flag was being parsed correctly, but we could probably at least get some of the more frequent ones and that would give us some decent test coverage.

elliottslaughter · 2023-08-22T16:23:45Z

I agree with Mike. Testing alloca plus querying the machine model would be sufficient to start, and we can evolve the test if we get an ability to query the configuration directly later.

eddy16112 · 2023-08-24T01:11:12Z

@mariodirenzo The patch has been pushed into the control replication branch.

mariodirenzo · 2023-08-24T09:52:17Z

Thank you

eddy16112 · 2023-08-30T07:44:34Z

@elliottslaughter @lightsighter I have updated the realm unit test to have a better test of command line parser and machine config API. https://gitlab.com/StanfordLegion/legion/-/merge_requests/904. Hopefully, it could catch similar errors in this issue next time.

lightsighter assigned lightsighter and eddy16112 and unassigned lightsighter Aug 18, 2023

elliottslaughter mentioned this issue Aug 18, 2023

Prioritized list of Regent features for HTR (PSAAP) #1032

Open

86 tasks

mariodirenzo closed this as completed Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realm: segmentation fault at startup #1523

Realm: segmentation fault at startup #1523

mariodirenzo commented Aug 18, 2023

lightsighter commented Aug 18, 2023

lightsighter commented Aug 18, 2023

lightsighter commented Aug 18, 2023

mariodirenzo commented Aug 18, 2023

elliottslaughter commented Aug 18, 2023

elliottslaughter commented Aug 18, 2023

eddy16112 commented Aug 18, 2023

mariodirenzo commented Aug 19, 2023

eddy16112 commented Aug 19, 2023

mariodirenzo commented Aug 19, 2023

eddy16112 commented Aug 19, 2023

mariodirenzo commented Aug 19, 2023

eddy16112 commented Aug 19, 2023

lightsighter commented Aug 21, 2023

mariodirenzo commented Aug 21, 2023

lightsighter commented Aug 21, 2023

elliottslaughter commented Aug 21, 2023

elliottslaughter commented Aug 21, 2023

eddy16112 commented Aug 21, 2023

eddy16112 commented Aug 21, 2023

elliottslaughter commented Aug 21, 2023

mariodirenzo commented Aug 22, 2023

eddy16112 commented Aug 22, 2023

eddy16112 commented Aug 22, 2023

lightsighter commented Aug 22, 2023

elliottslaughter commented Aug 22, 2023

eddy16112 commented Aug 24, 2023

mariodirenzo commented Aug 24, 2023

eddy16112 commented Aug 30, 2023

Realm: segmentation fault at startup #1523

Realm: segmentation fault at startup #1523

Comments

mariodirenzo commented Aug 18, 2023

lightsighter commented Aug 18, 2023

lightsighter commented Aug 18, 2023

lightsighter commented Aug 18, 2023

mariodirenzo commented Aug 18, 2023

elliottslaughter commented Aug 18, 2023

elliottslaughter commented Aug 18, 2023

eddy16112 commented Aug 18, 2023

mariodirenzo commented Aug 19, 2023

eddy16112 commented Aug 19, 2023

mariodirenzo commented Aug 19, 2023

eddy16112 commented Aug 19, 2023

mariodirenzo commented Aug 19, 2023

eddy16112 commented Aug 19, 2023

lightsighter commented Aug 21, 2023

mariodirenzo commented Aug 21, 2023

lightsighter commented Aug 21, 2023

elliottslaughter commented Aug 21, 2023

elliottslaughter commented Aug 21, 2023

eddy16112 commented Aug 21, 2023

eddy16112 commented Aug 21, 2023

elliottslaughter commented Aug 21, 2023

mariodirenzo commented Aug 22, 2023

eddy16112 commented Aug 22, 2023

eddy16112 commented Aug 22, 2023

lightsighter commented Aug 22, 2023

elliottslaughter commented Aug 22, 2023

eddy16112 commented Aug 24, 2023

mariodirenzo commented Aug 24, 2023

eddy16112 commented Aug 30, 2023