Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm: segmentation fault at startup #1523

Closed
Tracked by #1032
mariodirenzo opened this issue Aug 18, 2023 · 29 comments
Closed
Tracked by #1032

Realm: segmentation fault at startup #1523

mariodirenzo opened this issue Aug 18, 2023 · 29 comments
Assignees

Comments

@mariodirenzo
Copy link

I've noticed a regression in one of the CI tests of HTR. In particular, the same code version works fine on commit 707b7479e of control_replication and fails with a segmentation fault at startup on 48d22ecfe. If I revert e42aec2e with a git revert -m 1 e42aec2e, the error goes away.

The error can be reproduced on sapling2 with the following steps on the login node:

  • source /home/mariodr/load_htr3.sh
  • cd /home/mariodr/htr3/solverTests/Speelman_DV250
  • rm -rf slurm-* sample0;DEBUG=1 ../../prometeo.sh -i 1x1x1.json

A job will be submitted to the queue and will hang on one of the gpu nodes.

Legion is installed at /home/mariodr/legion3 and is compiled with DEBUG=1 srun -N 1 --exclusive -p gpu /home/mariodr/legion3/language/scripts/setup_env.py --prefix /home/mariodr/legion/language/

@elliottslaughter, can you please add this issue to #1032 with high priority?

@lightsighter
Copy link
Contributor

Assigning @eddy16112 since these are the changes to for the new programmable machine model in Realm.

@lightsighter
Copy link
Contributor

@mariodirenzo This doesn't look like a Realm crash, it looks like a segfault in a Regent-generated task:

(gdb) bt
#0  0x00007fefd4dc623f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fefb254a920, 
    rem=rem@entry=0x7fefb254a920) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x00007fefd4dcbec7 in __GI___nanosleep (requested_time=requested_time@entry=0x7fefb254a920, 
    remaining=remaining@entry=0x7fefb254a920) at nanosleep.c:27
#2  0x00007fefd4dcbdfe in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007fefe0f6ca99 in Realm::realm_freeze (signal=11) at /home/mariodr/legion3/runtime/realm/runtime_impl.cc:200
#4  <signal handler called>
#5  0x000055a0fba5f2b8 in $<workSingle> ()
#6  0x000055a0fba5f27d in $__regent_task_workSingle_primary () at /home/mariodr/legion3/language/src/regent/std_base.t:1214
#7  0x00007fefe12e8dd4 in Realm::LocalTaskProcessor::execute_task (this=0x55a108262880, func_id=311, task_args=...)
    at /home/mariodr/legion3/runtime/realm/proc_impl.cc:1175
#8  0x00007fefe10ee57a in Realm::Task::execute_on_processor (this=0x55a10be9b910, p=...)
    at /home/mariodr/legion3/runtime/realm/tasks.cc:326
#9  0x00007fefe10f2964 in Realm::KernelThreadTaskScheduler::execute_task (this=0x55a108262c20, task=0x55a10be9b910)
    at /home/mariodr/legion3/runtime/realm/tasks.cc:1421
#10 0x00007fefe10f16ae in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55a108262c20)
    at /home/mariodr/legion3/runtime/realm/tasks.cc:1160
#11 0x00007fefe10f1cff in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55a108262c20)
    at /home/mariodr/legion3/runtime/realm/tasks.cc:1272
#12 0x00007fefe10fa3a6 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55a108262c20) at /home/mariodr/legion3/runtime/realm/threads.inl:97
#13 0x00007fefe10c5e9e in Realm::KernelThread::pthread_entry (data=0x7fefb0004d90)
    at /home/mariodr/legion3/runtime/realm/threads.cc:781
#14 0x00007fefd4cc4609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#15 0x00007fefd4e08133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Do you think the code being emitted by Terra is wrong? If so we're probably going to need @elliottslaughter to tell us how to look at that.

@lightsighter
Copy link
Contributor

FWIW, there's evidence that at least one other Terra generated task in the same program started just fine:

Thread 37 (Thread 0x7fefbab57000 (LWP 287728)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fefe10bc78b in Realm::Doorbell::wait_slow (this=0x7fefbab56b40) at /home/mariodr/legion3/runtime/realm/mutex.cc:304
#2  0x00007fefe0fb0002 in Realm::Doorbell::wait (this=0x7fefbab56b40) at /home/mariodr/legion3/runtime/realm/mutex.inl:81
#3  0x00007fefe10bdd48 in Realm::FIFOCondVar::wait (this=0x7fefbab4a8f0) at /home/mariodr/legion3/runtime/realm/mutex.cc:1084
#4  0x00007fefe10f2b60 in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x55a108262c20, switch_to=0x7fefb0004d90) at /home/mariodr/legion3/runtime/realm/tasks.cc:1469
#5  0x00007fefe10f08ed in Realm::ThreadedTaskScheduler::thread_blocking (this=0x55a108262c20, thread=0x55a10be5a750) at /home/mariodr/legion3/runtime/realm/tasks.cc:963
#6  0x00007fefe12c84be in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x7fefbab4af87: false) at /home/mariodr/legion3/runtime/realm/threads.inl:218
#7  0x00007fefe12b6c68 in Realm::Event::wait_faultaware (this=0x7fefbab4af88, poisoned=@0x7fefbab4af87: false) at /home/mariodr/legion3/runtime/realm/event_impl.cc:244
#8  0x00007fefe00144ae in Legion::Internal::LgEvent::wait_faultaware (this=0x7fefbab4af88, poisoned=@0x7fefbab4af87: false) at /home/mariodr/legion3/runtime/legion/legion_types.h:3069
#9  0x00007fefe0797364 in Legion::Internal::FutureImpl::wait (this=0x7fefb0001130, silence_warnings=false, warning_string=0x0) at /home/mariodr/legion3/runtime/legion/runtime.cc:991
#10 0x00007fefdfffbc56 in Legion::Future::get_void_result (this=0x7fefb00068b0, silence_warnings=false, warning_string=0x0) at /home/mariodr/legion3/runtime/legion/legion.cc:2389
#11 0x00007fefe004df75 in legion_future_get_void_result (handle_=...) at /home/mariodr/legion3/runtime/legion/legion_c.cc:3198
#12 0x000055a0fbdce5e2 in $<main> () at /home/mariodr/legion3/language/src/regent/codegen.t:10023
#13 0x000055a0fbdcdfdd in $__regent_task_main_primary () at /home/mariodr/legion3/language/src/regent/std_base.t:1214
#14 0x00007fefe12e8dd4 in Realm::LocalTaskProcessor::execute_task (this=0x55a108262880, func_id=313, task_args=...) at /home/mariodr/legion3/runtime/realm/proc_impl.cc:1175
#15 0x00007fefe10ee57a in Realm::Task::execute_on_processor (this=0x55a10c062390, p=...) at /home/mariodr/legion3/runtime/realm/tasks.cc:326
#16 0x00007fefe10f2964 in Realm::KernelThreadTaskScheduler::execute_task (this=0x55a108262c20, task=0x55a10c062390) at /home/mariodr/legion3/runtime/realm/tasks.cc:1421
#17 0x00007fefe10f16ae in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55a108262c20) at /home/mariodr/legion3/runtime/realm/tasks.cc:1160
#18 0x00007fefe10f1cff in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55a108262c20) at /home/mariodr/legion3/runtime/realm/tasks.cc:1272
#19 0x00007fefe10fa3a6 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55a108262c20) at /home/mariodr/legion3/runtime/realm/threads.inl:97
#20 0x00007fefe10c5e9e in Realm::KernelThread::pthread_entry (data=0x55a10be5a750) at /home/mariodr/legion3/runtime/realm/threads.cc:781
#21 0x00007fefd4cc4609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#22 0x00007fefd4e08133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@mariodirenzo
Copy link
Author

Whatever it is, it goes away by reverting the commit e42aec2e

@elliottslaughter
Copy link
Contributor

@mariodirenzo, I'm trying to figure out the exact state of your Legion repo with and without the commit reverted.

Just so we're all on the same page, could you prepare a branch that (a) has the revert commit (that should pass) and (b) the commit immediately prior fails?

The only thing that changed recently in Regent is the addition of Terra binaries in setup_env.py. If you want to double check that, you can pass --no-terra-binary to be extra sure you aren't getting the binary. That is assuming your reproduction branch is on a commit new enough to have that. My understanding is the Realm config support merged in before my change to setup_env.py.

@elliottslaughter
Copy link
Contributor

Just FYI, there are no other recent changes to Regent aside from the setup_env.py change (in the time frame of the commits @mariodirenzo is talking about).

It's always possible for there be a timing difference that exposes a bug. However, in this case I highly suspect that the reverted merge is likely to be much more relevant. We've hit multiple Realm regressions in the last week, and it has become increasingly clear that our existing CI infrastructure just isn't sufficient.

@eddy16112
Copy link
Contributor

@lightsighter How do you run the program? I followed the steps, but I can not see the exact command to run the program.

@mariodirenzo
Copy link
Author

could you prepare a branch that (a) has the revert commit (that should pass)

You can find it at /home/mariodr/legion4.
It is built with DEBUG=1 srun -N 1 --exclusive -p gpu /home/mariodr/legion4/language/scripts/setup_env.py --prefix /home/mariodr/legion/language/.
If you export LEGION_DIR=/home/mariodr/legion4 and then you run the code, it works fine.

the commit immediately prior fails?

707b7479e and 48d22ecfe are successive commits on control_replication. If you want I can go on master and see if the problem reproduces further bisecting the commits.

How do you run the program?

@eddy16112, rm -rf slurm-* sample0;DEBUG=1 ../../prometeo.sh -i 1x1x1.json should run the program going through slurm. The job will be assigned to one of the GPU nodes

@eddy16112
Copy link
Contributor

@mariodirenzo I copied /home/mariodr/htr3/solverTests/Speelman_DV250 into my directory, and here is what I have in the slurm-*.txt

(base) wwu@sapling2:/scratch2/wwu/htr3/solverTests/Speelman_DV250$ cat slurm-5414.out 
Sending output to .
/scratch2/wwu/htr3/jobscripts/jobscript_shared.sh: line 108: USE_OPENMP: unbound variable

@mariodirenzo
Copy link
Author

Can you retry after sourcing /home/mariodr/load_htr3.sh again?

@eddy16112
Copy link
Contributor

Yes, I modified it to

export LEGION_DIR="/home/mariodr/legion3"
export HTR_DIR="/home/mariodr/htr3"
alias make_legion3="$LEGION_DIR/language/scripts/setup_env.py --prefix /home/mariodr/legion/language"

@mariodirenzo
Copy link
Author

The load script should look like this

(base) mariodr@sapling2:~$ cat load_htr3.sh
# Module loads
module load cmake/3.26.3
module load cuda/11.7
module load mpi/openmpi/4.1.5
module load slurm/23.02.1
# Build config
export CONDUIT=ibv
export CC=gcc
export CXX=g++
# Path setup
export LEGION_DIR=/home/mariodr/legion3
export HDF_ROOT="$LEGION_DIR"/language/hdf/install
export HTR_DIR=/home/mariodr/htr3
#export SCRATCH=/scratch/oldhome/`whoami`
# CUDA config
export CUDA_HOME=/usr/local/cuda-11.7
export CUDA="$CUDA_HOME"
export GPU_ARCH=pascal

# FFTW config
export FFTW_ROOT=/home/mariodr/fftw-3.3.8_install
export LD_LIBRARY_PATH=/home/mariodr/fftw-3.3.8_install/lib:$LD_LIBRARY_PATH

export USE_CUDA=1
export USE_OPENMP=1
export USE_GASNET=1
export USE_HDF=1
export USE_FFTW=1
export MAX_DIM=3
export REALM_BACKTRACE=1
``

@eddy16112
Copy link
Contributor

Now I am getting an error of /home/mariodr/htr3/src/prometeo_CH4_43SpIonsMix.exec: error while loading shared libraries: libhdf5.so.101: cannot open shared object file: No such file or directory. Apparently, the hdf is installed into "$LEGION_DIR"/language/hdf/install, but it is not there

@lightsighter
Copy link
Contributor

FWIW, the reproducer doesn't work for me anymore either:

Sending output to .
Invoking Legion on 1 rank(s), 1 node(s) (1 rank(s) per node), as follows:
/home/mariodr/htr3/src/prometeo_CH4_43SpIonsMix.exec -i 1x1x1.json -ll:force_kthreads -logfile ./%.log -lg:safe_ctrlrepl 2 -ll:cpu 1 -ll:ocpu 2 -ll:onuma 1 -ll:othr 9 -ll:ostack 8 -ll:gpu 4 -ll:fsize 14000 -ll:zsize 512 -ll:ib_zsize 512 -ll:util 4 -ll:io 4 -ll:bgwork 4 -ll:cpu_bgwork 100 -ll:util_bgwork 100 -ll:csize 220000 -lg:eager_alloc_percentage 30 -ll:rsize 512 -ll:ib_rsize 512 -ll:gsize 0 -ll:stacksize 8 -lg:sched -1 -lg:hysteresis 0
/home/mariodr/htr3/src/prometeo_CH4_43SpIonsMix.exec: error while loading shared libraries: libhdf5.so.101: cannot open shared object file: No such file or directory

@mariodirenzo
Copy link
Author

you just need to change export HDF_ROOT="$LEGION_DIR"/language/hdf/install into export HDF_ROOT=/home/mariodr/legion/language/hdf/install

@lightsighter
Copy link
Contributor

The segfault happens right in the first few instructions of the function:

Dump of assembler code for function $<workSingle>:
   0x0000557420f6a2a0 <+0>:     push   %rbp
   0x0000557420f6a2a1 <+1>:     mov    %rsp,%rbp
   0x0000557420f6a2a4 <+4>:     push   %r15
   0x0000557420f6a2a6 <+6>:     push   %r14
   0x0000557420f6a2a8 <+8>:     push   %r13
   0x0000557420f6a2aa <+10>:    push   %r12
   0x0000557420f6a2ac <+12>:    push   %rbx
   0x0000557420f6a2ad <+13>:    and    $0xffffffffffffffe0,%rsp
   0x0000557420f6a2b1 <+17>:    sub    $0x6102a0,%rsp
=> 0x0000557420f6a2b8 <+24>:    mov    %rdx,0x168(%rsp)
   0x0000557420f6a2c0 <+32>:    mov    %rsi,0x178(%rsp)
   0x0000557420f6a2c8 <+40>:    mov    %rdi,%r15
...

The values look ok, but the memory trying to be read is clearly not accessible.

(gdb) info registers rsp
rsp            0x7ff8bc912e00      0x7ff8bc912e00
(gdb) p *(long*)(0x7ff8bc912e00+0x168)
Cannot access memory at address 0x7ff8bc912f68

There's really only two possibilities:

  1. The compiler generated bad code. (Seems unlikely since we see another Regent-generated function running.)
  2. Realm messed up the arguments being passed into the function or the stack for the task.

@elliottslaughter
Copy link
Contributor

Isn't this a stack overflow? We just subtracted something like 61 KiB off of the stack pointer, and the first store after that point is what fails.

@elliottslaughter
Copy link
Contributor

Is it possible that Wei's change caused the stack size not to be parsed properly? I know HTR needs a larger than usual stack size to run correctly. And since that code just got modified....

@eddy16112
Copy link
Contributor

Looks like the stack size (-ll:stacksize) is not parsed correctly, but I do not understand why. Here is how I parsed it:

  void CoreModuleConfig::configure_from_cmdline(std::vector<std::string>& cmdline)
  {
    assert(finish_configured == false);
    // parse command line arguments
    CommandLineParser cp;
    cp.add_option_int("-ll:cpu", num_cpu_procs)
      .add_option_int("-ll:util", num_util_procs)
      .add_option_int("-ll:io", num_io_procs)
      .add_option_int("-ll:concurrent_io", concurrent_io_threads)
      .add_option_int_units("-ll:csize", sysmem_size, 'm')
      .add_option_int_units("-ll:stacksize", stack_size, 'm', true /*binary*/, true /*keep*/)
      .add_option_bool("-ll:pin_util", pin_util_procs)
      .add_option_int("-ll:cpu_bgwork", cpu_bgwork_timeslice)
      .add_option_int("-ll:util_bgwork", util_bgwork_timeslice)
      .add_option_int("-ll:ext_sysmem", use_ext_sysmem)
      .parse_command_line(cmdline);
    printf("cpus %d, stack size %lu\n", num_cpu_procs, stack_size);
  }

When I run any realm example with -ll:cpu 3 -ll:stacksize 4, I am getting a print:

cpus 3, stack size 2097152

The -ll:cpu is correct, but the stack size is not.

@eddy16112
Copy link
Contributor

I found the problem, the -ll:stacksize is used twice, one in core module and the other in runtimeimpl, and I did not parse it in the correct order. I created a fix https://gitlab.com/StanfordLegion/legion/-/merge_requests/888, @mariodirenzo could you please cherry-pick that commit and see if it works for you?

@elliottslaughter
Copy link
Contributor

In addition to @mariodirenzo's confirmation, do we have any direct tests of the command-line parser in Realm? Because this is a subtle issue (i.e., only certain applications will actually crash if we don't parse this flag), it would be nice to directly test this, so we don't rely on user reports to discover future regressions.

@mariodirenzo
Copy link
Author

@eddy16112, your patch fixes the issue. Please let me know when it is merged in control_replication

@eddy16112
Copy link
Contributor

I will run the full CI tonight, and if it is passed, I will merge it tomorrow.

@eddy16112
Copy link
Contributor

In addition to @mariodirenzo's confirmation, do we have any direct tests of the command-line parser in Realm? Because this is a subtle issue (i.e., only certain applications will actually crash if we don't parse this flag), it would be nice to directly test this, so we don't rely on user reports to discover future regressions.

The problem is currently Realm does not provide an API to query every command line args, so I am not sure how can we test it.

@lightsighter
Copy link
Contributor

So at least in this case, there is a way to test it: have a task that does a massive call to alloca that could only be satisfied if the -ll:stacksize argument had been properly passed and parsed by Realm. Maybe the test could also take a bunch of standard machine configuration flags and then check that the machine model matches the specification. Obviously it would be hard to test that every Realm flag was being parsed correctly, but we could probably at least get some of the more frequent ones and that would give us some decent test coverage.

@elliottslaughter
Copy link
Contributor

I agree with Mike. Testing alloca plus querying the machine model would be sufficient to start, and we can evolve the test if we get an ability to query the configuration directly later.

@eddy16112
Copy link
Contributor

@mariodirenzo The patch has been pushed into the control replication branch.

@mariodirenzo
Copy link
Author

Thank you

@eddy16112
Copy link
Contributor

@elliottslaughter @lightsighter I have updated the realm unit test to have a better test of command line parser and machine config API. https://gitlab.com/StanfordLegion/legion/-/merge_requests/904. Hopefully, it could catch similar errors in this issue next time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants