Non-deterministic hangs while running HTR #1167

jmwang14 · 2021-12-04T18:08:04Z

I am running a 45-node simulation on Lassen using HTR and the 11/24 Legion commit f4f80752 of control_replication. The runtime hangs non-deterministically, usually after ~10-100 time steps. Attached are two backtraces taken during one of these hangs, taken several minutes apart. Legion is compiled with CXXFLAGS='-g' and the GasNet version is 2021.9.0. HTR is run with DEBUG=1.

bt-first.tar.gz
bt-second.tar.gz

The text was updated successfully, but these errors were encountered:

mariodirenzo · 2021-12-04T18:17:12Z

@elliottslaughter can you add this issue to #1032?
I do not know if this is Realm or Legion related but it is definitely top priority,

lightsighter · 2021-12-04T19:52:00Z

Try running with -ll:defalloc 0 and see if it reproduces.

jmwang14 · 2021-12-05T00:41:48Z

It still hangs with -ll:defalloc 0

lightsighter · 2021-12-05T20:49:34Z

Try with -lg:inorder and get backtraces if it hangs.

lightsighter · 2021-12-05T20:49:59Z

Also which GASNet version and conduit are you using?

lightsighter · 2021-12-05T20:51:19Z

Try the fix for #1070 as well

streichler · 2021-12-05T21:07:32Z

Try the fix for #1070 as well

It looks like commit f4f80752 includes this fix already.

jmwang14 · 2021-12-05T21:32:42Z

I'll run with -ll:defalloc 0 -lg:inorder and get backtraces. GASNet version is 2021.9.0 and CONDUIT=ibv.

jmwang14 · 2021-12-06T06:25:16Z

Backtrace attached.
bt.tar.gz

lightsighter · 2021-12-06T09:17:41Z

How far did this run make it before it hung? Can you also do a run with -lg:safe_ctrlrepl 1?

jmwang14 · 2021-12-06T18:53:14Z

On the previous run, it hung after the 9th time step. For the attached backtrace with -lg:safe_ctrlrepl 1, the runtime froze (not hang) before the end of the of first time step.

bt.tar.gz

lightsighter · 2021-12-07T10:39:25Z

All signs point to this being a Realm or a GASNet hang. There is no rhyme or reason to these backtraces. Usually whenever it's Legion's fault for a hang, it will only happen without -lg:inorder and -lg:inorder will make the hang go away or make it completely deterministic. In these backtraces we don't see either. The -lg:inorder flag isn't producing any kind of deterministic execution or deterministic hang. I would put my money on GASNet. From now on do all runs with -lg:inorder and -lg:safe_ctrlrepl 1 unless instructed otherwise.

@streichler what realm flags do we need to use to check for outstanding GASNet messages still in flight? The other alternative would be to check for outstanding DMAs. There might be an indication that we could be getting stuck on lots of tiny DMA requests for future values.

streichler · 2021-12-07T15:59:49Z

Let's look at DMA requests first, running with -level dma=2,xplan=1 and attaching the log files for every rank here please.

jmwang14 · 2021-12-07T16:35:24Z

Which log files are you referring to? 0.log, 1.log etc. are all empty for me.

I ran with -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 1 -level dma=2,xplan=1, and the runtime freezes before the first time step is complete. It seems to catch the following assertion error:

prometeo_O2OMix.exec: /usr/WS1/wang83/build/legion-gpu-2021-11-24-debug/runtime/realm/event_impl.cc:1852: void Realm::GenEventImpl::trigger(Realm::EventImpl::gen_t, int, bool, Realm::TimeLimit): Assertion 'gen_triggered == (generation.load() + 1)' failed.

(The same assertion error was printed with just -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 1 — I had missed it earlier.)

mariodirenzo · 2021-12-07T16:50:15Z

Which log files are you referring to?

If you run HTR with DEBUG=1 in your environment, the files 0.log, 1.log, etc. are attached to each of the nodes/ranks and are what Sean was asking for (provided that the code runs long enough to produce some logging lines).

streichler · 2021-12-07T17:04:48Z

(The same assertion error was printed with just -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 1 — I had missed it earlier.)

Ok, let's run first with -level event=2 and see if we can determine whether there's a double-trigger of an event.

jmwang14 · 2021-12-08T20:35:45Z

Log files attached, with -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 1 -level event=2. It failed before the first time step was complete. No error message printed this time.

log-files.zip

streichler · 2021-12-08T21:05:50Z

The -level event=2 was only going to be useful for that assert fail, so let's maybe run with both sets of logging (i.e. -level dma=2,xplan=1,event=2) so that we can try to debug whichever of the two ways it fails.

jmwang14 · 2021-12-09T02:38:14Z

Log files attached, using the flags -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 1 -level dma=2,xplan=1,event=2. This time it tripped the assertion error (again before the first time step):

prometeo_O2OMix.exec: /usr/WS1/wang83/build/legion-gpu-2021-11-24-debug/runtime/realm/event_impl.cc:1852: void Realm::GenEventImpl::trigger(Realm::EventImpl::gen_t, int, bool, Realm::TimeLimit): Assertion 'gen_triggered == (generation.load() + 1)' failed.

log-files.tar.gz

jmwang14 · 2021-12-11T05:12:39Z

Hi Sean, have you had the opportunity to check out the log files? Please let me know if there is any other information I can provide. This is a critical issue for me, so I appreciate any time you're able to give it. I am avoiding the latest commit due to these hangs and resorting to an old commit still with the distributed collectible error, but a large portion of those simulations fail to start, so it can slow things down quite a bit.

lightsighter · 2021-12-11T10:45:48Z

Do we have backtraces yet for the double event triggers?

jmwang14 · 2021-12-11T16:16:36Z

What Legion flags do I use for that case? I can obtain backtraces for -level dma=2,xplan=1,event=2 if that is what you are referring to.

jmwang14 · 2021-12-13T20:35:33Z

Log files and backtraces attached, running with -level dma=2,xplan=1,event=2. Three nodes froze with the same assertion error:
prometeo_O2OMix.exec: /usr/WS1/wang83/build/legion-gpu-2021-11-24-debug/runtime/realm/event_impl.cc:1852: void Realm::GenEventImpl::trigger(Realm::EventImpl::gen_t, int, bool, Realm::TimeLimit): Assertion 'gen_triggered == (generation.load() + 1)' failed.

bt.tar.gz
log-files.tar.gz

lightsighter · 2021-12-15T10:49:54Z

This continues to look like a control replication violation. At least two different shards are disagreeing on who the owner is of an asynchronous collective operation.

These runs are all done with Legion having been built in debug mode? You can build with -O2 but Legion must have DEBUG_LEGION defined.

What happens if you run with -lg:safe_ctrlrepl 2?

jmwang14 · 2022-01-05T15:43:15Z

Hi Mike, I recompiled Legion with -O2 -DDEBUG_LEGION, and ran HTR with -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 2 -level dma=2,xplan=1,event=2. I don't have backtraces for this run, but I've attached log files. Error messages of the type

[27 - 7fffbb63f8b0]   12.766334 {5}{runtime}: [error 607] LEGION ERROR: Detected control replication violation when invoking future_from_value in task workSingle (UID 27) on shard 27. The hash summary for the function does not align with the hash summaries from other call sites. (from file /usr/WS1/wang83/build/legion-gpu-2021-11-24-debug2/runtime/legion/legion_context.cc:12143)

were printed for a number of shards.

log-files.tar.gz

elliottslaughter · 2022-01-05T19:04:45Z

@jmwang14 This error message indicates a control replication violation, like @lightsighter suspected. This is an application bug.

Does your code contain a task named future_from_value? I can't find this string anywhere in the Regent compiler or Legion C API interface.

It would appear that you're either calling a different sequence of tasks, or perhaps calling this task with different values on different shards.

jmwang14 · 2022-01-05T19:13:29Z

That string does not appear in the HTR source. However I do see a future_from_value on line 11449 of $LEGION_DIR/runtime/legion/legion_context.cc, commit f4f80752.

elliottslaughter · 2022-01-05T19:23:17Z

For posterity this is the line in question (from a newer commit, but should be same line):

legion/runtime/legion/legion_context.cc

Line 11456 in e67a940

verify_replicable(hasher, "future_from_value");

That makes me suspect you're going through this API call:

legion/runtime/legion/legion_c.h

Line 2621 in e67a940

legion_future_from_untyped_pointer(legion_runtime_t runtime,

Which would seem to indicate the application is creating a future.

At this point I think there are two debugging options:

Get a backtrace with Regent debug symbols enabled, i.e., regent.py -g my_script_name.rg .... You need Regent symbols specifically because this is going to be an application bug and we want to know which line of code in the application is causing the issue.
Printf debugging. You know the task that's causing the problem: workSingle. If you want, I can help you narrow down the possible places where a problem could be occurring. Then you'd go through and format.println all those values and see which one is diverging.

Let me know if you need any more help.

jmwang14 · 2022-01-06T18:49:29Z

I have attached backtraces after recompiling the code with regent.py -g .... They do not seem to show the failing line in task workSingle. The error message below was also printed from each of the 45 nodes (shard 28 shown):

[28 - 7fff8af8f8b0]   14.482024 {5}{runtime}: [error 607] LEGION ERROR: Detected control replication violation when invoking future_from_value in task workSingle (UID 28) on shard 28. The hash summary for the function does not align wi    th the hash summaries from other call sites. (from file /usr/WS1/wang83/build/legion-gpu-2021-11-24-debug2/runtime/legion/legion_context.cc:12143)

bt.tar.gz

elliottslaughter · 2022-01-11T16:40:22Z

I'm going to need to see more of the build output to know what step is failing and why.

None of this has changed in a long time, so unless you've done something on your end, I don't see why this would be failing now.

lightsighter · 2022-01-11T16:56:03Z

I agree with @elliottslaughter, that seems like a build issue. I would try doing a clean build. The change from yesterday passed through our CI infrastructure just fine and was a minor change that modified just a few characters.

jmwang14 · 2022-01-11T19:10:09Z

This is with clean builds of both HTR and Legion. I've attached the output of make clean; make -j |& tee make.log in the HTR source, if it is helpful. All of the runs reported above were with f4f80752 of Legion, which was from 11/24/2021. Is there any chance some change since then could be leading to the murmur_hash3_32 error? I've also kept my local HTR branch frozen throughout.

make.log

elliottslaughter · 2022-01-11T19:33:37Z

I think the problem is at 95b191b#diff-2225ef11b1f16dfc659dfc3a248ca9babe50c1aed760d2152050b38f396e1292R1042 , which appears to break -foffline 1.

Do you really need -foffline 1? It might be that you could work around with -fcuda-offline 1, which might be sufficient for what you're doing.

mariodirenzo · 2022-01-11T21:09:29Z

Do you really need -foffline 1?

We had to switch to -foffline because of #956. Maybe on the particular setup that Jonathan is using -foffline-cuda might be sufficient but it is not general enough

magnatelee · 2022-01-12T04:57:50Z

A fix has been merged with control_replication. @jmwang14 can you pull and try again?

jmwang14 · 2022-01-12T17:43:42Z

With the latest commit f2c5fc66 on control_replication, the simulation hangs during startup with no error message. Log files are attached. I kept all the previous flags: -DDEBUG_LEGION for building Legion, regent.py -g ... for building HTR, and -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 2 -level dma=2,xplan=1,event=2 for running HTR.

log-files.tar.gz

lightsighter · 2022-01-13T07:19:05Z

When you say that this hung during startup, do you mean that it hung before starting the top-level task, or that it hung as the application was doing it's startup? Most of these log files are so short I would find it hard to believe that the application had even started yet. My prior probability that something is either wrong with your system configuration or GASNet install is growing pretty considerably. We haven't even made it to a place where we're running dma operations yet, there's not a single one in the logs. Everything is just event work so far, but such a tiny amount of it that it would be hard for anything to have happened beyond initial Legion and Realm startup. That code is so well polished (literally every Legion program running on every machine we've ever run on has touched it) that I would be surprised if there was an issue there.

jmwang14 · 2022-01-15T17:00:28Z

I've tried two other tests:
(1) Using -ll:bgwork 1 at runtime
(2) Compiling Legion with REALM_NETWORKS="gasnetex" (and therefore GASNet-2021.9.0), and using -ll:bgwork 1 at runtime.

Both lead to the same behavior. I'm not sure where exactly the code is hanging, but the application does start up. It goes through a few initial operations like creating the output directory and getting the wall time, but stops before it sets up the mesh and flow variables. It is hanging earlier than the previous hangs.

@mariodirenzo I see the sample0 directory but without console.txt — I think this means that initSingle must've completed but it is hanging on or before the second line of SIM.DeclSymbols.

log-files.tar.gz

lightsighter · 2022-01-15T19:33:53Z

Can you get backtraces from all the nodes for the early hang when running with -ll:force_kthreads?

jmwang14 · 2022-01-18T19:50:09Z

Log files and backtraces attached.

bt.tar.gz

log-files.tar.gz

lightsighter · 2022-01-19T07:33:32Z

Here is a backtrace that is a pretty damning indictment of either Realm or GASNet:

Thread 14 (Thread 0x7fce29eff8b0 (LWP 39654)):
#0  syscall () at ../sysdeps/unix/sysv/linux/powerpc/syscall.S:29
#1  0x00007ffff6e87de0 in Realm::Doorbell::wait_slow (this=0x7fce29f00038) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/mutex.cc:241
#2  0x00007ffff6e89278 in wait (this=0x7fce29f00038) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/mutex.inl:81
#3  Realm::UnfairCondVar::wait (this=0x7fce29efd850) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/mutex.cc:874
#4  0x00007ffff6eb3078 in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x181bbd50, switch_to=0x7fbf70007b10) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/tasks.cc:1376
#5  0x00007ffff6eb473c in Realm::ThreadedTaskScheduler::thread_blocking (this=0x181bbd50, thread=<optimized out>) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/tasks.cc:901
#6  0x00007ffff700aaa8 in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x7fce29efe168: false) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/threads.inl:218
#7  0x00007ffff6ff9598 in Realm::Event::wait_faultaware (this=0x7fce29efe270, poisoned=@0x7fce29efe168: false) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/event_impl.cc:254
#8  0x00007ffff6ff980c in Realm::Event::wait (this=<optimized out>) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/event_impl.cc:206
#9  0x00000000104e102c in Legion::Internal::LgEvent::wait() const ()
#10 0x00007ffff699bc18 in Legion::Internal::Runtime::find_messenger (this=0x18ac1ac0, sid=<optimized out>) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/legion/runtime.cc:20972

Legion is waiting for the response message from a remote node that establishes that an endpoint connection between a pair of nodes. This only needs to happen once between each pair of nodes (in each direction) so once it is set up it never needs to happen again during the run. The send/receive pair of messages used to set up these endpoints are incredibly simple:

https://gitlab.com/StanfordLegion/legion/-/blob/control_replication/runtime/legion/runtime.cc?expanded=true&viewer=simple#L21008-21047

Both the send and receive Realm tasks are launched with no preconditions. If there were bugs in this code, we would be in a terrible place for all Legion applications. I can say with 99% certainty that the hang in this case is either in Realm or GASNet with a Realm task or a GASNet message getting lost.

streichler · 2022-01-19T17:35:03Z

@jmwang14 can you do a run of the gasnetex build with -level task=2,event=2,gexmsg=2,amhandler=2 and (assuming it still hangs), attach the logs you get?

jmwang14 · 2022-01-20T02:49:39Z

Attached.
log-files.tar.gz

ct-williams · 2022-03-09T21:33:57Z

I have also been running the HTR solver on Lassen with the Legion commit 42b768e10 of control_replication. Likewise, I have also been encountering this same non-deterministic hanging issue, and it seems that the hanging becomes increasingly frequent as more nodes are used.

Most recently, though, when trying to run HTR with the 03/08 Legion commit cbdc99666 of control_replication, the run has been failing with the following assertion:

Legion::Internal::DistributedCollectable::remove_base_valid_ref(Legion::Internal::ReferenceSource,
Legion::Internal::ReferenceMutator*, int): Assertion previous >= cnt' failed.`

Attached is a backtrace I obtained with Legion compiled in Debug mode, executing HTR with the following flags: -lg:inorder -ll:force_kthreads -lg:safe_ctrlrepl 2. I’m not sure if this more recent issue is connected to the hanging described above, but I just wanted to bring it to your attention.

bt.tar.gz

streichler · 2022-03-09T22:00:40Z

@ct-williams I believe that was one of the legion failures modes that was observed in #1193 (long read).

lightsighter · 2022-03-10T09:54:53Z

@ct-williams Ignore the comment from @streichler, this is a completely different failure mode from #1193. Rebuild with -DDEBUG_LEGION_GC and report new backtraces for the failure mode.

lightsighter · 2022-03-10T11:16:45Z

Make sure you pull the most recent control replication (1b34214e9e90be59) to get a fix for -DDEBUG_LEGION_GC.

ct-williams · 2022-03-13T05:43:49Z

To follow up on my original post, I have encountered another instance of HTR hanging when run on 32 nodes on Lassen. I was using the 3/10 Legion commit ad8ffeac9 of control_replication, compiled with CC_FLAGS="-g -O2". Attached are the backtraces, though it does seem like some of the files are truncated
bt.tar.gz
.

lightsighter · 2022-03-13T07:54:17Z

You ran this with -ll:force_kthreads -lg:inorder -lg:safe_ctrlrepl 2? Definitely some of these files have been truncated because the Thread 1 backtraces only appear in 16 of the 32 files. Can you try to generate them again using the flags above when running?

ct-williams · 2022-03-15T09:04:18Z

I re-ran the simulation with -ll:force_kthreads -lg:inorder -lg:safe_ctrlrepl 2, but now I receive the following error in the slurm file:

CU: CUDA_DRIVER_FNPTR(cuMemcpy2DAsync) (&copy_info, stream->get_stream()) = 700 (CUDA_ERROR_ILLEGAL_ADDRESS): an illegal memory access was encountered Legion process received signal 6: Aborted Process 109063 on node lassen90 is frozen! CU: CUDA_DRIVER_FNPTR(cuMemcpyHtoDAsync) (static_cast<CUdeviceptr>(out_base + out_offset), reinterpret_cast<const void *>(in_base + in_offset), bytes, stream->get_stream()) = 700 (CUDA_ERROR_ILLEGAL_ADDRESS): an illegal memory access was encountered Legion process received signal 6: Aborted

Also, attached are the backtraces. bt2.tar.gz

mariodirenzo · 2022-03-15T09:32:20Z

I am reporting here a part of my exchange on Slack with @lightsighter because I've observed a failure mode similar to what is reported in the previous post for the commit c333df6a of control_replication on Sapling. In this way, we are all on the same page.

I am running a single node job on a GPU node with 4 ranks per node. The runtime is compiled with DEBUG=1 and HTR is executed with -ll:force_kthreads -lg:inorder -lg:safe_ctrlrepl 2. I've been able to see at least four (maybe) different failure modes:

mode 1

[0 - 7fdd4edce840]  174.008633 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/cuda/cuda_module.cc:343: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
Legion process received signal 6: Aborted
Process 279081 on node g0002.stanford.edu is frozen!
CU: CUDA_DRIVER_FNPTR(cuMemcpy2DAsync) (&copy_info, stream->get_stream()) = 710 (CUDA_ERROR_ASSERT): device-side assert triggered
Legion process received signal 6: Aborted
Process 279081 on node g0002.stanford.edu is frozen!
CU: CUDA_DRIVER_FNPTR(cuMemsetD8Async) (CUdeviceptr(out_base + out_offset), fill_u8, bytes, stream->get_stream()) = 710 (CUDA_ERROR_ASSERT): device-side assert triggered
Legion process received signal 6: Aborted
Process 279081 on node g0002.stanford.edu is frozen!
CU: CUDA_DRIVER_FNPTR(cuMemsetD8Async) (CUdeviceptr(out_base + out_offset), fill_u8, bytes, stream->get_stream()) = 710 (CUDA_ERROR_ASSERT): device-side assert triggered
Legion process received signal 6: Aborted
Process 279081 on node g0002.stanford.edu is frozen!
CU: CUDA_DRIVER_FNPTR(cuMemsetD8Async) (CUdeviceptr(out_base + out_offset), fill_u8, bytes, stream->get_stream()) = 710 (CUDA_ERROR_ASSERT): device-side assert triggered
Legion process received signal 6: Aborted
Process 279081 on node g0002.stanford.edu is frozen!
CU: CUDA_DRIVER_FNPTR(cuMemsetD8Async) (CUdeviceptr(out_base + out_offset), fill_u8, bytes, stream->get_stream()) = 710 (CUDA_ERROR_ASSERT): device-side assert triggered
Legion process received signal 6: Aborted
Process 279081 on node g0002.stanford.edu is frozen!

mode 2

prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:26113: Legion::Internal::EquivalenceSet* Legion::Internal::Runtime::find_or_request_equivalence_set(Legion::DistributedID, Legion::Internal::RtEvent&): Assertion `LEGION_DISTRIBUTED_HELP_DECODE(did) == EQUIVALENCE_SET_DC' failed.
Legion process received signal 6: Aborted
Process 217253 on node g0004.stanford.edu is frozen!

mode 3

Legion process received signal 11: Segmentation fault
Process 213999 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:25891: void Legion::Internal::Runtime::register_distributed_collectable(Legion::DistributedID, Legion::Internal::DistributedCollectable*): Assertion `(finder->second.first == dc) || (finder->second.first == NULL)' failed.
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:26157: Legion::Internal::DistributedCollectable* Legion::Internal::Runtime::find_or_request_distributed_collectable(Legion::DistributedID, Legion::Internal::RtEvent&) [with T = Legion::Internal::EquivalenceSet; Legion::Internal::MessageKind MK = Legion::Internal::SEND_EQUIVALENCE_SET_REQUEST; Legion::DistributedID = long long unsigned int]: Assertion `target != address_space' failed.
Legion process received signal 6: Aborted
Process 213995 on node g0004.stanford.edu is frozen!
Legion process received signal 6: Aborted
Process 213995 on node g0004.stanford.edu is frozen!

mode 4

Legion process received signal 11: Segmentation fault
Process 210891 on node g0004.stanford.edu is frozen!
Legion process received signal 11: Segmentation fault
Process 210891 on node g0004.stanford.edu is frozen!

If the runtime is compiled in release mode, the execution is carried out successfully. Similarly, if the same setup is executed on CPUs, the run is successful even in debug mode.

If the runtime is updated to the most recent commit of control replication e08ecb0f, the error becomes

prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/legion_trace.cc:6726: Legion::Internal::TraceLocalID Legion::Internal::PhysicalTemplate::find_trace_local_id(Legion::Internal::Memoizable*): Assertion `operations.front().find(op_key) != operations.front().end()' failed.
Legion process received signal 6: Aborted
Process 249881 on node g0003.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/legion_trace.cc:6726: Legion::Internal::TraceLocalID Legion::Internal::PhysicalTemplate::find_trace_local_id(Legion::Internal::Memoizable*): Assertion `operations.front().find(op_key) != operations.front().end()' failed.
Legion process received signal 6: Aborted
Process 249885 on node g0003.stanford.edu is frozen!

and this happens both on GPUs and CPUs

lightsighter · 2022-03-15T10:04:48Z

Illegal address errors are almost guaranteed to be a bug in a CUDA kernel. CUDA can return asynchronous failures to literally any CUDA API call. Try running with CUDA_LAUNCH_BLOCKING=1 in your environment and see what backtraces you get.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution-host-device

mariodirenzo · 2022-03-15T11:05:22Z

As suggested by @lightsighter, commenting out these (https://gitlab.com/StanfordLegion/legion/-/blob/control_replication/runtime/legion/legion_analysis.cc#L1200-1209) lines avoids the tracing issue on e08ecb0f.

If the code is executed in this configuration with the CUDA_LAUNCH_BLOCKING=1 in the environment, it fails in a few different ways:

mode 1

prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:26113: Legion::Internal::EquivalenceSet* Legion::Internal::Runtime::find_or_request_equivalence_set(Legion::DistributedID, Legion::Internal::RtEvent&): Assertion `LEGION_DISTRIBUTED_HELP_DECODE(did) == EQUIVALENCE_SET_DC' failed.
Legion process received signal 6: Aborted
Process 251431 on node g0004.stanford.edu is frozen!

mode 2

[3 - 7f278bbed840]  252.998180 {6}{realm}: invalid event handle: id=7f1d081edc00
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2458: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
Legion process received signal 6: Aborted
Process 254986 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:26157: Legion::Internal::DistributedCollectable* Legion::Internal::Runtime::find_or_request_distributed_collectable(Legion::DistributedID, Legion::Internal::RtEvent&) [with T = Legion::Internal::EquivalenceSet; Legion::Internal::MessageKind MK = Legion::Internal::SEND_EQUIVALENCE_SET_REQUEST; Legion::DistributedID = long long unsigned int]: Assertion `target != address_space' failed.
Legion process received signal 6: Aborted
Process 254985 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:26157: Legion::Internal::DistributedCollectable* Legion::Internal::Runtime::find_or_request_distributed_collectable(Legion::DistributedID, Legion::Internal::RtEvent&) [with T = Legion::Internal::EquivalenceSet; Legion::Internal::MessageKind MK = Legion::Internal::SEND_EQUIVALENCE_SET_REQUEST; Legion::DistributedID = long long unsigned int]: Assertion `target != address_space' failed.
Legion process received signal 6: Aborted
Process 254981 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:25891: void Legion::Internal::Runtime::register_distributed_collectable(Legion::DistributedID, Legion::Internal::DistributedCollectable*): Assertion `(finder->second.first == dc) || (finder->second.first == NULL)' failed.
Legion process received signal 6: Aborted
Process 254985 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:25891: void Legion::Internal::Runtime::register_distributed_collectable(Legion::DistributedID, Legion::Internal::DistributedCollectable*): Assertion `(finder->second.first == dc) || (finder->second.first == NULL)' failed.
Legion process received signal 6: Aborted
Process 254981 on node g0004.stanford.edu is frozen!

mode 3

prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/legion_analysis.cc:16239: void Legion::Internal::EquivalenceSet::send_equivalence_set(Legion::AddressSpaceID): Assertion `(collective_mapping == NULL) || !collective_mapping->contains(target)' failed.
Legion process received signal 6: Aborted
Process 258088 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:25891: void Legion::Internal::Runtime::register_distributed_collectable(Legion::DistributedID, Legion::Internal::DistributedCollectable*): Assertion `(finder->second.first == dc) || (finder->second.first == NULL)' failed.
Legion process received signal 6: Aborted
Process 258090 on node g0004.stanford.edu is frozen!

mode 4

malloc(): invalid size (unsorted)
Legion process received signal 11: Segmentation fault
Process 264432 on node g0004.stanford.edu is frozen!
Legion process received signal 6: Aborted
Process 264430 on node g0004.stanford.edu is frozen!
Legion process received signal 11: Segmentation fault
Process 264432 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:20985: Legion::Internal::MessageManager* Legion::Internal::Runtime::find_messenger(Legion::AddressSpaceID): Assertion `sid < LEGION_MAX_NUM_NODES' failed.
Legion process received signal 6: Aborted
Process 264432 on node g0004.stanford.edu is frozen!
Legion process received signal 11: Segmentation fault
Process 264430 on node g0004.stanford.edu is frozen!

lightsighter · 2022-03-16T09:08:23Z

It's going to take me at least a week to fix this. Conceptually the problem is simple, but fixing it in way that doesn't cause thousands of merge conflicts with the collective instance branch is very hard.

mariodirenzo · 2022-03-16T09:15:17Z

Do we have any workaround that could use until you push a fix?
Do you think that this could potentially lead to the hangs observed in release mode?

mariodirenzo · 2022-03-18T11:41:43Z

I've tried running with Legion on e08ecb0f140ee29616a8dec9886278c13b075573 (lines https://gitlab.com/StanfordLegion/legion/-/blob/control_replication/runtime/legion/legion_analysis.cc#L1200-1209 commented out) and compiling with the flag DEBUG=1 -DDEBUG_LEGION_GC and I encounter at least two failure modes:

mode 1

Legion process received signal 11: Segmentation fault
Process 399060 on node g0002.stanford.edu is frozen!
Legion process received signal 11: Segmentation fault
Process 399060 on node g0002.stanford.edu is frozen!
Legion process received signal 11: Segmentation fault
Process 399060 on node g0002.stanford.edu is frozen!

mode 2

prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:26157: Legion::Internal::DistributedCollectable* Legion::Internal::Runtime::find_or_request_distributed_collectable(Legion::DistributedID, Legion::Internal::RtEvent&) [with T = Legion::Internal::EquivalenceSet; Legion::Internal::MessageKind MK = Legion::Internal::SEND_EQUIVALENCE_SET_REQUEST; Legion::DistributedID = long long unsigned int]: Assertion `target != address_space' failed.
Legion process received signal 6: Aborted
Process 402242 on node g0002.stanford.edu is frozen!

if I update Legion to the latest version (88461bfbad64599fb6611054bd5ecfd5955d6ff3) and the error message become consistely

[1 - 7f2fb6ff6840]  238.282098 {6}{realm}: invalid event handle: id=4
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2458: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
Legion process received signal 6: Aborted
Process 405571 on node g0002.stanford.edu is frozen!
[0 - 7f42c9fee840]  238.380398 {6}{realm}: invalid event handle: id=4
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2458: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
Legion process received signal 6: Aborted
Process 405569 on node g0002.stanford.edu is frozen!
[2 - 7f750fbed840]  238.387420 {6}{realm}: invalid event handle: id=4
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2458: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
Legion process received signal 6: Aborted
Process 405573 on node g0002.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2465: Realm::GenEventImpl* Realm::RuntimeImpl::get_genevent_impl(Realm::Event): Assertion `id.is_event()' failed.
Legion process received signal 6: Aborted
Process 405573 on node g0002.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2465: Realm::GenEventImpl* Realm::RuntimeImpl::get_genevent_impl(Realm::Event): Assertion `id.is_event()' failed.
Legion process received signal 6: Aborted
Process 405571 on node g0002.stanford.edu is frozen!
[3 - 7fce7bffe840]  238.424275 {6}{realm}: invalid event handle: id=4
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2458: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
Legion process received signal 6: Aborted
Process 405574 on node g0002.stanford.edu is frozen!

mariodirenzo · 2022-05-05T16:46:18Z

I am hitting this issue in a six nodes run which randomly hangs.
The runtime is built on 03d207e1 with CC_FLAGS="-g -O2".
HTR is executed with the flags -lg:no_physical_tracing -lg:inorder -level task=2,dma=2,xplan=1 -ll:force_kthreads.
The tarball at /home/mariodr/logs.tar.gz on sapling contains the backtraces of all the six nodes (bt_*.log) and the -level task=2,dma=2,xplan=1 logs (*.log).

mariodirenzo · 2023-05-30T07:51:44Z

Hi @jmwang14, I think that this Realm issue was fixed in May 2022. Could you please close the issue so it is easier for me to keep track of the outstanding problems?

elliottslaughter mentioned this issue Dec 4, 2021

Prioritized list of Regent features for HTR (PSAAP) #1032

Open

86 tasks

jmwang14 closed this as completed May 30, 2023

Non-deterministic hangs while running HTR #1167

Non-deterministic hangs while running HTR #1167

Comments

jmwang14 commented Dec 4, 2021

mariodirenzo commented Dec 4, 2021

lightsighter commented Dec 4, 2021

jmwang14 commented Dec 5, 2021

lightsighter commented Dec 5, 2021

lightsighter commented Dec 5, 2021

lightsighter commented Dec 5, 2021

streichler commented Dec 5, 2021

jmwang14 commented Dec 5, 2021

jmwang14 commented Dec 6, 2021

lightsighter commented Dec 6, 2021

jmwang14 commented Dec 6, 2021 • edited Loading

lightsighter commented Dec 7, 2021

streichler commented Dec 7, 2021

jmwang14 commented Dec 7, 2021 • edited Loading

mariodirenzo commented Dec 7, 2021

streichler commented Dec 7, 2021

jmwang14 commented Dec 8, 2021

streichler commented Dec 8, 2021

jmwang14 commented Dec 9, 2021 • edited Loading

jmwang14 commented Dec 11, 2021

lightsighter commented Dec 11, 2021

jmwang14 commented Dec 11, 2021

jmwang14 commented Dec 13, 2021

lightsighter commented Dec 15, 2021

jmwang14 commented Jan 5, 2022

elliottslaughter commented Jan 5, 2022 • edited Loading

jmwang14 commented Jan 5, 2022

elliottslaughter commented Jan 5, 2022

jmwang14 commented Jan 6, 2022

elliottslaughter commented Jan 11, 2022

lightsighter commented Jan 11, 2022

jmwang14 commented Jan 11, 2022

elliottslaughter commented Jan 11, 2022

mariodirenzo commented Jan 11, 2022

magnatelee commented Jan 12, 2022

jmwang14 commented Jan 12, 2022

lightsighter commented Jan 13, 2022

jmwang14 commented Jan 15, 2022

lightsighter commented Jan 15, 2022

jmwang14 commented Jan 18, 2022

lightsighter commented Jan 19, 2022

streichler commented Jan 19, 2022

jmwang14 commented Jan 20, 2022

ct-williams commented Mar 9, 2022

streichler commented Mar 9, 2022

lightsighter commented Mar 10, 2022

lightsighter commented Mar 10, 2022

ct-williams commented Mar 13, 2022

lightsighter commented Mar 13, 2022 • edited Loading

ct-williams commented Mar 15, 2022

mariodirenzo commented Mar 15, 2022

lightsighter commented Mar 15, 2022

mariodirenzo commented Mar 15, 2022

lightsighter commented Mar 16, 2022

mariodirenzo commented Mar 16, 2022

mariodirenzo commented Mar 18, 2022

mariodirenzo commented May 5, 2022 • edited Loading

mariodirenzo commented May 30, 2023

jmwang14 commented Dec 6, 2021 •

edited

Loading

jmwang14 commented Dec 7, 2021 •

edited

Loading

jmwang14 commented Dec 9, 2021 •

edited

Loading

elliottslaughter commented Jan 5, 2022 •

edited

Loading

lightsighter commented Mar 13, 2022 •

edited

Loading

mariodirenzo commented May 5, 2022 •

edited

Loading