Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic hangs while running HTR #1167

Closed
Tracked by #1032
jmwang14 opened this issue Dec 4, 2021 · 65 comments
Closed
Tracked by #1032

Non-deterministic hangs while running HTR #1167

jmwang14 opened this issue Dec 4, 2021 · 65 comments

Comments

@jmwang14
Copy link

jmwang14 commented Dec 4, 2021

I am running a 45-node simulation on Lassen using HTR and the 11/24 Legion commit f4f80752 of control_replication. The runtime hangs non-deterministically, usually after ~10-100 time steps. Attached are two backtraces taken during one of these hangs, taken several minutes apart. Legion is compiled with CXXFLAGS='-g' and the GasNet version is 2021.9.0. HTR is run with DEBUG=1.

bt-first.tar.gz
bt-second.tar.gz

@mariodirenzo
Copy link

@elliottslaughter can you add this issue to #1032?
I do not know if this is Realm or Legion related but it is definitely top priority,

@lightsighter
Copy link
Contributor

Try running with -ll:defalloc 0 and see if it reproduces.

@jmwang14
Copy link
Author

jmwang14 commented Dec 5, 2021

It still hangs with -ll:defalloc 0

@lightsighter
Copy link
Contributor

Try with -lg:inorder and get backtraces if it hangs.

@lightsighter
Copy link
Contributor

Also which GASNet version and conduit are you using?

@lightsighter
Copy link
Contributor

Try the fix for #1070 as well

@streichler
Copy link
Contributor

Try the fix for #1070 as well

It looks like commit f4f80752 includes this fix already.

@jmwang14
Copy link
Author

jmwang14 commented Dec 5, 2021

I'll run with -ll:defalloc 0 -lg:inorder and get backtraces. GASNet version is 2021.9.0 and CONDUIT=ibv.

@jmwang14
Copy link
Author

jmwang14 commented Dec 6, 2021

Backtrace attached.
bt.tar.gz

@lightsighter
Copy link
Contributor

How far did this run make it before it hung? Can you also do a run with -lg:safe_ctrlrepl 1?

@jmwang14
Copy link
Author

jmwang14 commented Dec 6, 2021

On the previous run, it hung after the 9th time step. For the attached backtrace with -lg:safe_ctrlrepl 1, the runtime froze (not hang) before the end of the of first time step.

bt.tar.gz

@lightsighter
Copy link
Contributor

All signs point to this being a Realm or a GASNet hang. There is no rhyme or reason to these backtraces. Usually whenever it's Legion's fault for a hang, it will only happen without -lg:inorder and -lg:inorder will make the hang go away or make it completely deterministic. In these backtraces we don't see either. The -lg:inorder flag isn't producing any kind of deterministic execution or deterministic hang. I would put my money on GASNet. From now on do all runs with -lg:inorder and -lg:safe_ctrlrepl 1 unless instructed otherwise.

@streichler what realm flags do we need to use to check for outstanding GASNet messages still in flight? The other alternative would be to check for outstanding DMAs. There might be an indication that we could be getting stuck on lots of tiny DMA requests for future values.

@streichler
Copy link
Contributor

Let's look at DMA requests first, running with -level dma=2,xplan=1 and attaching the log files for every rank here please.

@jmwang14
Copy link
Author

jmwang14 commented Dec 7, 2021

Which log files are you referring to? 0.log, 1.log etc. are all empty for me.

I ran with -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 1 -level dma=2,xplan=1, and the runtime freezes before the first time step is complete. It seems to catch the following assertion error:

prometeo_O2OMix.exec: /usr/WS1/wang83/build/legion-gpu-2021-11-24-debug/runtime/realm/event_impl.cc:1852: void Realm::GenEventImpl::trigger(Realm::EventImpl::gen_t, int, bool, Realm::TimeLimit): Assertion 'gen_triggered == (generation.load() + 1)' failed.

(The same assertion error was printed with just -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 1 — I had missed it earlier.)

@mariodirenzo
Copy link

Which log files are you referring to?

If you run HTR with DEBUG=1 in your environment, the files 0.log, 1.log, etc. are attached to each of the nodes/ranks and are what Sean was asking for (provided that the code runs long enough to produce some logging lines).

@streichler
Copy link
Contributor

(The same assertion error was printed with just -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 1 — I had missed it earlier.)

Ok, let's run first with -level event=2 and see if we can determine whether there's a double-trigger of an event.

@jmwang14
Copy link
Author

jmwang14 commented Dec 8, 2021

Log files attached, with -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 1 -level event=2. It failed before the first time step was complete. No error message printed this time.

log-files.zip

@streichler
Copy link
Contributor

The -level event=2 was only going to be useful for that assert fail, so let's maybe run with both sets of logging (i.e. -level dma=2,xplan=1,event=2) so that we can try to debug whichever of the two ways it fails.

@jmwang14
Copy link
Author

jmwang14 commented Dec 9, 2021

Log files attached, using the flags -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 1 -level dma=2,xplan=1,event=2. This time it tripped the assertion error (again before the first time step):

prometeo_O2OMix.exec: /usr/WS1/wang83/build/legion-gpu-2021-11-24-debug/runtime/realm/event_impl.cc:1852: void Realm::GenEventImpl::trigger(Realm::EventImpl::gen_t, int, bool, Realm::TimeLimit): Assertion 'gen_triggered == (generation.load() + 1)' failed.

log-files.tar.gz

@jmwang14
Copy link
Author

Hi Sean, have you had the opportunity to check out the log files? Please let me know if there is any other information I can provide. This is a critical issue for me, so I appreciate any time you're able to give it. I am avoiding the latest commit due to these hangs and resorting to an old commit still with the distributed collectible error, but a large portion of those simulations fail to start, so it can slow things down quite a bit.

@lightsighter
Copy link
Contributor

Do we have backtraces yet for the double event triggers?

@jmwang14
Copy link
Author

What Legion flags do I use for that case? I can obtain backtraces for -level dma=2,xplan=1,event=2 if that is what you are referring to.

@jmwang14
Copy link
Author

Log files and backtraces attached, running with -level dma=2,xplan=1,event=2. Three nodes froze with the same assertion error:
prometeo_O2OMix.exec: /usr/WS1/wang83/build/legion-gpu-2021-11-24-debug/runtime/realm/event_impl.cc:1852: void Realm::GenEventImpl::trigger(Realm::EventImpl::gen_t, int, bool, Realm::TimeLimit): Assertion 'gen_triggered == (generation.load() + 1)' failed.

bt.tar.gz
log-files.tar.gz

@lightsighter
Copy link
Contributor

This continues to look like a control replication violation. At least two different shards are disagreeing on who the owner is of an asynchronous collective operation.

These runs are all done with Legion having been built in debug mode? You can build with -O2 but Legion must have DEBUG_LEGION defined.

What happens if you run with -lg:safe_ctrlrepl 2?

@jmwang14
Copy link
Author

jmwang14 commented Jan 5, 2022

Hi Mike, I recompiled Legion with -O2 -DDEBUG_LEGION, and ran HTR with -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 2 -level dma=2,xplan=1,event=2. I don't have backtraces for this run, but I've attached log files. Error messages of the type

[27 - 7fffbb63f8b0]   12.766334 {5}{runtime}: [error 607] LEGION ERROR: Detected control replication violation when invoking future_from_value in task workSingle (UID 27) on shard 27. The hash summary for the function does not align with the hash summaries from other call sites. (from file /usr/WS1/wang83/build/legion-gpu-2021-11-24-debug2/runtime/legion/legion_context.cc:12143)

were printed for a number of shards.

log-files.tar.gz

@elliottslaughter
Copy link
Contributor

elliottslaughter commented Jan 5, 2022

@jmwang14 This error message indicates a control replication violation, like @lightsighter suspected. This is an application bug.

Does your code contain a task named future_from_value? I can't find this string anywhere in the Regent compiler or Legion C API interface.

It would appear that you're either calling a different sequence of tasks, or perhaps calling this task with different values on different shards.

@jmwang14
Copy link
Author

jmwang14 commented Jan 5, 2022

That string does not appear in the HTR source. However I do see a future_from_value on line 11449 of $LEGION_DIR/runtime/legion/legion_context.cc, commit f4f80752.

@elliottslaughter
Copy link
Contributor

For posterity this is the line in question (from a newer commit, but should be same line):

verify_replicable(hasher, "future_from_value");

That makes me suspect you're going through this API call:

legion_future_from_untyped_pointer(legion_runtime_t runtime,

Which would seem to indicate the application is creating a future.

At this point I think there are two debugging options:

  1. Get a backtrace with Regent debug symbols enabled, i.e., regent.py -g my_script_name.rg .... You need Regent symbols specifically because this is going to be an application bug and we want to know which line of code in the application is causing the issue.
  2. Printf debugging. You know the task that's causing the problem: workSingle. If you want, I can help you narrow down the possible places where a problem could be occurring. Then you'd go through and format.println all those values and see which one is diverging.

Let me know if you need any more help.

@jmwang14
Copy link
Author

jmwang14 commented Jan 6, 2022

I have attached backtraces after recompiling the code with regent.py -g .... They do not seem to show the failing line in task workSingle. The error message below was also printed from each of the 45 nodes (shard 28 shown):

[28 - 7fff8af8f8b0]   14.482024 {5}{runtime}: [error 607] LEGION ERROR: Detected control replication violation when invoking future_from_value in task workSingle (UID 28) on shard 28. The hash summary for the function does not align wi    th the hash summaries from other call sites. (from file /usr/WS1/wang83/build/legion-gpu-2021-11-24-debug2/runtime/legion/legion_context.cc:12143)

bt.tar.gz

@elliottslaughter
Copy link
Contributor

I'm going to need to see more of the build output to know what step is failing and why.

None of this has changed in a long time, so unless you've done something on your end, I don't see why this would be failing now.

@lightsighter
Copy link
Contributor

I agree with @elliottslaughter, that seems like a build issue. I would try doing a clean build. The change from yesterday passed through our CI infrastructure just fine and was a minor change that modified just a few characters.

@jmwang14
Copy link
Author

This is with clean builds of both HTR and Legion. I've attached the output of make clean; make -j |& tee make.log in the HTR source, if it is helpful. All of the runs reported above were with f4f80752 of Legion, which was from 11/24/2021. Is there any chance some change since then could be leading to the murmur_hash3_32 error? I've also kept my local HTR branch frozen throughout.

make.log

@elliottslaughter
Copy link
Contributor

I think the problem is at 95b191b#diff-2225ef11b1f16dfc659dfc3a248ca9babe50c1aed760d2152050b38f396e1292R1042 , which appears to break -foffline 1.

Do you really need -foffline 1? It might be that you could work around with -fcuda-offline 1, which might be sufficient for what you're doing.

@mariodirenzo
Copy link

Do you really need -foffline 1?

We had to switch to -foffline because of #956. Maybe on the particular setup that Jonathan is using -foffline-cuda might be sufficient but it is not general enough

@magnatelee
Copy link
Contributor

A fix has been merged with control_replication. @jmwang14 can you pull and try again?

@jmwang14
Copy link
Author

With the latest commit f2c5fc66 on control_replication, the simulation hangs during startup with no error message. Log files are attached. I kept all the previous flags: -DDEBUG_LEGION for building Legion, regent.py -g ... for building HTR, and -ll:defalloc 0 -lg:inorder -lg:safe_ctrlrepl 2 -level dma=2,xplan=1,event=2 for running HTR.

log-files.tar.gz

@lightsighter
Copy link
Contributor

When you say that this hung during startup, do you mean that it hung before starting the top-level task, or that it hung as the application was doing it's startup? Most of these log files are so short I would find it hard to believe that the application had even started yet. My prior probability that something is either wrong with your system configuration or GASNet install is growing pretty considerably. We haven't even made it to a place where we're running dma operations yet, there's not a single one in the logs. Everything is just event work so far, but such a tiny amount of it that it would be hard for anything to have happened beyond initial Legion and Realm startup. That code is so well polished (literally every Legion program running on every machine we've ever run on has touched it) that I would be surprised if there was an issue there.

@jmwang14
Copy link
Author

I've tried two other tests:
(1) Using -ll:bgwork 1 at runtime
(2) Compiling Legion with REALM_NETWORKS="gasnetex" (and therefore GASNet-2021.9.0), and using -ll:bgwork 1 at runtime.

Both lead to the same behavior. I'm not sure where exactly the code is hanging, but the application does start up. It goes through a few initial operations like creating the output directory and getting the wall time, but stops before it sets up the mesh and flow variables. It is hanging earlier than the previous hangs.

@mariodirenzo I see the sample0 directory but without console.txt — I think this means that initSingle must've completed but it is hanging on or before the second line of SIM.DeclSymbols.

log-files.tar.gz

@lightsighter
Copy link
Contributor

Can you get backtraces from all the nodes for the early hang when running with -ll:force_kthreads?

@jmwang14
Copy link
Author

Log files and backtraces attached.

bt.tar.gz

log-files.tar.gz

@lightsighter
Copy link
Contributor

Here is a backtrace that is a pretty damning indictment of either Realm or GASNet:

Thread 14 (Thread 0x7fce29eff8b0 (LWP 39654)):
#0  syscall () at ../sysdeps/unix/sysv/linux/powerpc/syscall.S:29
#1  0x00007ffff6e87de0 in Realm::Doorbell::wait_slow (this=0x7fce29f00038) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/mutex.cc:241
#2  0x00007ffff6e89278 in wait (this=0x7fce29f00038) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/mutex.inl:81
#3  Realm::UnfairCondVar::wait (this=0x7fce29efd850) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/mutex.cc:874
#4  0x00007ffff6eb3078 in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x181bbd50, switch_to=0x7fbf70007b10) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/tasks.cc:1376
#5  0x00007ffff6eb473c in Realm::ThreadedTaskScheduler::thread_blocking (this=0x181bbd50, thread=<optimized out>) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/tasks.cc:901
#6  0x00007ffff700aaa8 in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x7fce29efe168: false) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/threads.inl:218
#7  0x00007ffff6ff9598 in Realm::Event::wait_faultaware (this=0x7fce29efe270, poisoned=@0x7fce29efe168: false) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/event_impl.cc:254
#8  0x00007ffff6ff980c in Realm::Event::wait (this=<optimized out>) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/realm/event_impl.cc:206
#9  0x00000000104e102c in Legion::Internal::LgEvent::wait() const ()
#10 0x00007ffff699bc18 in Legion::Internal::Runtime::find_messenger (this=0x18ac1ac0, sid=<optimized out>) at /usr/WS1/wang83/build/legion-gpu-2022-01-12-gasnetex-g/runtime/legion/runtime.cc:20972

Legion is waiting for the response message from a remote node that establishes that an endpoint connection between a pair of nodes. This only needs to happen once between each pair of nodes (in each direction) so once it is set up it never needs to happen again during the run. The send/receive pair of messages used to set up these endpoints are incredibly simple:

https://gitlab.com/StanfordLegion/legion/-/blob/control_replication/runtime/legion/runtime.cc?expanded=true&viewer=simple#L21008-21047

Both the send and receive Realm tasks are launched with no preconditions. If there were bugs in this code, we would be in a terrible place for all Legion applications. I can say with 99% certainty that the hang in this case is either in Realm or GASNet with a Realm task or a GASNet message getting lost.

@streichler
Copy link
Contributor

@jmwang14 can you do a run of the gasnetex build with -level task=2,event=2,gexmsg=2,amhandler=2 and (assuming it still hangs), attach the logs you get?

@jmwang14
Copy link
Author

Attached.
log-files.tar.gz

@ct-williams
Copy link

I have also been running the HTR solver on Lassen with the Legion commit 42b768e10 of control_replication. Likewise, I have also been encountering this same non-deterministic hanging issue, and it seems that the hanging becomes increasingly frequent as more nodes are used.

Most recently, though, when trying to run HTR with the 03/08 Legion commit cbdc99666 of control_replication, the run has been failing with the following assertion:

Legion::Internal::DistributedCollectable::remove_base_valid_ref(Legion::Internal::ReferenceSource,
Legion::Internal::ReferenceMutator*, int): Assertion previous >= cnt' failed.`

Attached is a backtrace I obtained with Legion compiled in Debug mode, executing HTR with the following flags: -lg:inorder -ll:force_kthreads -lg:safe_ctrlrepl 2. I’m not sure if this more recent issue is connected to the hanging described above, but I just wanted to bring it to your attention.

bt.tar.gz

@streichler
Copy link
Contributor

@ct-williams I believe that was one of the legion failures modes that was observed in #1193 (long read).

@lightsighter
Copy link
Contributor

@ct-williams Ignore the comment from @streichler, this is a completely different failure mode from #1193. Rebuild with -DDEBUG_LEGION_GC and report new backtraces for the failure mode.

@lightsighter
Copy link
Contributor

Make sure you pull the most recent control replication (1b34214e9e90be59) to get a fix for -DDEBUG_LEGION_GC.

@ct-williams
Copy link

To follow up on my original post, I have encountered another instance of HTR hanging when run on 32 nodes on Lassen. I was using the 3/10 Legion commit ad8ffeac9 of control_replication, compiled with CC_FLAGS="-g -O2". Attached are the backtraces, though it does seem like some of the files are truncated
bt.tar.gz
.

@lightsighter
Copy link
Contributor

lightsighter commented Mar 13, 2022

You ran this with -ll:force_kthreads -lg:inorder -lg:safe_ctrlrepl 2? Definitely some of these files have been truncated because the Thread 1 backtraces only appear in 16 of the 32 files. Can you try to generate them again using the flags above when running?

@ct-williams
Copy link

I re-ran the simulation with -ll:force_kthreads -lg:inorder -lg:safe_ctrlrepl 2, but now I receive the following error in the slurm file:

CU: CUDA_DRIVER_FNPTR(cuMemcpy2DAsync) (&copy_info, stream->get_stream()) = 700 (CUDA_ERROR_ILLEGAL_ADDRESS): an illegal memory access was encountered Legion process received signal 6: Aborted Process 109063 on node lassen90 is frozen! CU: CUDA_DRIVER_FNPTR(cuMemcpyHtoDAsync) (static_cast<CUdeviceptr>(out_base + out_offset), reinterpret_cast<const void *>(in_base + in_offset), bytes, stream->get_stream()) = 700 (CUDA_ERROR_ILLEGAL_ADDRESS): an illegal memory access was encountered Legion process received signal 6: Aborted

Also, attached are the backtraces. bt2.tar.gz

@mariodirenzo
Copy link

I am reporting here a part of my exchange on Slack with @lightsighter because I've observed a failure mode similar to what is reported in the previous post for the commit c333df6a of control_replication on Sapling. In this way, we are all on the same page.

I am running a single node job on a GPU node with 4 ranks per node. The runtime is compiled with DEBUG=1 and HTR is executed with -ll:force_kthreads -lg:inorder -lg:safe_ctrlrepl 2. I've been able to see at least four (maybe) different failure modes:

  • mode 1
[0 - 7fdd4edce840]  174.008633 {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/cuda/cuda_module.cc:343: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
Legion process received signal 6: Aborted
Process 279081 on node g0002.stanford.edu is frozen!
CU: CUDA_DRIVER_FNPTR(cuMemcpy2DAsync) (&copy_info, stream->get_stream()) = 710 (CUDA_ERROR_ASSERT): device-side assert triggered
Legion process received signal 6: Aborted
Process 279081 on node g0002.stanford.edu is frozen!
CU: CUDA_DRIVER_FNPTR(cuMemsetD8Async) (CUdeviceptr(out_base + out_offset), fill_u8, bytes, stream->get_stream()) = 710 (CUDA_ERROR_ASSERT): device-side assert triggered
Legion process received signal 6: Aborted
Process 279081 on node g0002.stanford.edu is frozen!
CU: CUDA_DRIVER_FNPTR(cuMemsetD8Async) (CUdeviceptr(out_base + out_offset), fill_u8, bytes, stream->get_stream()) = 710 (CUDA_ERROR_ASSERT): device-side assert triggered
Legion process received signal 6: Aborted
Process 279081 on node g0002.stanford.edu is frozen!
CU: CUDA_DRIVER_FNPTR(cuMemsetD8Async) (CUdeviceptr(out_base + out_offset), fill_u8, bytes, stream->get_stream()) = 710 (CUDA_ERROR_ASSERT): device-side assert triggered
Legion process received signal 6: Aborted
Process 279081 on node g0002.stanford.edu is frozen!
CU: CUDA_DRIVER_FNPTR(cuMemsetD8Async) (CUdeviceptr(out_base + out_offset), fill_u8, bytes, stream->get_stream()) = 710 (CUDA_ERROR_ASSERT): device-side assert triggered
Legion process received signal 6: Aborted
Process 279081 on node g0002.stanford.edu is frozen!
  • mode 2
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:26113: Legion::Internal::EquivalenceSet* Legion::Internal::Runtime::find_or_request_equivalence_set(Legion::DistributedID, Legion::Internal::RtEvent&): Assertion `LEGION_DISTRIBUTED_HELP_DECODE(did) == EQUIVALENCE_SET_DC' failed.
Legion process received signal 6: Aborted
Process 217253 on node g0004.stanford.edu is frozen!
  • mode 3
Legion process received signal 11: Segmentation fault
Process 213999 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:25891: void Legion::Internal::Runtime::register_distributed_collectable(Legion::DistributedID, Legion::Internal::DistributedCollectable*): Assertion `(finder->second.first == dc) || (finder->second.first == NULL)' failed.
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:26157: Legion::Internal::DistributedCollectable* Legion::Internal::Runtime::find_or_request_distributed_collectable(Legion::DistributedID, Legion::Internal::RtEvent&) [with T = Legion::Internal::EquivalenceSet; Legion::Internal::MessageKind MK = Legion::Internal::SEND_EQUIVALENCE_SET_REQUEST; Legion::DistributedID = long long unsigned int]: Assertion `target != address_space' failed.
Legion process received signal 6: Aborted
Process 213995 on node g0004.stanford.edu is frozen!
Legion process received signal 6: Aborted
Process 213995 on node g0004.stanford.edu is frozen!
  • mode 4
Legion process received signal 11: Segmentation fault
Process 210891 on node g0004.stanford.edu is frozen!
Legion process received signal 11: Segmentation fault
Process 210891 on node g0004.stanford.edu is frozen!

If the runtime is compiled in release mode, the execution is carried out successfully. Similarly, if the same setup is executed on CPUs, the run is successful even in debug mode.

If the runtime is updated to the most recent commit of control replication e08ecb0f, the error becomes

prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/legion_trace.cc:6726: Legion::Internal::TraceLocalID Legion::Internal::PhysicalTemplate::find_trace_local_id(Legion::Internal::Memoizable*): Assertion `operations.front().find(op_key) != operations.front().end()' failed.
Legion process received signal 6: Aborted
Process 249881 on node g0003.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/legion_trace.cc:6726: Legion::Internal::TraceLocalID Legion::Internal::PhysicalTemplate::find_trace_local_id(Legion::Internal::Memoizable*): Assertion `operations.front().find(op_key) != operations.front().end()' failed.
Legion process received signal 6: Aborted
Process 249885 on node g0003.stanford.edu is frozen!

and this happens both on GPUs and CPUs

@lightsighter
Copy link
Contributor

Illegal address errors are almost guaranteed to be a bug in a CUDA kernel. CUDA can return asynchronous failures to literally any CUDA API call. Try running with CUDA_LAUNCH_BLOCKING=1 in your environment and see what backtraces you get.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution-host-device

@mariodirenzo
Copy link

As suggested by @lightsighter, commenting out these (https://gitlab.com/StanfordLegion/legion/-/blob/control_replication/runtime/legion/legion_analysis.cc#L1200-1209) lines avoids the tracing issue on e08ecb0f.

If the code is executed in this configuration with the CUDA_LAUNCH_BLOCKING=1 in the environment, it fails in a few different ways:

  • mode 1
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:26113: Legion::Internal::EquivalenceSet* Legion::Internal::Runtime::find_or_request_equivalence_set(Legion::DistributedID, Legion::Internal::RtEvent&): Assertion `LEGION_DISTRIBUTED_HELP_DECODE(did) == EQUIVALENCE_SET_DC' failed.
Legion process received signal 6: Aborted
Process 251431 on node g0004.stanford.edu is frozen!
  • mode 2
[3 - 7f278bbed840]  252.998180 {6}{realm}: invalid event handle: id=7f1d081edc00
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2458: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
Legion process received signal 6: Aborted
Process 254986 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:26157: Legion::Internal::DistributedCollectable* Legion::Internal::Runtime::find_or_request_distributed_collectable(Legion::DistributedID, Legion::Internal::RtEvent&) [with T = Legion::Internal::EquivalenceSet; Legion::Internal::MessageKind MK = Legion::Internal::SEND_EQUIVALENCE_SET_REQUEST; Legion::DistributedID = long long unsigned int]: Assertion `target != address_space' failed.
Legion process received signal 6: Aborted
Process 254985 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:26157: Legion::Internal::DistributedCollectable* Legion::Internal::Runtime::find_or_request_distributed_collectable(Legion::DistributedID, Legion::Internal::RtEvent&) [with T = Legion::Internal::EquivalenceSet; Legion::Internal::MessageKind MK = Legion::Internal::SEND_EQUIVALENCE_SET_REQUEST; Legion::DistributedID = long long unsigned int]: Assertion `target != address_space' failed.
Legion process received signal 6: Aborted
Process 254981 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:25891: void Legion::Internal::Runtime::register_distributed_collectable(Legion::DistributedID, Legion::Internal::DistributedCollectable*): Assertion `(finder->second.first == dc) || (finder->second.first == NULL)' failed.
Legion process received signal 6: Aborted
Process 254985 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:25891: void Legion::Internal::Runtime::register_distributed_collectable(Legion::DistributedID, Legion::Internal::DistributedCollectable*): Assertion `(finder->second.first == dc) || (finder->second.first == NULL)' failed.
Legion process received signal 6: Aborted
Process 254981 on node g0004.stanford.edu is frozen!
  • mode 3
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/legion_analysis.cc:16239: void Legion::Internal::EquivalenceSet::send_equivalence_set(Legion::AddressSpaceID): Assertion `(collective_mapping == NULL) || !collective_mapping->contains(target)' failed.
Legion process received signal 6: Aborted
Process 258088 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:25891: void Legion::Internal::Runtime::register_distributed_collectable(Legion::DistributedID, Legion::Internal::DistributedCollectable*): Assertion `(finder->second.first == dc) || (finder->second.first == NULL)' failed.
Legion process received signal 6: Aborted
Process 258090 on node g0004.stanford.edu is frozen!
  • mode 4
malloc(): invalid size (unsorted)
Legion process received signal 11: Segmentation fault
Process 264432 on node g0004.stanford.edu is frozen!
Legion process received signal 6: Aborted
Process 264430 on node g0004.stanford.edu is frozen!
Legion process received signal 11: Segmentation fault
Process 264432 on node g0004.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:20985: Legion::Internal::MessageManager* Legion::Internal::Runtime::find_messenger(Legion::AddressSpaceID): Assertion `sid < LEGION_MAX_NUM_NODES' failed.
Legion process received signal 6: Aborted
Process 264432 on node g0004.stanford.edu is frozen!
Legion process received signal 11: Segmentation fault
Process 264430 on node g0004.stanford.edu is frozen!

@lightsighter
Copy link
Contributor

It's going to take me at least a week to fix this. Conceptually the problem is simple, but fixing it in way that doesn't cause thousands of merge conflicts with the collective instance branch is very hard.

@mariodirenzo
Copy link

Do we have any workaround that could use until you push a fix?
Do you think that this could potentially lead to the hangs observed in release mode?

@mariodirenzo
Copy link

I've tried running with Legion on e08ecb0f140ee29616a8dec9886278c13b075573 (lines https://gitlab.com/StanfordLegion/legion/-/blob/control_replication/runtime/legion/legion_analysis.cc#L1200-1209 commented out) and compiling with the flag DEBUG=1 -DDEBUG_LEGION_GC and I encounter at least two failure modes:

  • mode 1
Legion process received signal 11: Segmentation fault
Process 399060 on node g0002.stanford.edu is frozen!
Legion process received signal 11: Segmentation fault
Process 399060 on node g0002.stanford.edu is frozen!
Legion process received signal 11: Segmentation fault
Process 399060 on node g0002.stanford.edu is frozen!
  • mode 2
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/legion/runtime.cc:26157: Legion::Internal::DistributedCollectable* Legion::Internal::Runtime::find_or_request_distributed_collectable(Legion::DistributedID, Legion::Internal::RtEvent&) [with T = Legion::Internal::EquivalenceSet; Legion::Internal::MessageKind MK = Legion::Internal::SEND_EQUIVALENCE_SET_REQUEST; Legion::DistributedID = long long unsigned int]: Assertion `target != address_space' failed.
Legion process received signal 6: Aborted
Process 402242 on node g0002.stanford.edu is frozen!

if I update Legion to the latest version (88461bfbad64599fb6611054bd5ecfd5955d6ff3) and the error message become consistely

[1 - 7f2fb6ff6840]  238.282098 {6}{realm}: invalid event handle: id=4
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2458: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
Legion process received signal 6: Aborted
Process 405571 on node g0002.stanford.edu is frozen!
[0 - 7f42c9fee840]  238.380398 {6}{realm}: invalid event handle: id=4
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2458: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
Legion process received signal 6: Aborted
Process 405569 on node g0002.stanford.edu is frozen!
[2 - 7f750fbed840]  238.387420 {6}{realm}: invalid event handle: id=4
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2458: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
Legion process received signal 6: Aborted
Process 405573 on node g0002.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2465: Realm::GenEventImpl* Realm::RuntimeImpl::get_genevent_impl(Realm::Event): Assertion `id.is_event()' failed.
Legion process received signal 6: Aborted
Process 405573 on node g0002.stanford.edu is frozen!
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2465: Realm::GenEventImpl* Realm::RuntimeImpl::get_genevent_impl(Realm::Event): Assertion `id.is_event()' failed.
Legion process received signal 6: Aborted
Process 405571 on node g0002.stanford.edu is frozen!
[3 - 7fce7bffe840]  238.424275 {6}{realm}: invalid event handle: id=4
prometeo_ConstPropMix.exec: /home/mariodr/legion/runtime/realm/runtime_impl.cc:2458: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.
Legion process received signal 6: Aborted
Process 405574 on node g0002.stanford.edu is frozen!

@mariodirenzo
Copy link

mariodirenzo commented May 5, 2022

I am hitting this issue in a six nodes run which randomly hangs.
The runtime is built on 03d207e1 with CC_FLAGS="-g -O2".
HTR is executed with the flags -lg:no_physical_tracing -lg:inorder -level task=2,dma=2,xplan=1 -ll:force_kthreads.
The tarball at /home/mariodr/logs.tar.gz on sapling contains the backtraces of all the six nodes (bt_*.log) and the -level task=2,dma=2,xplan=1 logs (*.log).

@mariodirenzo
Copy link

Hi @jmwang14, I think that this Realm issue was fixed in May 2022. Could you please close the issue so it is easier for me to keep track of the outstanding problems?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants