-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-deterministic hangs while running HTR #1167
Comments
@elliottslaughter can you add this issue to #1032? |
Try running with |
It still hangs with |
Try with |
Also which GASNet version and conduit are you using? |
Try the fix for #1070 as well |
It looks like commit |
I'll run with |
Backtrace attached. |
How far did this run make it before it hung? Can you also do a run with |
On the previous run, it hung after the 9th time step. For the attached backtrace with |
All signs point to this being a Realm or a GASNet hang. There is no rhyme or reason to these backtraces. Usually whenever it's Legion's fault for a hang, it will only happen without @streichler what realm flags do we need to use to check for outstanding GASNet messages still in flight? The other alternative would be to check for outstanding DMAs. There might be an indication that we could be getting stuck on lots of tiny DMA requests for future values. |
Let's look at DMA requests first, running with |
Which log files are you referring to? I ran with
(The same assertion error was printed with just |
If you run HTR with |
Ok, let's run first with |
Log files attached, with |
The |
Log files attached, using the flags
|
Hi Sean, have you had the opportunity to check out the log files? Please let me know if there is any other information I can provide. This is a critical issue for me, so I appreciate any time you're able to give it. I am avoiding the latest commit due to these hangs and resorting to an old commit still with the distributed collectible error, but a large portion of those simulations fail to start, so it can slow things down quite a bit. |
Do we have backtraces yet for the double event triggers? |
What Legion flags do I use for that case? I can obtain backtraces for |
Log files and backtraces attached, running with |
This continues to look like a control replication violation. At least two different shards are disagreeing on who the owner is of an asynchronous collective operation. These runs are all done with Legion having been built in debug mode? You can build with What happens if you run with |
Hi Mike, I recompiled Legion with
were printed for a number of shards. |
@jmwang14 This error message indicates a control replication violation, like @lightsighter suspected. This is an application bug. Does your code contain a task named It would appear that you're either calling a different sequence of tasks, or perhaps calling this task with different values on different shards. |
That string does not appear in the HTR source. However I do see a |
For posterity this is the line in question (from a newer commit, but should be same line): legion/runtime/legion/legion_context.cc Line 11456 in e67a940
That makes me suspect you're going through this API call: legion/runtime/legion/legion_c.h Line 2621 in e67a940
Which would seem to indicate the application is creating a future. At this point I think there are two debugging options:
Let me know if you need any more help. |
I have attached backtraces after recompiling the code with
|
I'm going to need to see more of the build output to know what step is failing and why. None of this has changed in a long time, so unless you've done something on your end, I don't see why this would be failing now. |
I agree with @elliottslaughter, that seems like a build issue. I would try doing a clean build. The change from yesterday passed through our CI infrastructure just fine and was a minor change that modified just a few characters. |
This is with clean builds of both HTR and Legion. I've attached the output of |
I think the problem is at 95b191b#diff-2225ef11b1f16dfc659dfc3a248ca9babe50c1aed760d2152050b38f396e1292R1042 , which appears to break Do you really need |
We had to switch to -foffline because of #956. Maybe on the particular setup that Jonathan is using -foffline-cuda might be sufficient but it is not general enough |
A fix has been merged with |
With the latest commit |
When you say that this hung during startup, do you mean that it hung before starting the top-level task, or that it hung as the application was doing it's startup? Most of these log files are so short I would find it hard to believe that the application had even started yet. My prior probability that something is either wrong with your system configuration or GASNet install is growing pretty considerably. We haven't even made it to a place where we're running dma operations yet, there's not a single one in the logs. Everything is just event work so far, but such a tiny amount of it that it would be hard for anything to have happened beyond initial Legion and Realm startup. That code is so well polished (literally every Legion program running on every machine we've ever run on has touched it) that I would be surprised if there was an issue there. |
I've tried two other tests: Both lead to the same behavior. I'm not sure where exactly the code is hanging, but the application does start up. It goes through a few initial operations like creating the output directory and getting the wall time, but stops before it sets up the mesh and flow variables. It is hanging earlier than the previous hangs. @mariodirenzo I see the |
Can you get backtraces from all the nodes for the early hang when running with |
Log files and backtraces attached. |
Here is a backtrace that is a pretty damning indictment of either Realm or GASNet:
Legion is waiting for the response message from a remote node that establishes that an endpoint connection between a pair of nodes. This only needs to happen once between each pair of nodes (in each direction) so once it is set up it never needs to happen again during the run. The send/receive pair of messages used to set up these endpoints are incredibly simple: Both the send and receive Realm tasks are launched with no preconditions. If there were bugs in this code, we would be in a terrible place for all Legion applications. I can say with 99% certainty that the hang in this case is either in Realm or GASNet with a Realm task or a GASNet message getting lost. |
@jmwang14 can you do a run of the gasnetex build with |
Attached. |
I have also been running the HTR solver on Lassen with the Legion commit Most recently, though, when trying to run HTR with the 03/08 Legion commit
Attached is a backtrace I obtained with Legion compiled in Debug mode, executing HTR with the following flags: |
@ct-williams I believe that was one of the legion failures modes that was observed in #1193 (long read). |
@ct-williams Ignore the comment from @streichler, this is a completely different failure mode from #1193. Rebuild with |
Make sure you pull the most recent control replication ( |
To follow up on my original post, I have encountered another instance of HTR hanging when run on 32 nodes on Lassen. I was using the 3/10 Legion commit |
You ran this with |
I re-ran the simulation with
Also, attached are the backtraces. bt2.tar.gz |
I am reporting here a part of my exchange on Slack with @lightsighter because I've observed a failure mode similar to what is reported in the previous post for the commit I am running a single node job on a GPU node with 4 ranks per node. The runtime is compiled with
If the runtime is compiled in release mode, the execution is carried out successfully. Similarly, if the same setup is executed on CPUs, the run is successful even in debug mode. If the runtime is updated to the most recent commit of control replication
and this happens both on GPUs and CPUs |
Illegal address errors are almost guaranteed to be a bug in a CUDA kernel. CUDA can return asynchronous failures to literally any CUDA API call. Try running with |
As suggested by @lightsighter, commenting out these (https://gitlab.com/StanfordLegion/legion/-/blob/control_replication/runtime/legion/legion_analysis.cc#L1200-1209) lines avoids the tracing issue on If the code is executed in this configuration with the
|
It's going to take me at least a week to fix this. Conceptually the problem is simple, but fixing it in way that doesn't cause thousands of merge conflicts with the collective instance branch is very hard. |
Do we have any workaround that could use until you push a fix? |
I've tried running with Legion on
if I update Legion to the latest version (
|
I am hitting this issue in a six nodes run which randomly hangs. |
Hi @jmwang14, I think that this Realm issue was fixed in May 2022. Could you please close the issue so it is easier for me to keep track of the outstanding problems? |
I am running a 45-node simulation on Lassen using HTR and the 11/24 Legion commit
f4f80752
of control_replication. The runtime hangs non-deterministically, usually after ~10-100 time steps. Attached are two backtraces taken during one of these hangs, taken several minutes apart. Legion is compiled withCXXFLAGS='-g'
and the GasNet version is 2021.9.0. HTR is run withDEBUG=1
.bt-first.tar.gz
bt-second.tar.gz
The text was updated successfully, but these errors were encountered: