-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Realm: segmentation fault at startup #1523
Comments
Assigning @eddy16112 since these are the changes to for the new programmable machine model in Realm. |
@mariodirenzo This doesn't look like a Realm crash, it looks like a segfault in a Regent-generated task:
Do you think the code being emitted by Terra is wrong? If so we're probably going to need @elliottslaughter to tell us how to look at that. |
FWIW, there's evidence that at least one other Terra generated task in the same program started just fine:
|
Whatever it is, it goes away by reverting the commit |
@mariodirenzo, I'm trying to figure out the exact state of your Legion repo with and without the commit reverted. Just so we're all on the same page, could you prepare a branch that (a) has the revert commit (that should pass) and (b) the commit immediately prior fails? The only thing that changed recently in Regent is the addition of Terra binaries in |
Just FYI, there are no other recent changes to Regent aside from the It's always possible for there be a timing difference that exposes a bug. However, in this case I highly suspect that the reverted merge is likely to be much more relevant. We've hit multiple Realm regressions in the last week, and it has become increasingly clear that our existing CI infrastructure just isn't sufficient. |
@lightsighter How do you run the program? I followed the steps, but I can not see the exact command to run the program. |
You can find it at
@eddy16112, |
@mariodirenzo I copied /home/mariodr/htr3/solverTests/Speelman_DV250 into my directory, and here is what I have in the slurm-*.txt
|
Can you retry after sourcing |
Yes, I modified it to
|
The load script should look like this
|
Now I am getting an error of |
FWIW, the reproducer doesn't work for me anymore either:
|
you just need to change |
The segfault happens right in the first few instructions of the function:
The values look ok, but the memory trying to be read is clearly not accessible.
There's really only two possibilities:
|
Isn't this a stack overflow? We just subtracted something like 61 KiB off of the stack pointer, and the first store after that point is what fails. |
Is it possible that Wei's change caused the stack size not to be parsed properly? I know HTR needs a larger than usual stack size to run correctly. And since that code just got modified.... |
Looks like the stack size (-ll:stacksize) is not parsed correctly, but I do not understand why. Here is how I parsed it:
When I run any realm example with -ll:cpu 3 -ll:stacksize 4, I am getting a print:
The -ll:cpu is correct, but the stack size is not. |
I found the problem, the |
In addition to @mariodirenzo's confirmation, do we have any direct tests of the command-line parser in Realm? Because this is a subtle issue (i.e., only certain applications will actually crash if we don't parse this flag), it would be nice to directly test this, so we don't rely on user reports to discover future regressions. |
@eddy16112, your patch fixes the issue. Please let me know when it is merged in |
I will run the full CI tonight, and if it is passed, I will merge it tomorrow. |
The problem is currently Realm does not provide an API to query every command line args, so I am not sure how can we test it. |
So at least in this case, there is a way to test it: have a task that does a massive call to |
I agree with Mike. Testing |
@mariodirenzo The patch has been pushed into the control replication branch. |
Thank you |
@elliottslaughter @lightsighter I have updated the realm unit test to have a better test of command line parser and machine config API. https://gitlab.com/StanfordLegion/legion/-/merge_requests/904. Hopefully, it could catch similar errors in this issue next time. |
I've noticed a regression in one of the CI tests of HTR. In particular, the same code version works fine on commit
707b7479e
ofcontrol_replication
and fails with a segmentation fault at startup on48d22ecfe
. If I reverte42aec2e
with agit revert -m 1 e42aec2e
, the error goes away.The error can be reproduced on sapling2 with the following steps on the login node:
source /home/mariodr/load_htr3.sh
cd /home/mariodr/htr3/solverTests/Speelman_DV250
rm -rf slurm-* sample0;DEBUG=1 ../../prometeo.sh -i 1x1x1.json
A job will be submitted to the queue and will hang on one of the gpu nodes.
Legion is installed at
/home/mariodr/legion3
and is compiled withDEBUG=1 srun -N 1 --exclusive -p gpu /home/mariodr/legion3/language/scripts/setup_env.py --prefix /home/mariodr/legion/language/
@elliottslaughter, can you please add this issue to #1032 with high priority?
The text was updated successfully, but these errors were encountered: