-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regent: memory leak #745
Comments
If you update to the newest master, we just fixed one possible source of leaks. Otherwise, you can follow the instructions here: #711 (comment) . Don't need a super long run, 10-100 timesteps or so should be fine. |
It's worth noting that it's not a leak if the program is actually still holding onto all the resources it's requesting and not deleting them. For example, some programs keep making lots of logical regions and never delete them which may look like a leak, but is actually a bug in the user code. We don't have any tools for finding those kinds of issues at the moment. |
I see. Can you give me some pointers to where such memory allocations might be happening? My solver uses just 3 large regions which are passed around in a lot of functions. Is there some de-allocation mechanism which I'm unaware of? Thanks. |
If you're only allocating three regions then it's probably not them. It can also happen if you're making lots of partitions and throwing them away (without deleting them). I advise you follow @elliottslaughter instructions and follow the steps in #711 to see if you can identify what is happening. |
@elliottslaughter I ran my solver for 1 timestep/iteration and the logs are 29 GB. I tried running |
Is there a way to run with a smaller problem size? E.g. how many subregions and partitions are you creating? Also just for sanity, what is your |
Sure, give me a day and I will run it on a smaller grid and get back to you. Right now I have no partitions or subregions - my goal was just to make sure I get correct output. Everything is basically running on one core. My plan was to debug the base code first, and add parallelism later, since I'm new to regent and parallel programming. (If you think this might not be a good approach and partitions should be incorporated in from the beginning, please feel free to let me know.) I don't use the |
Just to confirm, what's the failure mode? Are you watching memory usage climb in If it's the former, then this is really starting to look like an application bug. There are a still a handful of places where it could be happening in the runtime or compiler, but 10 MB/s is a lot, so this is looking increasingly unlikely. If it's the latter, then probably the thing to do is to get in with Legion Prof and/or Legion Spy and figure out which task is causing the new instances to be created. |
I was able to generate logs for a smaller run and analyse it with
And yes, I am seeing memory usage climb in |
Can you attach that log here? |
The log is about 8 GB. The output of |
It should compress pretty well though, see if you can upload a .tar.gz file of the legion_gc output. It might also help us to see a legion spy log from just a few iterations. |
Also, how many subregions do you have in your partitions? How many nodes is the machine you are running on too? |
If you can't upload the full file, run legion_gc with the |
I have no partitions right now. As I said, my goal was to get correct output first and add partitions later, since I'm new to regent and parallel programming. I'm currently running on a single node machine, and will use a distributed one later. You can find the file here https://filebin.net/3yc339xnbnfrlwry |
Ah, so I think I may have figured out the mystery. There's only two ways that you can make that many views with only 4 physical instances of regions, and if you don't do any partitioning, then there is only one. :) Can you share how deep your task tree is roughly? Do you happen to be doing recursion with any of your tasks? |
I don't do any recursion.
All of these inner tasks just read data (but not Could it be that since I haven't done any partitioning, the parallelisation of these tasks is leading to the leaked views? Memory usage on commenting out Also, could you please explain what views are and how you figured this out? It would be really helpful for me in the future. Thanks a lot. |
There are two ways to address this (one of which I'll recommend over the other).
In general I recommend the second approach. Tasks in are not the same things as functions as they come with overhead both in runtime overhead and memory needed for maintaining internal state for executing them. Legion and Regent work best when these overheads are small compared the amount of compute being performed and data being processed by tasks. In general, Legion tasks need about ~1ms of compute to amortize the cost these overheads. In your particular case this will probably be having your tasks process many points instead of just one at a time. |
I see. One more question though. I just wish to understand how to use Regent and Legion efficiently. Thanks a ton for helping me out! |
Right, the way to get parallelism is by partitioning. You can control how much parallelism by choosing how many subregions to create for each partition. So you're not really at risk of losing parallelism, the partitions can be as fine as they need to be to enable the parallelism you require. Note that in your current setup, if you don't do partitioning, you are probably not getting much parallelism. That's because whenever you |
Thank you, this was really helpful. |
@rupanshusoi Are you still hitting this? If not maybe we can close this issue. |
No, it has been resolved. Thanks. |
My CFD solver in Regent seems to leak about 10 MB/sec/node during execution. I'm not sure what is causing the leak and it's not clear to me which Legion debugging tool might be useful to locate memory leaks. Also, what are some common ways memory leaks can happen in Regent?
The text was updated successfully, but these errors were encountered: