Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regent: memory leak #745

Closed
rupanshusoi opened this issue Feb 8, 2020 · 23 comments
Closed

Regent: memory leak #745

rupanshusoi opened this issue Feb 8, 2020 · 23 comments
Assignees
Labels
bug planned Feature/fix to be actively worked on - needs release target Regent Issues pertaining to Regent
Milestone

Comments

@rupanshusoi
Copy link
Contributor

My CFD solver in Regent seems to leak about 10 MB/sec/node during execution. I'm not sure what is causing the leak and it's not clear to me which Legion debugging tool might be useful to locate memory leaks. Also, what are some common ways memory leaks can happen in Regent?

@elliottslaughter
Copy link
Contributor

If you update to the newest master, we just fixed one possible source of leaks.

Otherwise, you can follow the instructions here: #711 (comment) . Don't need a super long run, 10-100 timesteps or so should be fine.

@elliottslaughter elliottslaughter self-assigned this Feb 8, 2020
@elliottslaughter elliottslaughter added bug planned Feature/fix to be actively worked on - needs release target Regent Issues pertaining to Regent labels Feb 8, 2020
@elliottslaughter elliottslaughter added this to the 20.03 milestone Feb 8, 2020
@lightsighter
Copy link
Contributor

It's worth noting that it's not a leak if the program is actually still holding onto all the resources it's requesting and not deleting them. For example, some programs keep making lots of logical regions and never delete them which may look like a leak, but is actually a bug in the user code. We don't have any tools for finding those kinds of issues at the moment.

@rupanshusoi
Copy link
Contributor Author

I see. Can you give me some pointers to where such memory allocations might be happening? My solver uses just 3 large regions which are passed around in a lot of functions. Is there some de-allocation mechanism which I'm unaware of? Thanks.

@lightsighter
Copy link
Contributor

If you're only allocating three regions then it's probably not them. It can also happen if you're making lots of partitions and throwing them away (without deleting them). I advise you follow @elliottslaughter instructions and follow the steps in #711 to see if you can identify what is happening.

@rupanshusoi
Copy link
Contributor Author

@elliottslaughter I ran my solver for 1 timestep/iteration and the logs are 29 GB. I tried running legion_gc.py on it locally but had to kill it when the RAM usage went over 100 GB. How do you want to proceed?

@elliottslaughter
Copy link
Contributor

Is there a way to run with a smaller problem size? E.g. how many subregions and partitions are you creating?

Also just for sanity, what is your -ll:csize?

@rupanshusoi
Copy link
Contributor Author

Sure, give me a day and I will run it on a smaller grid and get back to you.

Right now I have no partitions or subregions - my goal was just to make sure I get correct output. Everything is basically running on one core. My plan was to debug the base code first, and add parallelism later, since I'm new to regent and parallel programming. (If you think this might not be a good approach and partitions should be incorporated in from the beginning, please feel free to let me know.)

I don't use the -ll:csize flag.

@elliottslaughter
Copy link
Contributor

Just to confirm, what's the failure mode? Are you watching memory usage climb in htop (or similar) or are you seeing a message like "Default mapper failed allocation"?

If it's the former, then this is really starting to look like an application bug. There are a still a handful of places where it could be happening in the runtime or compiler, but 10 MB/s is a lot, so this is looking increasingly unlikely.

If it's the latter, then probably the thing to do is to get in with Legion Prof and/or Legion Spy and figure out which task is causing the new instances to be created.

@rupanshusoi
Copy link
Contributor Author

I was able to generate logs for a smaller run and analyse it with legion_gc.py. Here is the summary

  Leaked Futures: 0
  Leaked Future Maps: 0
  LEAKED CONSTRAINTS: 4
  LEAKED MANAGERS: 4
  Pinned Managers: 0
  LEAKED VIEWS: 193580
  Leaked Equivalence Sets: 0
  LEAKED INDEX SPACES: 4
  Leaked Index Partitions: 0
  LEAKED FIELD SPACES: 4
  Leaked Regions: 0
  Leaked Partitions: 0

And yes, I am seeing memory usage climb in htop.

@lightsighter
Copy link
Contributor

Can you attach that log here?

@rupanshusoi
Copy link
Contributor Author

The log is about 8 GB. The output of legion_gc.py is about 350 MB.

@lightsighter
Copy link
Contributor

It should compress pretty well though, see if you can upload a .tar.gz file of the legion_gc output. It might also help us to see a legion spy log from just a few iterations.

@lightsighter
Copy link
Contributor

Also, how many subregions do you have in your partitions? How many nodes is the machine you are running on too?

@lightsighter
Copy link
Contributor

If you can't upload the full file, run legion_gc with the -lv options and report just some of the output for a few of the leaked views.

@rupanshusoi
Copy link
Contributor Author

I have no partitions right now. As I said, my goal was to get correct output first and add partitions later, since I'm new to regent and parallel programming. I'm currently running on a single node machine, and will use a distributed one later. htop shows only a single core taking the workload for the entire run.

You can find the file here https://filebin.net/3yc339xnbnfrlwry

@lightsighter
Copy link
Contributor

Ah, so I think I may have figured out the mystery. There's only two ways that you can make that many views with only 4 physical instances of regions, and if you don't do any partitioning, then there is only one. :) Can you share how deep your task tree is roughly? Do you happen to be doing recursion with any of your tasks?

@rupanshusoi
Copy link
Contributor Author

I don't do any recursion.
But I think I know what you mean. The basic structure of my code can be seen here #744. All the tasks inside the inner loop are basically a bunch of loops themselves. The only one of them that calls a lot of other tasks is flux_res. A loop inside flux_res would look something like

for pt in wallpts do
    var a = flux1(pt)
    var b = flux2(pt)
    var c = flux3(pt)
    pt.flux = a + b + c
end

All of these inner tasks just read data (but not pt.flux) so they can be run simultaneously. The size of wallpts can be as big as the grid size, which is about 50,000 points for my current case.

Could it be that since I haven't done any partitioning, the parallelisation of these tasks is leading to the leaked views? Memory usage on commenting out flux_res was pretty much constant for 4-5 iterations.

Also, could you please explain what views are and how you figured this out? It would be really helpful for me in the future. Thanks a lot.

@lightsighter
Copy link
Contributor

flux1, flux2, and flux3 are task launches themselves right? If they are, then this is where we are getting into trouble. We're getting lots of outstanding tasks which are all trying to run in parallel based on how your mapper is choosing to map them. The reason we're getting lots of views is that your mapper is choosing to map the regions for the intermediate tasks in the task tree rather than using virtual mappings. Each of those mappings are creating a "view" onto the physical instance for the region and that's how we're getting a huge number of views with only a four physical instances.

There are two ways to address this (one of which I'll recommend over the other).

  1. You can modify your mapper to be more disciplined in how you execute tasks to avoid mapping quite so far into the future or make virtual mappings for all your intermediate tasks that just turn around and launch sub-tasks.
  2. "Chunk" your tasks so that they handle more work instead of making lots of very small tasks to process individual points.

In general I recommend the second approach. Tasks in are not the same things as functions as they come with overhead both in runtime overhead and memory needed for maintaining internal state for executing them. Legion and Regent work best when these overheads are small compared the amount of compute being performed and data being processed by tasks. In general, Legion tasks need about ~1ms of compute to amortize the cost these overheads. In your particular case this will probably be having your tasks process many points instead of just one at a time.

@rupanshusoi
Copy link
Contributor Author

I see. One more question though.
Since assigning more computation to a task will actually lead to less parallelism than possible, since in theory all of flux1, flux2 and flux3 can be run in parallel for all points simultaneously, would making several partitions and calling them separately on each one be more efficient? My understanding is that having more partitions allows Legion to utilise more cores and hence achieve higher parallelism.

I just wish to understand how to use Regent and Legion efficiently. Thanks a ton for helping me out!

@elliottslaughter
Copy link
Contributor

Right, the way to get parallelism is by partitioning. You can control how much parallelism by choosing how many subregions to create for each partition. So you're not really at risk of losing parallelism, the partitions can be as fine as they need to be to enable the parallelism you require.

Note that in your current setup, if you don't do partitioning, you are probably not getting much parallelism. That's because whenever you reads writes to a region, even if the task only modifies one point inside that region, Regent/Legion will assume it modifies the entire region. Regent/Legion don't do any fancy analysis to determine what you actually change inside of your task, they just trust the privileges you give the task yourself. So there may have been a small amount of parallelism in your code (if you used reads and no reads writes), but whenever you used reads writes it would have caused everything to block on that task.

@rupanshusoi
Copy link
Contributor Author

Thank you, this was really helpful.

@elliottslaughter
Copy link
Contributor

@rupanshusoi Are you still hitting this? If not maybe we can close this issue.

@rupanshusoi
Copy link
Contributor Author

No, it has been resolved. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug planned Feature/fix to be actively worked on - needs release target Regent Issues pertaining to Regent
Projects
None yet
Development

No branches or pull requests

3 participants