Regent: memory leak #745

rupanshusoi · 2020-02-08T07:03:25Z

My CFD solver in Regent seems to leak about 10 MB/sec/node during execution. I'm not sure what is causing the leak and it's not clear to me which Legion debugging tool might be useful to locate memory leaks. Also, what are some common ways memory leaks can happen in Regent?

elliottslaughter · 2020-02-08T17:13:08Z

If you update to the newest master, we just fixed one possible source of leaks.

Otherwise, you can follow the instructions here: #711 (comment) . Don't need a super long run, 10-100 timesteps or so should be fine.

lightsighter · 2020-02-08T18:53:03Z

It's worth noting that it's not a leak if the program is actually still holding onto all the resources it's requesting and not deleting them. For example, some programs keep making lots of logical regions and never delete them which may look like a leak, but is actually a bug in the user code. We don't have any tools for finding those kinds of issues at the moment.

rupanshusoi · 2020-02-09T07:01:41Z

I see. Can you give me some pointers to where such memory allocations might be happening? My solver uses just 3 large regions which are passed around in a lot of functions. Is there some de-allocation mechanism which I'm unaware of? Thanks.

lightsighter · 2020-02-09T08:24:52Z

If you're only allocating three regions then it's probably not them. It can also happen if you're making lots of partitions and throwing them away (without deleting them). I advise you follow @elliottslaughter instructions and follow the steps in #711 to see if you can identify what is happening.

rupanshusoi · 2020-02-09T11:59:14Z

@elliottslaughter I ran my solver for 1 timestep/iteration and the logs are 29 GB. I tried running legion_gc.py on it locally but had to kill it when the RAM usage went over 100 GB. How do you want to proceed?

elliottslaughter · 2020-02-09T17:08:35Z

Is there a way to run with a smaller problem size? E.g. how many subregions and partitions are you creating?

Also just for sanity, what is your -ll:csize?

rupanshusoi · 2020-02-09T17:39:05Z

Sure, give me a day and I will run it on a smaller grid and get back to you.

Right now I have no partitions or subregions - my goal was just to make sure I get correct output. Everything is basically running on one core. My plan was to debug the base code first, and add parallelism later, since I'm new to regent and parallel programming. (If you think this might not be a good approach and partitions should be incorporated in from the beginning, please feel free to let me know.)

I don't use the -ll:csize flag.

elliottslaughter · 2020-02-10T19:24:55Z

Just to confirm, what's the failure mode? Are you watching memory usage climb in htop (or similar) or are you seeing a message like "Default mapper failed allocation"?

If it's the former, then this is really starting to look like an application bug. There are a still a handful of places where it could be happening in the runtime or compiler, but 10 MB/s is a lot, so this is looking increasingly unlikely.

If it's the latter, then probably the thing to do is to get in with Legion Prof and/or Legion Spy and figure out which task is causing the new instances to be created.

rupanshusoi · 2020-02-11T17:43:38Z

I was able to generate logs for a smaller run and analyse it with legion_gc.py. Here is the summary

  Leaked Futures: 0
  Leaked Future Maps: 0
  LEAKED CONSTRAINTS: 4
  LEAKED MANAGERS: 4
  Pinned Managers: 0
  LEAKED VIEWS: 193580
  Leaked Equivalence Sets: 0
  LEAKED INDEX SPACES: 4
  Leaked Index Partitions: 0
  LEAKED FIELD SPACES: 4
  Leaked Regions: 0
  Leaked Partitions: 0

And yes, I am seeing memory usage climb in htop.

lightsighter · 2020-02-11T17:44:39Z

Can you attach that log here?

rupanshusoi · 2020-02-11T17:45:38Z

The log is about 8 GB. The output of legion_gc.py is about 350 MB.

lightsighter · 2020-02-11T17:54:27Z

It should compress pretty well though, see if you can upload a .tar.gz file of the legion_gc output. It might also help us to see a legion spy log from just a few iterations.

lightsighter · 2020-02-11T18:00:50Z

Also, how many subregions do you have in your partitions? How many nodes is the machine you are running on too?

lightsighter · 2020-02-11T18:04:47Z

If you can't upload the full file, run legion_gc with the -lv options and report just some of the output for a few of the leaked views.

rupanshusoi · 2020-02-11T18:10:54Z

I have no partitions right now. As I said, my goal was to get correct output first and add partitions later, since I'm new to regent and parallel programming. I'm currently running on a single node machine, and will use a distributed one later. htop shows only a single core taking the workload for the entire run.

You can find the file here https://filebin.net/3yc339xnbnfrlwry

lightsighter · 2020-02-11T23:14:08Z

Ah, so I think I may have figured out the mystery. There's only two ways that you can make that many views with only 4 physical instances of regions, and if you don't do any partitioning, then there is only one. :) Can you share how deep your task tree is roughly? Do you happen to be doing recursion with any of your tasks?

rupanshusoi · 2020-02-12T05:39:39Z

I don't do any recursion.
But I think I know what you mean. The basic structure of my code can be seen here #744. All the tasks inside the inner loop are basically a bunch of loops themselves. The only one of them that calls a lot of other tasks is flux_res. A loop inside flux_res would look something like

for pt in wallpts do
    var a = flux1(pt)
    var b = flux2(pt)
    var c = flux3(pt)
    pt.flux = a + b + c
end

All of these inner tasks just read data (but not pt.flux) so they can be run simultaneously. The size of wallpts can be as big as the grid size, which is about 50,000 points for my current case.

Could it be that since I haven't done any partitioning, the parallelisation of these tasks is leading to the leaked views? Memory usage on commenting out flux_res was pretty much constant for 4-5 iterations.

Also, could you please explain what views are and how you figured this out? It would be really helpful for me in the future. Thanks a lot.

lightsighter · 2020-02-12T06:40:21Z

flux1, flux2, and flux3 are task launches themselves right? If they are, then this is where we are getting into trouble. We're getting lots of outstanding tasks which are all trying to run in parallel based on how your mapper is choosing to map them. The reason we're getting lots of views is that your mapper is choosing to map the regions for the intermediate tasks in the task tree rather than using virtual mappings. Each of those mappings are creating a "view" onto the physical instance for the region and that's how we're getting a huge number of views with only a four physical instances.

There are two ways to address this (one of which I'll recommend over the other).

You can modify your mapper to be more disciplined in how you execute tasks to avoid mapping quite so far into the future or make virtual mappings for all your intermediate tasks that just turn around and launch sub-tasks.
"Chunk" your tasks so that they handle more work instead of making lots of very small tasks to process individual points.

In general I recommend the second approach. Tasks in are not the same things as functions as they come with overhead both in runtime overhead and memory needed for maintaining internal state for executing them. Legion and Regent work best when these overheads are small compared the amount of compute being performed and data being processed by tasks. In general, Legion tasks need about ~1ms of compute to amortize the cost these overheads. In your particular case this will probably be having your tasks process many points instead of just one at a time.

rupanshusoi · 2020-02-12T12:54:56Z

I see. One more question though.
Since assigning more computation to a task will actually lead to less parallelism than possible, since in theory all of flux1, flux2 and flux3 can be run in parallel for all points simultaneously, would making several partitions and calling them separately on each one be more efficient? My understanding is that having more partitions allows Legion to utilise more cores and hence achieve higher parallelism.

I just wish to understand how to use Regent and Legion efficiently. Thanks a ton for helping me out!

elliottslaughter · 2020-02-12T18:25:15Z

Right, the way to get parallelism is by partitioning. You can control how much parallelism by choosing how many subregions to create for each partition. So you're not really at risk of losing parallelism, the partitions can be as fine as they need to be to enable the parallelism you require.

Note that in your current setup, if you don't do partitioning, you are probably not getting much parallelism. That's because whenever you reads writes to a region, even if the task only modifies one point inside that region, Regent/Legion will assume it modifies the entire region. Regent/Legion don't do any fancy analysis to determine what you actually change inside of your task, they just trust the privileges you give the task yourself. So there may have been a small amount of parallelism in your code (if you used reads and no reads writes), but whenever you used reads writes it would have caused everything to block on that task.

rupanshusoi · 2020-02-13T10:19:34Z

Thank you, this was really helpful.

elliottslaughter · 2020-03-10T21:37:07Z

@rupanshusoi Are you still hitting this? If not maybe we can close this issue.

rupanshusoi · 2020-03-10T21:38:27Z

No, it has been resolved. Thanks.

elliottslaughter self-assigned this Feb 8, 2020

elliottslaughter added bug planned Feature/fix to be actively worked on - needs release target Regent Issues pertaining to Regent labels Feb 8, 2020

elliottslaughter added this to the 20.03 milestone Feb 8, 2020

rupanshusoi mentioned this issue Feb 15, 2020

Regent: "missed work counter wakeup?" #744

Closed

elliottslaughter closed this as completed Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regent: memory leak #745

Regent: memory leak #745

rupanshusoi commented Feb 8, 2020

elliottslaughter commented Feb 8, 2020

lightsighter commented Feb 8, 2020

rupanshusoi commented Feb 9, 2020

lightsighter commented Feb 9, 2020

rupanshusoi commented Feb 9, 2020

elliottslaughter commented Feb 9, 2020

rupanshusoi commented Feb 9, 2020

elliottslaughter commented Feb 10, 2020

rupanshusoi commented Feb 11, 2020

lightsighter commented Feb 11, 2020

rupanshusoi commented Feb 11, 2020

lightsighter commented Feb 11, 2020

lightsighter commented Feb 11, 2020

lightsighter commented Feb 11, 2020

rupanshusoi commented Feb 11, 2020

lightsighter commented Feb 11, 2020

rupanshusoi commented Feb 12, 2020

lightsighter commented Feb 12, 2020

rupanshusoi commented Feb 12, 2020

elliottslaughter commented Feb 12, 2020

rupanshusoi commented Feb 13, 2020

elliottslaughter commented Mar 10, 2020

rupanshusoi commented Mar 10, 2020

Regent: memory leak #745

Regent: memory leak #745

Comments

rupanshusoi commented Feb 8, 2020

elliottslaughter commented Feb 8, 2020

lightsighter commented Feb 8, 2020

rupanshusoi commented Feb 9, 2020

lightsighter commented Feb 9, 2020

rupanshusoi commented Feb 9, 2020

elliottslaughter commented Feb 9, 2020

rupanshusoi commented Feb 9, 2020

elliottslaughter commented Feb 10, 2020

rupanshusoi commented Feb 11, 2020

lightsighter commented Feb 11, 2020

rupanshusoi commented Feb 11, 2020

lightsighter commented Feb 11, 2020

lightsighter commented Feb 11, 2020

lightsighter commented Feb 11, 2020

rupanshusoi commented Feb 11, 2020

lightsighter commented Feb 11, 2020

rupanshusoi commented Feb 12, 2020

lightsighter commented Feb 12, 2020

rupanshusoi commented Feb 12, 2020

elliottslaughter commented Feb 12, 2020

rupanshusoi commented Feb 13, 2020

elliottslaughter commented Mar 10, 2020

rupanshusoi commented Mar 10, 2020