Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Legion_prof: assertion failed: copy_inst_info._src.unwrap() == chan_src #1606

Closed
Tracked by #1032
mariodirenzo opened this issue Dec 1, 2023 · 13 comments
Closed
Tracked by #1032

Comments

@mariodirenzo
Copy link

I'm trying to visualize the timeline of the two attached logs produced with the commit cbbc900287e1c20a5c1e11e82bea0e27cdaf577f of Legion. However, I keep getting the assertion

thread 'main' panicked at 'assertion failed: copy_inst_info._src.unwrap() == chan_src', src/state.rs:2076:17

Do you have any advice on what might be wrong?

@elliottslaughter, can you please add this to #1032?

prof_0.log
prof_1.log

@eddy16112
Copy link
Contributor

Usually a CopyInfo can contains multiple CopyInstInfos, which have the same src and dst memories. This error tells some of the CopyInstInfos do not have the same src/dst memories. It is usually a Legion bug.
@lightsighter Here is what I found. The CopyInfo has 3 CopyInstInfos:

src=System Memory 0x1e00010000000000,dst=System Memory 0x1e00000000000000,fevent=9223372174312603679 ,indirect= False
src=System Memory 0x1e00010000000000,dst=System Memory 0x1e00000000000000,fevent=9223372174312603679 ,indirect= False
src=Framebuffer Memory 0x1e00010000000002,dst=System Memory 0x1e00000000000000,fevent=9223372174312603679 ,indirect= False

The src of the last one is not consistence with the first two.

@lightsighter
Copy link
Contributor

What are the other parameters for each of the CopyInstInfos?

@eddy16112
Copy link
Contributor

What are the other parameters for each of the CopyInstInfos?

Sure, they are

src=System Memory 0x1e00010000000000,dst=System Memory 0x1e00000000000000,src_fid=25,dst_fid25,src_inst_uid=0x8000804001d00011,dst_inst_uid=0x80000020001000fd,num_hops=2,fevent=9223372174312603679 ,indirect= False

src=System Memory 0x1e00010000000000,dst=System Memory 0x1e00000000000000,src_fid=28,dst_fid28,src_inst_uid=0x8000804001d00011,dst_inst_uid=0x80000020001000fd,num_hops=2,fevent=9223372174312603679 ,indirect= False

src=Framebuffer Memory 0x1e00010000000002,dst=System Memory 0x1e00000000000000,src_fid=26,dst_fid26,src_inst_uid=0x800080200260002a,dst_inst_uid=0x80000020001000fd,num_hops=3,fevent=9223372174312603679 ,indirect= False

@lightsighter
Copy link
Contributor

This isn't a bug in Legion; it is an issue with the profiler. Look at the src_inst_uid parameters in the CopyInstInfo objects. Two of them are the same while the third one is different which corresponds with fact that memories are different. Realm allows clients like Legion to group copies together so you can issue copies from multiple source instances together in the same Realm copy operation for different fields, and Legion has done exactly that fusion here. The fact that the profiler is not handling that case is its problem.

@eddy16112
Copy link
Contributor

The hierarchy of the profiler is:

  • Channel
    • CopyInfo
      • CopyInstInfo

but now a CopyInfo contains two types of CopyInstInfo, which belong to different channels, I am not sure how to handle it properly. Why does Legion group these two types into one CopyInfo? Since their src and dst are different, Realm will eventually issue at least two copies respectively.

@lightsighter
Copy link
Contributor

lightsighter commented Dec 3, 2023

Since their src and dst are different, Realm will eventually issue at least two copies respectively.

I don't think that is always true. If the src and dst instances are on different nodes and a multi-hop copy is required, then Realm can do the copies from each of the source instances into one intermediate buffer, and then do a single RDMA operation from the intermediate buffer to the destination instance. The Realm API encourages clients to express copies fused together in this way so that it can take advantage of optimizations like that, and Legion is a good client that tries to give Realm as much information in as few API calls as possible. Realm might end up splitting some of these copies apart, but some of them might also get fused.

but now a CopyInfo contains two types of CopyInstInfo, which belong to different channels, I am not sure how to handle it properly.

How is it different than what we do with the gather copies? In some sense this is a kind of gather copy, but just gathering different fields from different source instances into the same destination instance. Another possibility is just to put this copy on both channels, and only show the respective fields associated with each channel.

@eddy16112
Copy link
Contributor

How is it different than what we do with the gather copies?

For gather copies, we will at least have one CopyInstInfo whose indirect is true, then we can identify it is a gather copy, and then create a new/find an existing gather channel using the dst as the identification. If we consider this copy as a gather copy, then we mixed it with other indirect copies.

Another possibility is just to put this copy on both channels, and only show the respective fields associated with each channel.

I am not sure if having one copy in multiple channels will break the current design of legion prof. If it does not work, I am thinking about cloning the CopyInfo into two, and first one will have the first two CopyInstInfo, and the second one will have the last CoptInstInfo, then we can put these two CopyInfo into the correct channels. @elliottslaughter any ideas?

@lightsighter
Copy link
Contributor

I personally prefer the latter approach.

@elliottslaughter
Copy link
Contributor

If "the latter approach" means splitting the copy into one part per channel, I agree.

@eddy16112
Copy link
Contributor

@mariodirenzo Can you please try this commit https://gitlab.com/StanfordLegion/legion/-/commit/592efe8d0230b25a76c1139e9094f00a94fdf2b6? This is a fix for the legion_prof.py, if it works for you, I will apply the same fix for the rust code.

@mariodirenzo
Copy link
Author

The patch works for me. Thanks for working on it

@elliottslaughter
Copy link
Contributor

@mariodirenzo We made a few more fixes and then merged it into master and control_replication. Please confirm it works for you.

@mariodirenzo
Copy link
Author

Both python and rust versions of the profiler work now. Thanks for fixing this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants