-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#2388: LB stat files should not be written out every phase and should be marked with a trace event #2393
Conversation
Pipelines resultsPR tests (clang-9, ubuntu, mpich) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (clang-13, alpine, mpich) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (clang-12, ubuntu, mpich) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (clang-11, ubuntu, mpich) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (gcc-10, ubuntu, openmpi, no LB) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (gcc-9, ubuntu, mpich, zoltan) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (clang-13, ubuntu, mpich) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (clang-14, ubuntu, mpich, verbose) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (gcc-12, ubuntu, mpich, verbose, kokkos) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (clang-10, ubuntu, mpich) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (intel icpx, ubuntu, mpich, verbose) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (intel icpc, ubuntu, mpich) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose) Build for 1771158 (2025-02-10 19:06:00 UTC)
PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich) Build for 1771158 (2025-02-10 19:06:00 UTC)
|
return; | ||
} | ||
|
||
#if vt_check_enabled(trace_enabled) | ||
theTrace()->addUserEventBracketedBegin(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to actually register a user event with an associated name and pass it here instead of just passing 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
966c8df
to
968c2b9
Compare
return; | ||
} | ||
|
||
#if vt_check_enabled(trace_enabled) | ||
auto write_lb_stats_event = theTrace()->registerUserEventColl("write_lb_stats"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to do this once the first time it's used (or on startup) and then reuse the event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check to registerUserEventColl()
so that it will first look for the event name in user_event_
before creating a new event hash. If the event already exists, it returns the corresponding ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still an issue here that this is a collective call that all nodes must make. It seems this is conditional on the node, which will cause a problem I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed; I create the write event on startup, and added a getter to theTrace()
to access it.
return; | ||
} | ||
|
||
#if vt_check_enabled(trace_enabled) | ||
auto write_lb_stats_event = theTrace()->registerUserEventColl("write_lb_stats"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still an issue here that this is a collective call that all nodes must make. It seems this is conditional on the node, which will cause a problem I think.
5ba7405
to
5744127
Compare
src/vt/trace/trace.cc
Outdated
@@ -84,6 +84,7 @@ void Trace::initialize() /*override*/ { | |||
|
|||
// Register a trace user event to demarcate flushes that occur | |||
flush_event_ = trace::registerEventCollective("trace_flush"); | |||
write_stats_event_ = trace::registerEventCollective("write_lb_stats"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to LBManager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Fixes #2388