Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2201: implement memory aware TemperedLB in vt #2203

Closed
wants to merge 73 commits into from

Conversation

ppebay
Copy link
Contributor

@ppebay ppebay commented Oct 18, 2023

Resolves #2201

This PR in particular:

  • adds the necessary apparatus to allow for several transfer strategies, to be chosen by the user;
  • refactors the existing transfer stage to make it the Original strategy;
  • implements the SwapClusters transfer strategy; i.e. a simplification of LBAF's ClusteringStrategy (without sub-clustering).

@ppebay ppebay requested review from lifflander and nlslatt October 18, 2023 10:37
@ppebay ppebay self-assigned this Oct 18, 2023
@ppebay ppebay linked an issue Oct 18, 2023 that may be closed by this pull request
1 task
@github-actions
Copy link

github-actions bot commented Oct 18, 2023

Pipelines results

PR tests (gcc-12, ubuntu, mpich)

Build for 10c35df (2024-01-25 22:56:08 UTC)



The following tests FAILED:
  238 - vt:TestCheckpoint.test_checkpoint_in_place_2_proc_2 (Timeout)
  239 - vt:TestCheckpoint.test_checkpoint_in_place_3_proc_2 (Timeout)
  255 - vt:*/TestLoadBalancerOther.test_load_balancer_other_1/*_proc_2 (Timeout)
  256 - vt:*/TestLoadBalancerOther.test_load_balancer_other_keep_last_elm/*_proc_2 (Timeout)

Build log


Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

@lifflander lifflander force-pushed the 2201-implement-memory-aware-temperedlb-in-vt branch 2 times, most recently from 3cd9309 to c13aa2f Compare November 29, 2023 17:11
Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do a full review, just wanted to make some comments.

@lifflander lifflander force-pushed the 2201-implement-memory-aware-temperedlb-in-vt branch from f2332af to 441d27b Compare December 4, 2023 20:06
@ppebay ppebay requested a review from lifflander December 11, 2023 20:07
@nlslatt
Copy link
Collaborator

nlslatt commented Dec 15, 2023

@lifflander @ppebay I am unable to run this in a production environment:

vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [13] ------------------------------------------------ Fatal Error on Node 13 ------------------------------------------------
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] 
vt: [13]              Reason: Event being closed must be on the top of the open event stack.
vt: [13]    Assertion failed: (not open_events_.empty() and open_events_.back().ep == ep and open_events_.back().event == event)
vt: [13]                Node: 13
vt: [13]           Num Nodes: 14
vt: [13]                File: vt/src/vt/trace/trace.cc
vt: [13]                Line: 398
vt: [13]            Function: endProcessing
vt: [13]                Code: 1
vt: [13]           Build SHA: 68121476eacc3b25e4703bfbd22c9d91275f6046
vt: [13]           Build Ref: refs/heads/2201-implement-memory-aware-temperedlb-in-vt
vt: [13]         Description: heads/load-balancing-0-g68121476ea
vt: [13]            GIT Repo: *dirty*
vt: [13]            Hostname: mz7
vt: [13] 
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] ------------------------------------------- Dump Stack Backtrace on Node 13 --------------------------------------------
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] 0   18  0x1f54602 vt::debug::stack::dumpStack(int) + 50
vt: [13] 1   18  0x1b29cec vt::runtime::Runtime::output(std::string, int, bool, bool, bool) + 1516
vt: [13] 2   18  0x19d555e vt::CollectiveAnyOps<(vt::runtime::eRuntimeInstance)0>::output(std::string, int, bool, bool, bool, bool) + 94
vt: [13] 3   18  0x19d1da3 vt::output(std::string, int, bool, bool, bool, bool) + 67
vt: [13] 4   18  0x1dc1e80 std::enable_if<std::tuple_size<std::tuple<> >::value==(0), void>::type vt::debug::assert::assertOut<>(bool, std::string, std::string const&, std::string const&, int, std::string const&, int, std::tuple<>&&) [clone .isra.0] + 192
vt: [13] 5   18  0x1dcfaf9 vt::trace::Trace::endProcessing(vt::trace::TraceProcessingTag const&, vt::TimeTypeWrapper) + 681
vt: [13] 6   18  0x1dd11d8 std::_Function_handler<void (), vt::trace::Trace::startup()::{lambda()#1}>::_M_invoke(std::_Any_data const&) + 24
vt: [13] 7   18  0x1f0af42 vt::sched::Scheduler::triggerEvent(vt::sched::SchedulerEvent const&) + 98
vt: [13] 8   18  0x229f8fd vt::vrt::collection::lb::TemperedLB::considerSwapsAfterLock(vt::messaging::MsgSharedPtr<vt::vrt::collection::lb::TemperedLB::LockedInfoMsg>) + 3965
vt: [13] 9   18  0x22a3c77 vt::vrt::collection::lb::TemperedLB::lockObtained(vt::vrt::collection::lb::TemperedLB::LockedInfoMsg*) + 2983
vt: [13] 10  18  0x1ee491a vt::runnable::RunnableNew::run() + 138
vt: [13] 11  18  0x2336fda vt::sched::BaseUnit::execute() + 26
vt: [13] 12  18  0x1f114bc vt::sched::Scheduler::runWorkUnit(vt::sched::BaseUnit&) + 92
vt: [13] 13  18  0x1f11bf7 vt::sched::Scheduler::runSchedulerOnceImpl(bool) + 1063
vt: [13] 14  18  0x22ac4b7 vt::vrt::collection::lb::TemperedLB::swapClusters() + 695
vt: [13] 15  18  0x22b35f6 vt::vrt::collection::lb::TemperedLB::doLBStages(double) + 7478
vt: [13] 16  18  0x22b419c vt::vrt::collection::lb::TemperedLB::runLB(double) + 1004
vt: [13] 17  0   0x0 Unwinding error: unable to obtain symbol name for this frame + 0
vt: [13] 18  18  0x1cf2fba vt::vrt::collection::balance::LBManager::runLB(unsigned long, vt::pipe::callback::cbunion::CallbackTyped<vt::vrt::collection::balance::ReassignmentMsg>) + 2234
vt: [13] 19  18  0x1cf41bd vt::vrt::collection::balance::LBManager::startLB(unsigned long, vt::vrt::collection::balance::LBType, vt::pipe::callback::cbunion::CallbackTyped<vt::vrt::collection::balance::ReassignmentMsg>) + 3053
vt: [13] 20  18  0x1cf4ec9 vt::vrt::collection::balance::LBManager::selectStartLB(unsigned long) + 569
vt: [13] 21  18  0x1aea35f void vt::runInEpoch<vt::phase::PhaseManager::runHooks(vt::phase::PhaseHook)::{lambda()#1}>(vt::epoch::EpochType, vt::phase::PhaseManager::runHooks(vt::phase::PhaseHook)::{lambda()#1}&&) + 111
vt: [13] 22  18  0x1aea9f3 vt::phase::PhaseManager::runHooks(vt::phase::PhaseHook) + 787
vt: [13] 23  18  0x1aef658 vt::phase::PhaseManager::nextPhaseCollective() + 328
[snip]
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [13] ------------------------------------------------ Fatal Error on Node 13 ------------------------------------------------
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] 
vt: [13] Message: Assertion Failed
vt: [13] 
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] ------------------------------------------- Dump Stack Backtrace on Node 13 --------------------------------------------
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] 0   18  0x1f54602 vt::debug::stack::dumpStack(int) + 50
vt: [13] 1   18  0x1b29cec vt::runtime::Runtime::output(std::string, int, bool, bool, bool) + 1516
vt: [13] 2   18  0x1b2a437 vt::runtime::Runtime::abort(std::string, int) + 55
vt: [13] 3   18  0x19d5418 vt::CollectiveAnyOps<(vt::runtime::eRuntimeInstance)0>::abort(std::string, int) + 72
vt: [13] 4   18  0x19d1d00 vt::abort(std::string, int) + 32
vt: [13] 5   18  0x19d55cf vt::CollectiveAnyOps<(vt::runtime::eRuntimeInstance)0>::output(std::string, int, bool, bool, bool, bool) + 207
vt: [13] 6   18  0x19d1da3 vt::output(std::string, int, bool, bool, bool, bool) + 67
vt: [13] 7   18  0x1dc1e80 std::enable_if<std::tuple_size<std::tuple<> >::value==(0), void>::type vt::debug::assert::assertOut<>(bool, std::string, std::string const&, std::string const&, int, std::string const&, int, std::tuple<>&&) [clone .isra.0] + 192
vt: [13] 8   18  0x1dcfaf9 vt::trace::Trace::endProcessing(vt::trace::TraceProcessingTag const&, vt::TimeTypeWrapper) + 681
vt: [13] 9   18  0x1dd11d8 std::_Function_handler<void (), vt::trace::Trace::startup()::{lambda()#1}>::_M_invoke(std::_Any_data const&) + 24
vt: [13] 10  18  0x1f0af42 vt::sched::Scheduler::triggerEvent(vt::sched::SchedulerEvent const&) + 98
vt: [13] 11  18  0x229f8fd vt::vrt::collection::lb::TemperedLB::considerSwapsAfterLock(vt::messaging::MsgSharedPtr<vt::vrt::collection::lb::TemperedLB::LockedInfoMsg>) + 3965
vt: [13] 12  18  0x22a3c77 vt::vrt::collection::lb::TemperedLB::lockObtained(vt::vrt::collection::lb::TemperedLB::LockedInfoMsg*) + 2983
vt: [13] 13  18  0x1ee491a vt::runnable::RunnableNew::run() + 138
vt: [13] 14  18  0x2336fda vt::sched::BaseUnit::execute() + 26
vt: [13] 15  18  0x1f114bc vt::sched::Scheduler::runWorkUnit(vt::sched::BaseUnit&) + 92
vt: [13] 16  18  0x1f11bf7 vt::sched::Scheduler::runSchedulerOnceImpl(bool) + 1063
vt: [13] 17  18  0x22ac4b7 vt::vrt::collection::lb::TemperedLB::swapClusters() + 695
vt: [13] 18  18  0x22b35f6 vt::vrt::collection::lb::TemperedLB::doLBStages(double) + 7478
vt: [13] 19  18  0x22b419c vt::vrt::collection::lb::TemperedLB::runLB(double) + 1004
vt: [13] 20  0   0x0 Unwinding error: unable to obtain symbol name for this frame + 0
vt: [13] 21  18  0x1cf2fba vt::vrt::collection::balance::LBManager::runLB(unsigned long, vt::pipe::callback::cbunion::CallbackTyped<vt::vrt::collection::balance::ReassignmentMsg>) + 2234
vt: [13] 22  18  0x1cf41bd vt::vrt::collection::balance::LBManager::startLB(unsigned long, vt::vrt::collection::balance::LBType, vt::pipe::callback::cbunion::CallbackTyped<vt::vrt::collection::balance::ReassignmentMsg>) + 3053
vt: [13] 23  18  0x1cf4ec9 vt::vrt::collection::balance::LBManager::selectStartLB(unsigned long) + 569
vt: [13] 24  18  0x1aea35f void vt::runInEpoch<vt::phase::PhaseManager::runHooks(vt::phase::PhaseHook)::{lambda()#1}>(vt::epoch::EpochType, vt::phase::PhaseManager::runHooks(vt::phase::PhaseHook)::{lambda()#1}&&) + 111
vt: [13] 25  18  0x1aea9f3 vt::phase::PhaseManager::runHooks(vt::phase::PhaseHook) + 787
vt: [13] 26  18  0x1aef658 vt::phase::PhaseManager::nextPhaseCollective() + 328
[snip]

@lifflander
Copy link
Collaborator

I think that the way I've implemented this with a recursive handler is causing tracing issues. I will look into it.

@lifflander @ppebay I am unable to run this in a production environment:

ppebay and others added 24 commits January 25, 2024 14:55
@lifflander lifflander force-pushed the 2201-implement-memory-aware-temperedlb-in-vt branch from 6130120 to 10c35df Compare January 25, 2024 23:13
Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've flagged some stuff that needs to be fixed (mostly typos) on the rebased branch (not this one!)

tasks for the other shared_ids across all ranks do not need to be split across
multiple ranks to perfectly balance the load (time).

Below is one solution with a perfectly balanced load and decent communication.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the word perfectly and replace it with something weaker as long as the solution presented below is in fact well balanced. If the solution below is not well balanced, delete it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed perfectly but I haven't checked the given solution

}

bool TemperedLB::memoryTransferCriterion(double try_total_bytes, double src_bytes) {
// FIXME: incomplete implementation that ignores memory regrouping
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This FIXME isn't needed for swapping whole clusters, just for subclustering, right?

@nlslatt
Copy link
Collaborator

nlslatt commented Apr 15, 2024

I fixed the above typos in the rebased branch already.

@lifflander
Copy link
Collaborator

I fixed the above typos in the rebased branch already.

Do you think we should fully implement the sub-clustering with the full work model before we merge this?

@nlslatt
Copy link
Collaborator

nlslatt commented May 7, 2024

Closing because this is superseded by #2278

@nlslatt nlslatt closed this May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement memory-aware TemperedLB in VT
3 participants