Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Legion: AllGatherCollective<false>::~AllGatherCollective() [INORDER = false]: Assertion `done_triggered' failed. #1590

Closed
Tracked by #1032
mariodirenzo opened this issue Nov 6, 2023 · 5 comments

Comments

@mariodirenzo
Copy link

mariodirenzo commented Nov 6, 2023

I'm randomly hitting this assertion on a two-node run on sapling. When I do not hit the assertion, the execution hangs.
The backtraces for the two threads that go in clock_nanosleep are:

#0  0x00007f8e35e3523f in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f8e35e3aec7 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f8e35e3adfe in sleep () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x000055fd40ab32ce in Realm::realm_freeze (signal=6) at realm/runtime_impl.cc:200
#4  <signal handler called>
#5  0x00007f8e35d9b00b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f8e35d7a859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007f8e360058d1 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f8e3601137c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f8e360113e7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007f8e36012145 in __cxa_pure_virtual () from /lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x000055fd406c5f73 in Legion::Internal::AllGatherCollective<false>::unpack_stage (this=0x7f8e02510d90, stage=0, derez=...) at legion/legion_replication.cc:12713
#12 0x000055fd406c5ec9 in Legion::Internal::AllGatherCollective<false>::handle_collective_message (this=0x7f8e02510d90, derez=...) at legion/legion_replication.cc:12455
#13 0x000055fd405e0c5a in Legion::Internal::ReplicateContext::handle_collective_message (this=0x7f8e0a584840, derez=...) at legion/legion_context.cc:21075
#14 0x000055fd3f98e576 in Legion::Internal::ShardTask::handle_collective_message (this=0x7f8dc4001420, derez=...) at legion/legion_tasks.cc:8168
#15 0x000055fd406b1444 in Legion::Internal::ShardManager::handle_collective_message (this=0x7f8dc4000cd0, derez=...) at legion/legion_replication.cc:10638
#16 0x000055fd406b5b9d in Legion::Internal::ShardManager::handle_collective_message (derez=..., runtime=0x55fd58d72bc0) at legion/legion_replication.cc:11728
#17 0x000055fd3fc359e3 in Legion::Internal::VirtualChannel::handle_messages (this=0x7f8e0a509d70, num_messages=1, runtime=0x55fd58d72bc0, remote_address_space=0, args=0x7f8dea5c6ea0 "@", arglen=40) at legion/runtime.cc:13376
#18 0x000055fd3fc33889 in Legion::Internal::VirtualChannel::process_message (this=0x7f8e0a509d70, args=0x7f8dea5c6e84, arglen=60, runtime=0x55fd58d72bc0, remote_address_space=0) at legion/runtime.cc:11742
#19 0x000055fd3fc3d936 in Legion::Internal::MessageManager::receive_message (this=0x7f8e0a509d00, args=0x7f8dea5c6e80, arglen=68) at legion/runtime.cc:13524
#20 0x000055fd3fc708c9 in Legion::Internal::Runtime::process_message_task (this=0x55fd58d72bc0, args=0x7f8dea5c6e7c, arglen=72) at legion/runtime.cc:26647
#21 0x000055fd3fc82183 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f8dea5c6e70, arglen=76, userdata=0x55fd58d72360, userlen=8, p=...) at legion/runtime.cc:32338
#22 0x000055fd40a8572b in Realm::LocalTaskProcessor::execute_task (this=0x55fd57808c30, func_id=4, task_args=...) at realm/proc_impl.cc:1175
#23 0x000055fd40af8f6d in Realm::Task::execute_on_processor (this=0x7f8e14035250, p=...) at realm/tasks.cc:326
#24 0x000055fd40afd3bc in Realm::KernelThreadTaskScheduler::execute_task (this=0x55fd45d83700, task=0x7f8e14035250) at realm/tasks.cc:1421
#25 0x000055fd40afc150 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55fd45d83700) at realm/tasks.cc:1160
#26 0x000055fd40afc72a in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55fd45d83700) at realm/tasks.cc:1272
#27 0x000055fd40b0d03c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55fd45d83700) at realm/threads.inl:97
#28 0x000055fd40b1018a in Realm::KernelThread::pthread_entry (data=0x7f8dc650f220) at realm/threads.cc:781
#29 0x00007f8e43f7a609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#30 0x00007f8e35e77133 in clone () from /lib/x86_64-linux-gnu/libc.so.6

and

#0  0x00007f8e35e3523f in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f8e35e3aec7 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f8e35e3adfe in sleep () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x000055fd40ab32ce in Realm::realm_freeze (signal=6) at realm/runtime_impl.cc:200
#4  <signal handler called>
#5  0x00007f8e35d9b00b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f8e35d7a859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007f8e35d7a729 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#8  0x00007f8e35d8bfd6 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x000055fd406c555a in Legion::Internal::AllGatherCollective<false>::~AllGatherCollective (this=0x7f8e02510d90) at legion/legion_replication.cc:12387
#10 0x000055fd406d75e7 in Legion::Internal::CreateCollectiveFillView::~CreateCollectiveFillView (this=0x7f8e02510d90) at legion/legion_replication.h:1372
#11 0x000055fd406d7609 in Legion::Internal::CreateCollectiveFillView::~CreateCollectiveFillView (this=0x7f8e02510d90) at legion/legion_replication.h:1372
#12 0x000055fd4068b692 in Legion::Internal::ReplIndexFillOp::deactivate (this=0x7f8dea5cd200, freeop=true) at legion/legion_replication.cc:2890
#13 0x000055fd3f7fa922 in Legion::Internal::Operation::commit_operation (this=0x7f8dea5cd200, do_deactivate=true, wait_on=...) at legion/legion_ops.cc:2291
#14 0x000055fd3f849698 in Legion::Internal::IndexFillOp::trigger_commit (this=0x7f8dea5cd200) at legion/legion_ops.cc:19231
#15 0x000055fd3f7fa458 in Legion::Internal::Operation::complete_operation (this=0x7f8dea5cd200, wait_on=..., first_invocation=true) at legion/legion_ops.cc:2110
#16 0x000055fd3f847501 in Legion::Internal::FillOp::trigger_complete (this=0x7f8dea5cd200) at legion/legion_ops.cc:18741
#17 0x000055fd3f7f9e0c in Legion::Internal::Operation::complete_execution (this=0x7f8dea5cd200, wait_on=...) at legion/legion_ops.cc:1978
#18 0x000055fd3f84745b in Legion::Internal::FillOp::trigger_execution (this=0x7f8dea5cd200) at legion/legion_ops.cc:18730
#19 0x000055fd4068bffc in Legion::Internal::ReplIndexFillOp::trigger_ready (this=0x7f8dea5cd200) at legion/legion_replication.cc:3007
#20 0x000055fd3fd927be in Legion::Internal::Memoizable<Legion::Internal::ReplIndexFillOp>::trigger_ready (this=0x7f8dea5cd200) at legion/legion_ops.inl:111
#21 0x000055fd3fd91f60 in Legion::Internal::Predicated<Legion::Internal::ReplIndexFillOp>::trigger_ready (this=0x7f8dea5cd200) at legion/legion_ops.inl:215
#22 0x000055fd405983e2 in Legion::Internal::InnerContext::process_ready_queue (this=0x7f8e0a584840) at legion/legion_context.cc:8731
#23 0x000055fd405a9411 in Legion::Internal::InnerContext::handle_ready_queue (args=0x7f8e0250d810) at legion/legion_context.cc:12492
#24 0x000055fd3fc8219f in Legion::Internal::Runtime::legion_runtime_task (args=0x7f8e0250d810, arglen=12, userdata=0x55fd5b28fcd0, userlen=8, p=...) at legion/runtime.cc:32348
#25 0x000055fd40a8572b in Realm::LocalTaskProcessor::execute_task (this=0x55fd57809920, func_id=4, task_args=...) at realm/proc_impl.cc:1175
#26 0x000055fd40af8f6d in Realm::Task::execute_on_processor (this=0x7f8e0250d690, p=...) at realm/tasks.cc:326
#27 0x000055fd40afd3bc in Realm::KernelThreadTaskScheduler::execute_task (this=0x55fd45d12d00, task=0x7f8e0250d690) at realm/tasks.cc:1421
#28 0x000055fd40afc150 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55fd45d12d00) at realm/tasks.cc:1160
#29 0x000055fd40afc72a in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55fd45d12d00) at realm/tasks.cc:1272
#30 0x000055fd40b0d03c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55fd45d12d00) at realm/threads.inl:97
#31 0x000055fd40b1018a in Realm::KernelThread::pthread_entry (data=0x7f8dac019b30) at realm/threads.cc:781
#32 0x00007f8e43f7a609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#33 0x00007f8e35e77133 in clone () from /lib/x86_64-linux-gnu/libc.so.6

A failing execution will be available with debug symbols at

Legion process received signal 6: Aborted
Process 2581434 on node g0004.stanford.edu is frozen!
Legion process received signal 6: Aborted
Process 2581434 on node g0004.stanford.edu is frozen!

for another 2 hours and 45 mins. (cuda-gdb is required to see the debug symbols)

@elliottslaughter, can you please add this issue to #1032 with high priority?

@mariodirenzo
Copy link
Author

The previous job expired and I've got a new process that failed with the same assertion

Process 2583453 on node g0004.stanford.edu is frozen!
Process 2583453 on node g0004.stanford.edu is frozen!

@lightsighter
Copy link
Contributor

What happens if you run with -lg:safe_ctrlrepl 1?

@lightsighter
Copy link
Contributor

Also, these line numbers do not align with the most recent control replication commit. Please run with the most recent control replication.

@lightsighter
Copy link
Contributor

Pull and try again.

@mariodirenzo
Copy link
Author

This issue is solved on the latest shardrefine. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants