You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm randomly hitting this assertion on a two-node run on sapling. When I do not hit the assertion, the execution hangs.
The backtraces for the two threads that go in clock_nanosleep are:
#0 0x00007f8e35e3523f in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f8e35e3aec7 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f8e35e3adfe in sleep () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x000055fd40ab32ce in Realm::realm_freeze (signal=6) at realm/runtime_impl.cc:200
#4 <signal handler called>
#5 0x00007f8e35d9b00b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x00007f8e35d7a859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x00007f8e360058d1 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x00007f8e3601137c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007f8e360113e7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007f8e36012145 in __cxa_pure_virtual () from /lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x000055fd406c5f73 in Legion::Internal::AllGatherCollective<false>::unpack_stage (this=0x7f8e02510d90, stage=0, derez=...) at legion/legion_replication.cc:12713
#12 0x000055fd406c5ec9 in Legion::Internal::AllGatherCollective<false>::handle_collective_message (this=0x7f8e02510d90, derez=...) at legion/legion_replication.cc:12455
#13 0x000055fd405e0c5a in Legion::Internal::ReplicateContext::handle_collective_message (this=0x7f8e0a584840, derez=...) at legion/legion_context.cc:21075
#14 0x000055fd3f98e576 in Legion::Internal::ShardTask::handle_collective_message (this=0x7f8dc4001420, derez=...) at legion/legion_tasks.cc:8168
#15 0x000055fd406b1444 in Legion::Internal::ShardManager::handle_collective_message (this=0x7f8dc4000cd0, derez=...) at legion/legion_replication.cc:10638
#16 0x000055fd406b5b9d in Legion::Internal::ShardManager::handle_collective_message (derez=..., runtime=0x55fd58d72bc0) at legion/legion_replication.cc:11728
#17 0x000055fd3fc359e3 in Legion::Internal::VirtualChannel::handle_messages (this=0x7f8e0a509d70, num_messages=1, runtime=0x55fd58d72bc0, remote_address_space=0, args=0x7f8dea5c6ea0 "@", arglen=40) at legion/runtime.cc:13376
#18 0x000055fd3fc33889 in Legion::Internal::VirtualChannel::process_message (this=0x7f8e0a509d70, args=0x7f8dea5c6e84, arglen=60, runtime=0x55fd58d72bc0, remote_address_space=0) at legion/runtime.cc:11742
#19 0x000055fd3fc3d936 in Legion::Internal::MessageManager::receive_message (this=0x7f8e0a509d00, args=0x7f8dea5c6e80, arglen=68) at legion/runtime.cc:13524
#20 0x000055fd3fc708c9 in Legion::Internal::Runtime::process_message_task (this=0x55fd58d72bc0, args=0x7f8dea5c6e7c, arglen=72) at legion/runtime.cc:26647
#21 0x000055fd3fc82183 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f8dea5c6e70, arglen=76, userdata=0x55fd58d72360, userlen=8, p=...) at legion/runtime.cc:32338
#22 0x000055fd40a8572b in Realm::LocalTaskProcessor::execute_task (this=0x55fd57808c30, func_id=4, task_args=...) at realm/proc_impl.cc:1175
#23 0x000055fd40af8f6d in Realm::Task::execute_on_processor (this=0x7f8e14035250, p=...) at realm/tasks.cc:326
#24 0x000055fd40afd3bc in Realm::KernelThreadTaskScheduler::execute_task (this=0x55fd45d83700, task=0x7f8e14035250) at realm/tasks.cc:1421
#25 0x000055fd40afc150 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55fd45d83700) at realm/tasks.cc:1160
#26 0x000055fd40afc72a in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55fd45d83700) at realm/tasks.cc:1272
#27 0x000055fd40b0d03c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55fd45d83700) at realm/threads.inl:97
#28 0x000055fd40b1018a in Realm::KernelThread::pthread_entry (data=0x7f8dc650f220) at realm/threads.cc:781
#29 0x00007f8e43f7a609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#30 0x00007f8e35e77133 in clone () from /lib/x86_64-linux-gnu/libc.so.6
and
#0 0x00007f8e35e3523f in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f8e35e3aec7 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f8e35e3adfe in sleep () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x000055fd40ab32ce in Realm::realm_freeze (signal=6) at realm/runtime_impl.cc:200
#4 <signal handler called>
#5 0x00007f8e35d9b00b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x00007f8e35d7a859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x00007f8e35d7a729 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#8 0x00007f8e35d8bfd6 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x000055fd406c555a in Legion::Internal::AllGatherCollective<false>::~AllGatherCollective (this=0x7f8e02510d90) at legion/legion_replication.cc:12387
#10 0x000055fd406d75e7 in Legion::Internal::CreateCollectiveFillView::~CreateCollectiveFillView (this=0x7f8e02510d90) at legion/legion_replication.h:1372
#11 0x000055fd406d7609 in Legion::Internal::CreateCollectiveFillView::~CreateCollectiveFillView (this=0x7f8e02510d90) at legion/legion_replication.h:1372
#12 0x000055fd4068b692 in Legion::Internal::ReplIndexFillOp::deactivate (this=0x7f8dea5cd200, freeop=true) at legion/legion_replication.cc:2890
#13 0x000055fd3f7fa922 in Legion::Internal::Operation::commit_operation (this=0x7f8dea5cd200, do_deactivate=true, wait_on=...) at legion/legion_ops.cc:2291
#14 0x000055fd3f849698 in Legion::Internal::IndexFillOp::trigger_commit (this=0x7f8dea5cd200) at legion/legion_ops.cc:19231
#15 0x000055fd3f7fa458 in Legion::Internal::Operation::complete_operation (this=0x7f8dea5cd200, wait_on=..., first_invocation=true) at legion/legion_ops.cc:2110
#16 0x000055fd3f847501 in Legion::Internal::FillOp::trigger_complete (this=0x7f8dea5cd200) at legion/legion_ops.cc:18741
#17 0x000055fd3f7f9e0c in Legion::Internal::Operation::complete_execution (this=0x7f8dea5cd200, wait_on=...) at legion/legion_ops.cc:1978
#18 0x000055fd3f84745b in Legion::Internal::FillOp::trigger_execution (this=0x7f8dea5cd200) at legion/legion_ops.cc:18730
#19 0x000055fd4068bffc in Legion::Internal::ReplIndexFillOp::trigger_ready (this=0x7f8dea5cd200) at legion/legion_replication.cc:3007
#20 0x000055fd3fd927be in Legion::Internal::Memoizable<Legion::Internal::ReplIndexFillOp>::trigger_ready (this=0x7f8dea5cd200) at legion/legion_ops.inl:111
#21 0x000055fd3fd91f60 in Legion::Internal::Predicated<Legion::Internal::ReplIndexFillOp>::trigger_ready (this=0x7f8dea5cd200) at legion/legion_ops.inl:215
#22 0x000055fd405983e2 in Legion::Internal::InnerContext::process_ready_queue (this=0x7f8e0a584840) at legion/legion_context.cc:8731
#23 0x000055fd405a9411 in Legion::Internal::InnerContext::handle_ready_queue (args=0x7f8e0250d810) at legion/legion_context.cc:12492
#24 0x000055fd3fc8219f in Legion::Internal::Runtime::legion_runtime_task (args=0x7f8e0250d810, arglen=12, userdata=0x55fd5b28fcd0, userlen=8, p=...) at legion/runtime.cc:32348
#25 0x000055fd40a8572b in Realm::LocalTaskProcessor::execute_task (this=0x55fd57809920, func_id=4, task_args=...) at realm/proc_impl.cc:1175
#26 0x000055fd40af8f6d in Realm::Task::execute_on_processor (this=0x7f8e0250d690, p=...) at realm/tasks.cc:326
#27 0x000055fd40afd3bc in Realm::KernelThreadTaskScheduler::execute_task (this=0x55fd45d12d00, task=0x7f8e0250d690) at realm/tasks.cc:1421
#28 0x000055fd40afc150 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55fd45d12d00) at realm/tasks.cc:1160
#29 0x000055fd40afc72a in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55fd45d12d00) at realm/tasks.cc:1272
#30 0x000055fd40b0d03c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55fd45d12d00) at realm/threads.inl:97
#31 0x000055fd40b1018a in Realm::KernelThread::pthread_entry (data=0x7f8dac019b30) at realm/threads.cc:781
#32 0x00007f8e43f7a609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#33 0x00007f8e35e77133 in clone () from /lib/x86_64-linux-gnu/libc.so.6
A failing execution will be available with debug symbols at
Legion process received signal 6: Aborted
Process 2581434 on node g0004.stanford.edu is frozen!
Legion process received signal 6: Aborted
Process 2581434 on node g0004.stanford.edu is frozen!
for another 2 hours and 45 mins. (cuda-gdb is required to see the debug symbols)
I'm randomly hitting this assertion on a two-node run on sapling. When I do not hit the assertion, the execution hangs.
The backtraces for the two threads that go in
clock_nanosleep
are:and
A failing execution will be available with debug symbols at
for another 2 hours and 45 mins. (
cuda-gdb
is required to see the debug symbols)@elliottslaughter, can you please add this issue to #1032 with high priority?
The text was updated successfully, but these errors were encountered: