Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm::broadcast_trigger #1840

Open
Tracked by #1032
dpassiatore opened this issue Feb 28, 2025 · 3 comments
Open
Tracked by #1032

Realm::broadcast_trigger #1840

dpassiatore opened this issue Feb 28, 2025 · 3 comments
Assignees
Labels
Milestone

Comments

@dpassiatore
Copy link

This is HTR running on 16 nodes on Lassen. We get the following assertion:

prometeo.exec: /g/g92/dodipass/legion/runtime/realm/barrier_impl.cc:465: void Realm::broadcast_trigger(Barrier, const std::vector<RemoteNotification>&, const std::vector<int>&, EventImpl::gen_t, EventImpl::gen_t, EventImpl::gen_t, NodeID, unsigned int, ReductionOpID, const void*, size_t, bool): Assertion ((long long)max_recommended_payload - (long long)reduce_data_size - (long long)sizeof(BarrierTriggerMessageArgsInternal) - (long long)sizeof(size_t)) > 0' failed.

This is the backtrace:

Signal 6 received by node 0, process 3304 (thread 20005ecbf8b0) - obtaining backtrace Signal 6 received by process 3304 (thread 20005ecbf8b0) at: stack trace: 12 frames [0] = [0x2000000504d8] [1] = /lib64/libc.so.6(abort+0x2b4) [0x20000d282134] [2] = /lib64/libc.so.6(+0x357d4) [0x20000d2757d4] [3] = /lib64/libc.so.6(__assert_fail+0x64) [0x20000d2758c4] [4] = /g/g92/dodipass/HTRpp/bin/prometeo.exec() [0x13023400] [5] = /g/g92/dodipass/HTRpp/bin/prometeo.exec() [0x130260d4] [6] = /g/g92/dodipass/HTRpp/bin/prometeo.exec(Realm::IncomingMessageManager::do_work(Realm::TimeLimit)+0x134) [0x131f9694] [7] = /g/g92/dodipass/HTRpp/bin/prometeo.exec() [0x12ffe34c] [8] = /g/g92/dodipass/HTRpp/bin/prometeo.exec() [0x12ffee60] [9] = /g/g92/dodipass/HTRpp/bin/prometeo.exec() [0x13131cd8] [10] = /lib64/libpthread.so.0(+0x8cd4) [0x200000128cd4] [11] = /lib64/libc.so.6(clone+0xe4) [0x20000d367f14]

@elliottslaughter
Copy link
Contributor

@artempriakhin is the one who touched the barrier code most recently.

@apryakhin
Copy link
Contributor

Yes, this one should be on me. I will work on a fix Monday first thing.

@apryakhin apryakhin added the bug label Mar 1, 2025
@apryakhin apryakhin self-assigned this Mar 1, 2025
@apryakhin apryakhin added this to the realm-25.05 milestone Mar 1, 2025
@apryakhin
Copy link
Contributor

This is in progress. The patch will be out soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants