Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: fold HPX build into default workflow #7331

Merged
merged 3 commits into from
Feb 19, 2025
Merged

Conversation

junghans
Copy link
Contributor

@junghans junghans commented Sep 12, 2024

Due to some build failure with Fedora's HPX package in: https://koji.fedoraproject.org/koji/taskinfo?taskID=123213163

@junghans
Copy link
Contributor Author

junghans commented Sep 12, 2024

Ok we can reproduce it here:

The following tests FAILED:
	  2 - Kokkos_CoreUnitTest_Serial2 (Failed)
	  3 - Kokkos_CoreUnitTest_HPX (Failed)
Errors while running CTest

This is on x86_64, no ppc64le and aarch64, Kokkos_CoreUnitTest_HPX even segfaults.

@junghans
Copy link
Contributor Author

Another side note @diehlpk figured out that:

export TESTS_ARGUMENTS="--hpx:ini=hpx.stacks.use_guard_pages=0"

helps with the hpx test.

/CC @hkaiser

@masterleinad
Copy link
Contributor

We might just disable the respective since HPX throws its own exceptions by using std::set_new_handler (or try to catch that exception and throw our own).

@hkaiser
Copy link
Contributor

hkaiser commented Sep 12, 2024

Yes, currently HPX throws its own exceptions in tthis case, which is wrong. See STEllAR-GROUP/hpx#6543 for a possible fix.

Where can I see the actual runtime error generated by HPX, btw?

@masterleinad
Copy link
Contributor

Where can I see the actual runtime error generated by HPX, btw?

https://github.com/kokkos/kokkos/actions/runs/10834072931/job/30062369318?pr=7331 has

[ RUN ] hpx.view_bad_alloc
terminate called after throwing an instance of 'hpx::detail::exception_with_infohpx::exception'
what(): new allocator failed to allocate memory: HPX(out_of_memory)

and it results from

if (arg_alloc_size)
ptr = operator new(arg_alloc_size, std::align_val_t(alignment),
std::nothrow_t{});
if (!ptr || (reinterpret_cast<uintptr_t>(ptr) == ~uintptr_t(0)) ||
(reinterpret_cast<uintptr_t>(ptr) & alignment_mask)) {
Impl::throw_bad_alloc(name(), arg_alloc_size, arg_label);
}
we are explicitly requesting not to throw an error when allocating.

@hkaiser
Copy link
Contributor

hkaiser commented Sep 12, 2024

Where can I see the actual runtime error generated by HPX, btw?

https://github.com/kokkos/kokkos/actions/runs/10834072931/job/30062369318?pr=7331 has

[ RUN ] hpx.view_bad_alloc
terminate called after throwing an instance of 'hpx::detail::exception_with_infohpx::exception'
what(): new allocator failed to allocate memory: HPX(out_of_memory)

and it results from

if (arg_alloc_size)
ptr = operator new(arg_alloc_size, std::align_val_t(alignment),
std::nothrow_t{});
if (!ptr || (reinterpret_cast<uintptr_t>(ptr) == ~uintptr_t(0)) ||
(reinterpret_cast<uintptr_t>(ptr) & alignment_mask)) {
Impl::throw_bad_alloc(name(), arg_alloc_size, arg_label);
}

we are explicitly requesting not to throw an error when allocating.

Ok, this is something else, then. Why is a handler set by std::set_new_handler invoked in this case to begin with?

@masterleinad
Copy link
Contributor

Ok, this is something else, then. Why is a handler set by std::set_new_handler invoked in this case to begin with?

It seems https://github.com/llvm/llvm-project/blob/82a36468c74a29b6154639d659550c62457e655b/libcxx/src/new.cpp#L162-L166 is executed but we don't hit the catch clause for some reason but abort with an unhandled exception.

@hkaiser
Copy link
Contributor

hkaiser commented Sep 12, 2024

Ok, this is something else, then. Why is a handler set by std::set_new_handler invoked in this case to begin with?

It seems https://github.com/llvm/llvm-project/blob/82a36468c74a29b6154639d659550c62457e655b/libcxx/src/new.cpp#L162-L166 is executed but we don't hit the catch clause for some reason but abort with an unhandled exception.

Not sure what I can do about this... Would you have any ideas? It shouldn't matter whether the exception thrown from our new-handler is actually derived from std::bad_alloc (as it should, see the PR I linked above).

@hkaiser
Copy link
Contributor

hkaiser commented Sep 13, 2024

While I'm still not sure why this is happening, I can offer adding an option to HPX that allows disabling to set a new-handler (in the same way as one can already disable HPX's signal handling). This option could be used in environments like the Kokkos HPX backend. What do you think?

@masterleinad
Copy link
Contributor

Note that we already have a HPX build in our CI that I would expect to reproduce the issue if we update the HPX version there, see

- name: checkout hpx
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # v4.1.7
with:
repository: STELLAR-GROUP/hpx
ref: v1.9.0
path: hpx
.

@junghans
Copy link
Contributor Author

junghans commented Sep 16, 2024

Note that we already have a HPX build in our CI that I would expect to reproduce the issue if we update the HPX version there, see

- name: checkout hpx
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # v4.1.7
with:
repository: STELLAR-GROUP/hpx
ref: v1.9.0
path: hpx

.

Let's see.

@junghans
Copy link
Contributor Author

@masterleinad github-Linux-hpx works, but the build with the hpx-devel package on Fedora fails. Not sure if it is worth digging, but the fedora build logs are here:
https://koji.fedoraproject.org/koji/buildinfo?buildID=2540054

@masterleinad
Copy link
Contributor

@masterleinad github-Linux-hpx works, but the build with the hpx-devel package on Fedora fails. Not sure if it is worth digging, but the fedora build logs are here:

Interesting. I can reproduce the issue locally with clang and libc++.

@junghans
Copy link
Contributor Author

@masterleinad github-Linux-hpx works, but the build with the hpx-devel package on Fedora fails. Not sure if it is worth digging, but the fedora build logs are here:

Interesting. I can reproduce the issue locally with clang and libc++.

Did you build with:

            -DCMAKE_BUILD_TYPE=Debug \
            -DHPX_WITH_UNITY_BUILD=ON \
            -DHPX_WITH_MALLOC=system \
            -DHPX_WITH_NETWORKING=OFF \
            -DHPX_WITH_EXAMPLES=OFF \
            -DHPX_WITH_TESTS=OFF \

@junghans
Copy link
Contributor Author

@masterleinad @hkaiser any update or patch to try?

@masterleinad
Copy link
Contributor

@masterleinad @hkaiser any update or patch to try?

No, I think we need someone to dedicate some time to understand the root cause why the compiler even calls the handler set by std::set_new_handler.

@junghans
Copy link
Contributor Author

Ok, ping me if there is any patch to add to hpx package on Fedora!

@masterleinad
Copy link
Contributor

@diehlpk Can you try to have a look at this?

@diehlpk
Copy link
Contributor

diehlpk commented Oct 2, 2024

@masterleinad Currently,. I have no time but maybe @hkaiser can have a look?

@junghans
Copy link
Contributor Author

junghans commented Feb 7, 2025

@hkaiser @diehlpk
Now we are getting:

[ 26%] Building CXX object core/unit_test/CMakeFiles/Kokkos_CoreUnitTest_HPX.dir/hpx/TestHPX_AtomicOperations_longint.cpp.o
In file included from /__w/kokkos/kokkos/core/unit_test/hpx/TestHPX_InParallel.cpp:17:
/__w/kokkos/kokkos/core/unit_test/hpx/TestHPX_InParallel.cpp: In member function 'virtual void {anonymous}::hpx_in_parallel_for_range_policy_Test::TestBody()':
/__w/kokkos/kokkos/core/unit_test/hpx/TestHPX_InParallel.cpp:31:54: error: 'static bool Kokkos::Experimental::HPX::in_parallel(const Kokkos::Experimental::HPX&)' is deprecated [-Werror=deprecated-declarations]
   31 |   ASSERT_FALSE(Kokkos::Experimental::HPX::in_parallel());
      |                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~

@masterleinad
Copy link
Contributor

Now we are getting:

We just forgot to guard this test properly for deprecated code. Disabling deprecation warnings or warnings as errors (or deleting the test) is a easy workaround to investiagte the original issues here.

@hkaiser
Copy link
Contributor

hkaiser commented Feb 7, 2025

@hkaiser @diehlpk Now we are getting:

[ 26%] Building CXX object core/unit_test/CMakeFiles/Kokkos_CoreUnitTest_HPX.dir/hpx/TestHPX_AtomicOperations_longint.cpp.o
In file included from /__w/kokkos/kokkos/core/unit_test/hpx/TestHPX_InParallel.cpp:17:
/__w/kokkos/kokkos/core/unit_test/hpx/TestHPX_InParallel.cpp: In member function 'virtual void {anonymous}::hpx_in_parallel_for_range_policy_Test::TestBody()':
/__w/kokkos/kokkos/core/unit_test/hpx/TestHPX_InParallel.cpp:31:54: error: 'static bool Kokkos::Experimental::HPX::in_parallel(const Kokkos::Experimental::HPX&)' is deprecated [-Werror=deprecated-declarations]
   31 |   ASSERT_FALSE(Kokkos::Experimental::HPX::in_parallel());
      |                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~

We need to ask @msimberg what the alternative way is to avoid using the deprecated functionality.

@junghans
Copy link
Contributor Author

junghans commented Feb 7, 2025

Thanks @masterleinad

@msimberg
Copy link
Contributor

msimberg commented Feb 7, 2025

We need to ask @msimberg what the alternative way is to avoid using the deprecated functionality.

I think @masterleinad's commit (8316044) is the only right thing to do. HPX::in_parallel is deprecated, but the only place where it's used (and where the warning is coming from) is the unit test for HPX::in_parallel itself. There's no replacement. So simply disabling the warning until the functionality, and the test, is fully removed makes sense.

@junghans
Copy link
Contributor Author

junghans commented Feb 14, 2025

Update: Good news, the patched hpx-1.10.0 from Fedora and HPX_HANDLE_FAILED_NEW=0 make the test pass. (https://github.com/kokkos/kokkos/actions/runs/13318784191/job/37199813966?pr=7331)

Now we have 3 options:

@masterleinad
Copy link
Contributor

Now we have 3 options:

I think we can just wait a week.

@junghans junghans changed the title CI: try to build with HPX on Fedora CI: fold HPX build into default workflow Feb 14, 2025
Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me

@junghans
Copy link
Contributor Author

The patched hpx made it into Fedora, so this is ready to be merged.

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me.

Copy link
Contributor

@msimberg msimberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

This is not a blocker, but I want to note that the kokkos+hpx build took 2.5 hours on the latest workflow run (https://github.com/kokkos/kokkos/actions/runs/13357950343/job/37303340997). I don't remember it taking that long in the past and a semi-random build from a few months ago took a bit over an hour (https://github.com/kokkos/kokkos/actions/runs/12170179713/job/33945205886, without ccache as far as I can tell, and same number of threads for the build (2)). From a couple of days ago on master it's 1 hour 50 minutes (https://github.com/kokkos/kokkos/actions/runs/13340947907/job/37265301428). This could be apples to oranges if kokkos has had a lot of tests added recently, or something else in the setup has changed, but HPX may also have regressed in compile times (@hkaiser?).

@junghans
Copy link
Contributor Author

@msimberg thanks for the analysis, I noticed something similar but didn't do the historic analysis.
Could it have something to do with Debug vs RelWithDebInfo? For the Debug the build is faster, but the tests are slower and vice versa.

@hkaiser
Copy link
Contributor

hkaiser commented Feb 17, 2025

HPX may also have regressed in compile times (@hkaiser?).

I'm not aware of this.

@msimberg
Copy link
Contributor

@msimberg thanks for the analysis, I noticed something similar but didn't do the historic analysis. Could it have something to do with Debug vs RelWithDebInfo? For the Debug the build is faster, but the tests are slower and vice versa.

Yeah, that's definitely possible. I vaguely seem to remember something similar from when I was testing things a long time ago.

HPX may also have regressed in compile times (@hkaiser?).

I'm not aware of this.

Ok, good to know.

In any case, as said, not a blocker. I just wanted to highlight it as something to be aware of and maybe keep an eye on.

@junghans
Copy link
Contributor Author

Ok. then let's merge this and I will do another pass to speed up the CI afterwards. One obvious things would be continue-on-error: false, but that is a discussion for another day.

@junghans junghans requested review from dalg24 and hkaiser February 18, 2025 19:08
Copy link
Contributor

@hkaiser hkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@dalg24 dalg24 merged commit ac67b49 into kokkos:develop Feb 19, 2025
32 of 34 checks passed
@junghans junghans deleted the patch-5 branch February 19, 2025 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants