Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: Set g_device_list to NULL after free #10805

Closed
wants to merge 1 commit into from

Conversation

zachdworkin
Copy link
Contributor

fi_info has a double free issue. The efa g_device_list isn't being set to NULL after being freed so the next time efa_device_list_finalize is called it will be freed again (double free). Setting it to NULL removes this.

fi_info has a double free issue. The efa g_device_list
isn't being set to NULL after being freed so the next time
efa_device_list_finalize is called it will be freed
again (double free). Setting it to NULL removes this.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
@zachdworkin zachdworkin requested a review from wzamazon February 20, 2025 17:49
@shijin-aws shijin-aws requested a review from a team February 20, 2025 18:38
@shijin-aws
Copy link
Contributor

shijin-aws commented Feb 20, 2025

@zachdworkin Is it a new issue after b7b9dd6?

@zachdworkin
Copy link
Contributor Author

Yes its a new issue after that commit. I rebased before finding this bug. The way I found it was running the fi_info binary with any provider and saw the double free report. Libfabric must be configured so that efa is built (HAVE_EFA 1 in config.log) to reproduce

@shijin-aws
Copy link
Contributor

@zachdworkin Thanks for catching it ! Let me run with ASAN to reproduce

@zachdworkin
Copy link
Contributor Author

@shijin-aws no problem! Seems simple enough. What does ASAN stand for?

@shijin-aws
Copy link
Contributor

ASAN is address sanitizer that we used to catch illegal memory access and double free error. But I couldn't reproduce the double free report: fi_info -p tcp just runs ok... how do you get that double free report?

@shijin-aws
Copy link
Contributor

I think I can see where the bug is introduced:

Before b7b9dd6, efa_prov is never returned in the error path, so EFA_FIN is never called . But now efa_prov is returned anyway and EFA_FIN will be called in the process reaper again. So if g_device_list is not initialized and reset as NULL, EFA_FIN can see a garbage efa_device_list and double free it.

@zachdworkin
Copy link
Contributor Author

zachdworkin commented Feb 20, 2025

steps to reproduce:
git reset --hard upstream/main
make distclean
./autogen.sh
./configure --prefix=install/location
make -j install

(check HAVE_EFA 1 in config.log)

install/location/bin/fi_info -p verbs
double free error

if run with -p efa, shm, tcp, etc the error still appears and using gdb to follow it leads to the changes in this patch

@shijin-aws
Copy link
Contributor

shijin-aws commented Feb 20, 2025

I think this PR is only a bandaid as we have more clean up issue here - The EFA_INI code block needs to move the initialization code to the getinfo. I need to think about whether I need to revert that commit now before i can have a good fix shortly

@shijin-aws
Copy link
Contributor

@zachdworkin can you try if #10807 fixes the issue for you?

@zachdworkin
Copy link
Contributor Author

@shijin-aws #10807 fixes it. Should I close this and let that PR be the one to fix it?

@shijin-aws
Copy link
Contributor

Yes, you can close this one

@shijin-aws shijin-aws closed this Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants