Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Debain Upgrade] After debian upgrade the docker exec calls are taking more time than expected #21920

Open
gpunathilell opened this issue Mar 4, 2025 · 3 comments
Assignees
Labels
Triaged this issue has been triaged

Comments

@gpunathilell
Copy link
Contributor

gpunathilell commented Mar 4, 2025

Issue Summary

This issue was discovered after warm reboot started failing due to docker execs timing out (There is a check_docker_exec function as part of fast-reboot and warm-reboot https://github.com/sonic-net/sonic-utilities/blob/2f8508f182566dd75a7c66cfd8dcccaa21470485/scripts/fast-reboot#L469). Issue previously opened : #21236
We see that the docker execs are taking more time after the debian upgrade from 202311 (Bullseye) to 202411/202405(Bookworm) And it is affecting warm reboot since we have a check for 1 second timeout for docker execs.

Analysis:

There seems to be an increase in docker exec times for almost all dockers between 202311 and 202411, This is tested without any additional stress/load on CPU. With a simple bash script for 1000 runs of docker execs and measuring maximum, minimum and average time for a simple docker exec (docker exec swss echo "success"). The same dockers which are used for the reboot check in warm-reboot are used
This is result from 202311:

Running tests for container: bgp
Results for container: bgp
Average: 0.136636 seconds
Min: 0.11147 seconds
Max: 0.324412 seconds



Running tests for container: lldp
Results for container: lldp
Average: 0.133063 seconds
Min: 0.109969 seconds
Max: 0.337552 seconds



Running tests for container: swss
Results for container: swss
Average: 0.100698 seconds
Min: 0.076642 seconds
Max: 0.269502 seconds



Running tests for container: syncd
Results for container: syncd
Average: 0.0983149 seconds
Min: 0.0771413 seconds
Max: 0.251317 seconds



Running tests for container: database
Results for container: database
Average: 0.099173 seconds
Min: 0.0770285 seconds
Max: 0.235249 seconds

And this is from 202411:

Running tests for container: bgp
Results for container: bgp
Average: 0.161145 seconds
Min: 0.119685 seconds
Max: 0.821305 seconds



Running tests for container: lldp
Results for container: lldp
Average: 0.168193 seconds
Min: 0.119636 seconds
Max: 0.539189 seconds



Running tests for container: swss
Results for container: swss
Average: 0.116055 seconds
Min: 0.0840077 seconds
Max: 0.254673 seconds



Running tests for container: syncd
Results for container: syncd
Average: 0.117382 seconds
Min: 0.0828154 seconds
Max: 0.404531 seconds



Running tests for container: database
Results for container: database
Average: 0.169357 seconds
Min: 0.120504 seconds
Max: 0.606531 seconds

The maximum time which is seen is higher across all the dockers. This seems to affect warm boot (because of the timeout check for docker execs) but issue could be something bigger such as performance of dockers after the debian upgrade

@gpunathilell
Copy link
Contributor Author

Hi @saiarcot895 This is the output of strace from the system where failure was seen. The output here is basically taken using the command time strace docker exec ... and after thousand iterations whichever exec takes the longest, the strace output of that command is stored in the corresponding docker_max_strace_output.txt Let me know if any additional information is required
time_op.log
teamd_max_strace_output.txt
syncd_max_strace_output.txt
swss_max_strace_output.txt
radv_max_strace_output.txt
lldp_max_strace_output.txt
database_max_strace_output.txt
bgp_max_strace_output.txt

@gpunathilell
Copy link
Contributor Author

I have also taken the strace output of the minimum time from a separate run of the same simulation. Please find the information below:
time_op.log
teamd_min_strace_output.txt
syncd_min_strace_output.txt
swss_min_strace_output.txt
radv_min_strace_output.txt
lldp_min_strace_output.txt
database_min_strace_output.txt
bgp_min_strace_output.txt

@saiarcot895
Copy link
Contributor

Hi @gpunathilell,

Please use the following command for strace; that way, at least partial timing information is collected for all syscalls: strace -ttf --strings-in-hex=non-ascii-chars -s 128 ....

The strace output is needed only for one container; I'm fairly certain this is a container-independent issue.

Additionally, can you try the same experiment (1000 runs of docker exec database echo success) on the database container, but with all other containers stopped? If the max (and maybe average) time is lower, then this will prove that the time taken for docker exec to complete is subject to the system load. If they're the same, then the factor here is the time needed for system calls and/or the work that docker (and associated binaries) are doing.

@arlakshm arlakshm added the Triaged this issue has been triaged label Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

3 participants