Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchagent crashes on start #46

Closed
marian-pritsak opened this issue Dec 15, 2016 · 5 comments
Closed

Orchagent crashes on start #46

marian-pritsak opened this issue Dec 15, 2016 · 5 comments
Labels

Comments

@marian-pritsak
Copy link
Contributor

Sonic-buildimage revision - c199614b69be65edfecc504f6fd3042dbc3d195c

Orchagent aborted with no error messages

Backtrace:

Thread 2 (Thread 0x7efd6c232700 (LWP 50)):
#0  0x00007efd6c314893 in select () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007efd6d23f07e in swss::Select::select (this=this@entry=0x7efd6c231e80, c=c@entry=0x7efd6c231e18, fd=fd@entry=0x7efd6c231dfc, timeout=timeout@entry=4294967295) at select.cpp:77
#2  0x00007efd6d725d24 in ntf_thread () at sai_redis_switch.cpp:37
#3  0x00007efd6cbab970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007efd6d9430a4 in start_thread (arg=0x7efd6c232700) at pthread_create.c:309
#5  0x00007efd6c31b62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

    Thread 1 (Thread 0x7efd6e1f3740 (LWP 45)):
#0  0x00007efd6c268067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007efd6c269448 in __GI_abort () at abort.c:89
#2  0x00007efd6cb55b3d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007efd6cb53bb6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007efd6cb53c01 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007efd6d72762f in ~thread (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/4.9/thread:146
#6  destroy<std::thread> (this=<optimized out>, __p=<optimized out>) at /usr/include/c++/4.9/ext/new_allocator.h:124
#7  _S_destroy<std::thread> (__p=<optimized out>, __a=...) at /usr/include/c++/4.9/bits/alloc_traits.h:282
#8  destroy<std::thread> (__a=..., __p=<optimized out>) at /usr/include/c++/4.9/bits/alloc_traits.h:411
#9  std::_Sp_counted_ptr_inplace<std::thread, std::allocator<std::thread>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /usr/include/c++/4.9/bits/shared_ptr_base.h:524
#10 0x00007efd6d7276db in _M_release (this=0xfc23c0) at /usr/include/c++/4.9/bits/shared_ptr_base.h:149
#11 ~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/4.9/bits/shared_ptr_base.h:666
#12 ~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/4.9/bits/shared_ptr_base.h:914
#13 std::shared_ptr<std::thread>::~shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/4.9/bits/shared_ptr.h:93
#14 0x00007efd6c26aeaf in __cxa_finalize (d=0x7efd6d93a508) at cxa_finalize.c:56
#15 0x00007efd6d71d983 in __do_global_dtors_aux () from /usr/lib/x86_64-linux-gnu/libsairedis.so.0
#16 0x00007ffd7933b5d0 in ?? ()
#17 0x00007efd6dfeefca in _dl_fini () at dl-fini.c:252
        Backtrace stopped: frame did not save the PC
@liatgrozovik
Copy link

It can easily reproduced by running the following command
systemsctl swss restart

In recent version we see it a lot. Some containers are not started. Each time it is a different container.
In some cases restart of orcagent fix it, but in some cases it is not.

This has been found even without the new PR of single ONIE image.

@andriymoroz-mlnx
Copy link
Collaborator

Attached logs in zip
tail -n 1000 /var/log/syslog
and
docker logs sycnd

logs.zip

@lguohan
Copy link
Contributor

lguohan commented Jan 26, 2017

the logs show the syncd is doing a hard reinit, this happens when the database is not clear. The database is clear when swss start, but I do not see swss service has started. Do you have swss.service in /etc/systemd/system/?

can you add syncd.service to be run after swss.service and try to reboot the box to see if the issue is addressed or not.

@andriymoroz-mlnx
Copy link
Collaborator

Updated After in syncd.service:

root@arc-switch1027:/home/admin# cat /etc/systemd/system/syncd.service
[Unit]
Description=syncd container
Requires=database.service
After=database.service swss.service

[Service]
User=root
ExecStartPre=/etc/init.d/sxdkernel start
ExecStartPre=/usr/bin/mst start
#ExecStartPre=/etc/mlnx/msn2700 start
ExecStart=/usr/bin/docker start -a syncd
ExecStop=/usr/bin/docker stop syncd
#ExecStopPost=/etc/mlnx/msn2700 stop
ExecStopPost=/etc/init.d/sxdkernel stop
ExecStopPost=/usr/bin/mst stop
Restart=always

[Install]
WantedBy=multi-user.target

After restart swss is not running:

root@arc-switch1027:/home/admin# docker ps
CONTAINER ID        IMAGE                                            COMMAND                  CREATED             STATUS              PORTS               NAMES
03c6399c50a9        arc-build-server:5000/docker-teamd:latest        "/bin/sh -c '/usr/bin"   6 hours ago         Up 2 minutes                            teamd
53bb708b257f        arc-build-server:5000/docker-fpm:latest          "/bin/sh -c '/usr/bin"   6 hours ago         Up 2 minutes                            bgp
45cc761bc120        arc-build-server:5000/docker-syncd-mlnx:latest   "/bin/bash /usr/bin/s"   6 hours ago         Up 2 minutes                            syncd
0fd13597d4c4        arc-build-server:5000/docker-lldp-sv2:latest     "/bin/sh -c '/usr/bin"   6 hours ago         Up 2 minutes                            lldp
e8658a96186e        arc-build-server:5000/docker-snmp-sv2:latest     "/bin/sh -c '/usr/bin"   6 hours ago         Up 2 minutes                            snmp
78d59417d3f2        arc-build-server:5000/docker-database:latest     "/bin/sh -c 'service "   6 hours ago         Up 2 minutes                            database
root@arc-switch1027:/home/admin#

This usually can be fixed with:
systemctl restart database && systemctl restart swss

@oleksandrivantsiv
Copy link
Contributor

Issue is no longer reproducible after flush DB commands were moved from syncd.service to swss.service in the following commit:
sonic-net/sonic-mgmt@6e1f696

michaelli10 pushed a commit to michaelli10/SONiC that referenced this issue May 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants