Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BGP] The 'set src' command is sometimes set with no loopback ip address #21931

Open
dgsudharsan opened this issue Mar 5, 2025 · 10 comments
Open
Labels
FRR 🚥 Issue for 202411 Triaged this issue has been triaged

Comments

@dgsudharsan
Copy link
Collaborator

dgsudharsan commented Mar 5, 2025

This issue is seen starting 202411.

bgpcfgd has logic to apply set src command when receiving loopback interface

2025 Feb 24 19:54:55.980637 MSN-2700 INFO bgp#bgpcfgd: The 'set src' configuration with Loopback0 ip 'FC00:1::32' has been scheduled to be added
2025 Feb 24 19:54:55.980637 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('PortChannel1017|FC00::29/126', 'SET', (('state', 'ok'),))'
2025 Feb 24 19:54:55.980637 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('PortChannel105|FC00::9/126', 'SET', (('state', 'ok'),))'
2025 Feb 24 19:54:55.980637 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('PortChannel1017', 'SET', (('vrf', ''),))'
2025 Feb 24 19:54:55.980637 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('PortChannel1011|10.0.0.12/31', 'SET', (('state', 'ok'),))'
2025 Feb 24 19:54:55.980637 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('PortChannel1023|10.0.0.28/31', 'SET', (('state', 'ok'),))'
2025 Feb 24 19:54:55.980637 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('PortChannel1014|10.0.0.16/31', 'SET', (('state', 'ok'),))'
2025 Feb 24 19:54:55.980637 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('PortChannel108', 'SET', (('vrf', ''),))'
2025 Feb 24 19:54:55.980637 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('Loopback0|10.1.0.32/32', 'SET', (('state', 'ok'),))'
2025 Feb 24 19:54:55.980637 MSN-2700 INFO bgp#bgpcfgd: The 'set src' configuration with Loopback0 ip '10.1.0.32' has been scheduled to be added
2025 Feb 24 19:54:55.980637 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('PortChannel1014', 'SET', (('vrf', ''),))'
2025 Feb 24 19:54:55.980637 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('PortChannel1023|FC00::39/126', 'SET', (('state', 'ok'),))'
2025 Feb 24 19:54:55.980637 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('Loopback0', 'SET', (('NULL', 'NULL'),))'
2025 Feb 24 19:54:55.980637 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('Loopback0|FC00:1::32/128', 'SET', (('NULL', 'NULL'),))'
2025 Feb 24 19:54:55.980637 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('Loopback0|10.1.0.32/32', 'SET', (('NULL', 'NULL'),))'
2025 Feb 24 19:54:55.996435 MSN-2700 DEBUG bgp#bgpcfgd: Received message : '('fc00::42', 'SET', (('asn', '64001'), ('holdtime', '10'), ('keepalive', '3'), ('local_addr', 'fc00::41'), ('name', 'ARISTA01T0'), ('nhopself', '0'), ('rrclient', '0')))'

We find it rarely that this logic doesn't result in set src being set with loopback ip address

route-map RM_SET_SRC6 permit 10
exit
!
route-map RM_SET_SRC permit 10
exit

During this time, there is no indication of any error from zebra

2025 Feb 24 19:54:48.685582 MSN-2700 NOTICE bgp#zebra[34]: [V98V0-MTWPF] client 52 says hello and bids fair to announce only bgp routes vrf=0
2025 Feb 24 19:54:59.978720 MSN-2700 ERR bgp#zebra[34]: [HSYZM-HV7HF] Extended Error: Carrier for nexthop device is down
2025 Feb 24 19:54:59.978720 MSN-2700 ERR bgp#zebra[34]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=34, pid=3716991302
2025 Feb 24 19:54:59.978720 MSN-2700 ERR bgp#zebra[34]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (18[if 4 vrfid 0]) into the kernel
2025 Feb 24 19:55:00.080107 MSN-2700 ERR bgp#zebra[34]: [HSYZM-HV7HF] Extended Error: Carrier for nexthop device is down
2025 Feb 24 19:55:00.080182 MSN-2700 ERR bgp#zebra[34]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=45, pid=3716991302

Checking on swss rec. the intf_table for loopback is only set later in time than bgpcfg processing

2025-02-24.19:55:06.815196|INTF_TABLE:Loopback0|SET|NULL:NULL|mac_addr:00:00:00:00:00:00
2025-02-24.19:55:06.815219|INTF_TABLE:Loopback0:10.1.0.32/32|SET|scope:global|family:IPv4
2025-02-24.19:55:06.815511|INTF_TABLE:Loopback0:FC00:1::32/128|SET|scope:global|family:IPv6

Hence it is not clear if this is an FRR issue or bgpcfgd issue. Currently there are no logs or records for the commit command in bgpcfgd. I recommend adding one which can be helpful in debug

Tech support logs

frr.running_config

@dgsudharsan
Copy link
Collaborator Author

@kperumalbfn Have you encountered this bug starting 202411? We have seen this twice until now after config load_minigraph.

@dgsudharsan
Copy link
Collaborator Author

@hasan-brcm @adyeung For visibility.

@dgsudharsan
Copy link
Collaborator Author

@StormLiangMS can you please help to assign or investigate this issue?

@StormLiangMS
Copy link
Contributor

ack

@arlakshm
Copy link
Contributor

@prsunny for viz..

@arlakshm arlakshm added the Triaged this issue has been triaged label Mar 12, 2025
@StormLiangMS
Copy link
Contributor

StormLiangMS commented Mar 12, 2025

I think it is a display issue, when you use below vtysh command to show all, you can see the set src info completely.
vtysh -c "show run"
'''
route-map RM_SET_SRC permit 10
set src 10.1.0.32
exit
!
route-map RM_SET_SRC6 permit 10
set src fc00:1::32
exit
'''

but if you are using show run bgp, it will display as you found.

I see this issue since 202405, 202411 is same.

below is the frr zebra version for 202405.
root@str2-msn4600c-acs-01:/# /usr/lib/frr/zebra --version
zebra version 8.5.4

and I checked the routes in kernal, which has the src info correctly.
for example:
192.168.112.0/25 nhid 230 via 10.0.0.61 dev PortChannel1019 proto bgp src 10.1.0.32 metric 20
192.168.112.128/25 nhid 230 via 10.0.0.61 dev PortChannel1019 proto bgp src 10.1.0.32 metric 20
192.168.120.0/25 nhid 188 via 10.0.0.63 dev PortChannel1020 proto bgp src 10.1.0.32 metric 20
192.168.120.128/25 nhid 188 via 10.0.0.63 dev PortChannel1020 proto bgp src 10.1.0.32 metric 20

@dgsudharsan this is more like frr bug for show command, could you help to follow up with FRR community?

@dgsudharsan
Copy link
Collaborator Author

@StormLiangMS . No. This time around its a functional issue. We see test_default_route fails.

The expectation is it should be like below

Then execute command "ip  route list exact 0.0.0.0/0", the default route include source IP address (src 10.1.0.32) for the default route 

'default nhid 332 proto bgp src 10.1.0.32 metric 20 '                      <<<<<< Has "src 10.1.0.32"
'    nexthop via 10.0.0.9 dev PortChannel108 weight 1 '
'    nexthop via 10.0.0.13 dev PortChannel1011 weight 1 '
'    nexthop via 10.0.0.1 dev PortChannel102 weight 1 '
'    nexthop via 10.0.0.17 dev PortChannel1014 weight 1 '
'    nexthop via 10.0.0.29 dev PortChannel1023 weight 1 '
'    nexthop via 10.0.0.21 dev PortChannel1017 weight 1 '
'    nexthop via 10.0.0.5 dev PortChannel105 weight 1 '
'    nexthop via 10.0.0.25 dev PortChannel1020 weight 1 '

However in the problem state it has the following

'default nhid 428 proto bgp metric 20 '                          <<<<<<
'    nexthop via 10.0.0.13 dev PortChannel1011 weight 1 '
'    nexthop via 10.0.0.17 dev PortChannel1014 weight 1 '
'    nexthop via 10.0.0.1 dev PortChannel102 weight 1 '
'    nexthop via 10.0.0.9 dev PortChannel108 weight 1 '
'    nexthop via 10.0.0.29 dev PortChannel1023 weight 1 '
'    nexthop via 10.0.0.5 dev PortChannel105 weight 1 '
'    nexthop via 10.0.0.21 dev PortChannel1017 weight 1 '
'    nexthop via 10.0.0.25 dev PortChannel1020 weight 1 '

@StormLiangMS
Copy link
Contributor

StormLiangMS commented Mar 13, 2025

I see, do we have the steps to repro this? Failed on any sonic-mgmt with this? @dgsudharsan

@dgsudharsan
Copy link
Collaborator Author

There are no concrete steps. It's a statistical issue seen immediately after deployment and running config load minigraph. Seen twice until now.

@StormLiangMS
Copy link
Contributor

There are no concrete steps. It's a statistical issue seen immediately after deployment and running config load minigraph. Seen twice until now.

I see, I will try to repro from our end. Could you check the results by vtysh -c "show run" when you see this issue in next time? I'd like to know if this is a config missing or something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FRR 🚥 Issue for 202411 Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

3 participants