Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New deploy-mg causes TSA to be in config_db (and golden), causing test errors when startup_tsa_tsb is disabled #17246

Closed
Javier-Tan opened this issue Feb 28, 2025 · 8 comments · Fixed by #17247

Comments

@Javier-Tan
Copy link
Contributor

Javier-Tan commented Feb 28, 2025

Issue Type

Enhancement

Description

Since TSA/TSB was changed to start on swss startup (sonic-net/sonic-buildimage#21587), deploy-mg causes tsa_enabled: true in config_db. This could affect tests

If tsa_enabled is not explicitly set to false and saved to config before running tests, tsa_enabled will remain in the config, which is not desired (and deploy-mg is a common setup step before running nightly tests). Additionally pretest will save this to golden config.

In the case where pretest is run, startup_tsa_tsb service will be disabled, meaning any config reload will cause tsa_enabled to be on indefinitely causing test failures

This also causes "Core dump or config check failed" and triggers config reload, causing 15m per test delay

Issue Details

To recreate this issue:

Run deploy-mg on device and check tsa_enabled in config_db

Impact or Proposed Behavior

In general, tsa_enabled should be false for both config_db.json and golden_config_db.json

Importance or Severity

High

Is it platform specific

generic

Relevant log output

N/A

Output of show version

N/A

Attach files (if any)

N/A

@rlhui
Copy link

rlhui commented Mar 1, 2025

@Javier-Tan , can we please have a more descriptive title for this issue?

@Javier-Tan Javier-Tan changed the title [Bug|Enhancement|Test Gap]: New deploy-mg causes TSA to be in config_db (and golden), causing errors when startup_tsa_tsb is disabled Mar 1, 2025
@Javier-Tan
Copy link
Contributor Author

@rlhui done, sorry not sure why the title didnt save

@Javier-Tan Javier-Tan changed the title New deploy-mg causes TSA to be in config_db (and golden), causing errors when startup_tsa_tsb is disabled New deploy-mg causes TSA to be in config_db (and golden), causing test errors when startup_tsa_tsb is disabled Mar 1, 2025
@yxieca
Copy link
Collaborator

yxieca commented Mar 5, 2025

The belief is that deploy-mg will generate golden config with default state, e.g. TSA.

The pretest will flip the golden config to TSB.

However, if pretest is not executed, then the golden config is in TSA state. Can you confirm this is what happened?

From reading the pretest code, there seems to be another opportunity for error: the pretest golden config collector write in memory db contents directly to the golden config json file, but it didn't save the same contents in /etc/sonic/config_db.json (config save). So before a config save was called by any test, a config reload will put the DUT back to TSA state.

@yxieca yxieca added the Triaged label Mar 5, 2025
@Javier-Tan
Copy link
Contributor Author

Javier-Tan commented Mar 5, 2025

Hi @yxieca, so deploy-mg actually executes the 3 relevant steps:

  • load_minigraph
  • config save
  • remove running golden config files (for all the asics) if exists

In this case, load_minigraph puts the state into TSA, this was not the case previously until a recent .15 change sonic-net/sonic-buildimage#21587

The opportunity for error you mention is what we're mainly trying to tackle, there are 2 main issues that stem from TSB + config save not being executed some time before testing starts:

  • Tests generally failing due to TSA being enabled in config_db (may be triggered by config reload during the test)
  • Even if those tests don't fail, config_db checker will find a mismatch in TSA state between golden_running_config and current config, causing a 15m config reload (golden_running_config will have TSB while config will have TSA after config_reload)

In any case, we think putting TSB in deploy-mg before config save is safe (same as previous behaviour) and a config save after executing test_disable_startup_tsa_tsb_service in pretest as a redundant measure to make sure the testbed has TSB in config

Please let me know if you have any concerns/questions about the related fix, @arlakshm for viz

@anamehra
Copy link
Contributor

While this PR #17247 fixes the issue during deploy-mg, this is still a gap when someone executing a config save which tsa is true to write the config down to disk config_db files. I am wondering why this config tsa_enabled must be via CONFIG_DB?

@Javier-Tan
Copy link
Contributor Author

Javier-Tan commented Mar 26, 2025

Hey @anamehra, for testing specifically, we are planning a redundant measure #17278

This is a typically redundant measure because: generally we do expect deploy-mg/pretest (note: #17494) to have run before the tests which means we assume the "healthy" state to have tsa_enabled=False which should then be policed by the global core_dump_and_config_check fixture, ensuring tsa_enabled=False is set after every test.

However in the case deploy-mg/pretest is skipped sanity-check will catch it, if sanity-check is skipped and tsa_enabled=True then the testbed would be considered in an unhealthy state.

As for scenarios outside testing using tsa_enabled in config_db maybe @arlakshm can give more context

@anamehra
Copy link
Contributor

While this PR #17247 fixes the issue during deploy-mg, this is still a gap when someone executing a config save which tsa is true to write the config down to disk config_db files. I am wondering why this config tsa_enabled must be via CONFIG_DB?

Nevermind, we will require this on CONFIG_DB to be saved during maintenance mode in production.

@anamehra
Copy link
Contributor

Hey @anamehra, for testing specifically, we are planning a redundant measure #17278

This is a typically redundant measure because: generally we do expect deploy-mg/pretest (note: #17494) to have run before the tests which means we assume the "healthy" state to have tsa_enabled=False which should then be policed by the global core_dump_and_config_check fixture, ensuring tsa_enabled=False is set after every test.

However in the case deploy-mg/pretest is skipped sanity-check will catch it, if sanity-check is skipped and tsa_enabled=True then the testbed would be considered in an unhealthy state.

As for scenarios outside testing using tsa_enabled in config_db maybe @arlakshm can give more context

Understood, my worry is setups when are not running sonic-mgmt or if not configred for sonic-mgmt but other automated or manual testing. If someone saves the config by error and does not realize that it's set. Wondering if we should display some text like "In Maintenance Mode" in 'show ip bgp summary' output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants