Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve readiness check for searchd and index freshness #657

Merged
merged 2 commits into from
Feb 13, 2025

Conversation

ltclm
Copy link
Contributor

@ltclm ltclm commented Feb 13, 2025

some improvements in the sphinxsearch readiness probe:

  • Ensure searchd is running before proceeding.
  • Verify the existence and freshness of the index sync status file.
  • Introduce MAX_AGE env variable with a default value of 300 seconds to determine the maximum allowable age for the index sync status file.
  • Prevent further tests if index-sync-rotate.sh is currently running by checking for the presence of a lock file.

the check for the precence of a lock file will prevent false positives in the future. some sync commands can take up to 10-15 minutes if bigger indexes have been updated in the source EFS. during that time the readiness probe has been reporting false positive alerts before.

some improvements in the sphinxsearch readiness probe:

- Ensure `searchd` is running before proceeding.
- Verify the existence and freshness of the index sync status file.
- Introduce `MAX_AGE` env variable with a default value of 300 seconds
to determine the maximum allowable age for the index sync status file.
- Prevent further tests if `index-sync-rotate.sh` is currently running
by checking for the presence of a lock file.

the check for the precence of a lock file will prevent false positives
in the future. some sync commands can take up to 10-15 minutes if bigger
indexes have been updated in the source EFS. during that time the
readiness probe has been reporting false positive alerts before.
Copy link
Contributor

@ltshb ltshb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as far as I can tel, did you test it with a local image on the DEV cluster ?

file_mtime=$(stat -c %Y "${LAST_SYNC}")
# check if index-sync-rotate.sh is currently running with the lock file
# if a sync is currently running, further tests should not be executed
LOCK_FILE="/tmp/index-sync-rotate.sh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using flock? Using lock files correctly is surprisingly hard and that tools can help simplify things greatly. That said, I am lacking context on what exactly this script is supposed to be doing and how it's used to make a more specific recommendation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the feedback, yes the script index-sync-rotate.sh is using flock for the locking. :
https://github.com/geoadmin/service-search-sphinx/blob/develop-2025-03-12/scripts/index-sync-rotate.sh#L76

this script here is just checking if the lock is active which means than a sync is currently running.
this script here is just used for the readiness probe in kubernetes. one of the tests is to detect pods with outdated index files. inside the pod the index-sinc-rotate.sh is executed every 5m in order to sync (externally updated) sphinx index files into the local storage. as part of this sync the search service is being restarted with SIGHUP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. If that lock file is already managed by flock, its existence alone does not necessarily mean the lock is active. There are cases where the EXIT handler won't run. I think you want to test the return value of flock -n "${LOCK_FILE}" true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for that advice! i have changed the test accordingly 👍

@ltclm
Copy link
Contributor Author

ltclm commented Feb 13, 2025

Looks good as far as I can tel, did you test it with a local image on the DEV cluster ?

thanks @ltshb , yes i have tested it on the dev cluster. it is working as expected.

Copy link
Contributor

@rebert rebert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merci

flock is a better way to check the lockfile for being active or not
@ltclm ltclm force-pushed the fix_PB-1425_readiness_probe branch from b36c4fa to 40ea517 Compare February 13, 2025 12:29
@ltclm ltclm merged commit b5e0a87 into master Feb 13, 2025
5 checks passed
@ltclm ltclm deleted the fix_PB-1425_readiness_probe branch February 13, 2025 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants