-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve readiness check for searchd and index freshness #657
Conversation
some improvements in the sphinxsearch readiness probe: - Ensure `searchd` is running before proceeding. - Verify the existence and freshness of the index sync status file. - Introduce `MAX_AGE` env variable with a default value of 300 seconds to determine the maximum allowable age for the index sync status file. - Prevent further tests if `index-sync-rotate.sh` is currently running by checking for the presence of a lock file. the check for the precence of a lock file will prevent false positives in the future. some sync commands can take up to 10-15 minutes if bigger indexes have been updated in the source EFS. during that time the readiness probe has been reporting false positive alerts before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good as far as I can tel, did you test it with a local image on the DEV cluster ?
file_mtime=$(stat -c %Y "${LAST_SYNC}") | ||
# check if index-sync-rotate.sh is currently running with the lock file | ||
# if a sync is currently running, further tests should not be executed | ||
LOCK_FILE="/tmp/index-sync-rotate.sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using flock? Using lock files correctly is surprisingly hard and that tools can help simplify things greatly. That said, I am lacking context on what exactly this script is supposed to be doing and how it's used to make a more specific recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the feedback, yes the script index-sync-rotate.sh
is using flock for the locking. :
https://github.com/geoadmin/service-search-sphinx/blob/develop-2025-03-12/scripts/index-sync-rotate.sh#L76
this script here is just checking if the lock is active which means than a sync is currently running.
this script here is just used for the readiness probe in kubernetes. one of the tests is to detect pods with outdated index files. inside the pod the index-sinc-rotate.sh is executed every 5m in order to sync (externally updated) sphinx index files into the local storage. as part of this sync the search service is being restarted with SIGHUP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. If that lock file is already managed by flock, its existence alone does not necessarily mean the lock is active. There are cases where the EXIT handler won't run. I think you want to test the return value of flock -n "${LOCK_FILE}" true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for that advice! i have changed the test accordingly 👍
thanks @ltshb , yes i have tested it on the dev cluster. it is working as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merci
flock is a better way to check the lockfile for being active or not
b36c4fa
to
40ea517
Compare
some improvements in the sphinxsearch readiness probe:
searchd
is running before proceeding.MAX_AGE
env variable with a default value of 300 seconds to determine the maximum allowable age for the index sync status file.index-sync-rotate.sh
is currently running by checking for the presence of a lock file.the check for the precence of a lock file will prevent false positives in the future. some sync commands can take up to 10-15 minutes if bigger indexes have been updated in the source EFS. during that time the readiness probe has been reporting false positive alerts before.