-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chef_node_checkin monitoring check causes large number of defunct processes #210
Comments
@boxidau Can you confirm what version of the chef_node_check plugin you were using? @dearing contributed some fixes in the last 16 days: racker/rackspace-monitoring-agent-plugins-contrib#82 |
OOh that is interesting. I'm unsure how my changes would cause something like this. If so, it would be happening all over the place. |
@martinb3 from what I've seen today platformstack uses the plugin as defined here https://github.com/rackspace-cookbooks/platformstack/blob/master/attributes/cloud_monitoring.rb#L116 It updates this every chef run, so it would have been up to date. Running the script directly does actually seem to work correctly, contrary to my last comment. The server I was on was showing a huge amount of time since the last run, this was infact true since this issue was causing the chef run to fail right at the end. I've lodged an issue at virgo-agent-toolkit/rackspace-monitoring-agent#789 repo too since it could well be how the virgo agent invokes plugins. |
Interesting indeed, luvit/luvit#780. |
I haven't seen this to continue to be an issue since updates though I was catching some nodes here and there needing the update. I consider it resolved at this time. @boxidau ? |
This is most likely a bug with rackspace-cloud-monitoring or the actual chef_node_check.py script, however it causes chef runs to eventually fail with the following message in slack:
I believe the 413 error is coming from chef-manage rejecting the node details due to a process list many thousands of entries long
Here is an extract from
ps ax
there was literally 20,000 defunct processesRestarting rackspace-monitoring-agent cleared the defunct processes and the 413 error stopped being produced at the end of a chef run.
This is really more for reference, maybe we should consider not enabling the chef_node_check plugin by default until it can be fixed upstream? Also the script doesn't work anyway the data is wrong.
The text was updated successfully, but these errors were encountered: