Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chef_node_checkin monitoring check causes large number of defunct processes #210

Open
boxidau opened this issue Jul 23, 2015 · 5 comments
Open

Comments

@boxidau
Copy link

boxidau commented Jul 23, 2015

This is most likely a bug with rackspace-cloud-monitoring or the actual chef_node_check.py script, however it causes chef runs to eventually fail with the following message in slack:

Chef client run failed on er-staging-api-tag-v1-0-32-0 (67.699942186 seconds). 22 resources updated
413 "Request Entity Too Large" 

I believe the 413 error is coming from chef-manage rejecting the node details due to a process list many thousands of entries long

Here is an extract from ps ax there was literally 20,000 defunct processes

32172 ?        Z      0:00 [chef_node_check] <defunct>
32186 ?        Z      0:00 [chef_node_check] <defunct>
32189 ?        Z      0:00 [chef_node_check] <defunct>
32190 ?        Z      0:00 [chef_node_check] <defunct>
32203 ?        Z      0:00 [chef_node_check] <defunct>
32204 ?        Z      0:00 [chef_node_check] <defunct>
32212 ?        Z      0:00 [chef_node_check] <defunct>
32220 ?        Z      0:00 [chef_node_check] <defunct>
32224 ?        Z      0:00 [chef_node_check] <defunct>
32226 ?        Z      0:00 [chef_node_check] <defunct>

Restarting rackspace-monitoring-agent cleared the defunct processes and the 413 error stopped being produced at the end of a chef run.

This is really more for reference, maybe we should consider not enabling the chef_node_check plugin by default until it can be fixed upstream? Also the script doesn't work anyway the data is wrong.

@martinb3
Copy link
Contributor

@boxidau Can you confirm what version of the chef_node_check plugin you were using? @dearing contributed some fixes in the last 16 days: racker/rackspace-monitoring-agent-plugins-contrib#82

@dearing
Copy link

dearing commented Jul 23, 2015

OOh that is interesting. I'm unsure how my changes would cause something like this. If so, it would be happening all over the place.

@boxidau
Copy link
Author

boxidau commented Jul 23, 2015

@martinb3 from what I've seen today platformstack uses the plugin as defined here https://github.com/rackspace-cookbooks/platformstack/blob/master/attributes/cloud_monitoring.rb#L116

It updates this every chef run, so it would have been up to date.

Running the script directly does actually seem to work correctly, contrary to my last comment. The server I was on was showing a huge amount of time since the last run, this was infact true since this issue was causing the chef run to fail right at the end.

I've lodged an issue at virgo-agent-toolkit/rackspace-monitoring-agent#789 repo too since it could well be how the virgo agent invokes plugins.

@dearing
Copy link

dearing commented Jul 23, 2015

Interesting indeed, luvit/luvit#780.
Nice catch @boxidau.

@dearing
Copy link

dearing commented Oct 5, 2015

I haven't seen this to continue to be an issue since updates though I was catching some nodes here and there needing the update. I consider it resolved at this time. @boxidau ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants