chef_node_checkin monitoring check causes large number of defunct processes #210

boxidau · 2015-07-23T03:06:47Z

This is most likely a bug with rackspace-cloud-monitoring or the actual chef_node_check.py script, however it causes chef runs to eventually fail with the following message in slack:

Chef client run failed on er-staging-api-tag-v1-0-32-0 (67.699942186 seconds). 22 resources updated
413 "Request Entity Too Large"

I believe the 413 error is coming from chef-manage rejecting the node details due to a process list many thousands of entries long

Here is an extract from ps ax there was literally 20,000 defunct processes

32172 ?        Z      0:00 [chef_node_check] <defunct>
32186 ?        Z      0:00 [chef_node_check] <defunct>
32189 ?        Z      0:00 [chef_node_check] <defunct>
32190 ?        Z      0:00 [chef_node_check] <defunct>
32203 ?        Z      0:00 [chef_node_check] <defunct>
32204 ?        Z      0:00 [chef_node_check] <defunct>
32212 ?        Z      0:00 [chef_node_check] <defunct>
32220 ?        Z      0:00 [chef_node_check] <defunct>
32224 ?        Z      0:00 [chef_node_check] <defunct>
32226 ?        Z      0:00 [chef_node_check] <defunct>

Restarting rackspace-monitoring-agent cleared the defunct processes and the 413 error stopped being produced at the end of a chef run.

This is really more for reference, maybe we should consider not enabling the chef_node_check plugin by default until it can be fixed upstream? Also the script doesn't work anyway the data is wrong.

The text was updated successfully, but these errors were encountered:

martinb3 · 2015-07-23T11:10:22Z

@boxidau Can you confirm what version of the chef_node_check plugin you were using? @dearing contributed some fixes in the last 16 days: racker/rackspace-monitoring-agent-plugins-contrib#82

dearing · 2015-07-23T13:00:26Z

OOh that is interesting. I'm unsure how my changes would cause something like this. If so, it would be happening all over the place.

boxidau · 2015-07-23T13:08:53Z

@martinb3 from what I've seen today platformstack uses the plugin as defined here https://github.com/rackspace-cookbooks/platformstack/blob/master/attributes/cloud_monitoring.rb#L116

It updates this every chef run, so it would have been up to date.

Running the script directly does actually seem to work correctly, contrary to my last comment. The server I was on was showing a huge amount of time since the last run, this was infact true since this issue was causing the chef run to fail right at the end.

I've lodged an issue at virgo-agent-toolkit/rackspace-monitoring-agent#789 repo too since it could well be how the virgo agent invokes plugins.

dearing · 2015-07-23T21:11:33Z

Interesting indeed, luvit/luvit#780.
Nice catch @boxidau.

dearing · 2015-10-05T14:03:41Z

I haven't seen this to continue to be an issue since updates though I was catching some nodes here and there needing the update. I consider it resolved at this time. @boxidau ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chef_node_checkin monitoring check causes large number of defunct processes #210

chef_node_checkin monitoring check causes large number of defunct processes #210

boxidau commented Jul 23, 2015

martinb3 commented Jul 23, 2015

dearing commented Jul 23, 2015

boxidau commented Jul 23, 2015

dearing commented Jul 23, 2015

dearing commented Oct 5, 2015

chef_node_checkin monitoring check causes large number of defunct processes #210

chef_node_checkin monitoring check causes large number of defunct processes #210

Comments

boxidau commented Jul 23, 2015

martinb3 commented Jul 23, 2015

dearing commented Jul 23, 2015

boxidau commented Jul 23, 2015

dearing commented Jul 23, 2015

dearing commented Oct 5, 2015