Chef troubleshooting

Symptoms

HAProxy is missing workers:

lb7.cluster.gitlab.com HAProxy_gitlab_443/worker4.cluster.gitlab.com is UNKNOWN - Check output not found in local checks

Nodes are missing chef roles:

jeroen@xps15:~/src/gitlab/chef-repo$ bundle exec knife node show worker1.cluster.gitlab.com
Node Name:   worker1.cluster.gitlab.com
Environment: _default
FQDN:        worker1.cluster.gitlab.com
IP:          10.1.0.X
Run List:    
Roles:       
Recipes:     
Platform:    ubuntu 16.04
Tags:

Knife ssh does not work:

bundle exec knife ssh "name:worker1.cluster.gitlab.com" "uptime"
WARNING: Failed to connect to  -- Errno::ECONNREFUSED: Connection refused - connect(2)

Resolution

Check if the workers have the chef role gitlab-cluster-worker. HAProxy config is generated with a chef search on this specific role.
```
$ bundle exec knife node show worker1.cluster.gitlab.com
```
If not restore the worker via knife node from file:
```
$ bundle exec knife node from file worker1.cluster.gitlab.com.json
```
Run chef-client on the node. When the chef-client run is finished on the nodes force a chef-client run on the load balancers to regenerate the haproxy config with the workers:
```
$ bundle exec knife ssh -p2222 -a ipaddress role:gitlab-cluster-lb 'sudo chef-client'
$ bundle exec knife ssh -p2222 -a ipaddress role:gitlab-cluster-lb-pages 'sudo chef-client'
```
See resolution steps at point 1.
Check if the ipnumber is correct for the node:
```
$ bundle exec knife node show worker1.cluster.gitlab.com
```
If ipaddress contains a wrong public ip update /etc/ipaddress.txt on the node and run chef-client

If ipaddress contains a private (local) ip make sure /etc/ipaddress.txt is set and the node has at least the chef role base-X where X is the OS type like debian etc. check chef-repo/roles/base-* for all current base roles.

Alerts

Chef client failures have reached critical levels

Alert name: ChefClientErrorCritical Alert text: At least 10% of type TYPE are failing chef-runs

What to do:

Find one of the nodes that is affected
- The alert is summarized; click the link to the prometheus graph from the alert (to get to the alerting environment easily), and adjust the query to just be chef_client_error > 0. It should list a metric for each node that is currently broken, from which you can select one of the type that is alerting. There will often be some correlation/commonality that may stand out and allow you to select a suitable first candidate.
On that node, inspect the chef logs (sudo grep chef-client /var/log/syslog|less) to determine what's broken.

It could be anything, but td-agent and incompatible gem combinations is common. In that case you can use td-agent-gem to manually adjust installed versions until the list of gems, often google-related, are all compatible with each other (compare to a still functional node for versions if necessary). Or delete all the installed gems and start again (running chef-client may bootstrap things again in that case).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chef.md

chef.md

Chef troubleshooting

Symptoms

Resolution

Alerts

Chef client failures have reached critical levels

Files

chef.md

Latest commit

History

chef.md

File metadata and controls

Chef troubleshooting

Symptoms

Resolution

Alerts

Chef client failures have reached critical levels