Troubleshooting Cluster Operator

This guide describes how to troubleshoot common problems with RabbitMQ Cluster Kubernetes Operator.

This guide may be helpful for DIY RabbitMQ on Kubernetes deployments but such environments are not its primary focus.

Common Scenarios and Errors

Certain errors have dedicated sections:

RabbitMQ Cluster Fails to Deploy

After creating a RabbitMQ instance, it is not available within a few minutes and RabbitMQ Pods do not run.

Common reasons for such failure are:

  • Incorrect imagePullSecrets configuration. This prevents the image from being pulled from a Docker registry.
  • Incorrect storageClassName configuration.

Potential solution to resolve this issue:

Pods Are Not Being Created

An error such as

pods POD-NAME is forbidden: unable to validate against any pod security policy: []

as an event of the underlying ReplicaSet of the Kubernetes Operator deployment, or as an event of the underlying StatefulSet of the RabbitmqCluster.

This occurs if pod security policy admission control is enabled for the Kubernetes cluster, but you have not created the necessary PodSecurityPolicy and corresponding role-based access control (RBAC) resources.

Potential solution is to create the PodSecurityPolicy and RBAC resources by following the procedure in Pod Security Policies.

Pods Are Stuck in the Terminating State

symptom: “After deleting a RabbitmqCluster instance, some Pods are stuck in the terminating state. RabbitMQ is still running in the affected Pods.”

cause: “The likely cause is a leftover quorum queue in RabbitMQ.”

Potential solution to resolve this issue:

  • Ensure there are no messages in the queue, or that it is acceptable to delete those messages.
  • Delete the queue by force by running:
kubectl delete pod --force --grace-period=0 POD-NAME

This example uses a Pod name:

kubectl delete pod --force rabbit-rollout-restart-server-1
# warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
# pod 'rabbit-rollout-restart-server-1' force deleted

Check the Status of an Instance

To view the status of an instance by running, use

kubectl -n NAMESPACE get all

Where NAMESPACE is the Kubernetes namespace of the instance.

For example:

kubectl -n rmq-instance-1 get all
# NAME                   READY   STATUS    RESTARTS   AGE
# pod/example-server-0   1/1      Running   0          2m27s

# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE # service/example-nodes ClusterIP None None 4369/TCP 2m27s # service/example ClusterIP None 5672/TCP,15672/TCP,15692/TCP 2m28s
# NAME READY AGE # statefulset.apps/example-server 1/1 2m28s

Cluster Operator Fails on Startup

After deploying RabbitMQ Cluster Operator, it fails during startup and its pod is restarted.

Common reasons for such failure are:

  • The Operator can’t connect to the Kubernetes API.

Potential solution to resolve this issue:

  • Check whether the Operator is still crashing. Pod restarts solve many interim issues and therefore a restart is a symptom, not a problem.
  • Check the Operator logs (kubectl -n rabbitmq-system logs -l
  • You may see an error such as:
    • Failed to get API Group-Resources
    • Get https://ADDRESS:443/api: connect: connection refused
  • Check whether your Kubernetes cluster is healthy, specifically the kube-apiserver component
  • Check whether any security network policies could prevent the Operator from reaching the Kubernetes API server