LATEST VERSION: 1.10.3 - CHANGELOG
RabbitMQ for PCF v1.9.8

Monitoring and KPIs for Pre‑Provisioned RabbitMQ for PCF

This topic explains how to monitor the health of the pre-provisioned version of the RabbitMQ for Pivotal Cloud Foundry (PCF) service using the logs, metrics, and Key Performance Indicators (KPIs) generated by RabbitMQ for PCF component VMs.

Pre-provisioned RabbitMQ for PCF components generate many of the same metrics as the on-demand RabbitMQ service components.

See Logging and Metrics for general information about logging and metrics in PCF.

Direct the Logs

To enable monitoring for RabbitMQ for PCF, operators designate an external syslog endpoint for RabbitMQ component log messages. The endpoint serves as the input to a monitoring platform such as Datadog, Papertrail, or SumoLogic.

To specify the destination for RabbitMQ for PCF log messages, do the following:

  1. From the Ops Manager Installation Dashboard, click the RabbitMQ tile.
  2. In the RabbitMQ tile, click the Settings tab.
  3. Click Syslog.
  4. Enter your syslog address and port. Configure syslog pane
  5. Click Save.
  6. Return to the Ops Manager Installation Dashboard and click Apply Changes to redeploy with the changes.

Logging Formats

With pre-provisioned RabbitMQ for PCF logging configured, three types of component generate logs: the RabbitMQ message server nodes, the service broker, and HAProxy. If you have multiple server or HAProxy nodes, you can identify logs from individual nodes by their index, which corresponds to the index of the RabbitMQ VM instances displayed in Ops Manager:

  • The logs for RabbitMQ server nodes follow the format [job=rabbitmq-server-partition-GUID index=X]
  • The logs for HAProxy nodes follow the format [job=rabbitmq-haproxy-partition-GUID index=X]
  • The logs for the RabbitMQ service broker follow the format [job=rabbitmq-broker-partition-GUID index=X]

RabbitMQ and HAProxy servers log at the info level and capture errors, warnings, and informational messages.

The logging format does not change in v1.9.0. For users familiar with documentation for previous versions of the tile, the tag we used to call the app_name is now called the program_name. The generic log format is as follows:

<PRI>TIMESTAMP IP_ADDRESS PROGRAM_NAME [job=NAME index=JOB_INDEX id=JOB_ID] MESSAGE

The raw logs look similar to the following:

<7>2017-06-28T16:06:10.733560+00:00 10.244.16.133 vcap.agent [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5]  2017/06/28 16:06:10  CEF:0|CloudFoundry|BOSH|1|agent_api|ssh|1|duser=director.be5a66bb-a9b4-459f-a0d3-1fc5c9c3ed79.be148cc6-91ef-4eed-a788-237b0b8c63b7 src=10.254.50.4 spt=4222 shost=5ae233e0-ecc5-4868-9ae0-f9767571251b
<86>2017-06-28T16:06:16.704572+00:00 10.244.16.133 useradd [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5]  new group: name=bosh_ly0d2rbjr, GID=1003
<86>2017-06-28T16:06:16.704663+00:00 10.244.16.133 useradd [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5]  new user: name=bosh_ly0d2rbjr, UID=1001, GID=1003, home=/var/vcap/bosh_ssh/bosh_ly0d2rbjr, shell=/bin/bash
<86>2017-06-28T16:06:16.736932+00:00 10.244.16.133 usermod [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5]  add 'bosh_ly0d2rbjr' to group 'admin'
<86>2017-06-28T16:06:16.736964+00:00 10.244.16.133 usermod [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5]  add 'bosh_ly0d2rbjr' to group 'vcap'

Logs sent to external logging tools such as Papertrail may be presented in a different format.

The following table describes the logging tags used in this template:

Tag Description
PRI This is a value which in future will be used to describe the severity of the log message and which facility it came from.
TIMESTAMP This is the timestamp of when the log is forwarded, for example, 2016-08-24T05:14:15.000003Z. The timestamp value is typically slightly after when the log message was generated.
IP_ADDRESS The internal IP address of server on which the log message originated
PROGRAM_NAME Process name of the program the generated the message. Same as app_name before v1.9.0. For more information about program name, see RabbitMQ Program Names below.
NAME The BOSH instance group name (for example, rabbitmq_server)
JOB_INDEX BOSH job index. Used to distinguish between multiple instances of the same job.
JOB_ID BOSH VM GUID. This is distinct from the CID displayed in the Ops Manager Status tab, which corresponds to the VM ID assigned by the infrastructure provider.
MESSAGE The log message that appears

RabbitMQ Program Names

Program Name Description
rabbitmq_server_cluster_check Checks that the RabbitMQ cluster is healthy. Runs after every deploy.
rabbitmq_server_node_check Checks that the RabbitMQ node is healthy. Runs after every deploy.
rabbitmq_route_registrar_stderr Registers the route for the management API with the Gorouter in your Elastic Runtime deployment.
rabbitmq_route_registrar_stdout Registers the route for the management API with the Gorouter in your Elastic Runtime deployment.
rabbitmq_server The Erlang VM and RabbitMQ apps. Logs may span multiple lines.
rabbitmq_server_drain Shuts down the Erlang VM and RabbitMQ apps. Runs as part of the BOSH lifecycle.
rabbitmq_server_http_api_access Access to the RabbitMQ management UI.
rabbitmq_server_init Starts the Erlang VM and RabbitMQ.
rabbitmq_server_post_deploy_stderr Runs the node check and cluster check. Runs after every deploy.
rabbitmq_server_post_deploy_stdout Runs the node check and cluster check. Runs after every deploy.
rabbitmq_server_pre_start Runs before the rabbitmq-server job is started.
rabbitmq_server_sasl Supervisor, progress, and crash reporting for the Erlang VM and RabbitMQ apps.
rabbitmq_server_shutdown_stderr Stops the RabbitMQ app and Erlang VM.
rabbitmq_server_shutdown_stdout Stops the RabbitMQ app and Erlang VM.
rabbitmq_server_startup_stderr Starts the RabbitMQ app and Erlang VM, then configures users and permissions.
rabbitmq_server_startup_stdout Starts the RabbitMQ app and Erlang VM, then configures users and permissions.
rabbitmq_server_upgrade Shuts down Erlang VM and RabbitMQ app if required during an upgrade.

Metrics

Metrics are regularly-generated log messages that report measured component states. The metrics polling interval defaults to 30 seconds. This interval is a configuration option on the RabbitMQ tile (Settings > RabbitMQ). The interval setting applies to all components deployed by the tile.

Metrics are long, single lines of text that follow the format:

origin:"p-rabbitmq" eventType:ValueMetric timestamp:1441188462382091652 deployment:"cf-rabbitmq" job:"cf-rabbitmq-node" index:"0" ip:"10.244.3.46" valueMetric: < name:"/p-rabbitmq/rabbitmq/system/memory" value:1024 unit:"MB">

Partition Indicator

A new metric has been introduced to help to identify network partitions. Essentially it exposes how many nodes each node knows. When a node is in partition the only node that it recognizes is itself and that is a good indication that that node might be in a partition.

An example of that metrics is:

origin:"p-rabbitmq" eventType:ValueMetric timestamp:1441188462382091652 deployment:"cf-rabbitmq" job:"cf-rabbitmq-node" index:"0" ip:"10.244.3.46" valueMetric: < name:"/p-rabbitmq/rabbitmq/erlang/reachable_nodes" value:3 unit:"count">

Monitors can be created to emit alerts in case a cluster seems to be in a partition. A metrics is emitted for each node in the cluster. For example: in a three-node cluster a monitor can expect to have a total of 9 (nine) since each node is expected to emit 3 (2 reachable nodes and itself). Otherwise, an alert can be sent to the team.

Recovering from a network partition

Please refer to the oficial RabbitMQ guide to understand how to recover from a network partition: https://www.rabbitmq.com/partitions.html

Key Performance Indicators

Key Performance Indicators (KPIs) for RabbitMQ for PCF are metrics that operators find most useful for monitoring their RabbitMQ service to ensure smooth operation. KPIs are high-signal-value metrics that can indicate emerging issues. KPIs can be raw component metrics or derived metrics generated by applying formulas to raw metrics.

Pivotal provides the following KPIs as general alerting and response guidance for typical RabbitMQ for PCF installations. Pivotal recommends that operators continue to fine-tune the alert measures to their installation by observing historical trends. Pivotal also recommends that operators expand beyond this guidance and create new, installation-specific monitoring metrics, thresholds, and alerts based on learning from their own installations.

For a list of all RabbitMQ for PCF raw component metrics, see Component Metrics Reference.

Component Heartbeats

Key RabbitMQ for PCF components periodically emit heartbeat metrics: the RabbitMQ server nodes, HAProxy nodes, and the Service Broker. The heartbeats are Boolean metrics, where 1 means the system is available, and 0 or the absence of a heartbeat metric means the service is not responding and should be investigated.

Service Broker Heartbeat


p-rabbitmq.service_broker.heartbeat

Description RabbitMQ Service Broker is alive poll, which indicates if the component is available and able to respond to requests.

Use: If the Service Broker does not emit heartbeats, this indicates that it is offline. The Service Broker is required to create, update, and delete service instances, which are critical for dependent tiles such as Spring Cloud Services and Spring Cloud Data Flow.

Origin: Doppler/Firehose
Type: boolean
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Check the RabbitMQ Service Broker logs for errors. You can find this VM by targeting your RabbitMQ deployment with BOSH and running bosh vms.

HAProxy Heartbeat


p-rabbitmq.haproxy.heartbeat

Description RabbitMQ HAProxy is alive poll, which indicates if the component is available and able to respond to requests.

Use: If the HAProxy does not emit heartbeats, this indicates that it is offline. To be functional, service instances require HAProxy.

Origin: Doppler/Firehose
Type: boolean
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Check the RabbitMQ HAProxy logs for errors. You can find the VM by targeting your RabbitMQ deployment with BOSH and running bosh vms, which lists HAProxy_GUID.

Server Heartbeat


p-rabbitmq.rabbitmq.heartbeat

Description RabbitMQ Server is alive poll, which indicates if the component is available and able to respond to requests.

Use: If the server does not emit heartbeats, this indicates that it is offline. To be functional, service instances require RabbitMQ Server.

Origin: Doppler/Firehose
Type: boolean
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Check the RabbitMQ Server logs for errors. You can find the VM by targeting your RabbitMQ deployment with BOSH and running bosh vms, which lists rabbitmq.

RabbitMQ Server KPIs

The following KPIs from the RabbitMQ server component:

File Descriptors


p-rabbitmq.rabbitmq.system.file_descriptors

Description File descriptors consumed.

Use: If the number of file descriptors consumed becomes too large, the VM may lose the ability to perform disk IO, which can cause data loss.

Note: This assumes non-persistent messages are handled by retries or some other logic by the producers.

Origin: Doppler/Firehose
Type: count
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 50000
Red critical: > 55000
Recommended response The default ulimit for RabbitMQ for PCF v1.6 and later is 60000. If this metric is met or exceeded for an extended period of time, consider one of the following actions:
  • Scaling the rabbit nodes in the tile Resource Config pane.
  • Increasing the ulimit

Erlang Processes


p-rabbitmq.rabbitmq.system.erlang_processes

Description Erlang processes consumed by RabbitMQ, which runs on an Erlang VM.

Use: This is the key indicator of the processing capability of a node.

Origin: Doppler/Firehose
Type: count
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 900000
Red critical: > 950000
Recommended response The default Erlang process limit in RabbitMQ for PCF v1.6 and later is 1,048,816. If this metric meets or exceeds the recommended thresholds for extended periods of time, consider scaling the RabbitMQ nodes in the tile Resource Config pane.

BOSH System Health Metrics

The BOSH layer that underlies PCF generates healthmonitor metrics for all VMs in the deployment. However, these metrics are not included in the Loggregator Firehose by default. To get these metrics, do either of the following:

  • To send BOSH HM metrics through the Firehose, install the open-source HM Forwarder.
  • To retrieve BOSH health metrics outside of the Firehose, install the JMX Bridge for PCF tile.

In a future release the BOSH system health metrics will be available directly from the Firehose.

All BOSH-deployed components generate the following system metrics. Coming from RabbitMQ for PCF components, these system metrics serve as KPIs for the RabbitMQ for PCF service.

RAM


system.mem.percent

Description RAM being consumed by the p-rabbitmq VM.

Use: RabbitMQ is considered to be in a good state when it has little or no messages. In other words, “an empty rabbit is a happy rabbit.” Alerting on this metric can indicate that there are too few consumers or apps that read messages from the queue.

Healthmonitor reports when RabbitMQ uses more than 40% of its RAM for the past ten minutes.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 40
Red critical: > 50
Recommended response Add more consumers to drain the queue as fast as possible.

CPU


system.cpu.percent

Description CPU being consumed by the p-rabbitmq VM.

Use: A node that experiences context switching or high CPU usage will become unresponsive. This also affects the ability of the node to report metrics.

Healthmonitor reports when RabbitMQ uses more than 40% of its CPU for the past ten minutes.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 60
Red critical: > 75
Recommended response Remember that “an empty rabbit is a happy rabbit”. Add more consumers to drain the queue as fast as possible.

Ephemeral Disk


system.disk.percent

Description Ephemeral Disk being consumed by the p-rabbitmq VM.

Use: If system disk fills up, there are too few consumers.

Healthmonitor reports when RabbitMQ uses more than 40% of its CPU for the past ten minutes.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 60
Red critical: > 75
Recommended response Remember that “an empty rabbit is a happy rabbit”. Add more consumers to drain the queue as fast as possible.

Persistent Disk


persistent.disk.percent

Description Persistent Disk being consumed by the p-rabbitmq VM.

Use: If system disk fills up, there are too few consumers.

Healthmonitor reports when RabbitMQ uses more than 40% of its CPU for the past ten minutes.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 60
Red critical: > 75
Recommended response Remember that “an empty rabbit is a happy rabbit”. Add more consumers to drain the queue as fast as possible.

Component Metric Reference

RabbitMQ for PCF component VMs emit the following raw metrics. The full name of the metric follows the format: /p-rabbitmq/COMPONENT/METRIC-NAME

RabbitMQ  Server Metrics

RabbitMQ for PCF message server components emit the following metrics.

Full Name Unit Description
/p-rabbitmq.rabbitmq.heartbeat boolean Indicates whether the RabbitMQ server is available and able to respond to requests
/p-rabbitmq/rabbitmq/erlang/erlang_processes count The number of Erlang processes
/p-rabbitmq/rabbitmq/system/memory MB The memory in MB used by the node
/p-rabbitmq/rabbitmq/connections/count count The total number of connections to the node
/p-rabbitmq/rabbitmq/consumers/count count The total number of consumers registered in the node
/p-rabbitmq/rabbitmq/messages/delivered count The total number of messages with the status deliver_get on the node
/p-rabbitmq/rabbitmq/messages/delivered_no_ack count The number of messages with the status deliver_no_ack on the node
/p-rabbitmq/rabbitmq/messages/delivered_rate rate The rate at which messages are being delivered to consumers or clients on the node
/p-rabbitmq/rabbitmq/messages/published count The total number of messages with the status publish on the node
/p-rabbitmq/rabbitmq/messages/published_rate rate The rate at which messages are being published by the node
/p-rabbitmq/rabbitmq/messages/redelivered count The total number of messages with the status redeliver on the node
/p-rabbitmq/rabbitmq/messages/redelivered_rate rate The rate at which messages are getting the status redeliver on the node
/p-rabbitmq/rabbitmq/messages/got _no_ack count The number of messages with the status get_no_ack on the node
/p-rabbitmq/rabbitmq/messages/get _no_ack_rate rate The rate at which messages get the status get_no_ack on the node
/p-rabbitmq/rabbitmq/messages/pending count The number of messages with the status messages_unacknowledged on the node
/p-rabbitmq/rabbitmq/messages/depth count The number of messages with the status messages_unacknowledged or messages_ready on the node
/p-rabbitmq/rabbitmq/system/file descriptors count The number of open file descriptors on the node
/p-rabbitmq/rabbitmq/exchanges/count count The total number of exchanges on the node
/p-rabbitmq/rabbitmq/messages/available count The total number of messages with the status messages_ready on the node
/p-rabbitmq/rabbitmq/queues/count count The number of queues on the node
/p-rabbitmq/rabbitmq/channels/count count The number of channels on the node
/p-rabbitmq/rabbitmq/queues/VHOST-NAME/QUEUE-NAME/consumers count The number of consumers per virtual host per queue
/p-rabbitmq/rabbitmq/queues/VHOST-NAME/QUEUE-NAME/depth count The number of messages with the status messages_unacknowledged or messages_ready per virtual host per queue

HAProxy Metrics

RabbitMQ for PCF HAProxy components emit the following metrics.

Name Space Unit Description
/p-rabbitmq.haproxy.heartbeat boolean Indicates whether the RabbitMQ HAProxy component is available and able to respond to requests
/p-rabbitmq/haproxy/health/connections count The total number of concurrent front-end connections to the server
/p-rabbitmq/haproxy/backend/qsize/amqp size The total size of the AMQP queue on the server
/p-rabbitmq/haproxy/backend/retries/amqp count The number of AMQP retries to the server
/p-rabbitmq/haproxy/backend/ctime/amqp time The total time to establish the TCP AMQP connection to the server
Create a pull request or raise an issue on the source for this page in GitHub