LATEST VERSION: 1.10.5 - CHANGELOG
RabbitMQ for PCF v1.8.20

Monitoring and KPIs for Pre‑Provisioned RabbitMQ for PCF

This topic explains how to monitor the health of the pre-provisioned version of the RabbitMQ for Pivotal Cloud Foundry (PCF) service using the logs, metrics, and Key Performance Indicators (KPIs) generated by RabbitMQ for PCF component VMs.

Pre-provisioned RabbitMQ for PCF components generate many of the same metrics as the on-demand RabbitMQ service components.

For general information about logging and metrics in PCF, see Logging and Metrics.

Direct the Logs

To enable monitoring for RabbitMQ for PCF, operators designate an external syslog endpoint for RabbitMQ component log messages. i The endpoint serves as the input to a monitoring platform such as Datadog, Papertrail, or SumoLogic.

To specify the destination for RabbitMQ for PCF log messages, do the following:

  1. From the Ops Manager Installation Dashboard, click the RabbitMQ tile.
  2. In the RabbitMQ tile, click the Settings tab.
  3. Click Syslog.
  4. Enter your syslog address and port. Configure syslog pane
  5. Click Save.
  6. Return to the Ops Manager Installation Dashboard and click Apply Changes to redeploy with the changes.

Log Formats

With pre-provisioned RabbitMQ for PCF configured correctly, three types of component generate logs: the RabbitMQ message server nodes, the service broker, and HAProxy. If you have multiple server or HAProxy nodes, you can identify logs from individual nodes by their index, which corresponds to the index of the RabbitMQ VM instances displayed in Ops Manager:

  • The logs for RabbitMQ server nodes follow the format [job=rabbitmq-server-partition-GUID index=X]
  • The logs for HAProxy nodes follow the format [job=rabbitmq-haproxy-partition-GUID index=X]
  • The logs for the RabbitMQ service broker follow the format [job=rabbitmq-broker-partition-GUID index=X]

RabbitMQ and HAProxy servers log at the info level and capture errors, warnings, and informational messages.

Metrics

Metrics are regularly-generated log messages that report measured component states. The metrics polling interval defaults to 30 seconds. This interval is a configuration option on the RabbitMQ tile (Settings > RabbitMQ). The interval setting applies to all components deployed by the tile.

Metrics are long, single lines of text that follow the format:

origin:"p-rabbitmq" eventType:ValueMetric timestamp:1441188462382091652 deployment:"cf-rabbitmq" job:"cf-rabbitmq-node" index:"0" ip:"10.244.3.46" valueMetric: < name:"/p-rabbitmq/rabbitmq/system/memory" value:1024 unit:"MB">

Key Performance Indicators

Key Performance Indicators (KPIs) for RabbitMQ for PCF are metrics that operators find most useful for monitoring their RabbitMQ service to ensure smooth operation. KPIs are high-signal-value metrics that can indicate emerging issues. KPIs can be raw component metrics or derived metrics generated by applying formulas to raw metrics.

Pivotal provides the following KPIs as general alerting and response guidance for typical RabbitMQ for PCF installations. Pivotal recommends that operators continue to fine-tune the alert measures to their installation by observing historical trends. Pivotal also recommends that operators expand beyond this guidance and create new, installation-specific monitoring metrics, thresholds, and alerts based on learning from their own installations.

For a list of all RabbitMQ for PCF raw component metrics, see Component Metrics Reference.

Component Heartbeats

Key RabbitMQ for PCF components periodically emit heartbeat metrics: the RabbitMQ server nodes, HAProxy nodes, and the Service Broker. The heartbeats are Boolean metrics, where 1 means the system is available and 0 or the absence of a heartbeat metric means the service is not responding and should be investigated.

Service Broker Heartbeat


p-rabbitmq.service_broker.heartbeat

Description RabbitMQ Service Broker is alive poll, which indicates if the component is available and able to respond to requests.

Use: If the Service Broker does not emit heartbeats, this indicates that it is offline. The Service Broker is required to create, update, and delete service instances, which are critical for dependent tiles such as Spring Cloud Services and Spring Cloud Data Flow.

Origin: Doppler/Firehose
Type: boolean
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Check the RabbitMQ Service Broker logs for errors. You can find this VM by targeting your RabbitMQ deployment with BOSH and running bosh vms.

HAProxy Heartbeat


p-rabbitmq.haproxy.heartbeat

Description RabbitMQ HAProxy is alive poll, which indicates if the component is available and able to respond to requests.

Use: If the HAProxy does not emit heartbeats, this indicates that it is offline. To be functional, service instances require HAProxy.

Origin: Doppler/Firehose
Type: boolean
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Check the RabbitMQ HAProxy logs for errors. You can find the VM by targeting your RabbitMQ deployment with BOSH and running bosh vms, which lists HAProxy_GUID.

Server Heartbeat


p-rabbitmq.rabbitmq.heartbeat

Description RabbitMQ Server is alive poll, which indicates if the component is available and able to respond to requests.

Use: If the server does not emit heartbeats, this indicates that it is offline. To be functional, service instances require RabbitMQ Server.

Origin: Doppler/Firehose
Type: boolean
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Check the RabbitMQ Server logs for errors. You can find the VM by targeting your RabbitMQ deployment with BOSH and running bosh vms, which lists rabbitmq.

RabbitMQ Server KPIs

The following KPIs from the RabbitMQ server component:

File Descriptors


p-rabbitmq.rabbitmq.system.file_descriptors

Description File descriptors consumed.

Use: If the number of file descriptors consumed becomes too large, the VM may lose the ability to perform disk I/O, which can cause data loss.

Note: This assumes non-persistent messages are handled by retries or some other logic by the producers.

Origin: Doppler/Firehose
Type: count
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 50000
Red critical: > 55000
Recommended response The default ulimit for RabbitMQ for PCF v1.6 and later is 60000. If this metric is met or exceeded for an extended period of time, consider one of the following actions:
  • Scaling the RabbitMQ nodes in the tile Resource Config pane.
  • Increasing the ulimit

Erlang Processes


p-rabbitmq.rabbitmq.system.erlang_processes

Description Erlang processes consumed by RabbitMQ, which runs on an Erlang VM.

Use: This is the key indicator of the processing capability of a node.

Origin: Doppler/Firehose
Type: count
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 900000
Red critical: > 950000
Recommended response The default Erlang process limit in RabbitMQ for PCF v1.6 and later is 1,048,816. If this metric meets or exceeds the recommended thresholds for extended periods of time, consider scaling the RabbitMQ nodes in the tile Resource Config pane.

BOSH System Metrics

All BOSH-deployed components generate the following system metrics. Coming from RabbitMQ for PCF components, these system metrics serve as KPIs for the RabbitMQ for PCF service.

RAM


system.mem.percent

Description RAM being consumed by the p-rabbitmq VM.

Use: RabbitMQ is considered to be in a good state when it has little or no messages. In other words, “an empty rabbit is a happy rabbit.” Alerting on this metric can indicate that there are too few consumers or apps that read messages from the queue.

Healthmonitor reports when RabbitMQ uses more than 40% of its RAM for the past 10 minutes.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 40
Red critical: > 50
Recommended response Add more consumers to drain the queue as fast as possible.

CPU


system.cpu.percent

Description CPU being consumed by the p-rabbitmq VM.

Use: A node that experiences context switching or high CPU usage will become unresponsive. This also affects the ability of the node to report metrics.

Healthmonitor reports when RabbitMQ uses more than 40% of its CPU for the past 10 minutes.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 60
Red critical: > 75
Recommended response Remember that “an empty rabbit is a happy rabbit”. Add more consumers to drain the queue as fast as possible.

Ephemeral Disk


system.disk.percent

Description Ephemeral disk being consumed by the p-rabbitmq VM.

Use: If system disk fills up, there are too few consumers.

Healthmonitor reports when RabbitMQ uses more than 40% of its disk for the past 10 minutes.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 60
Red critical: > 75
Recommended response Remember that “an empty rabbit is a happy rabbit”. Add more consumers to drain the queue as fast as possible.

Persistent Disk


persistent.disk.percent

Description Persistent disk being consumed by the p-rabbitmq VM.

Use: If system disk fills up, there are too few consumers.

Healthmonitor reports when RabbitMQ uses more than 40% of its disk for the past 10 minutes.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 60
Red critical: > 75
Recommended response Remember that “an empty rabbit is a happy rabbit”. Add more consumers to drain the queue as fast as possible.

Component Metric Reference

RabbitMQ for PCF component VMs emit the following raw metrics. The full name of the metric follows the format: /p-rabbitmq/COMPONENT/METRIC-NAME

RabbitMQ  Server Metrics

RabbitMQ for PCF message server components emit the following metrics.

Full Name Unit Description
/p-rabbitmq.rabbitmq.heartbeat boolean Indicates whether the RabbitMQ server is available and able to respond to requests
/p-rabbitmq/rabbitmq/erlang/erlang_processes count The number of Erlang processes
/p-rabbitmq/rabbitmq/system/memory MB The memory in MB used by the node
/p-rabbitmq/rabbitmq/connections/count count The total number of connections to the node
/p-rabbitmq/rabbitmq/consumers/count count The total number of consumers registered in the node
/p-rabbitmq/rabbitmq/messages/delivered count The total number of messages with the status deliver_get on the node
/p-rabbitmq/rabbitmq/messages/delivered_no_ack count The number of messages with the status deliver_no_ack on the node
/p-rabbitmq/rabbitmq/messages/delivered_rate rate The rate at which messages are being delivered to consumers or clients on the node
/p-rabbitmq/rabbitmq/messages/published count The total number of messages with the status publish on the node
/p-rabbitmq/rabbitmq/messages/published_rate rate The rate at which messages are being published by the node
/p-rabbitmq/rabbitmq/messages/redelivered count The total number of messages with the status redeliver on the node
/p-rabbitmq/rabbitmq/messages/redelivered_rate rate The rate at which messages are getting the status redeliver on the node
/p-rabbitmq/rabbitmq/messages/got _no_ack count The number of messages with the status get_no_ack on the node
/p-rabbitmq/rabbitmq/messages/get _no_ack_rate rate The rate at which messages get the status get_no_ack on the node
/p-rabbitmq/rabbitmq/messages/pending count The number of messages with the status messages_unacknowledged on the node
/p-rabbitmq/rabbitmq/system/file descriptors count The number of open file descriptors on the node
/p-rabbitmq/rabbitmq/exchanges/count count The total number of exchanges on the node
/p-rabbitmq/rabbitmq/messages/available count The total number of messages with the status messages_ready on the node
/p-rabbitmq/rabbitmq/queues/count count The number of queues on the node
/p-rabbitmq/rabbitmq/channels/count count The number of channels on the node

RabbitMQ Service Broker Metric

RabbitMQ for PCF service broker components emit the following metric.

Full Name Unit Description
/p-rabbitmq.service_broker.heartbeat boolean Indicates whether the service broker is available and able to respond to requests

HAProxy Metrics

RabbitMQ for PCF HAProxy components emit the following metrics.

Name Space Unit Description
/p-rabbitmq.haproxy.heartbeat boolean Indicates whether the RabbitMQ HAProxy component is available and able to respond to requests
/p-rabbitmq/haproxy/health/connections count The total number of concurrent front-end connections to the server
/p-rabbitmq/haproxy/backend/qsize/amqp size The total size of the AMQP queue on the server
/p-rabbitmq/haproxy/backend/retries/amqp count The number of AMQP retries to the server
/p-rabbitmq/haproxy/backend/ctime/amqp time The total time to establish the TCP AMQP connection to the server
Create a pull request or raise an issue on the source for this page in GitHub