Monitoring and KPIs for Pre‑Provisioned RabbitMQ for PCF
Warning: RabbitMQ for Pivotal Cloud Foundry v1.16 is no longer supported because it has reached the End of General Support (EOGS) phase as defined by the Support Lifecycle Policy. To stay up to date with the latest software and security updates, upgrade to a supported version.
This topic explains how to monitor the health of the pre-provisioned version of the RabbitMQ for Pivotal Cloud Foundry (PCF) service using the logs, metrics, and Key Performance Indicators (KPIs) generated by RabbitMQ for PCF component VMs.
Pre-provisioned RabbitMQ for PCF components generate many of the same metrics as the on-demand RabbitMQ service components.
See Overview of Logging and Metrics for general information about logging and metrics in PCF.
Setting up Syslog Forwarding
Operators can enable log forwarding by configuring an external syslog endpoint for RabbitMQ component log messages. For instructions on setting up syslog forwarding, see Configure Syslog Forwarding and Metrics Polling Interval.
If syslog forwarding is enabled, log entries with timestamps can also be found locally in /var/log/messages
. In any case, logs are available under /var/vcap/sys/log/
.
Logging Formats
With pre-provisioned RabbitMQ for PCF logging configured, three types of component generate logs: the RabbitMQ message server nodes, the service broker, and HAProxy. If you have multiple server or HAProxy nodes, you can identify logs from individual nodes by their index, which corresponds to the index of the RabbitMQ VM instances displayed in Ops Manager:
- The logs for RabbitMQ server nodes follow the format
[job=rabbitmq-server-partition-GUID index=X]
- The logs for HAProxy nodes follow the format
[job=rabbitmq-haproxy-partition-GUID index=X]
- The logs for the RabbitMQ service broker follow the format
[job=rabbitmq-broker-partition-GUID index=X]
RabbitMQ and HAProxy servers log at the info
level and capture errors, warnings, and informational messages.
For users familiar with documentation for previous versions of the tile, the tag we used to call the app_name
is now called the program_name
.
The generic log format is as follows:
<PRI>TIMESTAMP IP_ADDRESS PROGRAM_NAME [job=NAME index=JOB_INDEX id=JOB_ID] MESSAGE
The raw logs look similar to the following:
<7>2017-06-28T16:06:10.733560+00:00 10.244.16.133 vcap.agent [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5] 2017/06/28 16:06:10 CEF:0|CloudFoundry|BOSH|1|agent_api|ssh|1|duser=director.be5a66bb-a9b4-459f-a0d3-1fc5c9c3ed79.be148cc6-91ef-4eed-a788-237b0b8c63b7 src=10.254.50.4 spt=4222 shost=5ae233e0-ecc5-4868-9ae0-f9767571251b
<86>2017-06-28T16:06:16.704572+00:00 10.244.16.133 useradd [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5] new group: name=bosh_ly0d2rbjr, GID=1003
<86>2017-06-28T16:06:16.704663+00:00 10.244.16.133 useradd [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5] new user: name=bosh_ly0d2rbjr, UID=1001, GID=1003, home=/var/vcap/bosh_ssh/bosh_ly0d2rbjr, shell=/bin/bash
<86>2017-06-28T16:06:16.736932+00:00 10.244.16.133 usermod [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5] add 'bosh_ly0d2rbjr' to group 'admin'
<86>2017-06-28T16:06:16.736964+00:00 10.244.16.133 usermod [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5] add 'bosh_ly0d2rbjr' to group 'vcap'
Logs sent to external logging tools such as Papertrail may be presented in a different format.
The following table describes the logging tags used in this template:
Tag | Description |
---|---|
PRI | This is a value which in future will be used to describe the severity of the log message and which facility it came from. |
TIMESTAMP | This is the timestamp of when the log is forwarded, for example, 2016-08-24T05:14:15.000003Z .
The timestamp value is typically slightly after when the log message was generated. |
IP_ADDRESS | The internal IP address of server on which the log message originated |
PROGRAM_NAME | Process name of the program the generated the message.
Same as app_name before v1.9.0.
For more information about program name, see RabbitMQ Program Names below. |
NAME | The BOSH instance group name (for example, rabbitmq_server ) |
JOB_INDEX | BOSH job index. Used to distinguish between multiple instances of the same job. |
JOB_ID | BOSH VM GUID. This is distinct from the CID displayed in the Ops Manager Status tab, which corresponds to the VM ID assigned by the infrastructure provider. |
MESSAGE | The log message that appears |
RabbitMQ Program Names
Program Name | Description |
---|---|
rabbitmq_server_cluster_check | Checks that the RabbitMQ cluster is healthy. Runs after every deploy. |
rabbitmq_server_node_check | Checks that the RabbitMQ node is healthy. Runs after every deploy. |
rabbitmq_route_registrar_stderr | Registers the route for the management API with the Gorouter in your Pivotal Application Service deployment. |
rabbitmq_route_registrar_stdout | Registers the route for the management API with the Gorouter in your Pivotal Application Service deployment. |
rabbitmq_server | The Erlang VM and RabbitMQ apps. Logs may span multiple lines. |
rabbitmq_server_drain | Shuts down the Erlang VM and RabbitMQ apps. Runs as part of the BOSH lifecycle. |
rabbitmq_server_http_api_access | Access to the RabbitMQ Management UI. |
rabbitmq_server_init | Starts the Erlang VM and RabbitMQ. |
rabbitmq_server_post_deploy_stderr | Runs the node check and cluster check. Runs after every deploy. |
rabbitmq_server_post_deploy_stdout | Runs the node check and cluster check. Runs after every deploy. |
rabbitmq_server_pre_start | Runs before the rabbitmq-server job is started. |
rabbitmq_server_sasl | Supervisor, progress, and crash reporting for the Erlang VM and RabbitMQ apps. |
rabbitmq_server_shutdown_stderr | Stops the RabbitMQ app and Erlang VM. |
rabbitmq_server_shutdown_stdout | Stops the RabbitMQ app and Erlang VM. |
rabbitmq_server_startup_stderr | Starts the RabbitMQ app and Erlang VM, then configures users and permissions. |
rabbitmq_server_startup_stdout | Starts the RabbitMQ app and Erlang VM, then configures users and permissions. |
rabbitmq_server_upgrade | Shuts down Erlang VM and RabbitMQ app if required during an upgrade. |
Metrics
Metrics are regularly-generated log messages that report measured component states. The metrics polling interval defaults to 30 seconds. The metrics polling interval is a configuration option on the RabbitMQ tile (Settings > RabbitMQ). Setting this interval to -1 disables metrics. The interval setting applies to all components deployed by the tile.
Metrics are long, single lines of text that follow the format:
origin:"p-rabbitmq" eventType:ValueMetric timestamp:1441188462382091652 deployment:"cf-rabbitmq" job:"cf-rabbitmq-node" index:"0" ip:"10.244.3.46" valueMetric: < name:"/p-rabbitmq/rabbitmq/system/memory" value:1024 unit:"MB">
Partition Indicator
A new metric has been introduced to help to identify network partitions. Essentially it exposes how many nodes each node knows. When a node is in partition the only node that it recognizes is itself and that is a good indication that that node might be in a partition.
An example of that metrics is:
origin:"p-rabbitmq" eventType:ValueMetric timestamp:1441188462382091652 deployment:"cf-rabbitmq" job:"cf-rabbitmq-node" index:"0" ip:"10.244.3.46" valueMetric: < name:"/p-rabbitmq/rabbitmq/erlang/reachable_nodes" value:3 unit:"count">
Monitors can be created to emit alerts in case a cluster seems to be in a partition. A metrics is emitted for each node in the cluster. For example: in a three-node cluster a monitor can expect to have a total of 9 (nine) because each node is expected to emit 3 (2 reachable nodes and itself). Otherwise, an alert can be sent to the team.
Recovering from a network partition
See Clustering and Network Partitions in the RabbitMQ documentation to learn how to recover from a network partition.
Key Performance Indicators
Key Performance Indicators (KPIs) for RabbitMQ for PCF are metrics that operators find most useful for monitoring their RabbitMQ service to ensure smooth operation. KPIs are high-signal-value metrics that can indicate emerging issues. KPIs can be raw component metrics or derived metrics generated by applying formulas to raw metrics.
Pivotal provides the following KPIs as general alerting and response guidance for typical RabbitMQ for PCF installations. Pivotal recommends that operators continue to fine-tune the alert measures to their installation by observing historical trends. Pivotal also recommends that operators expand beyond this guidance and create new, installation-specific monitoring metrics, thresholds, and alerts based on learning from their own installations.
For a list of all RabbitMQ for PCF raw component metrics, see Component Metrics Reference.
Component Heartbeats
Key RabbitMQ for PCF components periodically emit heartbeat metrics: the RabbitMQ server nodes, HAProxy nodes, and the Service Broker. The heartbeats are Boolean metrics, where 1
means the system is available, and 0
or the absence of a heartbeat metric means the service is not responding and should be investigated.
Service Broker Heartbeat
p-rabbitmq.service_broker.heartbeat | |
---|---|
Description | RabbitMQ Service Broker is alive poll, which indicates if the component is available and able to respond to requests.Use: If the Service Broker does not emit heartbeats, this indicates that it is offline. The Service Broker is required to create, update, and delete service instances, which are critical for dependent tiles such as Spring Cloud Services and Spring Cloud Data Flow. Origin: Doppler/Firehose Type: boolean Frequency: 30 s (default), 10 s (configurable minimum) |
Recommended measurement | Average over last 5 minutes |
Recommended alert thresholds | Yellow warning: N/A Red critical: < 1 |
Recommended response |
Check the RabbitMQ Service Broker logs for errors.
You can find this VM by targeting your RabbitMQ deployment with BOSH and running the following command:bosh -d service-instance_GUID vms
|
HAProxy Heartbeat
p-rabbitmq.haproxy.heartbeat | |
---|---|
Description | RabbitMQ HAProxy is alive poll, which indicates if the component is available and
able to respond to requests.Use: If the HAProxy does not emit heartbeats, this indicates that it is offline. To be functional, service instances require HAProxy. Origin: Doppler/Firehose Type: boolean Frequency: 30 s (default), 10 s (configurable minimum) |
Recommended measurement | Average over last 5 minutes |
Recommended alert thresholds | Yellow warning: N/A Red critical: < 1 |
Recommended response |
Check the RabbitMQ HAProxy logs for errors.
You can find the VM by targeting your RabbitMQ deployment with BOSH and
running the following command, which lists HAProxy_GUID :bosh -d service-instance_GUID vms
|
Server Heartbeat
p-rabbitmq.rabbitmq.heartbeat | |
---|---|
Description | RabbitMQ Server is alive poll, which indicates if the component is available and
able to respond to requests.Use: If the server does not emit heartbeats, this indicates that it is offline. To be functional, service instances require RabbitMQ Server. Origin: Doppler/Firehose Type: boolean Frequency: 30 s (default), 10 s (configurable minimum) |
Recommended measurement | Average over last 5 minutes |
Recommended alert thresholds | Yellow warning: N/A Red critical: < 1 |
Recommended response |
Check the RabbitMQ Server logs for errors.
You can find the VM by targeting your RabbitMQ deployment with BOSH and
running one of the following commands, which lists rabbitmq :bosh -d service-instance_GUID vms
|
RabbitMQ Server KPIs
The following KPIs from the RabbitMQ server component:
File Descriptors
p-rabbitmq.rabbitmq.system.file_descriptors | |
---|---|
Description | File descriptors consumed. Use: If the number of file descriptors consumed becomes too large, the VM might lose the ability to perform disk I/O, which can cause data loss. Note: This assumes non-persistent messages are handled by retries or some other logic by the producers. Origin: Doppler/FirehoseType: count Frequency: 30 s (default), 10 s (configurable minimum) |
Recommended measurement | Average over last 10 minutes |
Recommended alert thresholds | Yellow warning: > 250000 Red critical: > 280000 |
Recommended response | The default ulimit for RabbitMQ for PCF is 300000. If this metric is met or exceeded for an extended period of time, consider one of the following actions:
|
Erlang Processes
p-rabbitmq.rabbitmq.erlang.erlang_processes | |
---|---|
Description | Erlang processes consumed by RabbitMQ, which runs on an Erlang VM. Use: This is the key indicator of the processing capability of a node. Origin: Doppler/Firehose Type: count Frequency: 30 s (default), 10 s (configurable minimum) |
Recommended measurement | Average over last 10 minutes |
Recommended alert thresholds | Yellow warning: > 900000 Red critical: > 950000 |
Recommended response | The default Erlang process limit in RabbitMQ for PCF v1.6 and later is 1,048,816. If this metric meets or exceeds the recommended thresholds for extended periods of time, consider scaling the RabbitMQ nodes in the tile Resource Config pane. |
BOSH System Health Metrics
The BOSH layer that underlies Pivotal Cloud Foundry generates healthmonitor
metrics for all VMs in the deployment.
As of RabbitMQ for Pivotal Cloud Foundry v2.0, these metrics are included in the Loggregator Firehose by default.
For more information, see BOSH System Metrics Available in Loggregator Firehose in
Pivotal Application Service Release Notes.
All BOSH-deployed components generate the system health metrics below. These component metrics are from RabbitMQ for PCF components, and serve as KPIs for the RabbitMQ for PCF service.
RAM
system.mem.percent | |
---|---|
Description | RAM being consumed by the p-rabbitmq VM.Use: RabbitMQ is considered to be in a good state when it has little or no messages. In other words, “an empty rabbit is a happy rabbit.” Alerting on this metric can indicate that there are too few consumers or apps that read messages from the queue. Healthmonitor reports when RabbitMQ uses more than 40% of its RAM for the past ten minutes. Origin: BOSH HM Type: percent Frequency: 30 s (default), 10 s (configurable minimum) |
Recommended measurement | Average over last 10 minutes |
Recommended alert thresholds | Yellow warning: > 40 Red critical: > 50 |
Recommended response | Add more consumers to drain the queue as fast as possible. |
CPU
system.cpu.user | |
---|---|
Description | CPU being consumed by user processes on the p-rabbitmq VM.Use: A node that experiences context switching or high CPU usage becomes unresponsive. This also affects the ability of the node to report metrics. Healthmonitor reports when RabbitMQ uses more than 40% of its CPU for the past ten minutes. Origin: BOSH HM Type: percent Frequency: 30 s (default), 10 s (configurable minimum) |
Recommended measurement | Average over last 10 minutes |
Recommended alert thresholds | Yellow warning: > 60 Red critical: > 75 |
Recommended response | Remember that “an empty rabbit is a happy rabbit”. Add more consumers to drain the queue as fast as possible. |
Ephemeral Disk
system.disk.ephemeral.percent | |
---|---|
Description | Ephemeral Disk being consumed by the p-rabbitmq VM.Use: If system disk fills up, there are too few consumers. Healthmonitor reports when RabbitMQ uses more than 40% of its CPU for the past ten minutes. Origin: BOSH HM Type: percent Frequency: 30 s (default), 10 s (configurable minimum) |
Recommended measurement | Average over last 10 minutes |
Recommended alert thresholds | Yellow warning: > 60 Red critical: > 75 |
Recommended response | Remember that “an empty rabbit is a happy rabbit”. Add more consumers to drain the queue as fast as possible. |
Persistent Disk
system.disk.persistent.percent | |
---|---|
Description | Persistent Disk being consumed by the p-rabbitmq VM.Use: If system disk fills up, there are too few consumers. Healthmonitor reports when RabbitMQ uses more than 40% of its CPU for the past ten minutes. Origin: BOSH HM Type: percent Frequency: 30 s (default), 10 s (configurable minimum) |
Recommended measurement | Average over last 10 minutes |
Recommended alert thresholds | Yellow warning: > 60 Red critical: > 75 |
Recommended response | Remember that “an empty rabbit is a happy rabbit”. Add more consumers to drain the queue as fast as possible. |
Component Metric Reference
RabbitMQ for PCF component VMs emit the following raw metrics. The full name of the metric follows the format: /p-rabbitmq/COMPONENT/METRIC-NAME
RabbitMQ Server Metrics
RabbitMQ for PCF message server components emit the following metrics.
Full Name | Unit | Description |
---|---|---|
/p-rabbitmq.rabbitmq.heartbeat |
boolean | Indicates whether the RabbitMQ server is available and able to respond to requests |
/p-rabbitmq/rabbitmq/erlang/erlang_processes |
count | The number of Erlang processes |
/p-rabbitmq/rabbitmq/system/memory |
MB | The memory in MB used by the node |
/p-rabbitmq/rabbitmq/system/mem_alarm |
boolean | Indicates if the memory alarm went off |
/p-rabbitmq/rabbitmq/system/disk_free_alarm |
boolean | Indicates if the disk free alarm went off |
/p-rabbitmq/rabbitmq/system/disk_free |
MB | The disk space available on the node |
/p-rabbitmq/rabbitmq/connections/count |
count | The total number of connections to the node |
/p-rabbitmq/rabbitmq/consumers/count |
count | The total number of consumers registered in the node |
/p-rabbitmq/rabbitmq/messages/delivered |
count | The total number of messages with the status deliver_get on the node |
/p-rabbitmq/rabbitmq/messages/delivered_noack |
count | The number of messages with the status deliver_noack on the node |
/p-rabbitmq/rabbitmq/messages/delivered_rate |
rate | The rate per second at which messages are being delivered to consumers or clients on the node |
/p-rabbitmq/rabbitmq/messages/published |
count | The total number of messages with the status publish on the node |
/p-rabbitmq/rabbitmq/messages/published_rate |
rate | The rate per second at which messages are being published by the node |
/p-rabbitmq/rabbitmq/messages/redelivered |
count | The total number of messages with the status redeliver on the node |
/p-rabbitmq/rabbitmq/messages/redelivered_rate |
rate | The rate per second at which messages are getting the status redeliver on the node |
/p-rabbitmq/rabbitmq/messages/get_no_ack |
count | The number of messages with the status get_no_ack on the node |
/p-rabbitmq/rabbitmq/messages/get_no_ack_rate |
rate | The rate per second at which messages get the status get_no_ack on the node |
/p-rabbitmq/rabbitmq/messages/pending |
count | The number of messages with the status messages_unacknowledged on the node |
/p-rabbitmq/rabbitmq/messages/depth |
count | The number of messages with the status messages_unacknowledged or messages_ready on the node |
/p-rabbitmq/rabbitmq/system/file_descriptors |
count | The number of open file descriptors on the node |
/p-rabbitmq/rabbitmq/exchanges/count |
count | The total number of exchanges on the node |
/p-rabbitmq/rabbitmq/messages/available |
count | The total number of messages with the status messages_ready on the node |
/p-rabbitmq/rabbitmq/queues/count |
count | The number of queues on the node |
/p-rabbitmq/rabbitmq/channels/count |
count | The number of channels on the node |
/p-rabbitmq/rabbitmq/vhosts/count |
count | The number of vhosts |
/p-rabbitmq/rabbitmq/queues/VHOST-NAME/QUEUE-NAME/consumers |
count | The number of consumers per virtual host per queue |
/p-rabbitmq/rabbitmq/queues/VHOST-NAME/QUEUE-NAME/depth |
count | The number of messages with the status messages_unacknowledged or messages_ready per virtual host per queue |
HAProxy Metrics
RabbitMQ for PCF HAProxy components emit the following metrics.
Name Space | Unit | Description |
---|---|---|
/p-rabbitmq.haproxy.heartbeat |
boolean | Indicates whether the RabbitMQ HAProxy component is available and able to respond to requests |
/p-rabbitmq/haproxy/health/connections |
count | The total number of concurrent front-end connections to the server |
/p-rabbitmq/haproxy/backend/qsize/amqp |
size | The total size of the AMQP queue on the server |
/p-rabbitmq/haproxy/backend/retries/amqp |
count | The number of AMQP retries to the server |
/p-rabbitmq/haproxy/backend/ctime/amqp |
time | The total time to establish the TCP AMQP connection to the server |