Monitoring and KPIs for On-Demand VMware Tanzu RabbitMQ for VMs

Note: Pivotal Platform is now part of VMware Tanzu. In v1.20 and later, VMware Tanzu RabbitMQ [VMs] is named VMware Tanzu RabbitMQ for VMs.

This topic explains how to monitor the health of the on-demand version of the VMware Tanzu RabbitMQ for VMs service using the logs, metrics, and Key Performance Indicators (KPIs) generated by Tanzu RabbitMQ component VMs.

On-Demand Tanzu RabbitMQ components generate many of the same metrics as the pre-provisioned Tanzu RabbitMQ service components. For information about metrics for the pre-provisioned service, see Monitoring and KPIs for Pre‑Provisioned VMware Tanzu RabbitMQ for VMs.

Note: On-Demand service metrics are prefixed with p.rabbitmq to distinguish them from the pre-provisioned service metrics.

See Overview of Logging and Metrics for general information about logging and metrics in VMware Tanzu Application Service for VMs.

Configure Syslog Forwarding

Syslog forwarding is preconfigured and enabled by default. VMware recommends that you keep the default setting because it is good operational practice. However, you can opt out by selecting No for Do you want to configure syslog? in the Ops Manager Settings tab.

To enable monitoring for Tanzu RabbitMQ, operators designate an external syslog endpoint for Tanzu RabbitMQ component log entries. The endpoint serves as the input to a monitoring platform such as Datadog, Papertrail, or SumoLogic.

To specify the destination for Tanzu RabbitMQ log entries:

  1. From the Ops Manager Installation Dashboard, click the Tanzu RabbitMQ tile.
  2. In the Tanzu RabbitMQ tile, click the Settings tab.
  3. Click Syslog. Screenshot of RabbitMQ tile settings with header
called 'Syslog'. The page has several fields:
Radio button group, 'Do you want to configure Syslog forwarding?'
with two options: 'No, do not forward Syslog' or 'Yes'.
Required text field, 'Address',
required text field, 'Port' with value 22822,
required dropdown field, 'Transport Protocol' with TCP selected,
disabled checkbox, 'Enabled TLS',
grayed required text field, 'Permitted Peer'.
grayed required text area field, 'SSL Certificate',
text field, 'Queue Size',
checkbox field, 'Forward Debug Logs', and
textarea field, 'Custom rsyslog Configuration'.
A blue botton is at the bottom called 'Save Syslog Settings'.
  4. Configure the fields on the Syslog pane as follows:

    Field Description
    Syslog Address Enter the IP or DNS address of the syslog server
    Syslog Port Enter the port of the syslog server
    Transport Protocol Select the transport protocol of the syslog server. The options are TLS, UDP, or RELP.
    Enable TLS Enable TLS to the syslog server.
    Permitted Peer If there are several peer servers that can respond to remote syslog connections, enter a wildcard in the domain, such as *.example.com.
    SSL Certificate If the server certificate is not signed by a known authority, such as an internal syslog server, enter the CA certificate of the log management service endpoint.
    Queue Size The number of log entries the buffer holds before dropping messages. A larger buffer size might overload the system. The default is 100000.
    Forward Debug Logs Some components produce very long debug logs. This option prevents them from being forwarded. These logs are still written to local disk.
    Custom Rules The custom rsyslog rules are written in RainerScript and are inserted before the rule that forwards logs. For the list of custom rules you can add in this field, see RabbitMQ Syslog Custom Rules below. For more information about the program names you can use in the custom rules, see Program Names below.

  5. Click Save.

  6. Return to the Ops Manager Installation Dashboard.

  7. Click Review Pending Changes. For more information about this Ops Manager page, see Reviewing Pending Product Changes.

  8. Click Apply Changes to redeploy with the changes.

Logging Format

With on-demand Tanzu RabbitMQ logging configured, two types of components generate logs: the server nodes and the service broker.

  • The logs for RabbitMQ server nodes follow the format [job=rabbitmq-server-partition-GUID index=0]
  • The logs for the RabbitMQ service broker follow the format [job=rabbitmq-broker-partition-GUID index=0]

The RabbitMQ VMs log at the info level and capture errors, warnings, and informational messages.

For users familiar with documentation for previous versions of the tile, the tag formerly called the app_name is now called the program_name.

The generic log format is as follows:

<PRI>TIMESTAMP IP_ADDRESS PROGRAM_NAME [job=NAME index=JOB_INDEX id=JOB_ID] MESSAGE

The raw logs look similar to the following:

<7>2017-06-28T16:06:10.733560+00:00 10.244.16.133 vcap.agent [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5]  2017/06/28 16:06:10  CEF:0|CloudFoundry|BOSH|1|agent_api|ssh|1|duser=director.be5a66bb-a9b4-459f-a0d3-1fc5c9c3ed79.be148cc6-91ef-4eed-a788-237b0b8c63b7 src=10.254.50.4 spt=4222 shost=5ae233e0-ecc5-4868-9ae0-f9767571251b
<86>2017-06-28T16:06:16.704572+00:00 10.244.16.133 useradd [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5]  new group: name=bosh_ly0d2rbjr, GID=1003
<86>2017-06-28T16:06:16.704663+00:00 10.244.16.133 useradd [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5]  new user: name=bosh_ly0d2rbjr, UID=1001, GID=1003, home=/var/vcap/bosh_ssh/bosh_ly0d2rbjr, shell=/bin/bash
<86>2017-06-28T16:06:16.736932+00:00 10.244.16.133 usermod [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5]  add 'bosh_ly0d2rbjr' to group 'admin'
<86>2017-06-28T16:06:16.736964+00:00 10.244.16.133 usermod [job=rmq index=0 id=e37ecdca-5b10-4141-abd8-e1d777dfd8b5]  add 'bosh_ly0d2rbjr' to group 'vcap'

Logs sent to external logging tools such as Papertrail might be presented in a different format.

The following table describes the logging tags used in this template:

Tag Description
PRI This is a value which in future will be used to describe the severity of the log entry and which facility it came from.
TIMESTAMP This is the timestamp of when the log is forwarded, for example, 2016-08-24T05:14:15.000003Z. The timestamp value is typically slightly after when the log entry was generated.
IP_ADDRESS The internal IP address of server on which the log entry originated
PROGRAM_NAME Process name of the program the generated the message. Same as app_name before v1.9.0. For more information about program name, see RabbitMQ Program Names below.
NAME The BOSH instance group name (for example, rabbitmq_server)
JOB_INDEX BOSH job index. Used to distinguish between multiple instances of the same job.
JOB_ID BOSH VM GUID. This is distinct from the CID displayed in the Ops Manager Status tab, which corresponds to the VM ID assigned by the infrastructure provider.
MESSAGE The log entry that appears

RabbitMQ Program Names

For new service instances created using Tanzu RabbitMQ v1.20 and later, the default program name is rabbitmq-server. Existing service instances, including instances upgraded from Tanzu RabbitMQ v1.19 and earlier, have the program names listed in the table below.

If you want new service instances to keep the program names that are in Tanzu RabbitMQ v1.19 and earlier, you must manually add custom rules. For the custom rules to add, see Add RabbitMQ Syslog Custom Rules below.

The following table lists the program names you can make available for use in the logs:

Program Name Description
rabbitmq_server_cluster_check Checks that the RabbitMQ cluster is healthy. Runs after every deploy.
rabbitmq_server_node_check Checks that the RabbitMQ node is healthy. Runs after every deploy.
rabbitmq_route_registrar_stderr Registers the route for the management API with the Gorouter in your VMware Tanzu Application Service for VMs deployment.
rabbitmq_route_registrar_stdout Registers the route for the management API with the Gorouter in your VMware Tanzu Application Service for VMs deployment.
rabbitmq_server The Erlang VM and RabbitMQ apps. Logs can span multiple lines.
rabbitmq_server_drain Shuts down the Erlang VM and RabbitMQ apps. Runs as part of the BOSH lifecycle.
rabbitmq_server_http_api_access Access to the RabbitMQ Management UI.
rabbitmq_server_init Starts the Erlang VM and RabbitMQ.
rabbitmq_server_post_deploy_stderr Runs the node check and cluster check. Runs after every deploy.
rabbitmq_server_post_deploy_stdout Runs the node check and cluster check. Runs after every deploy.
rabbitmq_server_pre_start Runs before the rabbitmq-server job is started.
rabbitmq_server_sasl Supervisor, progress, and crash reporting for the Erlang VM and RabbitMQ apps.
rabbitmq_server_shutdown_stderr Stops the RabbitMQ app and Erlang VM.
rabbitmq_server_shutdown_stdout Stops the RabbitMQ app and Erlang VM.
rabbitmq_server_startup_stderr Starts the RabbitMQ app and Erlang VM, then configures users and permissions.
rabbitmq_server_startup_stdout Starts the RabbitMQ app and Erlang VM, then configures users and permissions.
rabbitmq_server_upgrade Shuts down Erlang VM and RabbitMQ app if required during an upgrade.

Add RabbitMQ Syslog Custom Rules

Tanzu RabbitMQ syslog configuration is now managed by Ops Manager. For new services instances, to continue to filter logs in the same way as Tanzu RabbitMQ v1.19 and earlier, you must manually add custom rules. This retains the program names.

The custom rsyslog rules are written in RainerScript. For more information, see the RainerScript documentation.

To add custom rules that keep the same program names as in Tanzu RabbitMQ v1.19 and earlier:

Note: You can add a subset of the rules below depending on how you want to filter the logs.

  1. Add the following rules to the Custom Syslog Configuration field in the Syslog pane:

    module(load="imfile")
    input(type="imfile"
    File="/var/vcap/sys/log/broker/post-start.stdout.log"
    Tag="rabbitmq_on_demand_broker_post_start_stdout")
    
    input(type="imfile"
    File="/var/vcap/sys/log/broker/post-start.stderr.log"
    Tag="rabbitmq_on_demand_broker_post_start_stderr")
    
    input(type="imfile"
    File="/var/vcap/sys/log/broker/broker.log"
    Tag="rabbitmq_on_demand_broker")
    
    input(type="imfile"
    File="/var/vcap/sys/log/broker/broker-ctl.log"
    Tag="rabbitmq_on_demand_broker_ctl")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-service-broker/broker_stdout.log"
    Tag="rabbitmq_broker_startup_stdout")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-service-broker/broker_stderr.log"
    Tag="rabbitmq_broker_startup_stderr")
    
    input(type="imfile"
    File="/var/vcap/sys/log/route_registrar/route_registrar.log"
    Tag="rabbitmq_broker_route_registrar_stdout")
    
    input(type="imfile"
    File="/var/vcap/sys/log/route_registrar/route_registrar.err.log"
    Tag="rabbitmq_broker_route_registrar_stderr")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-haproxy/haproxy.log"
    Tag="rabbitmq_haproxy")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-haproxy/pre-start.stderr.log"
    Tag="rabbitmq_haproxy_pre_start_stderr")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-haproxy/pre-start.stdout.log"
    Tag="rabbitmq_haproxy_pre_start_stdout")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-haproxy/startup_stderr.log"
    Tag="rabbitmq_haproxy_pre_startup_stderr")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-haproxy/startup_stdout.log"
    Tag="rabbitmq_haproxy_pre_startup_stdout")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/rabbit@*-sasl.log"
    Tag="rabbitmq_server_sasl")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/rabbit@*.log"
    Tag="rabbitmq_server")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/startup_stderr.log"
    Tag="rabbitmq_server_startup_stderr")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/startup_stdout.log"
    Tag="rabbitmq_server_startup_stdout")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/shutdown_stdout.log"
    Tag="rabbitmq_server_shutdown_stdout")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/shutdown_stderr.log"
    Tag="rabbitmq_server_shutdown_stderr")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/management-ui/access.log*"
    Tag="rabbitmq_server_http_api_access")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/upgrade.log"
    Tag="rabbitmq_server_upgrade")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/init.log"
    Tag="rabbitmq_server_init")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/node-check.log"
    Tag="rabbitmq_server_node_check")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/cluster-check.log"
    Tag="rabbitmq_server_cluster_check")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/post-deploy.stderr.log"
    Tag="rabbitmq_server_post_deploy_stderr")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/post-deploy.stdout.log"
    Tag="rabbitmq_server_post_deploy_stdout")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/drain.log"
    Tag="rabbitmq_server_drain")
    
    input(type="imfile"
    File="/var/vcap/sys/log/rabbitmq-server/pre-start.log"
    Tag="rabbitmq_server_pre_start")
    
    input(type="imfile"
    File="/var/vcap/sys/log/route_registrar/route_registrar.log"
    Tag="rabbitmq_route_registrar_stdout")
    
    input(type="imfile"
    File="/var/vcap/sys/log/route_registrar/route_registrar.err.log"
    Tag="rabbitmq_route_registrar_stderr")
    

Metrics

Metrics are regularly-generated log entries that report measured component states. Metrics are long, single lines of text that follow the format:

origin:"p.rabbitmq" eventType:ValueMetric timestamp:1441188462382091652 deployment:"cf-rabbitmq" job:"cf-rabbitmq-node" index:"0" ip:"10.244.3.46" valueMetric: < name:"/p.rabbitmq/rabbitmq/system/memory" value:1024 unit:"MB">

Configure the Metrics Polling Interval

To configure the metrics polling interval:

  1. From the Ops Manager Installation Dashboard, click the Tanzu RabbitMQ tile.
  2. In the Tanzu RabbitMQ tile, click the Settings tab.
  3. Click Metrics. Screenshot of the RabbitMQ tile with header
'Metrics settings for both Pre-Provisioned and On-Demand service offerings' with one field: required
text field, 'Metrics polling interval' with entered value 30 and help text, 'Select the polling
interval for the RabbitMQ service metrics in seconds. Setting this field to -1 disabled metrics.'
A blue 'Save' button is at the bottom of the page.

  4. Configure the fields on the Metrics pane as follows:

    Field Description
    Metrics polling interval The default setting is 30 seconds for all deployed components. VMware recommends that you do not change this interval. To avoid overwhelming components, do not set this below 10 seconds. Set this to -1 to disable metrics. Changing this setting affects all deployed instances.

  5. Click Save.

  6. Return to the Ops Manager Installation Dashboard.

  7. Click Review Pending Changes. For more information about this Ops Manager page, see Reviewing Pending Product Changes.

  8. Click Apply Changes to redeploy with the changes.

Key Performance Indicators

The following sections describe the metrics used as Key Performance Indicators and other useful metrics for monitoring the Tanzu RabbitMQ on-demand service.

Key Performance Indicators (KPIs) for Tanzu RabbitMQ are metrics that operators find most useful for monitoring their Tanzu RabbitMQ service to ensure smooth operation. KPIs are high-signal-value metrics that can indicate emerging issues. KPIs can be raw component metrics or derived metrics generated by applying formulas to raw metrics.

VMware provides the following KPIs as general alerting and response guidance for typical Tanzu RabbitMQ installations. VMware recommends that operators continue to fine-tune the alert measures to their installation by observing historical trends. VMware also recommends that operators expand beyond this guidance and create new, installation-specific monitoring metrics, thresholds, and alerts based on learning from their own installations.

For a list of all Tanzu RabbitMQ raw component metrics, see Component Metrics Reference below.

Component Heartbeats

Key Tanzu RabbitMQ components periodically emit heartbeat metrics: the RabbitMQ server nodes, HAProxy nodes, and the Service Broker. The heartbeats are Boolean metrics, where 1 means the system is available, and 0 or the absence of a heartbeat metric means the service is not responding and should be investigated.

Service Broker Heartbeat


p.rabbitmq/service_broker/heartbeat

Description RabbitMQ Service Broker is alive poll, which indicates if the component is available and able to respond to requests.

Use: If the Service Broker does not emit heartbeats, this indicates that it is offline. The Service Broker is required to create, update, and delete service instances, which are critical for dependent tiles such as Spring Cloud Services and Spring Cloud Data Flow.

Origin: Doppler/Firehose
Type: boolean
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Search the RabbitMQ Service Broker logs for errors. You can find this VM by targeting your Tanzu RabbitMQ deployment with BOSH and running the command. bosh -d service-instance_GUID vms

Server Heartbeat


p.rabbitmq/rabbitmq/heartbeat

Description RabbitMQ Server is alive poll, which indicates if the component is available and able to respond to requests.

Use: If the server does not emit heartbeats, this indicates that it is offline. To be functional, service instances require RabbitMQ Server.

Origin: Doppler/Firehose
Type: boolean
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Search the RabbitMQ Server logs for errors. You can find the VM by targeting your Tanzu RabbitMQ deployment with BOSH and listing rabbitmq by running: bosh -d service-instance_GUID vms

RabbitMQ Server KPIs

The following KPIs from the RabbitMQ server component:

File Descriptors


p.rabbitmq/rabbitmq/system/file_descriptors

Description File descriptors consumed.

Use: If the number of file descriptors consumed becomes too large, the VM might lose the ability to perform disk I/O, which can cause data loss.

Note: This assumes non-persistent messages are handled by retries or some other logic by the producers.

Origin: Doppler/Firehose
Type: count
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 250000
Red critical: > 280000
Recommended response The default ulimit for Tanzu RabbitMQ is 300000. If this metric is met or exceeded for an extended period of time, consider reducing the load on the server.

Erlang Processes


p.rabbitmq/rabbitmq/erlang/erlang_processes

Description Erlang processes consumed by RabbitMQ, which runs on an Erlang VM.

Use: This is the key indicator of the processing capability of a node.

Origin: Doppler/Firehose
Type: count
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 900000
Red critical: > 950000
Recommended response The default Erlang process limit in Tanzu RabbitMQ v1.6 and later is 1,048,816. If this metric meets or exceeds the recommended thresholds for extended periods of time, consider scaling the RabbitMQ nodes in the tile Resource Config pane.

Enable the Prometheus Plugin

Tanzu RabbitMQ supports enabling the rabbitmq_prometheus plugin for on-demand instances. For more information about the plugin and monitoring RabbitMQ using Prometheus and Grafana, see the RabbitMQ documentation. For a list of plugins that can be enabled for on-demand instances, see RabbitMQ Server Plugins.

Enabling the plugin causes Prometheus-style metrics to be emitted at SERVICE-INSTANCE-ID:15692/metrics. To pull these metrics from the service instances, you must configure a Prometheus instance. If Prometheus is deployed in the environment, use the following scrape configuration in the Prometheus config file to discover RabbitMQ instances:

job_name: rabbitmq
metrics_path: "/metrics"
scheme: http
dns_sd_configs:
- names:
   - q-s4.rabbitmq-server.*.*.bosh.
    type: A
     port: 15692

The regular expression in the scrape config name ensures that Prometheus discovers all future service instances.

The RabbitMQ team has written pre-set Grafana dashboards that you can import into Grafana. For more information about these dashboards, see the RabbitMQ documentation.

BOSH System Health Metrics

The BOSH layer that underlies Ops Manager generates healthmonitor metrics for all VMs in the deployment. As of Ops Manager v2.0, these metrics are included in the Loggregator Firehose by default. For more information, see BOSH System Metrics Available in Loggregator Firehose in VMware Tanzu Application Service for VMs Release Notes.

All BOSH-deployed components generate the system health metrics below. These component metrics are from Tanzu RabbitMQ components, and serve as KPIs for the Tanzu RabbitMQ service.

RAM


system.mem.percent

Description RAM being consumed by the p.rabbitmq VM.

Use: RabbitMQ is considered to be in a good state when it has few or no messages. In other words, “an empty rabbit is a happy rabbit.” Alerting on this metric can indicate that there are too few consumers or apps that read messages from the queue.

Healthmonitor reports when RabbitMQ uses more than 40% of its RAM for the past ten minutes.

Origin: BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 40
Red critical: > 50
Recommended response Add more consumers to drain the queue as fast as possible.

CPU


system.cpu.user

Description CPU being consumed by user processes on the p.rabbitmq VM.

Use: A node that experiences context switching or high CPU usage becomes unresponsive. This also affects the ability of the node to report metrics.

Healthmonitor reports when RabbitMQ uses more than 40% of its CPU for the past ten minutes.

Origin: BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 60
Red critical: > 75
Recommended response Remember that “an empty rabbit is a happy rabbit”. Add more consumers to drain the queue as fast as possible.

Ephemeral Disk


system.disk.ephemeral.percent

Description Ephemeral Disk being consumed by the p.rabbitmq VM.

Use: If system disk fills up, there are too few consumers.

Healthmonitor reports when RabbitMQ uses more than 50% of its Ephemeral Disk for the past ten minutes.

Origin: BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 50
Red critical: > 75
Recommended response Remember that “an empty rabbit is a happy rabbit”. Add more consumers to drain the queue as fast as possible. Insufficient disk space leads to node failures and might result in data loss due to all disk writes failing.

Persistent Disk


system.disk.persistent.percent

Description Persistent Disk being consumed by the p.rabbitmq VM.

Use: If system disk fills up, there are too few consumers.

Healthmonitor reports when RabbitMQ uses more than 50% of its Persistent Disk.

Origin: BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 50
Red critical: > 75
Recommended response Remember that “an empty rabbit is a happy rabbit”. Add more consumers to drain the queue as fast as possible. Insufficient disk space leads to node failures and might result in data loss due to all disk writes failing.

Determine If There Is a Network Partition

You can use the reachable_nodes metric to help to identify network partitions. This metric shows how many nodes in the cluster each individual node is aware of. A good indication that a node might be in a partition is when it is aware of only itself.

Here is an example of this metrics:

origin:"p.rabbitmq" eventType:ValueMetric timestamp:1441188462382091652 deployment:"cf-rabbitmq" job:"cf-rabbitmq-node" index:"0" ip:"10.244.3.46" valueMetric: < name:"/p.rabbitmq/rabbitmq/erlang/reachable_nodes" value:3 unit:"count">

You can create monitors to emit alerts in case a cluster seems to be in a partition. In a healthy cluster that is not undergoing upgrades, each node’s reachable_nodes count is equal to the number of nodes in the cluster.

To monitor for network partition, VMware recommends alerting when one of the nodes starts reporting a reachable_nodes count that is less than the size of the cluster.

During rolling upgrades, nodes lose contact with other nodes. Therefore, only alert if a lowered reachable_nodes count persists longer than the expected upgrade time.

Recover from a Network Partition

For information about how to recover from a network partition, see the RabbitMQ documentation.

Component Metrics Reference

Tanzu RabbitMQ component VMs emit the following raw metrics. The full name of the metric follows the format: /p.rabbitmq/COMPONENT/METRIC-NAME

RabbitMQ  Server Metrics

Tanzu RabbitMQ message server components emit the following metrics.

Full Name Unit Description
/p.rabbitmq/rabbitmq/heartbeat boolean Indicates whether the RabbitMQ server is available and able to respond to requests
/p.rabbitmq/rabbitmq/erlang/erlang_processes count The number of Erlang processes
/p.rabbitmq/rabbitmq/erlang/reachable_nodes count The number of nodes the current node can reach
/p.rabbitmq/rabbitmq/system/memory MB The memory in MB used by the node
/p.rabbitmq/rabbitmq/system/mem_alarm boolean Indicates if the memory alarm went off
/p.rabbitmq/rabbitmq/system/disk_free MB The disk space available on the node
/p.rabbitmq/rabbitmq/system/disk_free_alarm boolean Indicates if the disk free alarm went off
/p.rabbitmq/rabbitmq/connections/count count The total number of connections to the node
/p.rabbitmq/rabbitmq/consumers/count count The total number of consumers registered in the node
/p.rabbitmq/rabbitmq/messages/delivered count The total number of messages with the status deliver_get on the node
/p.rabbitmq/rabbitmq/messages/delivered_noack count The number of messages with the status deliver_noack on the node
/p.rabbitmq/rabbitmq/messages/delivered_rate rate The rate per second at which messages are being delivered to consumers or clients on the node
/p.rabbitmq/rabbitmq/messages/published count The total number of messages with the status publish on the node
/p.rabbitmq/rabbitmq/messages/published_rate rate The rate per second at which messages are being published by the node
/p.rabbitmq/rabbitmq/messages/redelivered count The total number of messages with the status redeliver on the node
/p.rabbitmq/rabbitmq/messages/redelivered_rate rate The rate per second at which messages are getting the status redeliver on the node
/p.rabbitmq/rabbitmq/messages/get_no_ack count The number of messages with the status get_no_ack on the node
/p.rabbitmq/rabbitmq/messages/get_no_ack_rate rate The rate per second at which messages get the status get_no_ack on the node
/p.rabbitmq/rabbitmq/messages/pending count The number of messages with the status messages_unacknowledged on the node
/p.rabbitmq/rabbitmq/messages/depth count The number of messages with the status messages_unacknowledged or messages_ready on the node
/p.rabbitmq/rabbitmq/system/file_descriptors count The number of open file descriptors on the node
/p.rabbitmq/rabbitmq/exchanges/count count The total number of exchanges on the node
/p.rabbitmq/rabbitmq/messages/available count The total number of messages with the status messages_ready on the node
/p.rabbitmq/rabbitmq/queues/count count The number of queues on the node
/p.rabbitmq/rabbitmq/channels/count count The number of channels on the node
/p.rabbitmq/rabbitmq/vhosts/count count The number of vhosts
Was this helpful?
What can we do to improve?