LATEST VERSION: 1.4 - CHANGELOG
PCF Metrics v1.4

Monitoring PCF Metrics

This topic explains how to monitor the health of the Pivotal Cloud Foundry (PCF) Metrics service using the logs, metrics, and Key Performance Indicators (KPIs) emitted by Cloud Foundry and the Metrics app itself.

Key Performance Indicators

KPIs for PCF Metrics are the metrics that operators find most useful for monitoring their PCF Metrics service. KPIs are high-signal-value metrics that can indicate emerging issues.

Pivotal provides the following KPIs as general alerting and response guidance for typical PCF Metrics installations. Pivotal recommends that operators continue to fine-tune the alert measures to their installation by observing historical trends. Pivotal also recommends that operators expand beyond this guidance and create new, installation-specific monitoring metrics, thresholds, and alerts based on learning from their own installations.

BOSH Metrics

All BOSH-deployed components generate the following metrics. Monitor them to ensure that they are not consuming excess resources.


system.mem.percent

Description Percentage used of the VM Memory for MySQL, Redis, ElasticSearch data, and ElasticSearch master.

Use: Too much VM Memory usage will likely negatively impact data storage and access performance.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > N/A
Red critical: > 80%
Recommended response Scale up as appropriate.

persistent.disk.percent

Description Percentage used of the VM persistent disk for MySQL, Redis, ElasticSearch data, and ElasticSearch master.

Use: It is important to make sure that the system disks of data services do not fill up and cause data loss and performance degradation.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > N/A
Red critical: > 80%
Recommended response Scale up as appropriate.

System Metrics

All apps pushed using Cloud Foundry automatically emit the following app system metrics. PCF Metrics is a collection of apps like any other CF apps, and thus can be monitored by PCF Metrics (among other monitoring services). The following KPIs can indicate problems with your installation.


system.mem.percent

Description Percentage used of the app container memory for PCF Metrics apps (metrics-ingestor, mysql-logqueue, elasticsearch-logqueue, metrics, metrics-ui).

Use: PCF Metrics apps running out of memory will likely negatively impact performance.

Origin: Firehose
Type: percent
Frequency: Every minute
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > N/A
Red critical: > 80%
Recommended response Scale up as appropriate.

persistent.disk.percent

Description Percentage used of the app container persistent disk for PCF Metrics apps (metrics-ingestor, mysql-logqueue, elasticsearch-logqueue, metrics, metrics-ui).

Use: PCF Metrics apps running out of disk will likely negatively impact performance.

Origin: Firehose
Type: percent
Frequency: Every minute
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > N/A
Red critical: > 80%
Recommended response Scale up as appropriate.

Custom Metrics

All PCF Metrics apps are already set up to emit certain custom metrics to indicate app health. As long as you make the appropriate configurations to export these metrics, they can be monitored by PCF Metrics (among other monitoring services). The following KPIs can indicate problems with your installation. Please refer to the Metrics Forwarder documentation for more information about how to set up custom metrics with the Metrics Forwarder.


metric_processor.envelopes_stored.rate.1_minute

Description The rate at which metrics processor stores metric envelopes to its persistent data store for mysql-logsqueue.

Use: Zero-value rate indicates that no metrics have been stored, which is likely caused by some major metrics processing errors or failures.

Origin: Firehose
Type: count per minute
Frequency: Every minute
Recommended measurement At all times for the past 30 minutes
Recommended alert thresholds Below 0 at all times for the past 30 minutes
Recommended response Consult the troubleshooting document for further guidance.

log_processor.envelopes_stored.rate.1_minute

Description The rate at which log processor stores log envelopes to its persistent data store for PCF Metrics apps (metrics-ingestor, mysql-logqueue, elasticsearch-logqueue, metrics, metrics-ui).

Use: Zero-value rate indicates that no logs have been stored, which is likely caused by some major logs processing errors or failures.

Origin: Firehose
Type: count per minute
Frequency: Every minute
Recommended measurement At all times for the past 30 minutes
Recommended alert thresholds Below 0 at all times for the past 30 minutes
Recommended response Consult the troubleshooting document for further guidance.

If you have any further questions regarding monitoring PCF Metrics, refer to the PCF Metrics troubleshooting guide.

Create a pull request or raise an issue on the source for this page in GitHub