LATEST VERSION: 1.6 - RELEASE NOTES
PCF Metrics v1.6

Monitoring PCF Metrics

This topic explains how to monitor the health of the Pivotal Cloud Foundry (PCF) Metrics service using the logs, metrics, and Key Performance Indicators (KPIs) emitted by Cloud Foundry and the Metrics application itself.

For more information about monitoring PCF, see Monitoring Pivotal Cloud Foundry.

Healthwatch

The premier way to monitor PCF Metrics is using Healthwatch. Once installed, navigate to the JobHealth dashboard to view the PCF Metrics deployment which is named apmPostgres.

Healthwatch also supports alerting based on VM persistent disk percentage system.disk.persistent.percent and VM health system.healthy.

Key Performance Indicators

KPIs for PCF Metrics are the metrics that operators find most useful for monitoring their PCF Metrics service. KPIs are high-signal-value metrics that can indicate emerging issues.

Pivotal provides the following KPIs as general alerting and response guidance for typical PCF Metrics installations. Pivotal recommends that operators continue to fine-tune the alert measures to their installation by observing historical trends. Pivotal also recommends that operators expand beyond this guidance and create new, installation-specific monitoring metrics, thresholds, and alerts based on learning from their own installations.

BOSH Metrics

All BOSH-deployed components generate the following metrics. Monitor them to ensure that they are not consuming excess resources.


system.mem.percent

Description Percentage used of the VM Memory for MySQL, Redis, and PostgreSQL.

Use: Too much VM Memory usage will likely negatively impact data storage and access performance.

Origin:bosh-system-metrics-forwarder
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 80%
Red critical: > 85%
Recommended response Scale up as appropriate.

persistent.disk.percent

Description Percentage used of the VM persistent disk for MySQL, Redis, and PostgreSQL.

Use: It is important to make sure that the system disks of data services do not fill up and cause data loss and performance degradation.

Origin:bosh-system-metrics-forwarder
Type: percent
Frequency: 30 s (default), 10 s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 70%
Red critical: > 80%
Recommended response Scale up as appropriate.

Component Metrics

All applications pushed using Cloud Foundry automatically emit the following application component metrics. PCF Metrics is a collection of applications like any other CF applications, and thus can be monitored by PCF Metrics (among other monitoring services). The following KPIs can indicate problems with your installation.


system.mem.percent

Description Percentage used of the application container memory for PCF Metrics applications.

Use: PCF Metrics applications running out of memory will likely negatively impact performance.

Origin: rep
Type: percent
Frequency: Every minute
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 80%
Red critical: > 90%
Recommended response Scale up as appropriate.

persistent.disk.percent

Description Percentage used of the application container persistent disk for PCF Metrics applications.

Use: PCF Metrics applications running out of disk will likely negatively impact performance.

Origin: Firehose
Type: percent
Frequency: Every minute
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > N/A
Red critical: > 80%
Recommended response Scale up as appropriate.

Custom Metrics

Installing the Metrics Forwarder tile will allow one to gather more fine-grained metrics from the PCF Metrics deployment.

All PCF Metrics applications are already set up to emit certain custom metrics to indicate application health. As long as the Metrics Forwarder is installed, custom metrics are viewable in PCF Metrics. The following KPIs can indicate problems with your installation.


metric-processor.envelopes-stored.rate.1-minute

Description The rate at which metrics processor stores metric envelopes to its persistent data store for metrics-queue.

Use: Zero-value rate indicates that no metrics have been stored, which is likely caused by some major metrics processing errors or failures.

Origin: metrics-forwarder
Type: gauge
Frequency: Every minute
Recommended measurement Every minute for the past 30 minutes
Recommended alert thresholds A 0 value at any point in the past 30 minutes
Recommended response Consult the troubleshooting document for further guidance.

log-processor.logstore.bulk-inserts.rate.1_minute

Description The rate at which log processor stores log envelopes to its persistent data store for PCF Metrics applications.

Use: Zero-value rate indicates that no logs have been stored, which is likely caused by some major logs processing errors or failures.

Origin: metrics-forwarder
Type: gauge
Frequency: Every minute
Recommended measurement Every minute for the past 30 minutes
Recommended alert thresholds A 0 value at any point in the past 30 minutes
Recommended response Consult the troubleshooting document for further guidance.

If you have any further questions regarding monitoring PCF Metrics, refer to the PCF Metrics troubleshooting guide.

Create a pull request or raise an issue on the source for this page in GitHub