Monitoring Pivotal Cloud Foundry

This guide describes how Pivotal Cloud Foundry (PCF) operators can monitor their deployments.

In This Guide

This guide includes the following topics:

  • Key Performance Indicators: A list of Key Performance Indicators (KPIs) that operators may want to monitor with their PCF deployment to help ensure it is in a good operational state.
  • Key Capacity Scaling Indicators: A list of capacity scaling indicators that operators may want to monitor to determine when they need to scale their PCF deployments.
  • Configuring a Monitoring System: Guidance for setting up PCF with third-party monitoring platforms to continuously monitor component metrics and trigger health alerts.

For information about logging and metrics in PCF and about monitoring of services for PCF, see Additional Resources below.

KPI Changes from PCF v1.11 to v1.12

This table highlights new and changed KPIs in PCF v1.12.

The internal MySQL job included in Elastic Runtime now emits metrics. See the Elastic Runtime MySQL KPIs.

Change Which KPI and why? See…
New KPI: gorouter.file_descriptors

New component functionality in the Gorouter system needs monitoring.
New KPI: gorouter.backend_exhausted_conns

New component functionality in the Gorouter system needs monitoring.
New Elastic Runtime MySQL KPIs

Elastic Runtime’s internal MySQL database now emits metrics. See Elastic Runtime MySQL KPIs for KPIs based on these metrics. PCF v1.10.25+ and v1.11.11+ patch releases also include this change.
Modified KPI: Firehose Dropped Messages

It is now necessary to add loggregator.doppler.dropped to DopplerServer.doppler.shedEnvelopes to properly measure Dropped Messages.
Modified KPI: Firehose Throughput

It is now necessary to add loggregator.doppler.ingress to DopplerServer.listeners.totalReceivedMessageCount to properly measure Firehose Throughput.
Modified KPI: Firehose Loss Rate

Due to calculation changes in Firehose Dropped Messages and Firehose Throughput, the formula to calculate the Firehose Loss Rate has been modified.
Modified KPI: Scalable Syslog Drain Bindings Count

Due to improvements in the underlying Adapter memory utilization, the number of adapters handled per drain has increased. This allows the recommended threshold scaling indicator to be increased accordingly.

As of May 2018, the recommended scaling threshold for this metric was modified downward.
Modified KPI: BBS Time to Handle Requests

Diego is now aggregating this metric to emit, every 60 seconds, the max value observed over 60 seconds. This updates the expected frequency of the metric. It also alters the recommended alerting measurement.

Additional Resources

For information about logging and metrics in PCF, see the following topics:

  • Configuring System Logging in Elastic Runtime: This topic explains how to configure the PCF Loggregator system to scale its maximum throughput and to forward logs to an external aggregator service.
  • Logging and Metrics: A guide to Loggregator, the system which aggregates and streams logs and metrics from user apps and system components in Elastic Runtime.

For information about KPIs and metrics for PCF services, see the following topics:

Create a pull request or raise an issue on the source for this page in GitHub