Configuring a Monitoring System

This topic describes how to set up Pivotal Cloud Foundry (PCF) with third-party monitoring platforms to continuously monitor system metrics and trigger health alerts.

To perform a manual, one-time check of current PCF system status from Ops Manager, see Monitoring Virtual Machines in Pivotal Cloud Foundry.

Pivotal recommends that operators experiment with different combinations of metrics and alerts appropriate to their specific requirements. As an example, the Datadog Config repository shows how the Pivotal Cloud Ops team monitors the health of its Cloud Foundry deployments using a customized Datadog dashboard.

Note: Pivotal does not support any monitoring platforms.


As a prerequisite to PCF monitoring, you need an account with a monitoring platform such as SignalFx, Datadog, OpenTDSB, New Relic, vRealize Operations (vROPS), AppDynamics, Dynatrace, or another tool.

To set up PCF monitoring, you then configure PCF and your monitoring platform as follows:

  • In PCF:

    • Install a nozzle that extracts BOSH and CF metrics from the Firehose and sends them to the monitoring platform.
    • (Optional) Deploy smoke tests or other apps that generate custom metrics. Pivotal recommends custom metrics for production environments.
  • In your monitoring platform:

    • Customize a dashboard that lets you check and diagnose system health.
    • Create alerts that generate communications regarding attention-worthy conditions.

BOSH Health Monitor and CF Component Metrics

You can configure PCF to direct metrics from all Elastic Runtime component VMs, including system components and hosts, to a monitoring platform. To do this, you configure component logs and metrics to stream from the Loggregator Firehose endpoint and install a nozzle that filters out the logs and directs the metrics to the monitoring platform.

The Firehose metrics come from two sources:

  • BOSH Health Monitor: The BOSH layer that underlies PCF generates healthmonitor metrics for all VMs in the deployment. These metrics are not in the Firehose by default. You must install the Open-Source HM Forwarder to send these VM metrics through the Firehose.
    • Alternatively, you can also retrieve BOSH health metrics outside of the Firehose with the JMX Bridge for PCF tile.
  • Cloud Foundry components: Cloud Foundry component VMs for executive control, hosting, routing, traffic control, authentication, and other internal functions generate metrics.

Note: PCF components specific to your IaaS also generate key metrics for health monitoring.

See the following topics for lists of high-signal-value metrics and capacity scaling indicators in a PCF deployment:

Metrics Path from Component to Firehose

PCF component metrics originate from the Metron agents on their source components, then travel through Dopplers to the Traffic Controller.

The Traffic Controller aggregates both metrics and log messages system-wide from all Dopplers, and emits them from its Firehose endpoint.

Smoke Tests and Custom System Metrics

PCF includes smoke tests, which are functional unit and integration tests on all major system components. By default, whenever an operator upgrades to a new version of Elastic Runtime, these smoke tests run as a post-deploy errand.

Production systems typically also have an app that runs smoke tests periodically, for example every five minutes, and generate “pass/fail” metrics from the results. For example smoke tests, see the Pivotal Cloud Ops CF Smoke Tests repository.

Operators can also generate other custom system metrics based on multi-component tests. An example is average outbound latency between components.

PCF Monitoring Setup

Perform the following steps to set up PCF monitoring:

  1. Install a nozzle that extracts BOSH and CF metrics from the Loggregator Firehose and sends them to the monitoring platform.

  2. If you are not using the JMX Bridge nozzle, install the HM Forwarder process to run on the BOSH Health Monitor VM. This process routes health metrics to the local Metron agent, and it does not install automatically as part of PCF. You do not need the HM Forwarder with the JMX Bridge nozzle, which queries the Health Monitor directly.

  3. Install a custom app to generate smoke test or other custom system metrics.

  4. Customize your monitoring platform dashboard and alerts.

Install a Nozzle

To monitor BOSH and CF component metrics, you install a nozzle that directs the metrics from the Firehose to your monitoring platform. The nozzle process takes the Firehose output, ignores the logs, and sends the metrics.

If you do not use the JMX Bridge OpenTSDB Firehose Nozzle, you must install a BOSH HM Forwarder job on the VM that runs the BOSH Health Monitor and its Metron agent.

You can see an example nozzle for sending metrics to Datadog in the datadog-firehose-nozzle GitHub repository. You configure the Datadog account credentials, API location, and other fields and options in the config/datadog-firehose-nozzle.json file.

Deploy a Custom System Metrics App

For production systems, Pivotal recommends deploying an app that runs regular smoke tests and other custom tests and generates metrics from the results.

A custom system metrics app sends metrics to the monitoring platform directly, so you must configure it with your platform endpoint and account information. The app does not run a Metron agent, and the Firehose does not carry custom system metrics.

The app can run in its own Docker container, on a Concourse VM, or elsewhere.

See the Pivotal Cloud Ops CF Smoke Tests repository for more information and examples of smoke test and custom system metrics apps.

See the Metrics topic in the Concourse documentation for how to set up Concourse to generate custom system metrics.

Configure your Monitoring Platform

Monitoring platforms support two types of monitoring:

  • A dashboard for active monitoring when you are at a keyboard and screen
  • Automated alerts for when your attention is elsewhere

Some monitoring solutions offer both in one package. Others require putting the two pieces together.

See the Datadog Config repository for an example of how to configure a dashboard and alerts for Cloud Foundry in Datadog.

Determine Warning and Critical Thresholds

To properly configure your monitoring dashboard and alerts, you need to establish what thresholds should drive alerting and red/yellow/green dashboard behavior. PCF users can refer to these recommended Key Performance Indicators and Key Capacity Scaling Indicators.

Some key metrics of operational value have more fixed thresholds, and recommended threshold numbers for them are generally the same values across different foundations and use cases. These metrics tend to revolve around the health and performance of key components that can impact the performance of the entire system.

Other key metrics of operational value are more dynamic in nature. This means that you must establish a baseline and yellow/red thresholds suitable for your system and its use cases. Initially, baselines can be established by watching values of key metrics over time and noting what seems to be a good starting threshold level between acceptable and unacceptable system performance and health. The initial baselines should continue to be refined as you determine the appropriate balance between early detection and reducing unnecessary alert fatigue. These types of more dynamic measures should be revisited on occasion to ensure they are still appropriate to the current system configuration and its usage patterns.

Customize Your Dashboard

You customize a dashboard by defining elements on the screen that show values derived from one or more metrics. These dashboard elements typically use simple formulas, such as averaging metric values over the past 60 seconds or summing them up over related instances. They are also often normalized to display with 3 or fewer digits for easy reading and color-coded red, yellow, or green to indicate health conditions.

Datadog dashboard

In Datadog, for example, you can define a screen_template query that watches the auctioneer.AuctioneerLRPAuctionsFailed metric and displays its current average over the past minute.

Create Alerts

You create alerts by defining boolean conditions based on operations over one or more metrics, and an action that the platform takes when an alert triggers. The booleans typically check whether metric values exceed or fall below thresholds, or compare metric values against each other.

In Datadog,for example, you can define an alert_template condition that triggers when the auctioneer.AuctioneerLRPAuctionsFailed metric indicates an average of more than one failed auction per minute for the past 15 minutes:

  "query": "min(last_15m):per_minute(avg:datadog.nozzle.auctioneer.AuctioneerLRPAuctionsFailed{deployment:<%= metron_agent_diego_deployment %>}) > 1",
  "message": "##Description:\nDiego internal metrics \n\n## Escalation Path:\nDiego \n\n## Possible Causes:\nThose alerts were a pretty strong signal for us to look at the BBS, which was locked up\n\n## Potential Solutions:\nEscalate to Diego team\n><%= cloudops_pagerduty %> <%= diego_oncall %>",
  "name": "<%= environment %> Diego: LRP Auction Failure per min is too high",
  "no_data_timeframe": 30,
  "notify_no_data": false

Actions that an alert triggers can include sending a pager or SMS message, sending an email, generating a support ticket, or passing the alert to a alerting system such as PagerDuty.

Monitoring Platforms

Some monitoring solutions offer both alerts and a dashboard in one package, while others require separate packages for alerts and dashboard.

Popular monitoring platforms among PCF customers include the following:

Create a pull request or raise an issue on the source for this page in GitHub