LATEST VERSION: 1.10 - CHANGELOG
Pivotal Cloud Foundry v1.10

Monitoring a Pivotal Cloud Foundry Deployment

Page last updated:

This topic describes how to set up Pivotal Cloud Foundry (PCF) with third-party monitoring platforms to continuously monitor system metrics and trigger health alerts.

To perform a manual, one-time check of current PCF system status from Ops Manager, see Monitoring Virtual Machines in Pivotal Cloud Foundry.

Pivotal recommends that operators experiment with different combinations of metrics and alerts appropriate to their specific requirements. As an example, the Datadog Config repository shows how the Pivotal Cloud Ops team monitors the health of its Cloud Foundry deployments using a customized Datadog dashboard.

Note: Pivotal does not support any monitoring platforms.

Overview

As a prerequisite to PCF monitoring, you need an account with a monitoring platform such as Datadog or OpenTDSB.

To set up PCF monitoring, you then configure PCF and your monitoring platform as follows:

  • In PCF:

    • Install a nozzle that extracts BOSH and CF metrics from the Firehose and sends them to the monitoring platform.
    • (Optional) Deploy smoke tests or other apps that generate custom metrics. Pivotal recommends custom metrics for production environments.
  • In your monitoring platform:

    • Customize a dashboard that lets you check and diagnose system health.
    • Create alerts that generate communications regarding attention-worthy conditions.

BOSH Health Monitor and CF Component Metrics

You can configure PCF to direct metrics from all Elastic Runtime component VMs, including system components and hosts, to a monitoring platform. To do this, you configure component logs and metrics to stream from from the Loggregator Firehose endpoint and install a nozzle that filters out the logs and directs the metrics to the monitoring platform.

The Firehose logs and metrics come from two sources: the BOSH Health Monitor and Cloud Foundry components.

BOSH Health Monitor

The BOSH layer that underlies PCF generates healthmonitor metrics for all VMs in the deployment. The Pivotal Cloud Ops team considers the following of these to be the most important for monitoring system health:

  • bosh.healthmonitor.system.cpu: CPU usage, percent of total available on VM
  • bosh.healthmonitor.system.mem: Memory usage, percent of total available on VM
  • bosh.healthmonitor.system.disk: Disk usage, percent of total available on VM
  • bosh.healthmonitor.system.healthy: 1 if VM is healthy, 0 otherwise

Cloud Foundry Components

Cloud Foundry component VMs for executive control, hosting, routing, traffic control, authentication, and other internal functions generate metrics. See the Cloud Foundry Component Metrics topic for a detailed list of metrics generated by Cloud Foundry.

The Pivotal Cloud Ops team considers the following PCF Component metrics often useful to monitor:

  • auctioneer.AuctioneerFetchStatesDuration
  • auctioneer.AuctioneerLRPAuctionsFailed
  • bbs.Domain.cf_apps
  • bbs.CrashedActualLRPs
  • bbs.LRPsMissing
  • bbs.ConvergenceLRPDuration
  • bbs.RequestLatency
  • DopplerServer.listeners.receivedEnvelopes
  • DopplerServer.TruncatingBuffer.totalDroppedMessages
  • gorouter.total_routes
  • gorouter.ms_since_last_registry_update
  • MetronAgent.dropsondeMarshaller.sentEnvelopes
  • nsync_bulker.DesiredLRPSyncDuration
  • rep.CapacityRemainingMemory
  • rep.CapacityTotalMemory
  • rep.RepBulkSyncDuration
  • route_emitter.RouteEmitterSyncDuration

PCF components specific to your IaaS also generate key metrics for health monitoring.

Metrics Path from Component to Firehose

PCF component metrics originate from the Metron agents on their source components, then travel through Dopplers to the Traffic Controller.

The Traffic Controller aggregates both metrics and log messages system-wide from all Dopplers, and emits them from its Firehose endpoint.

Smoke Tests and Custom System Metrics

PCF includes smoke tests, which are functional unit and integration tests on all major system components. By default, whenever an operator upgrades to a new version of Elastic Runtime, these smoke tests run as a post-deploy errand.

Production systems typically also have an app that runs smoke tests periodically, for example every five minutes, and generate “pass/fail” metrics from the results. For example smoke tests, see the Pivotal Cloud Ops CF Smoke Tests repo.

Operators can also generate other custom system metrics based on multi-component tests. An example is average outbound latency between components.

PCF Monitoring Setup

Perform the following steps to set up PCF monitoring:

  1. Install a nozzle that extracts BOSH and CF metrics from the Loggregator Firehose and sends them to the monitoring platform.

  2. If you are not using the JMX Bridge nozzle, install the HM Forwarder process to run on the BOSH Health Monitor VM. This process routes health metrics to the local Metron agent, and it does not install automatically as part of PCF. You do not need the HM Forwarder with the JMX Bridge nozzle, which queries the Health Monitor directly.

  3. Install a custom app to generate smoke test or other custom system metrics.

  4. Customize your monitoring platform dashboard and alerts.

Install a Nozzle

To monitor BOSH and CF component metrics, you install a nozzle that directs the metrics from the Firehose to your monitoring platform. The nozzle process takes the Firehose output, ignores the logs, and sends the metrics.

If you do not use the JMX Bridge OpenTSDB Firehose Nozzle, you must install a BOSH HM Forwarder job on the VM that runs the BOSH Health Monitor and its Metron agent.

You can see an example nozzle for sending metrics to Datadog in the datadog-firehose-nozzle GitHub repository. You configure the Datadog account credentials, API location, and other fields and options in the config/datadog-firehose-nozzle.json file.

Deploy a Custom System Metrics App

For production systems, Pivotal recommends deploying an app that runs regular smoke tests and other custom tests and generates metrics from the results.

A custom system metrics app sends metrics to the monitoring platform directly, so you must configure it with your platform endpoint and account information. The app does not run a Metron agent, and the Firehose does not carry custom system metrics.

The app can run in own Docker container, on a Concourse VM, or elsewhere.

See the Pivotal Cloud Ops CF Smoke Tests repository for more information and examples of smoke test and custom system metrics apps.

See the Metrics topic in the Concourse documentation for how to set up Concourse to generate custom system metrics.

Configure your Monitoring Platform

Monitoring platforms support two types of monitoring:

  • A dashboard for active monitoring when you are at a keyboard and screen
  • Automated alerts for when your attention is elsewhere

Some monitoring solutions offer both in one package. Others require putting the two pieces together.

See the Datadog Config repository for an example of how to configure a dashboard and alerts for Cloud Foundry in Datadog.

Customize Your Dashboard

You customize a dashboard by defining elements on the screen that show values derived from one or more metrics. These dashboard elements typically use simple formulas, such as averaging metric values over the past 60 seconds or summing them up over related instances. They are also often normalized to display with 3 or fewer digits for easy reading and color-coded red, yellow, or green to indicate health conditions.

Datadog dashboard

In Datadog, for example, you can define a screen_template query that watches the auctioneer.AuctioneerLRPAuctionsFailed metric and displays its current average over the past minute.

Create Alerts

You create alerts by defining boolean conditions based on operations over one or more metrics, and an action that the platform takes when an alert triggers. The booleans typically check whether metric values exceed or fall below thresholds, or compare metric values against each other.

In Datadog, for example, you can define an alert_template condition that triggers when the auctioneer.AuctioneerLRPAuctionsFailed metric indicates an average of more than one failed auction per minute for the past 15 minutes:


    {
        "query": "min(last_15m):per_minute(avg:datadog.nozzle.auctioneer.AuctioneerLRPAuctionsFailed{deployment:<%= metron_agent_diego_deployment %>}) > 1",
        "message": "##Description:\nDiego internal metrics \n\n## Escalation Path:\nDiego \n\n## Possible Causes:\nThose alerts were a pretty strong signal for us to look at the BBS, which was locked up\n\n## Potential Solutions:\nEscalate to Diego team\n><%= cloudops_pagerduty %> <%= diego_oncall %>",
        "name": "<%= environment %> Diego: LRP Auction Failure per min is too high",
        "no_data_timeframe": 30,
        "notify_no_data": false
    }

Actions that an alert triggers can include sending a pager or SMS message, sending an email, generating a support ticket, or passing the alert to a alerting system such as PagerDuty.

Monitoring Platforms

Some monitoring solutions offer both alerts and a dashboard in one package, while others require separate packages for alerts and dashboard.

Popular monitoring platforms among PCF customers include the following:

Create a pull request or raise an issue on the source for this page in GitHub