Configuring a Monitoring System

This topic describes how to set up Pivotal Cloud Foundry (PCF) with third-party monitoring platforms to continuously monitor system metrics and trigger health alerts.

Note: You can also use PCF Healthwatch to monitor your deployment. PCF Healthwatch is a service tile developed and supported by Pivotal.

To perform a manual, one-time check of current PCF system status from Ops Manager, see Monitoring Virtual Machines in Pivotal Cloud Foundry.

Pivotal recommends that operators experiment with different combinations of metrics and alerts appropriate to their specific requirements. As an example, the Datadog Config repository shows how the Pivotal Cloud Ops team monitors the health of its Cloud Foundry deployments using a customized Datadog dashboard.

Overview

As a prerequisite to PCF monitoring, you need an account with a monitoring platform such as SignalFx, Datadog, OpenTDSB, New Relic, vRealize Operations (vROPS), AppDynamics, Dynatrace, or another tool.

To set up PCF monitoring, you then configure PCF and your monitoring platform as follows:

  • In PCF:

    • Install a nozzle that extracts BOSH and CF metrics from the Firehose and sends them to the monitoring platform.
    • (Optional) Deploy smoke tests or other apps that generate custom metrics. Pivotal recommends custom metrics for production environments.
  • In your monitoring platform:

    • Customize a dashboard that lets you check and diagnose system health.
    • Create alerts that generate communications regarding attention-worthy conditions.

BOSH Health Monitor and CF Component Metrics

You can configure PCF to direct metrics from all Pivotal Application Service (PAS) component VMs, including system components and hosts, to a monitoring platform. To do this, you configure component logs and metrics to stream from the Loggregator Firehose endpoint and install a nozzle that filters out the logs and directs the metrics to the monitoring platform.

The Loggregator Firehose metrics come from two sources:

  • The BOSH Health Monitor: The BOSH Health Monitor receives health monitoring metrics from BOSH Agents for all VMs in a deployment.

    Note: BOSH-reported system metrics are now available in the Loggregator Firehose by default. If your monitoring tool is set up to consume metrics from both the Loggregator Firehose and the PCF JMX Bridge tile, you may receive duplicate data. To prevent this, stop using PCF JMX Bridge to consume BOSH-reported system metrics outside of the Firehose. For more guidance, see BOSH System Metrics Available in Loggregator Firehose.

  • Cloud Foundry components: Cloud Foundry component VMs for executive control, hosting, routing, traffic control, authentication, and other internal functions generate metrics.

Note: PCF components specific to your IaaS also generate key metrics for health monitoring.

See the following topics for lists of high-signal-value metrics and capacity scaling indicators in a PCF deployment:

Metrics Path from Component to Firehose

PCF component metrics originate from the Metron agents on their source components, then travel through Dopplers to the Traffic Controller.

The Traffic Controller aggregates both metrics and log messages system-wide from all Dopplers, and emits them from its Firehose endpoint.

Smoke Tests and Custom System Metrics

PCF includes smoke tests, which are functional unit and integration tests on all major system components. By default, whenever an operator upgrades to a new version of PAS, these smoke tests run as a post-deploy errand.

Production systems typically also have an app that runs smoke tests periodically, for example every five minutes, and generate “pass/fail” metrics from the results. For example smoke tests, see the Pivotal Cloud Ops CF Smoke Tests repository.

Operators can also generate other custom system metrics based on multi-component tests. An example is average outbound latency between components.

PCF Monitoring Setup

Perform the following steps to set up PCF monitoring:

  1. Install a nozzle that extracts BOSH and CF metrics from the Loggregator Firehose and sends them to the monitoring platform.

  2. Install a custom app to generate smoke test or other custom system metrics.

  3. Customize your monitoring platform dashboard and alerts.

Install a Nozzle

To monitor BOSH and CF component metrics, you install a nozzle that directs the metrics from the Firehose to your monitoring platform. The nozzle process takes the Firehose output, ignores the logs, and sends the metrics.

You can see an example nozzle for sending metrics to Datadog in the datadog-firehose-nozzle GitHub repository. You configure the Datadog account credentials, API location, and other fields and options in the config/datadog-firehose-nozzle.json file.

Deploy a Custom System Metrics App

For production systems, Pivotal recommends deploying an app that runs regular smoke tests and other custom tests and generates metrics from the results.

A custom system metrics app sends metrics to the monitoring platform directly, so you must configure it with your platform endpoint and account information. The app does not run a Metron agent, and the Firehose does not carry custom system metrics.

The app can run in own Docker container, on a Concourse VM, or elsewhere.

See the Pivotal Cloud Ops CF Smoke Tests repository for more information and examples of smoke test and custom system metrics apps.

See the Metrics topic in the Concourse documentation for how to set up Concourse to generate custom system metrics.

Configure your Monitoring Platform

Monitoring platforms support two types of monitoring:

  • A dashboard for active monitoring when you are at a keyboard and screen
  • Automated alerts for when your attention is elsewhere

Some monitoring solutions offer both in one package. Others require putting the two pieces together.

See the Datadog Config repository for an example of how to configure a dashboard and alerts for Cloud Foundry in Datadog.

Determine Warning and Critical Thresholds

To properly configure your monitoring dashboard and alerts, you need to establish what thresholds should drive alerting and red/yellow/green dashboard behavior. PCF users can refer to these recommended Key Performance Indicators and Key Capacity Scaling Indicators.

Some key metrics of operational value have more fixed thresholds, and recommended threshold numbers for them are generally the same values across different foundations and use cases. These metrics tend to revolve around the health and performance of key components that can impact the performance of the entire system.

Other key metrics of operational value are more dynamic in nature. This means that you must establish a baseline and yellow/red thresholds suitable for your system and its use cases. Initially, baselines can be established by watching values of key metrics over time and noting what seems to be a good starting threshold level between acceptable and unacceptable system performance and health. The initial baselines should continue to be refined as you determine the appropriate balance between early detection and reducing unnecessary alert fatigue. These types of more dynamic measures should be revisited on occasion to ensure they are still appropriate to the current system configuration and its usage patterns.

Customize Your Dashboard

You customize a dashboard by defining elements on the screen that show values derived from one or more metrics. These dashboard elements typically use simple formulas, such as averaging metric values over the past 60 seconds or summing them up over related instances. They are also often normalized to display with 3 or fewer digits for easy reading and color-coded red, yellow, or green to indicate health conditions.

Datadog dashboard

In Datadog, for example, you can define a screen_template query that watches the auctioneer.AuctioneerLRPAuctionsFailed metric and displays its current average over the past minute.

Create Alerts

You create alerts by defining boolean conditions based on operations over one or more metrics, and an action that the platform takes when an alert triggers. The booleans typically check whether metric values exceed or fall below thresholds, or compare metric values against each other.

In Datadog,for example, you can define an alert_template condition that triggers when the auctioneer.AuctioneerLRPAuctionsFailed metric indicates an average of more than one failed auction per minute for the past 15 minutes:

{
  "query": "min(last_15m):per_minute(avg:datadog.nozzle.auctioneer.AuctioneerLRPAuctionsFailed{deployment:<%= metron_agent_diego_deployment %>}) > 1",
  "message": "##Description:\nDiego internal metrics \n\n## Escalation Path:\nDiego \n\n## Possible Causes:\nThose alerts were a pretty strong signal for us to look at the BBS, which was locked up\n\n## Potential Solutions:\nEscalate to Diego team\n><%= cloudops_pagerduty %> <%= diego_oncall %>",
  "name": "<%= environment %> Diego: LRP Auction Failure per min is too high",
  "no_data_timeframe": 30,
  "notify_no_data": false
}

Actions that an alert triggers can include sending a pager or SMS message, sending an email, generating a support ticket, or passing the alert to a alerting system such as PagerDuty.

Monitoring Platforms

Some monitoring solutions offer both alerts and a dashboard in one package, while others require separate packages for alerts and dashboard.

Popular monitoring platforms among PCF customers include the following:

Create a pull request or raise an issue on the source for this page in GitHub