LATEST VERSION: 1.2 - CHANGELOG
PCF Healthwatch v1.1

Monitoring PCF Healthwatch

This topic explains how to monitor the health of Pivotal Cloud Foundry (PCF) Healthwatch using the metrics and key performance indicators (KPIs) generated by the service.

For general information about monitoring PCF, see Monitoring Pivotal Cloud Foundry.

About Metrics

PCF Healthwatch emits metrics in the following format:

origin:"healthwatch" eventType:ValueMetric timestamp:1509638101820496694 deployment:"healthwatch-app-dev-v1-1" job:"healthwatch-forwarder" index:"097f4b1e-5ca8-4866-82d5-00883798dad4" ip:"10.0.16.29" valueMetric:<name:"healthwatch.metrics.published" value:38 unit:"count">

All PCF Healthwatch-emitted metrics have the healthwatch origin.

Key Performance Indicators for PCF Healthwatch

This section describes the KPIs that you can use to monitor the health of PCF Healthwatch.

Number of Healthwatch Nozzle Disconnects from Firehose


healthwatch.ingestor.disconnects

Description Number of forced disconnects of the PCF Healthwatch data ingestor nozzle from the Firehose.

Use: An unusual increase in the number of disconnects from the Firehose typically indicates that you need to scale the nozzle up. The Firehose disconnects nozzles that are slow consumers to protect apps from backpressure. This metric can also spike during a PCF deployment because the Traffic Controller VMs restart, logging a disconnect.

A prolonged period of losing metrics as a result of disconnects can endanger the assessments that PCF Healthwatch makes using platform metrics from the Firehose.

Origin: Firehose
Type: Gauge
Frequency: 60 s
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response If no known deployment occurred and the spike is sustained, increase the number of PCF Healthwatch Ingestor instances and monitor this metric to ensure that it returns to a normal state.

You can scale Ingestor instances in the Healthwatch Component Config tab of the PCF Healthwatch tile or using the cf scale healthwatch-ingestor command. While cf scale helps you to quickly scale the instances, you should also update the tile configuration so that the next deployment does not override the manual scaling.

Number of Metrics Dropped by Healthwatch Data Loader


healthwatch.loader.dropped

Description Number of metrics dropped by the PCF Healthwatch data loader, which loads incoming data into the PCF Healthwatch datastore.

Use: An unusual increase in the number of dropped metrics by the PCF Healthwatch Loader likely indicates that you need to scale this component up. A prolonged period of dropping metrics can endanger the assessments that PCF Healthwatch makes using platform metrics from the Firehose.

Origin: Firehose
Type: Gauge
Frequency: 60 s
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response Increase the number of PCF Healthwatch Loader instances and monitor this metric to ensure that it returns to a normal state.

You can scale Loader instances in the Healthwatch Component Config tab of the PCF Healthwatch tile or using the cf scale healthwatch-loader command. While cf scale helps you to quickly scale the instances, you should also update the tile configuration so that the next deployment does not override the manual scaling.

PCF Healthwatch UI Availability


healthwatch.ui.available

Description The PCF Healthwatch UI is currently available. This assessment is made using a probe that looks for a successful response: 1 = available, 0 = not available, or timeout (10 s).

Use: Indicates that the Healthwatch UI is running and available to product users. While an issue with the UI does not impact the assessments that PCF Healthwatch is making, loss of the UI can impact user ability to visually reference these assessments.

Origin: Firehose
Type: Gauge
Frequency: 60 s
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: ≤ 0.6
Red critical: ≤ 0.4
Recommended response
  1. Ensure the healthwatch app is running in the healthwatch space of the system org.
  2. Check the app logs for any obvious errors.
  3. Verify that the /info endpoint is available on the healthwatch app route.

CLI Health Test Availability


health.check.cliCommand.probe.available

Description PCF Healthwatch has up-to-date results for the CLI Command Health Test, which means that the test was recently available.

Metric values: 1 = available, 0 = not available, or timeout (10 s)

This assessment is made by looking for results within the configured test schedule plus timeout. For example, a test runner scheduled on 5-minute intervals with a 2-minute timeout must show a test result within the last 7 minutes to succeed.

Use: Indicates that PCF Healthwatch is assessing the health of the Cloud Foundry Command Line Interface (cf CLI) commands. If these continuous validation tests fail to make up-to-date assessments, they are no longer a reliable warning mechanism.

Origin: Firehose
Type: Gauge
Frequency: 60 s
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: ≤ 0.6
Red critical: ≤ 0.4
Recommended response
  1. Ensure the cf-health-check app is running in the healthwatch space of the system org.
  2. Check the app logs for any obvious errors.

Canary App Health Test Availability


health.check.CanaryApp.probe.available

Description PCF Healthwatch has up-to-date results for the Canary App Health Test, which means that the test was recently available.

Metric values: 1 = available, 0 = not available, or timeout (10 s)

This assessment of up-to-date results is made by looking for results within the configured test schedule plus timeout. For example, a test runner scheduled on 5-minute intervals with a 2-minute timeout must show a test result within the last 7 minutes to succeed.

Use: Indicates that PCF Healthwatch is assessing the current state of health for the canary app. If this continuous validation test fails to make up-to-date assessments, it is no longer a reliable warning mechanism.

Origin: Firehose
Type: Gauge
Frequency: 60 s
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: ≤ 0.6
Red critical: ≤ 0.4
Recommended response
  1. Ensure the canary-health-check app is running in the healthwatch space of the system org. Check the app logs for any obvious errors.
  2. Verify that Apps Manager is running and accessible through the URL configured in the CANARY_URL environment variable of the canary-health-check app.

BOSH Director Health Test Availability


health.check.bosh.director.probe.available

Description PCF Healthwatch has up-to-date results for the BOSH Director Health Test, which means that the test was recently available.

Metric values: 1 = available, 0 = not available, or timeout (10 s)

This assessment of up-to-date results is made by looking for results within the configured test schedule plus timeout. For example, a test runner scheduled on 5-minute intervals with a 2-minute timeout must show a test result within the last 7 minutes to succeed.

Use: Indicates that PCF Healthwatch is assessing the current state of health for the BOSH Director. If this continuous validation test fails to make up-to-date assessments, it is no longer a reliable warning mechanism.

Origin: Firehose
Type: Gauge
Frequency: 60 s
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: ≤ 0.6
Red critical: ≤ 0.4
Recommended response
  1. Ensure the bosh-health-check app is running in the healthwatch space of the system org. Check the app logs for any obvious errors.
  2. SSH into the running bosh-health-check app and copy the BOSH manifest from /home/vcap/app/health_check_manifest.yml. Try to deploy it manually on the BOSH Director and check for errors.

Ops Manager Health Test Availability


health.check.OpsMan.probe.available

Description PCF Healthwatch has up-to-date results for the Ops Manager Health Test, which means that the test was recently available.

Metric values: 1 = available, 0 = not available, or timeout (10 s)

This assessment of up-to-date results is made by looking for results within the configured test schedule plus timeout. For example, a test runner scheduled on 5-minute intervals with a 2-minute timeout must show a test result within the last 7 minutes to succeed.

Use: Indicates that PCF Healthwatch is assessing the current state of health for Ops Manager. If this continuous validation test fails to make up-to-date assessments, it is no longer a reliable warning mechanism.

Origin: Firehose
Type: Gauge
Frequency: 60 s
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: ≤ 0.6
Red critical: ≤ 0.4
Recommended response
  1. Ensure the opsmanager-health-check app is running in the healthwatch space of the system org. Check the app logs for any obvious errors.
  2. Verify that Ops Manager is running and accessible through the URL configured in the OPSMANAGER_URL environment variable of the opsmanager-health-check app.

Number of Healthwatch Super Metrics Published to Firehose


healthwatch.metrics.published

Description Number of PCF Healthwatch Metrics published back to the Firehose.

Use: If an operator has not made changes that impact the number or frequency of assessments, an unusual drop in the number of metrics published can indicate that PCF Healthwatch may be experiencing a computation or publication issue.

Origin: Firehose
Type: Gauge
Frequency: 60 s
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Verify that the healthwatch-forwarder VM is running.
  2. Check all of the logs in /var/vcap/sys/log on the VM.
  3. Verify that the *-health-check apps are running and the logs in the healthwatch space of the system org are not receiving any obvious errors from them.

Number of Healthwatch Continuous Validation Tests Executed

This section describes the metrics that you can use to monitor the number of continuous validation tests executed by PCF Healthwatch.

CLI Command Health


health.check.cliCommand.probe.count

Description Number of PCF Healthwatch CLI Command Health probe assessments completed in the measured time interval.

Use: For alerting purposes, Pivotal suggests using health.check.cliCommand.probe.available instead. This metric is most helpful for additional diagnostics or secondary alerting.

When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality.

In the default installation, these tests run every 5 minutes across 2 runner apps.

Origin: Firehose
Type: Gauge
Frequency: 60 s

Ops Manager Health


health.check.OpsMan.probe.count

Description Number of PCF Healthwatch Ops Manager Health probe assessments completed in the measured time interval.

Use: For alerting purposes, Pivotal suggests using health.check.OpsMan.probe.available instead. This metric is most helpful for additional diagnostics or secondary alerting.

When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality.

In the default installation, these tests run every 1 minute across 2 runner apps.

Origin: Firehose
Type: Gauge
Frequency: 60 s

Canary App Health


health.check.CanaryApp.probe.count

Description Number of PCF Healthwatch Canary App Health probe assessments completed in the measured time interval.

Use: For alerting purposes, Pivotal suggests using health.check.CanaryApp.probe.available instead. This metric is most helpful for additional diagnostics or secondary alerting.

When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality.

In the default installation, these tests run every 1 minutes across 2 runner apps.

Origin: Firehose
Type: Gauge
Frequency: 60 s

BOSH Director Health


health.check.bosh.director.probe.count

Description Number of PCF Healthwatch BOSH Director Health probe assessments completed in the measured time interval.

Use: For alerting purposes, Pivotal suggests using health.check.bosh.director.probe.available instead. This metric is most helpful for additional diagnostics or secondary alerting.

When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality.

In the default installation, these tests run every 10 minutes using 1 runner app.

Origin: Firehose
Type: Gauge
Frequency: 60 s

Other Metrics Available

This section describes other metrics that you can use to monitor PCF Healthwatch.

BOSH Deployment Check Probe


health.bosh.deployment.probe.count

Description Number of PCF Healthwatch BOSH Deployment Occurrence probes completed in the measured time interval.

Use: When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality.

In the default installation, these tests run every 30 seconds across 2 runner apps.

Origin: Firehose
Type: Gauge
Frequency: 60 s
Create a pull request or raise an issue on the source for this page in GitHub