Monitoring PCF Healthwatch

Warning: PCF Healthwatch v1.6 is no longer supported or available for download. PCF Healthwatch v1.6 has reached the End of General Support (EOGS) phase as defined by the Support Lifecycle Policy. To stay up to date with the latest software and security updates, upgrade to a supported version.

This topic explains how to monitor the health of Pivotal Cloud Foundry (PCF) Healthwatch using the metrics and key performance indicators (KPIs) generated by the service.

For general information about monitoring PCF, see Monitoring Pivotal Cloud Foundry.

About Metrics

PCF Healthwatch emits metrics in the following format:

origin:"healthwatch" eventType:ValueMetric timestamp:1509638101820496694 deployment:"healthwatch-app-dev-v1-3"
job:"healthwatch-forwarder" index:"097f4b1e-5ca8-4866-82d5-00883798dad4" ip:"10.0.16.29" 
valueMetric:<name:"metrics.published" value:38 unit:"count">

All PCF Healthwatch-emitted metrics have the healthwatch origin.

Service Level Indicators for PCF Healthwatch

Service Level Indicators monitor that key features of the PCF Healthwatch product are working as expected. These SLIs are the most important operational metrics emitted about Healthwatch itself, as they indicate the reliability of the assessments Healthwatch is making.

CLI Health Test Availability

Description

Use: Indicates that PCF Healthwatch is assessing the health of the Cloud Foundry Command Line Interface (cf CLI) commands. If these continuous validation tests fail to make up-to-date assessments, they are no longer a reliable warning mechanism.

Metrics
name: health.check.cliCommand.probe.available firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL avg_over_time(health_check_cliCommand_probe_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
Thresholds Yellow warning: <= 0.6
Red critical: <= 0.4

These are environment specific

Measurement

Average over last 5 minutes

Recommended Response
  1. Ensure the cf-health-check app is running in the healthwatch space of the system org.
  2. Check the app logs for any obvious errors.

Canary App Health Test Availability

Description

Use: Indicates that PCF Healthwatch is assessing the current state of health for the canary app. If this continuous validation test fails to make up-to-date assessments, it is no longer a reliable warning mechanism.

Metrics
name: health.check.CanaryApp.probe.available firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL avg_over_time(health_check_CanaryApp_probe_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
Thresholds Yellow warning: <= 0.6
Red critical: <= 0.4

These are environment specific

Measurement

Average over last 5 minutes

Recommended Response
  1. Ensure the canary-health-check app is running in the healthwatch space of the system org. Check the app logs for any obvious errors.
  2. Verify that Apps Manager is running and accessible through the URL configured in the CANARY_URL environment variable of the canary-health-check app.

BOSH Director Health Test Availability

Description

Use: Indicates that PCF Healthwatch is assessing the current state of health for the BOSH Director. If this continuous validation test fails to make up-to-date assessments, it is no longer a reliable warning mechanism.

Metrics
name: health.check.bosh.director.probe.available firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL avg_over_time(health_check_bosh_director_probe_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
Thresholds Yellow warning: <= 0.6
Red critical: <= 0.4

These are environment specific

Measurement

Average over last 5 minutes

Recommended Response
  1. Ensure the bosh-health-check app is running in the healthwatch space of the system org. Check the app logs for any obvious errors.
  2. SSH into the running bosh-health-check app and copy the BOSH manifest from /home/vcap/app/health_check_manifest.yml. Try to deploy it manually on the BOSH Director and check for errors.

Ops Manager Health Test Availability

Description

Use: Indicates that PCF Healthwatch is assessing the current state of health for Ops Manager. If this continuous validation test fails to make up-to-date assessments, it is no longer a reliable warning mechanism.

Metrics
name: health.check.OpsMan.probe.available firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL avg_over_time(health_check_OpsMan_probe_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
Thresholds Yellow warning: <= 0.6
Red critical: <= 0.4

These are environment specific

Measurement

Average over last 5 minutes

Recommended Response
  1. Ensure the opsmanager-health-check app is running in the healthwatch space of the system org. Check the app logs for any obvious errors.
  2. Verify that Ops Manager is running and accessible through the URL configured in the OPSMANAGER_URL environment variable of the opsmanager-health-check app.

PCF Healthwatch UI Availability

Description

Use: Indicates that the Healthwatch UI is running and available to product users. While an issue with the UI does not impact the assessments that PCF Healthwatch is making, loss of the UI can impact user ability to visually reference these assessments.

Metrics
name: healthwatch.ui.available firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL avg_over_time(healthwatch_ui_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
Thresholds Yellow warning: <= 0.6
Red critical: <= 0.4

These are environment specific

Measurement

Average over last 5 minutes

Recommended Response
  1. Ensure the healthwatch app is running in the healthwatch space of the system org.
  2. Check the app logs for any obvious errors.
  3. Verify that the /info endpoint is available on the healthwatch app route.

Key Performance Indicators for PCF Healthwatch

This section describes the KPIs that you can use to monitor the health of PCF Healthwatch.

Number of PCF Healthwatch Nozzle Disconnects from Firehose

Description

Use: An unusual increase in the number of disconnects from the Firehose typically indicates that you need to scale the nozzle up. The Firehose disconnects nozzles that are slow consumers to protect apps from backpressure. This metric can also spike during a PCF deployment because the Traffic Controller VMs restart, logging a disconnect.

A prolonged period of losing metrics as a result of disconnects can endanger the assessments that PCF Healthwatch makes using platform metrics from the Firehose.

Metrics
name: ingestor.disconnects firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL avg_over_time(ingestor_disconnects{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
Thresholds Yellow warning: >= 20
Red critical: >= 30

These are environment specific

Measurement

Average over last 5 minutes

Recommended Response

If no known deployment occurred and the spike is sustained, increase the number of PCF Healthwatch Ingestor instances and monitor this metric to ensure that it returns to a normal state.

You can scale Ingestor instances in the Healthwatch Component Config tab of the PCF Healthwatch tile or using the cf scale healthwatch-ingestor command. While cf scale helps you to quickly scale the instances, you should also update the tile configuration so that the next deployment does not override the manual scaling.

Number of Ingestor Dropped Metrics

Description

Use: An unusual increase in the number of dropped messages by the PCF Healthwatch Ingestor likely indicates that you need to scale up this component and verify the health of Redis. A prolonged period of dropping messages can endanger the assessments that PCF Healthwatch makes using platform metrics from the Firehose.

Metrics
name: ingestor.dropped firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL avg_over_time(ingestor_dropped{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
Thresholds Yellow warning: >= 10
Red critical: >= 20

These are environment specific

Measurement

Average over last 5 minutes

Recommended Response

Verify the health of the Redis VM and increase the number of PCF Healthwatch Ingestor instances. Monitor this metric to ensure that it returns to a normal state.

You can scale Ingestor instances using the cf scale healthwatch-ingestor command. While cf scale helps you to quickly scale the instances, you should also update the Ingestor Count in the tile configuration located in Healthwatch Component Config tab. Otherwise, the next Apply Changes will override the manual scaling.

Redis Queue Size

Description

Use: An unusual spike in the number of queued metrics can indicate that PCF Healthwatch Workers are unable to keep up with the volume of metrics from the Firehose. A large Redis queue will result in value metrics and counter events being delayed; if the queue becomes completely full, metrics will be lost altogether. This will also adversely affect PCF Healthwatch’s ability to calculate super value metrics.

Metrics
name: redis.valueMetricQueue.size firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

name: redis.counterEventQueue.size firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL redis_valueMetricQueue_size{source_id="healthwatch-forwarder",deployment="$deployment"} + redis_counterEventQueue_size{source_id="healthwatch-forwarder",deployment="$deployment"}
Thresholds Red critical: >= 10000
Measurement

Average over last 5 minutes

Recommended Response

If the spike is sustained, increase the number of PCF Healthwatch Worker instances and monitor this metric to ensure that it returns to a normal state.

You can scale Worker instances in the Healthwatch Component Config tab of the PCF Healthwatch tile or using the cf scale healthwatch-worker command. While cf scale helps you to quickly scale the instances, you should also update the tile configuration so that the next deployment does not override the manual scaling.

Number of Healthwatch Super Metrics Published to Firehose

Description

Use: If an operator has not made changes that impact the number or frequency of assessments, an unusual drop in the number of metrics published can indicate that PCF Healthwatch may be experiencing a computation or publication issue.

Metrics
name: metrics.published firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL sum(metrics_published{source_id="healthwatch-forwarder",deployment="$deployment"}) by (job)
Thresholds Yellow warning: <= 20
Red critical: <= 10

These are environment specific

Measurement

Average over last 5 minutes

Recommended Response
  1. Verify that the healthwatch-forwarder VM is running.
  2. Check all of the logs in /var/vcap/sys/log on the VM.
  3. Verify that the *-health-check apps are running and the logs in the healthwatch space of the system org are not receiving any obvious errors from them.

Other Metrics Available

This section describes other metrics that you can use to monitor PCF Healthwatch.

Number of Healthwatch Events Published to PCF Event Alerts

Description

Number of PCF Healthwatch Event Alerts triggered and published to [PCF Event Alerts](http://docs.pivotal.io/event-alerts/index.html.

Use: This metric is primarily interesting for informational purposes. As the number of alerting events could vary greatly, it is not recommended to alert on this metric itself.

name: events.published firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL events_published{source_id="healthwatch-forwarder",deployment="$deployment"}

BOSH Deployment Check Probe Count

Description

Number of PCF Healthwatch BOSH Deployment Occurrence probes completed in the measured time interval.

Use: When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality.

In the default installation, these tests run every 30 seconds across 2 runner apps.

name: health.bosh.deployment.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL health_bosh_deployment_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}

CLI Command Health Probe Count

Description

Number of PCF Healthwatch CLI Command Health probe assessments completed in the measured time interval.

Use: For alerting purposes, Pivotal suggests using health.check.cliCommand.probe.available instead. This metric is most helpful for additional diagnostics or secondary alerting.

When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality.

In the default installation, these tests run every 5 minutes across 2 runner apps.

name: health.check.cliCommand.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL health_check_cliCommand_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}

Ops Manager Health Probe Count

Description

Number of PCF Healthwatch Ops Manager Health probe assessments completed in the measured time interval.

Use: For alerting purposes, Pivotal suggests using health.check.OpsMan.probe.available instead. This metric is most helpful for additional diagnostics or secondary alerting.

When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality.

In the default installation, these tests run every 1 minute across 2 runner apps.

name: health.check.OpsMan.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL health_check_OpsMan_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}

Canary App Health Probe Count

Description

Number of PCF Healthwatch Canary App Health probe assessments completed in the measured time interval.

Use: For alerting purposes, Pivotal suggests using health.check.CanaryApp.probe.available instead. This metric is most helpful for additional diagnostics or secondary alerting.

When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality.

In the default installation, these tests run every 1 minutes across 2 runner apps.

name: health.check.CanaryApp.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL health_check_CanaryApp_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}

BOSH Director Health Probe Count

Description

Number of PCF Healthwatch BOSH Director Health probe assessments completed in the measured time interval.

Use: For alerting purposes, Pivotal suggests using health.check.bosh.director.probe.available instead. This metric is most helpful for additional diagnostics or secondary alerting.

When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality.

In the default installation, these tests run every 10 minutes using 1 runner app.

name: health.check.bosh.director.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s

PromQL health_check_bosh_director_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}