Monitoring PCF Healthwatch
Warning: PCF Healthwatch v1.6 is no longer supported or available for download. PCF Healthwatch v1.6 has reached the End of General Support (EOGS) phase as defined by the Support Lifecycle Policy. To stay up to date with the latest software and security updates, upgrade to a supported version.
This topic explains how to monitor the health of Pivotal Cloud Foundry (PCF) Healthwatch using the metrics and key performance indicators (KPIs) generated by the service.
For general information about monitoring PCF, see Monitoring Pivotal Cloud Foundry.
About Metrics
PCF Healthwatch emits metrics in the following format:
origin:"healthwatch" eventType:ValueMetric timestamp:1509638101820496694 deployment:"healthwatch-app-dev-v1-3"
job:"healthwatch-forwarder" index:"097f4b1e-5ca8-4866-82d5-00883798dad4" ip:"10.0.16.29"
valueMetric:<name:"metrics.published" value:38 unit:"count">
All PCF Healthwatch-emitted metrics have the healthwatch
origin.
Service Level Indicators for PCF Healthwatch
Service Level Indicators monitor that key features of the PCF Healthwatch product are working as expected. These SLIs are the most important operational metrics emitted about Healthwatch itself, as they indicate the reliability of the assessments Healthwatch is making.
CLI Health Test Availability
Description | Use: Indicates that PCF Healthwatch is assessing the health of the Cloud Foundry Command Line Interface (cf CLI) commands. If these continuous validation tests fail to make up-to-date assessments, they are no longer a reliable warning mechanism. Metrics |
---|---|
PromQL |
avg_over_time(health_check_cliCommand_probe_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: <= 0.6 Red critical: <= 0.4 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
|
Canary App Health Test Availability
Description | Use: Indicates that PCF Healthwatch is assessing the current state of health for the canary app. If this continuous validation test fails to make up-to-date assessments, it is no longer a reliable warning mechanism. Metrics |
---|---|
PromQL |
avg_over_time(health_check_CanaryApp_probe_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: <= 0.6 Red critical: <= 0.4 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
|
BOSH Director Health Test Availability
Description | Use: Indicates that PCF Healthwatch is assessing the current state of health for the BOSH Director. If this continuous validation test fails to make up-to-date assessments, it is no longer a reliable warning mechanism. Metrics |
---|---|
PromQL |
avg_over_time(health_check_bosh_director_probe_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: <= 0.6 Red critical: <= 0.4 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
|
Ops Manager Health Test Availability
Description | Use: Indicates that PCF Healthwatch is assessing the current state of health for Ops Manager. If this continuous validation test fails to make up-to-date assessments, it is no longer a reliable warning mechanism. Metrics |
---|---|
PromQL |
avg_over_time(health_check_OpsMan_probe_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: <= 0.6 Red critical: <= 0.4 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
|
PCF Healthwatch UI Availability
Description | Use: Indicates that the Healthwatch UI is running and available to product users. While an issue with the UI does not impact the assessments that PCF Healthwatch is making, loss of the UI can impact user ability to visually reference these assessments. Metrics |
---|---|
PromQL |
avg_over_time(healthwatch_ui_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: <= 0.6 Red critical: <= 0.4 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
|
Key Performance Indicators for PCF Healthwatch
This section describes the KPIs that you can use to monitor the health of PCF Healthwatch.
Number of PCF Healthwatch Nozzle Disconnects from Firehose
Description | Use: An unusual increase in the number of disconnects from the Firehose typically indicates that you need to scale the nozzle up. The Firehose disconnects nozzles that are slow consumers to protect apps from backpressure. This metric can also spike during a PCF deployment because the Traffic Controller VMs restart, logging a disconnect. A prolonged period of losing metrics as a result of disconnects can endanger the assessments that PCF Healthwatch makes using platform metrics from the Firehose. Metrics |
---|---|
PromQL |
avg_over_time(ingestor_disconnects{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: >= 20 Red critical: >= 30 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
If no known deployment occurred and the spike is sustained, increase the number of PCF Healthwatch Ingestor instances and monitor this metric to ensure that it returns to a normal state. You can scale Ingestor instances in the Healthwatch Component Config tab of the PCF Healthwatch tile or using the
|
Number of Ingestor Dropped Metrics
Description | Use: An unusual increase in the number of dropped messages by the PCF Healthwatch Ingestor likely indicates that you need to scale up this component and verify the health of Redis. A prolonged period of dropping messages can endanger the assessments that PCF Healthwatch makes using platform metrics from the Firehose. Metrics |
---|---|
PromQL |
avg_over_time(ingestor_dropped{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: >= 10 Red critical: >= 20 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
Verify the health of the Redis VM and increase the number of PCF Healthwatch Ingestor instances. Monitor this metric to ensure that it returns to a normal state. You can scale Ingestor instances using the |
Redis Queue Size
Description | Use: An unusual spike in the number of queued metrics can indicate that PCF Healthwatch Workers are unable to keep up with the volume of metrics from the Firehose. A large Redis queue will result in value metrics and counter events being delayed; if the queue becomes completely full, metrics will be lost altogether. This will also adversely affect PCF Healthwatch’s ability to calculate super value metrics. Metrics name: redis.counterEventQueue.size firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
redis_valueMetricQueue_size{source_id="healthwatch-forwarder",deployment="$deployment"} + redis_counterEventQueue_size{source_id="healthwatch-forwarder",deployment="$deployment"}
|
Thresholds |
Red critical: >= 10000 |
Measurement |
Average over last 5 minutes |
Recommended Response |
If the spike is sustained, increase the number of PCF Healthwatch Worker instances and monitor this metric to ensure that it returns to a normal state. You can scale Worker instances in the Healthwatch Component Config tab of the PCF Healthwatch tile or using the
|
Number of Healthwatch Super Metrics Published to Firehose
Description | Use: If an operator has not made changes that impact the number or frequency of assessments, an unusual drop in the number of metrics published can indicate that PCF Healthwatch may be experiencing a computation or publication issue. Metrics |
---|---|
PromQL |
sum(metrics_published{source_id="healthwatch-forwarder",deployment="$deployment"}) by (job)
|
Thresholds |
Yellow warning: <= 20 Red critical: <= 10 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
|
Other Metrics Available
This section describes other metrics that you can use to monitor PCF Healthwatch.
Number of Healthwatch Events Published to PCF Event Alerts
Description | Number of PCF Healthwatch Event Alerts triggered and published to [PCF Event Alerts](http://docs.pivotal.io/event-alerts/index.html. Use: This metric is primarily interesting for informational purposes. As the number of alerting events could vary greatly, it is not recommended to alert on this metric itself. name: events.published firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
events_published{source_id="healthwatch-forwarder",deployment="$deployment"}
|
BOSH Deployment Check Probe Count
Description | Number of PCF Healthwatch BOSH Deployment Occurrence probes completed in the measured time interval. Use: When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality. In the default installation, these tests run every 30 seconds across 2 runner apps. name: health.bosh.deployment.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
health_bosh_deployment_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}
|
CLI Command Health Probe Count
Description | Number of PCF Healthwatch CLI Command Health probe assessments completed in the measured time interval. Use: For alerting purposes, Pivotal suggests using When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality. In the default installation, these tests run every 5 minutes across 2 runner apps. name: health.check.cliCommand.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
health_check_cliCommand_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}
|
Ops Manager Health Probe Count
Description | Number of PCF Healthwatch Ops Manager Health probe assessments completed in the measured time interval. Use: For alerting purposes, Pivotal suggests using When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality. In the default installation, these tests run every 1 minute across 2 runner apps. name: health.check.OpsMan.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
health_check_OpsMan_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}
|
Canary App Health Probe Count
Description | Number of PCF Healthwatch Canary App Health probe assessments completed in the measured time interval. Use: For alerting purposes, Pivotal suggests using When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality. In the default installation, these tests run every 1 minutes across 2 runner apps. name: health.check.CanaryApp.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
health_check_CanaryApp_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}
|
BOSH Director Health Probe Count
Description | Number of PCF Healthwatch BOSH Director Health probe assessments completed in the measured time interval. Use: For alerting purposes, Pivotal suggests using When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality. In the default installation, these tests run every 10 minutes using 1 runner app. name: health.check.bosh.director.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
health_check_bosh_director_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}
|