Monitoring Pivotal Healthwatch
This topic explains how to monitor the health of Pivotal Healthwatch using the metrics and key performance indicators (KPIs) generated by the service.
For general information about monitoring Pivotal Application Service, see Monitoring PAS.
About Metrics
Pivotal Healthwatch emits metrics in the following format:
origin:"healthwatch" eventType:ValueMetric timestamp:1509638101820496694 deployment:"healthwatch-app-dev-v1-3"
job:"healthwatch-forwarder" index:"097f4b1e-5ca8-4866-82d5-00883798dad4" ip:"10.0.16.29"
valueMetric:<name:"metrics.published" value:38 unit:"count">
All Pivotal Healthwatch-emitted metrics have the healthwatch
origin.
Service Level Indicators for Pivotal Healthwatch
Service Level Indicators monitor that key features of the Pivotal Healthwatch product are working as expected. These SLIs are the most important operational metrics emitted about Healthwatch itself, as they indicate the reliability of the assessments Healthwatch is making.
CLI Health Test Availability
Description | Use: Indicates that Pivotal Healthwatch is assessing the health of the Cloud Foundry Command Line Interface (cf CLI) commands. If these continuous validation tests fail to make up-to-date assessments, they are no longer a reliable warning mechanism. Metrics |
---|---|
PromQL |
avg_over_time(health_check_cliCommand_probe_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: <= 0.6 Red critical: <= 0.4 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
|
Canary App Health Test Availability
Description | Use: Indicates that Pivotal Healthwatch is assessing the current state of health for the canary app. If this continuous validation test fails to make up-to-date assessments, it is no longer a reliable warning mechanism. Metrics |
---|---|
PromQL |
avg_over_time(health_check_CanaryApp_probe_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: <= 0.6 Red critical: <= 0.4 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
|
BOSH Director Health Test Availability
Description | Use: Indicates that Pivotal Healthwatch is assessing the current state of health for the BOSH Director. If this continuous validation test fails to make up-to-date assessments, it is no longer a reliable warning mechanism. Metrics |
---|---|
PromQL |
avg_over_time(health_check_bosh_director_probe_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: <= 0.6 Red critical: <= 0.4 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
|
Ops Manager Health Test Availability
Description | Use: Indicates that Pivotal Healthwatch is assessing the current state of health for Ops Manager. If this continuous validation test fails to make up-to-date assessments, it is no longer a reliable warning mechanism. Metrics |
---|---|
PromQL |
avg_over_time(health_check_OpsMan_probe_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: <= 0.6 Red critical: <= 0.4 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
|
Pivotal Healthwatch UI Availability
Description | Use: Indicates that the Healthwatch UI is running and available to product users. While an issue with the UI does not impact the assessments that Pivotal Healthwatch is making, loss of the UI can impact user ability to visually reference these assessments. Metrics |
---|---|
PromQL |
avg_over_time(healthwatch_ui_available{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: <= 0.6 Red critical: <= 0.4 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
|
Key Performance Indicators for Pivotal Healthwatch
This section describes the KPIs that you can use to monitor the health of Pivotal Healthwatch.
Number of Pivotal Healthwatch Nozzle Disconnects from RLP
Description | Use: An unusual increase in the number of disconnects from the RLP typically indicates connection issue between RLP and the Healthwatch ingestor. This metric can spike during a PAS deployment because the Traffic Controller VMs restart, logging a disconnect. A prolonged period of losing metrics as a result of disconnects can endanger the assessments that Pivotal Healthwatch makes using platform metrics from the RLP. Metrics |
---|---|
PromQL |
avg_over_time(ingestor_disconnects{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: >= 20 Red critical: >= 30 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
If no known deployment occurred and the spike is sustained, increase the number of Pivotal Healthwatch Ingestor instances and monitor this metric to ensure that it returns to a normal state. You can scale Ingestor instances in the Healthwatch Component Config tab of the Pivotal Healthwatch tile or using the
|
Number of Ingestor Dropped Metrics
Description | Use: An unusual increase in the number of dropped messages by the Pivotal Healthwatch Ingestor likely indicates that you need to scale up this component and verify the health of Redis. A prolonged period of dropping messages can endanger the assessments that Pivotal Healthwatch makes using platform metrics from the Firehose. Metrics |
---|---|
PromQL |
avg_over_time(ingestor_dropped{source_id="healthwatch-forwarder",deployment="$deployment"}[5m])
|
Thresholds |
Yellow warning: >= 10 Red critical: >= 20 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
Verify the health of the Redis VM and increase the number of Pivotal Healthwatch Ingestor instances. Monitor this metric to ensure that it returns to a normal state. You can scale Ingestor instances using the |
Redis Queue Size
Description | Use: An unusual spike in the number of queued metrics can indicate that Pivotal Healthwatch Workers are unable to keep up with the volume of metrics from the Firehose. A large Redis queue will result in value metrics and counter events being delayed; if the queue becomes completely full, metrics will be lost altogether. This will also adversely affect Pivotal Healthwatch’s ability to calculate super value metrics. Metrics name: redis.counterEventQueue.size firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
redis_valueMetricQueue_size{source_id="healthwatch-forwarder",deployment="$deployment"} + redis_counterEventQueue_size{source_id="healthwatch-forwarder",deployment="$deployment"}
|
Thresholds |
Red critical: >= 10000 |
Measurement |
Average over last 5 minutes |
Recommended Response |
If the spike is sustained, increase the number of Pivotal Healthwatch Worker instances and monitor this metric to ensure that it returns to a normal state. You can scale Worker instances in the Healthwatch Component Config tab of the Pivotal Healthwatch tile or using the
|
Number of Healthwatch Super Metrics Published to Firehose
Description | Use: If an operator has not made changes that impact the number or frequency of assessments, an unusual drop in the number of metrics published can indicate that Pivotal Healthwatch may be experiencing a computation or publication issue. Metrics |
---|---|
PromQL |
sum(metrics_published{source_id="healthwatch-forwarder",deployment="$deployment"}) by (job)
|
Thresholds |
Yellow warning: <= 20 Red critical: <= 10 These are environment specific |
Measurement |
Average over last 5 minutes |
Recommended Response |
|
Other Metrics Available
This section describes other metrics that you can use to monitor Pivotal Healthwatch.
Number of Healthwatch Events Published to Pivotal Event Alerts
Description | Number of Pivotal Healthwatch Event Alerts triggered and published to Pivotal Event Alerts. Use: This metric is primarily interesting for informational purposes. As the number of alerting events could vary greatly, it is not recommended to alert on this metric itself. name: events.published firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
events_published{source_id="healthwatch-forwarder",deployment="$deployment"}
|
BOSH Deployment Check Probe Count
Description | Number of Pivotal Healthwatch BOSH Deployment Occurrence probes completed in the measured time interval. Use: When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality. In the default installation, these tests run every 30 seconds across 2 runner apps. name: health.bosh.deployment.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
health_bosh_deployment_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}
|
CLI Command Health Probe Count
Description | Number of Pivotal Healthwatch CLI Command Health probe assessments completed in the measured time interval. Use: For alerting purposes, Pivotal suggests using When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality. In the default installation, these tests run every 5 minutes across 2 runner apps. name: health.check.cliCommand.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
health_check_cliCommand_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}
|
Ops Manager Health Probe Count
Description | Number of Pivotal Healthwatch Ops Manager Health probe assessments completed in the measured time interval. Use: For alerting purposes, Pivotal suggests using When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality. In the default installation, these tests run every 1 minute across 2 runner apps. name: health.check.OpsMan.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
health_check_OpsMan_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}
|
Canary App Health Probe Count
Description | Number of Pivotal Healthwatch Canary App Health probe assessments completed in the measured time interval. Use: For alerting purposes, Pivotal suggests using When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality. In the default installation, these tests run every 1 minutes across 2 runner apps. name: health.check.CanaryApp.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
health_check_CanaryApp_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}
|
BOSH Director Health Probe Count
Description | Number of Pivotal Healthwatch BOSH Director Health probe assessments completed in the measured time interval. Use: For alerting purposes, Pivotal suggests using When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality. In the default installation, these tests run every 10 minutes using 1 runner app. name: health.check.bosh.director.probe.count firehose origin: healthwatch log-cache source id: healthwatch-forwarder type: gauge frequency: 60s |
---|---|
PromQL |
health_check_bosh_director_probe_count{source_id="healthwatch-forwarder",deployment="$deployment"}
|