PCF Healthwatch Metrics

Warning: PCF Healthwatch v1.6 is no longer supported or available for download. PCF Healthwatch v1.6 has reached the End of General Support (EOGS) phase as defined by the Support Lifecycle Policy. To stay up to date with the latest software and security updates, upgrade to a supported version.

This topic lists the super metrics created by Pivotal Cloud Foundry (PCF) Healthwatch.

Note: For external monitoring consumers, PCF Healthwatch forwards the metrics that it creates into the Loggregator Firehose. You can retrieve these metrics with a nozzle or through the Log Cache API.

Platform Service Level Indicators

PCF Healthwatch generates metrics that directly indicate the health of several platform components. You can use these metrics with service level objectives to calculate percent availability and error budgets.

Cloud Foundry CLI Health

The Cloud Foundry Command Line Interface (cf CLI) enables developers to create and manage apps on PCF. PCF Healthwatch executes a continuous test suite for validating the core functions of the cf CLI.

The table below provides information about the cf CLI health smoke tests and the metrics that are generated for these tests.

Test Metric Frequency Description
Can login healthwatch.health.check.cliCommand.login and healthwatch.health.check.cliCommand.login.timeout 5 min 1 = pass or 0 = fail
Can push healthwatch.health.check.cliCommand.push and healthwatch.health.check.cliCommand.push.timeout 5 min 1 = pass, 0 = fail, or -1 = test did not run
Can start healthwatch.health.check.cliCommand.start and healthwatch.health.check.cliCommand.start.timeout 5 min 1 = pass, 0 = fail, or -1 = test did not run
Receiving logs healthwatch.health.check.cliCommand.logs and healthwatch.health.check.cliCommand.logs.timeout 5 min 1 = pass, 0 = fail, or -1 test did not run
Can stop healthwatch.health.check.cliCommand.stop and healthwatch.health.check.cliCommand.stop.timeout 5 min 1 = pass, 0 = fail, or -1 test did not run
Can delete healthwatch.health.check.cliCommand.delete and healthwatch.health.check.cliCommand.delete.timeout 5 min 1 = pass, 0 = fail, or -1 test did not run
Test app push time healthwatch.health.check.cliCommand.pushTime 5 min Time in ms
Overall smoke test battery result healthwatch.health.check.cliCommand.success 5 min 1 = pass or 0 = fail
Overall smoke test battery run time healthwatch.health.check.cliCommand.duration 5 min Time in ms

Note: Timeout metrics are written only when a timeout occurs, after 2 minutes. Their value is always zero.

Note: PCF Healthwatch runs this test suite in the system org and the healthwatch space.

Ops Manager Health

Issues with Ops Manager health can impact an operator’s ability to perform an upgrade or to scale PCF. Therefore, it is recommended to continuously monitor Ops Manager health. PCF Healthwatch executes this check as follows:

Note: This metric is not emitted if the Ops Manager health check is disabled. For more information, see Installing PCF Healthwatch.

Test Metric Frequency Description
Ops Manager availability healthwatch.health.check.OpsMan.available 1 min 1 = pass or 0 = fail

Canary App Health

App availability and responsiveness issues can significantly impact the experience of end users. By default, PCF Healthwatch uses Apps Manager as a canary app. An excellent indicator for the overall health of apps running on PCF, if Apps Manager is down, it is highly likely other apps are failing as well. If it is up and responsive, then other apps and their underlying routing are likely healthy as well. As of v1.3, you can configure PCF Healthwatch to use a different PCF app for this test. See Configuring Canary App Health Endpoint.

Test Metric Frequency Description
Apps Manager availability healthwatch.health.check.CanaryApp.available 1 min 1 = pass or 0 = fail or 10s timeout
Aps Manager response time healthwatch.health.check.CanaryApp.responseTime 1 min Time in ms

BOSH Director Health

If the BOSH Director is not responsive and functional, BOSH-managed VMs lose their resiliency. It is recommended to continuously monitor the health of the BOSH Director. PCF Healthwatch executes this check as follows:

Test Metric Frequency Description
BOSH Director health healthwatch.health.check.bosh.director.success and healthwatch.health.check.bosh.director.timeout 10 min 1 = pass or 0 = fail

Note: PCF Healthwatch deploys, stops, starts, and deletes a VM named bosh-health-check as part of this test suite.

Note: The timeout metric is written if deploying or deleting the VM takes more than 10 minutes.

Logging Performance

This section lists metrics used to monitor Loggregator, the PCF component responsible for logging.

Log Transport Loss Rate

This derived metric is recommended for automating and monitoring platform scaling. Two versions of the metric (per minute and per hour) are used to monitor the Loggregator Firehose.

Reports Metric Description
Log Transport Loss Rate healthwatch.Firehose.LossRate.1H and healthwatch.Firehose.LossRate.1M Loss rate per minute and per hour

Doppler Message Rate Capacity

This derived metric is recommended for automating and monitoring the platform scaling of the Doppler component portion of Loggregator. It represents the average load of each Doppler instance. Upon nearing an average of 1 million messages-per-minute per Doppler (the recommended maximum load), additional Doppler instances should be added.

Reports Metric Description
Doppler Message Rate Capacity healthwatch.Doppler.MessagesAverage.1M The number of reported messages coming in to Doppler loggregator.doppler.ingress divided by the current number of Doppler instances to produce an approximate envelopes-per-minute rate load on the Doppler instances

Adapter Loss Rate

This derived metric is recommended for automating and monitoring platform scaling. The metric is used to monitor the CF Syslog Drain feature of Loggregator.

Reports Metric Description
Adapter loss rate (syslog drain performance) healthwatch.SyslogDrain.Adapter.LossRate.1M Loss rate per minute

Reverse Log Proxy Loss Rate

This derived metric is recommended for automating and monitoring platform scaling. The metric is used to monitor the CF Syslog Drain feature of Loggregator.

Reports Metric Description
Reverse Log Proxy loss rate (syslog drain performance) healthwatch.SyslogDrain.RLP.LossRate.1M Loss rate per minute

Syslog Drain Binding Capacity

This derived metric is recommended for automating and monitoring platform scaling. The metric is used to monitor the CF Syslog Drain feature of Loggregator.

Reports Metric Description
Average number of drain bindings across Adapter instances healthwatch.SyslogDrain.Adapter.BindingsAverage.5M The number of reported syslog drain bindings cf-syslog-drain.scheduler.drains divided by the current number of Syslog Adapters cf-syslog-drain.scheduler.adapters to produce a 5-minute rolling average

Note: Each adapter can handle a limited number of bindings. Calculating the ratio of drain bindings to available Adapters helps to monitor and scale Adapter instances.

Capacity Available

This section lists metrics used to monitor the total percentage of available memory, disk, and cell container capacity.

Number of Available Free Chunks of Memory

This derived metric is recommended for automating and monitoring platform scaling. Insufficient free chunks of memory can prevent pushing and scaling apps. Monitoring the amount of free chunks remaining is a more valuable indicator of impending cf push errors due to lack of memory than relying on the Memory Remaining metrics alone.

Reports Metric Description
Available free chunks of memory healthwatch.Diego.AvailableFreeChunks The current number of calculated 4-GB chunks of free memory across all cells. The default calculation value of 4-GB can be modified via API.

Percentage of Memory Available

This derived metric is recommended for automating and monitoring platform scaling. Going below the recommended percentages for your environment set-up can result in insufficient capacity to tolerate failure of an entire AZ.

Reports Metric Description
Available memory healthwatch.Diego.TotalPercentageAvailableMemoryCapacity.5M Percentage of available memory across all cells (averaged over last 5 min)

Total Memory Available

This derived metric is recommended for automating platform scaling and understanding on-going trends for capacity planning.

Reports Metric Description
Available memory healthwatch.Diego.TotalAvailableMemoryCapacity.5M Total remaining available memory across all cells (averaged over the last 5 min)

Number of Available Free Chunks of Disk

This derived metric is recommended for automating and monitoring platform scaling. Insufficient free chunks of disk can prevent pushing and scaling apps. Diego cannot stage app instances and tasks without at least 6GB free. The amount of free chunks remaining is a more valuable indicator of impending cf push errors due to lack of disk than the Disk Remaining metrics alone.

Reports Metric Description
Available free chunks of disk healthwatch.Diego.AvailableFreeChunksDisk The current number of calculated 6-GB chunks of free disk across all cells. The default calculation value of 6-GB can be modified via API.

Percentage of Disk Available

This derived metric is recommended for automating and monitoring platform scaling. Going below the recommended percentages for your environment set-up can result in insufficient capacity to tolerate failure of an entire AZ.

Reports Metric Description
Available disk healthwatch.Diego.TotalPercentageAvailableDiskCapacity.5M Percentage of available disk across all cells (averaged over last 5 min)

Total Disk Available

This derived metric is recommended for automating platform scaling and understanding on-going trends for capacity planning.

Reports Metric Description
Available disk healthwatch.Diego.TotalAvailableDiskCapacity.5M Total remaining available disk across all cells (averaged over the last 5 min)

Percentage of Cell Container Capacity Available

This derived metric is recommended for automating and monitoring platform scaling. Going below the recommended percentages for your environment set-up can result in insufficient capacity to tolerate failure of an entire AZ.

Reports Metric Description
Available cell container capacity healthwatch.Diego.TotalPercentageAvailableContainerCapacity.5M Percentage of available cell container capacity (averaged over last 5 min)

Total Cell Container Capacity Available

This derived metric is recommended for automating platform scaling and understanding on-going trends for capacity planning.

Reports Metric Description
Available cell container capacity healthwatch.Diego.TotalAvailableContainerCapacity.5M Total remaining available cell container capacity (averaged over last 5 min)

BOSH Deployment Occurrence

Monitoring BOSH deployment occurrence adds context to related data, such as VM (job) health.

Limitation: A BOSH deployment start or complete event can be determined. However, you cannot currently know to which VMs it is occurring.

Reports Metric Frequency Description
BOSH deployment occurrence healthwatch.bosh.deployment 30 sec 1 = a running deployment or 0 = not a running deployment