LATEST VERSION: 1.4 - RELEASE NOTES
PCF Healthwatch v1.1

PCF Healthwatch Metrics

This topic lists the super metrics created by Pivotal Cloud Foundry (PCF) Healthwatch.

Note: For external monitoring consumers, PCF Healthwatch forwards the metrics that it creates into the Loggregator Firehose.

Cloud Foundry CLI Health

The Cloud Foundry Command Line Interface (cf CLI) enables developers to create and manage apps on PCF. PCF Healthwatch executes a continuous test suite for validating the core functions of the cf CLI.

The table below provides information about the cf CLI health smoke tests and the metrics that are generated for these tests.

Test Metric Frequency Description
Can login healthwatch.health.check.cliCommand.login and healthwatch.health.check.cliCommand.login.timeout 5 min 1 = pass or 0 = fail
Can push healthwatch.health.check.cliCommand.push and healthwatch.health.check.cliCommand.push.timeout 5 min 1 = pass, 0 = fail, or -1 = test did not run
Can start healthwatch.health.check.cliCommand.start and healthwatch.health.check.cliCommand.start.timeout 5 min 1 = pass, 0 = fail, or -1 = test did not run
Receiving logs healthwatch.health.check.cliCommand.logs and healthwatch.health.check.cliCommand.logs.timeout 5 min 1 = pass, 0 = fail, or -1 test did not run
Can stop healthwatch.health.check.cliCommand.stop and healthwatch.health.check.cliCommand.stop.timeout 5 min 1 = pass, 0 = fail, or -1 test did not run
Can delete healthwatch.health.check.cliCommand.delete and healthwatch.health.check.cliCommand.delete.timeout 5 min 1 = pass, 0 = fail, or -1 test did not run
Test app push time healthwatch.health.check.cliCommand.pushTime 5 min Time in ms
Overall smoke test battery result healthwatch.health.check.cliCommand.success 5 min 1 = pass or 0 = fail
Overall smoke test battery run time healthwatch.health.check.cliCommand.duration 5 min Time in ms

Note: Timeout metrics are written only when a timeout occurs (2 minutes). Their value is always zero.

Note: PCF Healthwatch runs this test suite in the system org and the healthwatch space.

Ops Manager Health

Issues with Ops Manager health can impact an operator’s ability to perform an upgrade or to scale PCF. Therefore, it is recommended to continuously monitor Ops Manager health. PCF Healthwatch executes this check as follows:

Note: This metric is not emitted if the Ops Manager health check is disabled. For more information, see Installing PCF Healthwatch.

Test Metric Frequency Description
Ops Manager availability healthwatch.health.check.OpsMan.available 1 min 1 = pass or 0 = fail

Canary App Health

App availability and responsiveness issues can significantly impact the experience of end users. PCF Healthwatch uses Apps Manager as a canary app and continuously checks its health. Because of the functions Apps Manager provides, Pivotal recommends it as a canary for insight into the performance of other apps running on PCF.

Test Metric Frequency Description
Apps Manager availability healthwatch.health.check.CanaryApp.available 1 min 1 = pass or 0 = fail or 10s timeout
Aps Manager response time healthwatch.health.check.CanaryApp.responseTime 1 min Time in ms

BOSH Director Health

If the BOSH Director is not responsive and functional, BOSH-managed VMs lose their resiliency. It is recommended to continuously monitor the health of the BOSH Director. PCF Healthwatch executes this check as follows:

Test Metric Frequency Description
BOSH Director health healthwatch.health.check.bosh.director.success and healthwatch.health.check.bosh.director.timeout 10 min 1 = pass or 0 = fail

Note: PCF Healthwatch deploys, stops, starts, and deletes a VM named bosh-health-check as part of this test suite.

Note: The timeout metric is written if deploying or deleting the VM takes more than 10 minutes.

Logging Performance

This section lists metrics used to monitor Loggregator, the PCF component responsible for logging.

Firehose Loss Rate

This derived metric is recommended for automating and monitoring platform scaling. Two versions of the metric (per minute and per hour) are used to monitor the Loggregator Firehose.

Reports Metric Description
Firehose loss rate healthwatch.Firehose.LossRate.1H and healthwatch.Firehose.LossRate.1M Loss rate per minute and per hour

Adapter Loss Rate

This derived metric is recommended for automating and monitoring platform scaling. The metric is used to monitor the CF Syslog Drain feature of Loggregator.

Reports Metric Description
Adapter loss rate (syslog drain performance) healthwatch.SyslogDrain.Adapter.LossRate.1M Loss rate per minute

Reverse Log Proxy Loss Rate

This derived metric is recommended for automating and monitoring platform scaling. The metric is used to monitor the CF Syslog Drain feature of Loggregator.

Reports Metric Description
Reverse Log Proxy loss rate (syslog drain performance) healthwatch.SyslogDrain.RLP.LossRate.1M Loss rate per minute

Syslog Drain Binding Capacity

This derived metric is recommended for automating and monitoring platform scaling. The metric is used to monitor the CF Syslog Drain feature of Loggregator.

Reports Metric Description
Average number of drain bindings across Adapter instances healthwatch.SyslogDrain.Adapter.BindingsAverage.5M The number of reported syslog drain bindings cf-syslog-drain.scheduler.drains divided by the current number of Syslog Adapters cf-syslog-drain.scheduler.adapters to produce a 5-minute rolling average

Note: Each adapter can handle a limited number of bindings. Calculating the ratio of drain bindings to available Adapters helps to monitor and scale Adapter instances.

Capacity Available

This section lists metrics used to monitor the total percentage of available memory, disk, and cell container capacity.

Number of Available Free Chunks of Memory

This derived metric is recommended for automating and monitoring platform scaling. Insufficient free chunks of memory can prevent pushing and scaling apps. Monitoring the amount of free chunks remaining is a more valuable indicator of impending cf push errors due to lack of memory than relying on the Memory Remaining metrics alone.

Reports Metric Description
Available free chunks of memory healthwatch.Diego.AvailableFreeChunks The current number of 4-GB chunks of free memory across all cells

Percentage of Memory Available

This derived metric is recommended for automating and monitoring platform scaling. Going below the recommended percentages for your environment set-up can result in insufficient capacity to tolerate failure of an entire AZ.

Reports Metric Description
Available memory healthwatch.Diego.TotalPercentageAvailableMemoryCapacity.5M Percentage of available memory across all cells (averaged over last 5 min)

Total Memory Available

This derived metric is recommended for automating platform scaling and understanding on-going trends for capacity planning. This metric is available as of v1.1.4.

Reports Metric Description
Available memory healthwatch.Diego.TotalAvailableMemoryCapacity.5M Total remaining available memory across all cells (averaged over the last 5 min)

Percentage of Disk Available

This derived metric is recommended for automating and monitoring platform scaling. Going below the recommended percentages for your environment set-up can result in insufficient capacity to tolerate failure of an entire AZ.

Reports Metric Description
Available disk healthwatch.Diego.TotalPercentageAvailableDiskCapacity.5M Percentage of available disk across all cells (averaged over last 5 min)

Total Disk Available

This derived metric is recommended for automating platform scaling and understanding on-going trends for capacity planning. This metric is available as of v1.1.4.

Reports Metric Description
Available disk healthwatch.Diego.TotalAvailableDiskCapacity.5M Total remaining available disk across all cells (averaged over the last 5 min)

Percentage of Cell Container Capacity Available

This derived metric is recommended for automating and monitoring platform scaling. Going below the recommended percentages for your environment set-up can result in insufficient capacity to tolerate failure of an entire AZ.

Reports Metric Description
Available cell container capacity healthwatch.Diego.TotalPercentageAvailableContainerCapacity.5M Percentage of available cell container capacity (averaged over last 5 min)

Total Cell Container Capacity Available

This derived metric is recommended for automating platform scaling and understanding on-going trends for capacity planning. This metric is available as of v1.1.4.

Reports Metric Description
Available cell container capacity healthwatch.Diego.TotalAvailableContainerCapacity.5M Total remaining available cell container capacity (averaged over last 5 min)

BOSH Deployment Occurrence

Monitoring BOSH deployment occurrence adds context to related data, such as VM (job) health.

Limitation: A BOSH deployment start or complete event can be determined. However, you cannot currently know to which VMs it is occurring.

Reports Metric Frequency Description
BOSH deployment occurrence healthwatch.bosh.deployment 30 sec 1 = a running deployment or 0 = not a running deployment
Create a pull request or raise an issue on the source for this page in GitHub