PCF Healthwatch Metrics
Warning: PCF Healthwatch v1.6 is no longer supported or available for download. PCF Healthwatch v1.6 has reached the End of General Support (EOGS) phase as defined by the Support Lifecycle Policy. To stay up to date with the latest software and security updates, upgrade to a supported version.
This topic lists the super metrics created by Pivotal Cloud Foundry (PCF) Healthwatch.
Note: For external monitoring consumers, PCF Healthwatch forwards the metrics that it creates into the Loggregator Firehose. You can retrieve these metrics with a nozzle or through the Log Cache API.
Platform Service Level Indicators
PCF Healthwatch generates metrics that directly indicate the health of several platform components. You can use these metrics with service level objectives to calculate percent availability and error budgets.
Cloud Foundry CLI Health
The Cloud Foundry Command Line Interface (cf CLI) enables developers to create and manage apps on PCF. PCF Healthwatch executes a continuous test suite for validating the core functions of the cf CLI.
The table below provides information about the cf CLI health smoke tests and the metrics that are generated for these tests.
Test | Metric | Frequency | Description |
---|---|---|---|
Can login | healthwatch.health.check.cliCommand.login and healthwatch.health.check.cliCommand.login.timeout |
5 min | 1 = pass or 0 = fail |
Can push | healthwatch.health.check.cliCommand.push and healthwatch.health.check.cliCommand.push.timeout |
5 min | 1 = pass, 0 = fail, or -1 = test did not run |
Can start | healthwatch.health.check.cliCommand.start and healthwatch.health.check.cliCommand.start.timeout |
5 min | 1 = pass, 0 = fail, or -1 = test did not run |
Receiving logs | healthwatch.health.check.cliCommand.logs and healthwatch.health.check.cliCommand.logs.timeout |
5 min | 1 = pass, 0 = fail, or -1 test did not run |
Can stop | healthwatch.health.check.cliCommand.stop and healthwatch.health.check.cliCommand.stop.timeout |
5 min | 1 = pass, 0 = fail, or -1 test did not run |
Can delete | healthwatch.health.check.cliCommand.delete and healthwatch.health.check.cliCommand.delete.timeout |
5 min | 1 = pass, 0 = fail, or -1 test did not run |
Test app push time | healthwatch.health.check.cliCommand.pushTime |
5 min | Time in ms |
Overall smoke test battery result | healthwatch.health.check.cliCommand.success |
5 min | 1 = pass or 0 = fail |
Overall smoke test battery run time | healthwatch.health.check.cliCommand.duration |
5 min | Time in ms |
Note: Timeout metrics are written only when a timeout occurs, after 2 minutes. Their value is always zero.
Note: PCF Healthwatch runs this test suite in the system org and the healthwatch space.
Ops Manager Health
Issues with Ops Manager health can impact an operator’s ability to perform an upgrade or to scale PCF. Therefore, it is recommended to continuously monitor Ops Manager health. PCF Healthwatch executes this check as follows:
Note: This metric is not emitted if the Ops Manager health check is disabled. For more information, see Installing PCF Healthwatch.
Test | Metric | Frequency | Description |
---|---|---|---|
Ops Manager availability | healthwatch.health.check.OpsMan.available |
1 min | 1 = pass or 0 = fail |
Canary App Health
App availability and responsiveness issues can significantly impact the experience of end users. By default, PCF Healthwatch uses Apps Manager as a canary app. An excellent indicator for the overall health of apps running on PCF, if Apps Manager is down, it is highly likely other apps are failing as well. If it is up and responsive, then other apps and their underlying routing are likely healthy as well. As of v1.3, you can configure PCF Healthwatch to use a different PCF app for this test. See Configuring Canary App Health Endpoint.
Test | Metric | Frequency | Description |
---|---|---|---|
Apps Manager availability | healthwatch.health.check.CanaryApp.available |
1 min | 1 = pass or 0 = fail or 10s timeout |
Aps Manager response time | healthwatch.health.check.CanaryApp.responseTime |
1 min | Time in ms |
BOSH Director Health
If the BOSH Director is not responsive and functional, BOSH-managed VMs lose their resiliency. It is recommended to continuously monitor the health of the BOSH Director. PCF Healthwatch executes this check as follows:
Test | Metric | Frequency | Description |
---|---|---|---|
BOSH Director health | healthwatch.health.check.bosh.director.success and healthwatch.health.check.bosh.director.timeout |
10 min | 1 = pass or 0 = fail |
Note: PCF Healthwatch deploys, stops, starts, and deletes a VM named bosh-health-check
as part of this test suite.
Note: The timeout metric is written if deploying or deleting the VM takes more than 10 minutes.
Logging Performance
This section lists metrics used to monitor Loggregator, the PCF component responsible for logging.
Log Transport Loss Rate
This derived metric is recommended for automating and monitoring platform scaling. Two versions of the metric (per minute and per hour) are used to monitor the Loggregator Firehose.
Reports | Metric | Description |
---|---|---|
Log Transport Loss Rate | healthwatch.Firehose.LossRate.1H and healthwatch.Firehose.LossRate.1M |
Loss rate per minute and per hour |
Doppler Message Rate Capacity
This derived metric is recommended for automating and monitoring the platform scaling of the Doppler component portion of Loggregator. It represents the average load of each Doppler instance. Upon nearing an average of 1 million messages-per-minute per Doppler (the recommended maximum load), additional Doppler instances should be added.
Reports | Metric | Description |
---|---|---|
Doppler Message Rate Capacity | healthwatch.Doppler.MessagesAverage.1M |
The number of reported messages coming in to Doppler loggregator.doppler.ingress divided by the current number of Doppler instances to produce an approximate envelopes-per-minute rate load on the Doppler instances |
Adapter Loss Rate
This derived metric is recommended for automating and monitoring platform scaling. The metric is used to monitor the CF Syslog Drain feature of Loggregator.
Reports | Metric | Description |
---|---|---|
Adapter loss rate (syslog drain performance) | healthwatch.SyslogDrain.Adapter.LossRate.1M |
Loss rate per minute |
Reverse Log Proxy Loss Rate
This derived metric is recommended for automating and monitoring platform scaling. The metric is used to monitor the CF Syslog Drain feature of Loggregator.
Reports | Metric | Description |
---|---|---|
Reverse Log Proxy loss rate (syslog drain performance) | healthwatch.SyslogDrain.RLP.LossRate.1M |
Loss rate per minute |
Syslog Drain Binding Capacity
This derived metric is recommended for automating and monitoring platform scaling. The metric is used to monitor the CF Syslog Drain feature of Loggregator.
Reports | Metric | Description |
---|---|---|
Average number of drain bindings across Adapter instances | healthwatch.SyslogDrain.Adapter.BindingsAverage.5M |
The number of reported syslog drain bindings cf-syslog-drain.scheduler.drains divided by the current number of Syslog Adapters cf-syslog-drain.scheduler.adapters to produce a 5-minute rolling average |
Note: Each adapter can handle a limited number of bindings. Calculating the ratio of drain bindings to available Adapters helps to monitor and scale Adapter instances.
Capacity Available
This section lists metrics used to monitor the total percentage of available memory, disk, and cell container capacity.
Number of Available Free Chunks of Memory
This derived metric is recommended for automating and monitoring platform scaling. Insufficient free chunks of memory can prevent pushing and scaling apps. Monitoring the amount of free chunks remaining is a more valuable indicator of impending cf push
errors due to lack of memory than relying on the Memory Remaining metrics alone.
Reports | Metric | Description |
---|---|---|
Available free chunks of memory | healthwatch.Diego.AvailableFreeChunks |
The current number of calculated 4-GB chunks of free memory across all cells. The default calculation value of 4-GB can be modified via API. |
Percentage of Memory Available
This derived metric is recommended for automating and monitoring platform scaling. Going below the recommended percentages for your environment set-up can result in insufficient capacity to tolerate failure of an entire AZ.
Reports | Metric | Description |
---|---|---|
Available memory | healthwatch.Diego.TotalPercentageAvailableMemoryCapacity.5M |
Percentage of available memory across all cells (averaged over last 5 min) |
Total Memory Available
This derived metric is recommended for automating platform scaling and understanding on-going trends for capacity planning.
Reports | Metric | Description |
---|---|---|
Available memory | healthwatch.Diego.TotalAvailableMemoryCapacity.5M |
Total remaining available memory across all cells (averaged over the last 5 min) |
Number of Available Free Chunks of Disk
This derived metric is recommended for automating and monitoring platform
scaling. Insufficient free chunks of disk can prevent pushing and scaling apps.
Diego cannot stage app instances and tasks without at least 6GB free. The amount
of free chunks remaining is a more valuable indicator of impending
cf push
errors due to lack of disk than the Disk Remaining metrics alone.
Reports | Metric | Description |
---|---|---|
Available free chunks of disk | healthwatch.Diego.AvailableFreeChunksDisk |
The current number of calculated 6-GB chunks of free disk across all cells. The default calculation value of 6-GB can be modified via API. |
Percentage of Disk Available
This derived metric is recommended for automating and monitoring platform scaling. Going below the recommended percentages for your environment set-up can result in insufficient capacity to tolerate failure of an entire AZ.
Reports | Metric | Description |
---|---|---|
Available disk | healthwatch.Diego.TotalPercentageAvailableDiskCapacity.5M |
Percentage of available disk across all cells (averaged over last 5 min) |
Total Disk Available
This derived metric is recommended for automating platform scaling and understanding on-going trends for capacity planning.
Reports | Metric | Description |
---|---|---|
Available disk | healthwatch.Diego.TotalAvailableDiskCapacity.5M |
Total remaining available disk across all cells (averaged over the last 5 min) |
Percentage of Cell Container Capacity Available
This derived metric is recommended for automating and monitoring platform scaling. Going below the recommended percentages for your environment set-up can result in insufficient capacity to tolerate failure of an entire AZ.
Reports | Metric | Description |
---|---|---|
Available cell container capacity | healthwatch.Diego.TotalPercentageAvailableContainerCapacity.5M |
Percentage of available cell container capacity (averaged over last 5 min) |
Total Cell Container Capacity Available
This derived metric is recommended for automating platform scaling and understanding on-going trends for capacity planning.
Reports | Metric | Description |
---|---|---|
Available cell container capacity | healthwatch.Diego.TotalAvailableContainerCapacity.5M |
Total remaining available cell container capacity (averaged over last 5 min) |
BOSH Deployment Occurrence
Monitoring BOSH deployment occurrence adds context to related data, such as VM (job) health.
Limitation: A BOSH deployment start or complete event can be determined. However, you cannot currently know to which VMs it is occurring.
Reports | Metric | Frequency | Description |
---|---|---|---|
BOSH deployment occurrence | healthwatch.bosh.deployment |
30 sec | 1 = a running deployment or 0 = not a running deployment |