PCF Healthwatch Metrics

This topic lists the super metrics created by Pivotal Cloud Foundry (PCF) Healthwatch.

Note: For external monitoring consumers, PCF Healthwatch forwards the metrics it creates into the Loggregator Firehose.

In this topic, you can also find information about the existing PCF platform component and BOSH VM metrics used by PCF Healthwatch.

Healthwatch: Cloud Foundry CLI Health

The Cloud Foundry command line interface (CLI) enables developers to create and manage PCF apps. PCF Healthwatch executes a continuous test suite validating the core app developer functions of the CLI. Running a continuous validation test suite is often significantly more meaningful to reassuring functionality than monitoring for trends in metrics alone.

See the table below for information on generated metrics related to the Cloud Foundry CLI Health smoke tests.

Test Metric Frequency Description
Can login healthwatch.health.check.cliCommand.login and healthwatch.health.check.cliCommand.login.timeout 5 min 1 = pass or 0 = fail
Can push healthwatch.health.check.cliCommand.push and healthwatch.health.check.cliCommand.push.timeout 5 min 1 = pass, 0 = fail, or -1 = test did not run
Can start healthwatch.health.check.cliCommand.start and healthwatch.health.check.cliCommand.start.timeout 5 min 1 = pass, 0 = fail, or -1 = test did not run
Receiving logs healthwatch.health.check.cliCommand.logs and healthwatch.health.check.cliCommand.logs.timeout 5 min 1 = pass, 0 = fail, or -1 test did not run
Can stop healthwatch.health.check.cliCommand.stop and healthwatch.health.check.cliCommand.stop.timeout 5 min 1 = pass, 0 = fail, or -1 test did not run
Can delete healthwatch.health.check.cliCommand.delete and healthwatch.health.check.cliCommand.delete.timeout 5 min 1 = pass, 0 = fail, or -1 test did not run
Test app push time healthwatch.health.check.cliCommand.pushTime 5 min Time in ms
Overall smoke test battery result healthwatch.health.check.cliCommand.success 5 min 1 = pass or 0 = fail
Overall smoke test battery run time healthwatch.health.check.cliCommand.duration 5 min Time in ms

Note: Timeout metrics are written only when a timeout occurs. Their value is always zero.

Note: PCF Healthwatch runs this test suite in the system org and the healthwatch space.

Healthwatch: Ops Manager Health

Issues with Ops Manager health can impact an operator’s ability to perform an upgrade or to rescale the PCF platform when necessary. Therefore, it is recommended to continuously monitor Ops Manager availability. PCF Healthwatch executes this check as a part of its test suite.

Test Metric Frequency Description
Ops Manager availability healthwatch.health.check.OpsMan.available 1 min 1 = pass or 0 = fail

Healthwatch: Apps Manager Health

App availability and responsiveness issues can result in significant end user impacts. PCF Healthwatch uses Apps Manager as a canary app and continuously checks its health. Because of the functions Apps Manager provides, Pivotal recommends it as a canary for insight into the performance of other apps on the foundation.

Test Metric Frequency Description
Apps Manager availability healthwatch.health.check.AppsMan.available 1 min 1 = pass or 0 = fail or 10s timeout
Aps Manager response time healthwatch.health.check.AppsMan.responseTime 1 min Time in ms

Healthwatch: BOSH Director Health

Losing the BOSH Director does not significantly impact the experience of PCF end users. However, this issue means a loss of resiliency for BOSH-managed VMs. It is recommended to continuously monitor the health of the BOSH Director. PCF Healthwatch executes this check as a part of its test suite.

Test Metric Frequency Description
BOSH Director health healthwatch.health.check.bosh.director.success and healthwatch.health.check.bosh.director.timeout 10 min 1 = pass or 0 = fail

Note: The timeout metric is written if a deploy or delete task takes more than 10 minutes.

Note: PCF Healthwatch deploys, stops, starts, and deletes a VM named bosh-health-check as part of this test suite.

Healthwatch: Logging Performance Loss Rates

This section lists metrics used to monitor Loggregator, the PCF component responsible for logging.

Firehose Loss Rate

This derived metric is recommended for automating and monitoring platform scaling. Two versions of the metric (per minute and per hour) are used to monitor the Loggregator Firehose.

Reports Metric Description
Firehose loss rate healthwatch.Firehose.LossRate.1H and healthwatch.Firehose.LossRate.1M Loss rate per minute and per hour

Adapter Loss Rate

This derived metric is recommended for automating and monitoring platform scaling. The metric is used to monitor the Scalable Syslog feature of Loggregator.

Reports Metric Description
Adapter loss rate (syslog drain performance) healthwatch.SyslogDrain.Adapter.LossRate.1M Loss rate per minute

Reverse Log Proxy Loss Rate

This derived metric is recommended for automating and monitoring platform scaling. The metric is used to monitor the Scalable Syslog feature of Loggregator.

Reports Metric Description
Reverse Log Proxy loss rate (syslog drain performance) healthwatch.SyslogDrain.RLP.LossRate.1M Loss rate per minute

Healthwatch: Percentage of Capacity Available

This section lists metrics used to monitor the total percentage of available memory, disk, and cell container capacity.

Percentage of Memory Available

This derived metric is recommended for automating and monitoring platform scaling.

Reports Metric Description
Available memory healthwatch.Diego.TotalPercentageAvailableMemoryCapacity.5M Percentage of available memory (averaged over last 5 min)

Percentage of Disk Available

This derived metric is recommended for automating and monitoring platform scaling.

Reports Metric Description
Available disk healthwatch.Diego.TotalPercentageAvailableDiskCapacity.5M Percentage of available disk (averaged over last 5 min)

Percentage of Cell Container Capacity Available

This derived is recommended for automating and monitoring platform scaling.

Reports Metric Description
Available cell container capacity healthwatch.Diego.TotalPercentageAvailableContainerCapacity.5M Percentage of available cell container capacity (averaged over last 5 min)

Healthwatch: BOSH Deployment Occurrence

Monitoring BOSH deployment occurrence adds context to related data, such as VM (job) health.

Limitation: A BOSH deployment start or complete event can be determined. However, you cannot currently know to which VMs it is occurring.

Reports Metric Frequency Description
BOSH deployment occurrence healthwatch.bosh.deployment 30 sec 1 = a running deployment or 0 = not a running deployment

Other Exising Platform Metrics Used

This section lists the existing platform metrics used by PCF Healthwatch. For more information about these metrics, see Key Performance Indicators and Key Capacity Scaling Indicators.

Job Health

The Job Health metric is used for every VM in the CF deployment, and it is provided through BOSH. This does not include additional deployments, such as RabbitMQ or Redis.

Reports Metric Description
Job health system.healthy 1 = system is healthy or 0 = system is not healthy

Job Vitals

The Job Vitals metrics are written for core ERT jobs, and they are provided through BOSH. This does not include additional deployments, such as RabbitMQ or Redis.

Reports Metric Description
CPU utilization system.cpu.user Percentage of CPU used
Memory utilization system.mem.percent Percentage of system memory used
Disk utilization system.disk.system.percent Percentage of system disk used
Persistent disk utilization system.disk.persistent.percent Percentage of persistent disk used
Ephemeral disk utilization system.disk.ephemeral.percent Percentage of ephemeral disk used

Diego Cell Capacity

The Capacity metrics are used to monitor the amount of memory, disk, and container capacity available for Diego cell(s).

Reports Metric Description
Available memory rep.CapacityRemainingMemory Amount of memory (MiB) available for a Diego cell to allocate to containers
Total memory rep.CapacityTotalMemory Total amount of memory (MiB) available for this cell to allocate to containers
Available disk rep.CapacityRemainingDisk Amount of disk (MiB) available for a Diego cell to allocate to containers
Total disk rep.CapacityTotalDisk Total amount of disk (MiB) available for this cell to allocate to containers
Available container capacity rep.CapacityRemainingContainers Remaining number of containers this cell can host
Total container capacity rep.CapacityTotalContainers Total number of containers this cell can host

Application Instances

The Application Instances metrics are used to monitor the health of application instances (AIs). For more information about the lifecycle of an app container and crash events, see Crash Events.

Reports Metric Description
Current running AIs and change in running AIs bbs.LRPsRunning Total number of LRP instances running on Diego cells
Crashed AIs bbs.CrashedActualLRPs Total number of LRP instances that have crashed in a deployment
Missing AIs bbs.LRPsMissing Total number of LRP instances that are desired but have no record in the BBS
Extra AIs bbs.LRPsExtra Total number of LRP instances that are no longer desired but still have a BBS record
Auctioneer AI starts auctioneer.AuctioneerLRPAuctionsStarted Number of LRP instances that the Auctioneer successfully placed on Diego cells
Auctioneer AI failures auctioneer.AuctioneerLRPAuctionsFailed Number of LRP instances that the Auctioneer failed to place on Diego cells
Auctioneer task placement failures auctioneer.AuctioneerTaskAuctionsFailed Number of Tasks that the Auctioneer failed to place on Diego cells

Diego Health

The Diego health and performance metrics are used to monitor core Diego functionality.

Reports Metric Description
BBS time to handle requests bbs.RequestLatency Time in ns that the BBS took to handle requests aggregated across all its API endpoints
BBS time to run LRP convergence bbs.ConvergenceLRPDuration Time in ns that the BBS took to run its LRP convergence pass
Auctioneer time to fetch Cell state auctioneer.AuctioneerFetchStatesDuration Time in ns that the Auctioneer took to fetch state from all the Diego cells when running its auction
Route Emitter time to sync route_emitter.RouteEmitterSyncDuration Time in ns that the active route-emitter took to perform its synchronization pass
Cell Rep time to sync rep.RepBulkSyncDuration Time in ns that the Diego Cell Rep took to sync the ActualLRPs that it claimed with its actual Garden containers
Locket active presences locket.ActivePresences Total count of active presences*
Locket active locks locket.ActiveLocks Total count of how many locks the system components are holding
Diego Cell health check rep.UnhealthyCell 0 = healthy Cell or 1 = unhealthy Cell
Diego and Cloud Controller synched check bbs.Domain.cf-apps Indicates whether the cf-apps domain is up-to-date

* Presences are defined as the registration records that the Cells maintain to advertise themselves to the platform.

The Diego cell periodically checks its health against the Garden backend.

This means that Cloud Foundry app requests from Cloud Controller are synchronized to bbs.LRPsDesired (Diego-desired AIs) for execution.

Logging Performance

The Loggregator Firehose and Scalable Syslog metrics are used to monitor PCF logging performance.

Reports Metric Description
Firehose throughput DopplerServer.listeners.totalReceivedMessageCount (+ loggregator.doppler.ingress in PCF v1.12) Total number of messages received across all Doppler listeners
Firehose dropped messages DopplerServer.doppler.shedEnvelopes (+ loggregator.doppler.dropped in PCF v1.12) Total number of messages intentionally dropped by Doppler due to back pressure
Syslog drain binding count scalablesyslog.scheduler.drains Number of scalable syslog drain bindings

Router

The Router metrics are used to monitor the health and performance of the Gorouter.

Reports Metric Description
Router throughput gorouter.total_requests Lifetime number of requests completed by the Gorouter VM
Router latency gorouter.latency Time (ms) the Gorouter takes to handle requests to its app endpoints
Router jobs CPU system.cpu.user CPU utilization of the Gorouter job(s) as reported by BOSH
502 bad gateways gorouter.bad_gateways Lifetime number of bad gateways, or 502 responses, from the Gorouter itself
All 5XX errors gorouter.responses.5xx Lifetime number of requests completed by the Gorouter VM for HTTP status family 5xx, server errors
Number of routes registered gorouter.total_routes Current total number of routes registered with the Gorouter
Router File Descriptors gorouter.file_descriptors The number of file descriptors currently used by the Gorouter job*
Router Exhausted Connections gorouter.backend_exhausted_conns The lifetime number of requests that have been rejected by the Gorouter VM due to the `Max Connections Per Backend` limit being reached across all tried backends*
Time since last route registered gorouter.ms_since_last_registry_update Time in ms since the last route register was received

* This metric is relevant to PCF v1.12 and does not appear in PCF Healthwatch if it is running on PCF v1.11.