PCF Healthwatch Metrics
- Healthwatch: Cloud Foundry CLI Health
- Healthwatch: Ops Manager Health
- Healthwatch: Apps Manager Health
- Healthwatch: BOSH Director Health
- Healthwatch: Logging Performance Loss Rates
- Healthwatch: Percentage of Capacity Available
- Healthwatch: BOSH Deployment Occurrence
- Other Exising Platform Metrics Used
This topic lists the super metrics created by Pivotal Cloud Foundry (PCF) Healthwatch.
Note: For external monitoring consumers, PCF Healthwatch forwards the metrics it creates into the Loggregator Firehose.
In this topic, you can also find information about the existing PCF platform component and BOSH VM metrics used by PCF Healthwatch.
Healthwatch: Cloud Foundry CLI Health
The Cloud Foundry command line interface (CLI) enables developers to create and manage PCF apps. PCF Healthwatch executes a continuous test suite validating the core app developer functions of the CLI. Running a continuous validation test suite is often significantly more meaningful to reassuring functionality than monitoring for trends in metrics alone.
See the table below for information on generated metrics related to the Cloud Foundry CLI Health smoke tests.
Test | Metric | Frequency | Description |
---|---|---|---|
Can login | healthwatch.health.check.cliCommand.login and healthwatch.health.check.cliCommand.login.timeout |
5 min | 1 = pass or 0 = fail |
Can push | healthwatch.health.check.cliCommand.push and healthwatch.health.check.cliCommand.push.timeout |
5 min | 1 = pass, 0 = fail, or -1 = test did not run |
Can start | healthwatch.health.check.cliCommand.start and healthwatch.health.check.cliCommand.start.timeout |
5 min | 1 = pass, 0 = fail, or -1 = test did not run |
Receiving logs | healthwatch.health.check.cliCommand.logs and healthwatch.health.check.cliCommand.logs.timeout |
5 min | 1 = pass, 0 = fail, or -1 test did not run |
Can stop | healthwatch.health.check.cliCommand.stop and healthwatch.health.check.cliCommand.stop.timeout |
5 min | 1 = pass, 0 = fail, or -1 test did not run |
Can delete | healthwatch.health.check.cliCommand.delete and healthwatch.health.check.cliCommand.delete.timeout |
5 min | 1 = pass, 0 = fail, or -1 test did not run |
Test app push time | healthwatch.health.check.cliCommand.pushTime |
5 min | Time in ms |
Overall smoke test battery result | healthwatch.health.check.cliCommand.success |
5 min | 1 = pass or 0 = fail |
Overall smoke test battery run time | healthwatch.health.check.cliCommand.duration |
5 min | Time in ms |
Note: Timeout metrics are written only when a timeout occurs. Their value is always zero.
Note: PCF Healthwatch runs this test suite in the system org and the healthwatch space.
Healthwatch: Ops Manager Health
Issues with Ops Manager health can impact an operator’s ability to perform an upgrade or to rescale the PCF platform when necessary. Therefore, it is recommended to continuously monitor Ops Manager availability. PCF Healthwatch executes this check as a part of its test suite.
Test | Metric | Frequency | Description |
---|---|---|---|
Ops Manager availability | healthwatch.health.check.OpsMan.available |
1 min | 1 = pass or 0 = fail |
Healthwatch: Apps Manager Health
App availability and responsiveness issues can result in significant end user impacts. PCF Healthwatch uses Apps Manager as a canary app and continuously checks its health. Because of the functions Apps Manager provides, Pivotal recommends it as a canary for insight into the performance of other apps on the foundation.
Test | Metric | Frequency | Description |
---|---|---|---|
Apps Manager availability | healthwatch.health.check.AppsMan.available |
1 min | 1 = pass or 0 = fail or 10s timeout |
Aps Manager response time | healthwatch.health.check.AppsMan.responseTime |
1 min | Time in ms |
Healthwatch: BOSH Director Health
Losing the BOSH Director does not significantly impact the experience of PCF end users. However, this issue means a loss of resiliency for BOSH-managed VMs. It is recommended to continuously monitor the health of the BOSH Director. PCF Healthwatch executes this check as a part of its test suite.
Test | Metric | Frequency | Description |
---|---|---|---|
BOSH Director health | healthwatch.health.check.bosh.director.success and healthwatch.health.check.bosh.director.timeout |
10 min | 1 = pass or 0 = fail |
Note: The timeout metric is written if a deploy or delete task takes more than 10 minutes.
Note: PCF Healthwatch deploys, stops, starts, and deletes a VM named bosh-health-check
as part of this test suite.
Healthwatch: Logging Performance Loss Rates
This section lists metrics used to monitor Loggregator, the PCF component responsible for logging.
Firehose Loss Rate
This derived metric is recommended for automating and monitoring platform scaling. Two versions of the metric (per minute and per hour) are used to monitor the Loggregator Firehose.
Reports | Metric | Description |
---|---|---|
Firehose loss rate | healthwatch.Firehose.LossRate.1H and healthwatch.Firehose.LossRate.1M |
Loss rate per minute and per hour |
Adapter Loss Rate
This derived metric is recommended for automating and monitoring platform scaling. The metric is used to monitor the Scalable Syslog feature of Loggregator.
Reports | Metric | Description |
---|---|---|
Adapter loss rate (syslog drain performance) | healthwatch.SyslogDrain.Adapter.LossRate.1M |
Loss rate per minute |
Reverse Log Proxy Loss Rate
This derived metric is recommended for automating and monitoring platform scaling. The metric is used to monitor the Scalable Syslog feature of Loggregator.
Reports | Metric | Description |
---|---|---|
Reverse Log Proxy loss rate (syslog drain performance) | healthwatch.SyslogDrain.RLP.LossRate.1M |
Loss rate per minute |
Healthwatch: Percentage of Capacity Available
This section lists metrics used to monitor the total percentage of available memory, disk, and cell container capacity.
Percentage of Memory Available
This derived metric is recommended for automating and monitoring platform scaling.
Reports | Metric | Description |
---|---|---|
Available memory | healthwatch.Diego.TotalPercentageAvailableMemoryCapacity.5M |
Percentage of available memory (averaged over last 5 min) |
Percentage of Disk Available
This derived metric is recommended for automating and monitoring platform scaling.
Reports | Metric | Description |
---|---|---|
Available disk | healthwatch.Diego.TotalPercentageAvailableDiskCapacity.5M |
Percentage of available disk (averaged over last 5 min) |
Percentage of Cell Container Capacity Available
This derived is recommended for automating and monitoring platform scaling.
Reports | Metric | Description |
---|---|---|
Available cell container capacity | healthwatch.Diego.TotalPercentageAvailableContainerCapacity.5M |
Percentage of available cell container capacity (averaged over last 5 min) |
Healthwatch: BOSH Deployment Occurrence
Monitoring BOSH deployment occurrence adds context to related data, such as VM (job) health.
Limitation: A BOSH deployment start or complete event can be determined. However, you cannot currently know to which VMs it is occurring.
Reports | Metric | Frequency | Description |
---|---|---|---|
BOSH deployment occurrence | healthwatch.bosh.deployment |
30 sec | 1 = a running deployment or 0 = not a running deployment |
Other Exising Platform Metrics Used
This section lists the existing platform metrics used by PCF Healthwatch. For more information about these metrics, see Key Performance Indicators and Key Capacity Scaling Indicators.
Job Health
The Job Health metric is used for every VM in the CF deployment, and it is provided through BOSH. This does not include additional deployments, such as RabbitMQ or Redis.
Reports | Metric | Description |
---|---|---|
Job health | system.healthy |
1 = system is healthy or 0 = system is not healthy |
Job Vitals
The Job Vitals metrics are written for core ERT jobs, and they are provided through BOSH. This does not include additional deployments, such as RabbitMQ or Redis.
Reports | Metric | Description |
---|---|---|
CPU utilization | system.cpu.user |
Percentage of CPU used |
Memory utilization | system.mem.percent |
Percentage of system memory used |
Disk utilization | system.disk.system.percent |
Percentage of system disk used |
Persistent disk utilization | system.disk.persistent.percent |
Percentage of persistent disk used |
Ephemeral disk utilization | system.disk.ephemeral.percent |
Percentage of ephemeral disk used |
Diego Cell Capacity
The Capacity metrics are used to monitor the amount of memory, disk, and container capacity available for Diego cell(s).
Reports | Metric | Description |
---|---|---|
Available memory | rep.CapacityRemainingMemory |
Amount of memory (MiB) available for a Diego cell to allocate to containers |
Total memory | rep.CapacityTotalMemory |
Total amount of memory (MiB) available for this cell to allocate to containers |
Available disk | rep.CapacityRemainingDisk |
Amount of disk (MiB) available for a Diego cell to allocate to containers |
Total disk | rep.CapacityTotalDisk |
Total amount of disk (MiB) available for this cell to allocate to containers |
Available container capacity | rep.CapacityRemainingContainers |
Remaining number of containers this cell can host |
Total container capacity | rep.CapacityTotalContainers |
Total number of containers this cell can host |
Application Instances
The Application Instances metrics are used to monitor the health of application instances (AIs). For more information about the lifecycle of an app container and crash events, see Crash Events.
Reports | Metric | Description |
---|---|---|
Current running AIs and change in running AIs | bbs.LRPsRunning |
Total number of LRP instances running on Diego cells |
Crashed AIs | bbs.CrashedActualLRPs |
Total number of LRP instances that have crashed in a deployment |
Missing AIs | bbs.LRPsMissing |
Total number of LRP instances that are desired but have no record in the BBS |
Extra AIs | bbs.LRPsExtra |
Total number of LRP instances that are no longer desired but still have a BBS record |
Auctioneer AI starts | auctioneer.AuctioneerLRPAuctionsStarted |
Number of LRP instances that the Auctioneer successfully placed on Diego cells |
Auctioneer AI failures | auctioneer.AuctioneerLRPAuctionsFailed |
Number of LRP instances that the Auctioneer failed to place on Diego cells |
Auctioneer task placement failures | auctioneer.AuctioneerTaskAuctionsFailed |
Number of Tasks that the Auctioneer failed to place on Diego cells |
Diego Health
The Diego health and performance metrics are used to monitor core Diego functionality.
Reports | Metric | Description |
---|---|---|
BBS time to handle requests | bbs.RequestLatency |
Time in ns that the BBS took to handle requests aggregated across all its API endpoints |
BBS time to run LRP convergence | bbs.ConvergenceLRPDuration |
Time in ns that the BBS took to run its LRP convergence pass |
Auctioneer time to fetch Cell state | auctioneer.AuctioneerFetchStatesDuration |
Time in ns that the Auctioneer took to fetch state from all the Diego cells when running its auction |
Route Emitter time to sync | route_emitter.RouteEmitterSyncDuration |
Time in ns that the active route-emitter took to perform its synchronization pass |
Cell Rep time to sync | rep.RepBulkSyncDuration |
Time in ns that the Diego Cell Rep took to sync the ActualLRPs that it claimed with its actual Garden containers |
Locket active presences | locket.ActivePresences |
Total count of active presences* |
Locket active locks | locket.ActiveLocks |
Total count of how many locks the system components are holding |
Diego Cell health check | rep.UnhealthyCell |
0 = healthy Cell or 1 = unhealthy Cell† |
Diego and Cloud Controller synched check | bbs.Domain.cf-apps |
Indicates whether the cf-apps domain is up-to-date |
* Presences are defined as the registration records that the Cells maintain to advertise themselves to the platform.
† The Diego cell periodically checks its health against the Garden backend.
This means that Cloud Foundry app requests from Cloud Controller are synchronized to bbs.LRPsDesired
(Diego-desired AIs) for execution.
Logging Performance
The Loggregator Firehose and Scalable Syslog metrics are used to monitor PCF logging performance.
Reports | Metric | Description |
---|---|---|
Firehose throughput | DopplerServer.listeners.totalReceivedMessageCount (+ loggregator.doppler.ingress in PCF v1.12) |
Total number of messages received across all Doppler listeners |
Firehose dropped messages | DopplerServer.doppler.shedEnvelopes (+ loggregator.doppler.dropped in PCF v1.12) |
Total number of messages intentionally dropped by Doppler due to back pressure |
Syslog drain binding count | scalablesyslog.scheduler.drains |
Number of scalable syslog drain bindings |
Router
The Router metrics are used to monitor the health and performance of the Gorouter.
Reports | Metric | Description |
---|---|---|
Router throughput | gorouter.total_requests |
Lifetime number of requests completed by the Gorouter VM |
Router latency | gorouter.latency |
Time (ms) the Gorouter takes to handle requests to its app endpoints |
Router jobs CPU | system.cpu.user |
CPU utilization of the Gorouter job(s) as reported by BOSH |
502 bad gateways | gorouter.bad_gateways |
Lifetime number of bad gateways, or 502 responses, from the Gorouter itself |
All 5XX errors | gorouter.responses.5xx |
Lifetime number of requests completed by the Gorouter VM for HTTP status family 5xx, server errors |
Number of routes registered | gorouter.total_routes |
Current total number of routes registered with the Gorouter |
Router File Descriptors | gorouter.file_descriptors |
The number of file descriptors currently used by the Gorouter job* |
Router Exhausted Connections | gorouter.backend_exhausted_conns |
The lifetime number of requests that have been rejected by the Gorouter VM due to the `Max Connections Per Backend` limit being reached across all tried backends* |
Time since last route registered | gorouter.ms_since_last_registry_update |
Time in ms since the last route register was received |
* This metric is relevant to PCF v1.12 and does not appear in PCF Healthwatch if it is running on PCF v1.11.