Key Performance Indicators

This topic describes Key Performance Indicators (KPIs) that operators may want to monitor with their Pivotal Cloud Foundry (PCF) deployment to help ensure it is in a good operational state.

The following PCF v1.11 KPIs are provided for operators to give general guidance on monitoring a PCF deployment using platform component and system (BOSH) metrics. Although many metrics are emitted from the platform, the following PCF v1.11 KPIs are high-signal-value metrics that can indicate emerging platform issues.

This alerting and response guidance has been shown to apply to most deployments. Pivotal recommends that operators continue to fine-tune the alert measures to their deployment by observing historical trends. Pivotal also recommends that operators expand beyond this guidance and create new, deployment-specific monitoring metrics, thresholds, and alerts based on learning from their deployments.

Note: Thresholds noted as “dynamic” in the tables below indicate that while a metric is highly important to watch, the relative numbers to set threshold warnings at are specific to a given PCF environment and its use cases. These dynamic thresholds should be occasionally revisited because the PCF foundation and its usage continue to evolve. See Determine Warning and Critical Thresholds for more information.

Diego Auctioneer Metrics

Auctioneer App Instance (AI) Placement Failures


auctioneer.AuctioneerLRPAuctionsFailed

Description The number of Long Running Process (LRP) instances that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

Use: This metric can indicate that PCF is out of container space or that there is a lack of resources within your environment. This indicator also increases when the LRP is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking in sufficient resources to accept the work.

This error is most common due to capacity issues, for example, if cells do not have enough resources or if cells are going back and forth between a healthy and unhealthy state.

Origin: Firehose
Type: Counter (Integer)
Frequency: During each auction
Recommended measurement Per minute delta averaged over a 5-minute window
Recommended alert thresholds Yellow warning: ≥ 0.5
Red critical: ≥ 1
Recommended response
  1. In order to best determine the root cause, examine the Auctioneer logs. Depending on the specific error or resource constraint, you may also find a failure reason in the Cloud Controller (CC) API.
  2. Investigate the health of your Diego cells to determine if they are the resource type causing the problem.
  3. Scale additional cells using Ops Manager.
  4. If scaling cells does not solve the problem, pull Diego brain logs and BBS node logs and contact Pivotal Support telling them that LRP auctions are failing.

Auctioneer Time to Fetch Cell State


auctioneer.AuctioneerFetchStatesDuration

Description Time in ns that the auctioneer took to fetch state from all the Diego cells when running its auction.

Use: Indicates how the cells themselves are performing. Alerting on this metric helps alert that app staging requests to Diego may be failing.

Origin: Firehose
Type: Gauge, integer in ns
Frequency: During event, during each auction
Recommended measurement Maximum over the last 5 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: ≥ 2 s
Red critical: ≥ 5 s
Recommended response
  1. Check the health of the cells by reviewing the logs and looking for errors.
  2. Review IaaS console metrics.
  3. Pull Diego brain logs and cell logs and contact Pivotal Support telling them that fetching cell states is taking too long.

Auctioneer App Instance Starts


auctioneer.AuctioneerLRPAuctionsStarted

Description The number of LRP instances that the auctioneer successfully placed on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

Use: Provides a sense of running system activity levels in your environment. Can also give you a sense of how many app instances have been started over time. The recommended measurement, below, can help indicate a significant amount of container churn. However, for capacity planning purposes, it is more helpful to observe deltas over a long time window.

Origin: Firehose
Type: Counter (Integer)
Frequency: During event, during each auction
Recommended measurement Per minute delta averaged over a 5-minute window
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response When observing a significant amount of container churn, do the following:

  1. Look to eliminate explainable causes of temporary churn, such as a deployment or increased developer activity.
  2. If container churn appears to continue over an extended period, pull logs from the Diego Brain and BBS node before contacting Pivotal support.
When observing extended periods of high or low activity trends, scale up or down CF components as needed.

Auctioneer Task Placement Failures


auctioneer.AuctioneerTaskAuctionsFailed

Description The number of Tasks that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

Use: Failing Task auctions indicate a lack of resources within your environment and that you likely need to scale. This indicator also increases when the Task is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking in sufficient resources to accept the work.

This error is most common due to capacity issues, for example, if cells do not have enough resources or if cells are going back and forth between a healthy and unhealthy state.

Origin: Firehose
Type: Counter (Float)
Frequency: During event, during each auction
Recommended measurement Per minute delta averaged over a 5-minute window
Recommended alert thresholds Yellow warning: ≥ 0.5
Red critical: ≥ 1
Recommended response
  1. In order to best determine the root cause, examine the Auctioneer logs. Depending on the specific error or resource constraint, you may also find a failure reason in the CC API.
  2. Investigate the health of Diego cells.
  3. Scale additional cells using Ops Manager.
  4. If scaling cells does not solve the problem, pull Diego brain logs and BBS logs for troubleshooting, then contact Pivotal Support and tell them that Task auctions are failing.

Diego BBS Metrics

BBS Time to Run LRP Convergence


bbs.ConvergenceLRPDuration

Description Time in ns that the BBS took to run its LRP convergence pass.

Use: If the convergence run takes too long, apps or Tasks may be crashing without restarting. This symptom can also indicate loss of connectivity to the BBS database.

Origin: Firehose
Type: Gauge (Integer in ns)
Frequency: During event, every 30 seconds when LRP convergence runs, emission should be near-constant on a running deployment
Recommended measurement Maximum over the last 15 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: ≥ 10 s
Red critical: ≥ 20 s
Recommended response
  1. Check BBS logs for errors.
  2. Try vertically scaling the BBS VM resources up. For example, add more CPUs or memory, depending on its system.cpu and system.memory metrics.
  3. If that does not solve the issue, pull the BBS logs and contact Pivotal Support for additional troubleshooting.

BBS Time to Handle Requests


bbs.RequestLatency

Description Time in ns that the BBS took to handle requests, aggregated across all its API endpoints.

Use: If this metric rises, the PCF API is slowing. Response to certain cf CLI commands is slow if request latency is high.

Origin: Firehose
Type: Gauge (Integer in ns)
Frequency: During event, when the BBS API handles requests, emission should be near-constant on a running deployment
Recommended measurement Maximum over the last 15 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: ≥ 5 s
Red critical: ≥ 10 s
Recommended response
  1. Check CPU and memory statistics in Ops Manager.
  2. Check BBS logs for faults and errors that can indicate issues with BBS.
  3. Try scaling the BBS VM resources up. For example, add more CPUs or memory, depending on its system.cpu and system.memory metrics.
  4. If the above steps do not solve the issue, collect a sample of the cell logs from the BBS VMs and contact Pivotal Support to troubleshoot further.

Cloud Controller and Diego in Sync


bbs.Domain.cf-apps

Description Indicates if the cf-apps Domain is up-to-date, meaning that CF App requests from Cloud Controller are synchronized to bbs.LRPsDesired (Diego-desired AIs) for execution.
  • 1 means cf-apps Domain is up-to-date
  • No data received means cf-apps Domain is not up-to-date
Use: If the cf-apps Domain does not stay up-to-date, changes requested in the Cloud Controller are not guaranteed to propagate throughout the system. If the Cloud Controller and Diego are out of sync, then apps running could vary from those desired.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: not applicable
Red critical: < 1
Recommended response
  1. Check the BBS logs.
  2. If the problem continues, pull Diego brain logs and BBS logs and contact Pivotal Support to say that the cf-apps domain is not being kept fresh.

More App Instances Than Expected


bbs.LRPsExtra

Description Total number of LRP instances that are no longer desired but still have a BBS record. When Diego wants to add more apps, the BBS sends a request to the auctioneer to spin up additional LRPs. LRPsExtra is the total number of LRP instances that are no longer desired but still have a BBS record.

Use: If Diego has more LRPs running than expected, there may be problems with the BBS.

Deleting an app with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsExtra is unusual and should be investigated.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: ≥ 5
Red critical: ≥ 10
Recommended response
  1. Review the BBS logs for proper operation or errors, looking for detailed error messages.
  2. If the condition persists, pull the BBS logs and contact Pivotal Support.

Fewer App Instances Than Expected


bbs.LRPsMissing

Description Total number of LRP instances that are desired but have no record in the BBS. When Diego wants to add more apps, the BBS sends a request to the auctioneer to spin up additional LRPs. LRPsMissing is the total number of LRP instances that are desired but have no BBS record.

Use: If Diego has less LRP running than expected, there may be problems with the BBS.

An app push with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsMissing is unusual and should be investigated.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: ≥ 5
Red critical: ≥ 10
Recommended response
  1. Review the BBS logs for proper operation or errors, looking for detailed error messages.
  2. If the condition persists, pull the BBS logs and contact Pivotal Support.

Crashed App Instances


bbs.CrashedActualLRPs

Description Total number of LRP instances that have crashed.

Use: Indicates how many instances in the deployment are in a crashed state. An increase in bbs.CrashedActualLRPs can indicate several problems, from a bad app with many instances associated, to a platform issue that is resulting in app crashes. Use this metric to help create a baseline for your deployment. After you have a baseline, you can create a deployment-specific alert to notify of a spike in crashes above the trend line. Tune alert values to your deployment.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Examine the BBS logs for apps that are crashing, and and at the cell logs to see if the problem is with the apps themselves, rather than a platform issue.
  2. Before contacting Pivotal Support, pull the BBS logs and, if particular apps are the problem, also pull the logs from their Diego cells.

Running App Instances, Rate of Change


1hr average of bbs.LRPsRunning – prior 1hr average of bbs.LRPsRunning

Description Rate of change in app instances being started or stopped on the platform. It is derived from bbs.LRPsRunning and represents the total number of LRP instances that are running on Diego cells.

Use: Delta reflects upward or downward trend for app instances started or stopped. Helps to provide a picture of the overall growth trend of the environment for capacity planning. You may want to alert on delta values outside of the expected range.

Origin: Firehose
Type: Gauge (Float)
Frequency: During event, emission should be constant on a running deployment
Recommended measurement derived=(1-hour average of bbs.LRPsRunning – prior 1-hour average of bbs.LRPsRunning)
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response Scale components as necessary.

Diego Cell Metrics

Remaining Memory Available — Cell Memory Chunks Available


rep.CapacityRemainingMemory

Description Remaining amount of memory in MiB available for this Diego cell to allocate to containers.

Use: Indicates the available cell memory. Insufficient cell memory can prevent pushing and scaling apps.

The strongest operational value of this metric is to understand a deployment’s average app size and monitor or alert on ensuring that at least some cells have large enough capacity to accept standard app size pushes. For example, if pushing a 4 GB app, Diego would have trouble placing that app if there is no one cell with sufficient capacity of 4 GB or greater.

As an example, Pivotal Cloud Ops uses a standard of 4 GB, and computes and monitors for the number of cells with at least 4 GB free. When the number of cells with at least 4 GB falls below a defined threshold, this is a scaling indicator alert to increase capacity. This free chunk count threshold should be tuned to the deployment size and the standard size of apps being pushed to the deployment.

Origin: Firehose
Type: Gauge (Integer in bytes)
Frequency: 60 s
Recommended measurement For alerting:
  1. Determine the size of a standard app in your deployment. This is the suggested value to calculate free chunks of Remaining Memory by.
  2. Create a script or tool that can iterate through each Diego Cell and do the following:
    1. Pull the rep.CapacityRemainingMemory metric for each cell.
    2. Divide the values received by 1000 to get the value in Gigabytes (if desired threshold is GB-based).
    3. Compare recorded values to your minimum capacity threshold, and count the number of cells that have equal or greater than the desired amount of free chunk space.
  3. Determine a desired scaling threshold based on the minimum amount of free chunks that are acceptable in this deployment given historical trends.
  4. Set an alert to indicate the need to scale cell memory capacity when the value falls below the desired threshold number.
For visualization purposes:
Looking at this metric (rep.CapacityRemainingMemory) as a minimum value per cell has more informational value than alerting value. It can be an informative heatmap visualization, showing average variance and density over time.
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Assign more resources to the cells or assign more cells.
  2. Scale additional Diego cells using Ops Manager.

Remaining Memory Available — Overall Remaining Memory Available


rep.CapacityRemainingMemory
(Alternative Use)

Description Remaining amount of memory in MiB available for this Diego cell to allocate to containers.

Use: Can indicate low memory capacity overall in the platform. Low memory can prevent app scaling and new deployments. The overall sum of capacity can indicate that you need to scale the platform. Observing capacity consumption trends over time helps with capacity planning.

Origin: Firehose
Type: Gauge (Integer in bytes)
Frequency: 60 s
Recommended measurement Minimum over the last 5 minutes divided by 1024 (across all instances)
Recommended alert thresholds Yellow warning: ≤ 32 GB
Red critical: ≤ 16 GB
Recommended response
  1. Assign more resources to the cells or assign more cells.
  2. Scale additional Diego cells via Ops Manager.

Remaining Disk Available


rep.CapacityRemainingDisk

Description Remaining amount of disk in MiB available for this Diego cell to allocate to containers.

Use: Low disk capacity can prevent app scaling and new deployments. Because Diego staging Tasks can fail without at least 4 GB free, the recommended red threshold is based on the minimum disk capacity across the deployment falling below 4 GB in the previous 5 minutes.

It can also be meaningful to assess how many chunks of free disk space are above a given threshold, similar to rep.CapacityRemainingMemory.

Origin: Firehose
Type: Gauge (Integer in bytes)
Frequency: 60 s
Recommended measurement Minimum over the last 5 minutes divided by 1024 (across all instances)
Recommended alert thresholds Yellow warning: ≤ 6 GB
Red critical: ≤ 3.5 GB
Recommended response
  1. Assign more resources to the cells or assign more cells.
  2. Scale additional cells using Ops Manager.

Cell Rep Time to Sync


rep.RepBulkSyncDuration

Description Time in ns that the Diego Cell Rep took to sync the ActualLRPs that it claimed with its actual garden containers.

Use: Sync times that are too high can indicate issues with the BBS.

Origin: Firehose
Type: Gauge (Float in ns)
Frequency: 30 s
Recommended measurement Maximum over the last 15 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: ≥ 5 s
Red critical: ≥ 10 s
Recommended response
  1. Investigate BBS logs for faults and errors.
  2. If a particular cell or cells appear problematic, pull logs for the cells and the BBS logs before contacting Pivotal Support.

Unhealthy Cells


rep.UnhealthyCell

Description The Diego cell periodically checks its health against the garden backend. For Diego cells, 0 means healthy, and 1 means unhealthy.

Use: Set an alert for further investigation if multiple unhealthy Diego cells are detected in the given time window. If one cell is impacted, it does not participate in auctions, but end-user impact is usually low. If multiple cells are impacted, this can indicate a larger problem with Diego.

Suggested alert threshold based on multiple unhealthy cells in the given time window.

Origin: Firehose
Type: Gauge (Float, 0-1)
Frequency: 30 s
Recommended measurement Maximum over the last 5 minutes
Recommended alert thresholds Yellow warning: not applicable
Red critical: > 1
Recommended response
  1. Investigate Diego cell servers for faults and errors.
  2. If a particular cell or cells appear problematic, pull logs for that cell, as well as the BBS logs before contacting Pivotal Support.

Diego Locket Metrics

Active Locks


locket.ActiveLocks

Description Total count of how many locks the system components are holding. As of PCF v1.11, the BBS, Auctioneer, TPS Watcher, and Routing API components have migrated to Locket from Consul lock.

Use: If the Active Lock count is greater than the expected maximum, there is likely a problem with Diego.

Origin: Firehose
Type: Gauge
Frequency: 60 s
Recommended measurement Maximum over the last 5 minutes
Recommended alert thresholds Yellow warning: not applicable
Red critical: > 4
Recommended response
  1. Run monit status to inspect for failing processes.
  2. If there are no failing processes, then review the logs for the components using the locket service: BBS, Auctioneer, TPS Watcher, and Routing API. Look for indications that only one of each component is active at a time.
  3. Focus triage on the BBS first:
    • A healthy BBS shows obvious activity around starting or claiming LRPs.
    • An unhealthy BBS leads to the Auctioneer showing minimal or no activity. The BBS sends work to the Auctioneer.
  4. If the BBS appears healthy, then check the Auctioneer to ensure it is processing auction payloads.
    Recent logs for Auctioneer should show all but one are currently waiting on locks, and the active Auctioneer should show a record of when it last attempted to execute work. This attempt to execute should correspond to app dev activity, such as a cf push. The TPS Watcher is primarily only active when application instances crash, so if the TPS Watcher is suspected, review the most recent logs.
  5. If unable to resolve on-going excessive active locks, pull logs from the Diego BBS and Auctioneer VMs, which will include the locket service component logs, and contact Pivotal Support.

Active Presences


locket.ActivePresences

Description Total count of active presences. Presences are defined as the registration records that the cells maintain to advertise themselves to the platform.

Use: If the Active Presences count is far from the expected, there might be a problem with Diego.

The number of active presences varies according to the number of cells deployed. Therefore, during purposeful scale adjustments to PCF, this alerting threshold should be adjusted.
Establish an initial threshold by observing the historical trends for the deployment over a brief period of time, Increase the threshold as more cells are deployed. During a rolling deploy, this metric shows variance during the BOSH lifecycle when cells are evacuated and restarted. Tolerable variance is within the bounds of the max inflight range, Max Inflight Container Starts established in Elastic Runtime.

Origin: Firehose
Type: Gauge
Frequency: 60 s
Recommended measurement Maximum over the last 15 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Ensure that the variance is not the result of an active rolling deploy. Also ensure that the alert threshold is appropriate to the number of cells in the current deployment.
  2. Run monit status to inspect for failing processes.
  3. If there are no failing processes, then review the logs for the components using the locket service: BBS, Auctioneer, TPS Watcher, and Routing API.
  4. Focus triage on the BBS first:
    • A healthy BBS shows obvious activity around starting or claiming LRPs.
    • An unhealthy BBS leads to the Auctioneer showing minimal or no activity. The BBS sends work to the Auctioneer.
  5. If the BBS appears healthy, then check the Auctioneer to ensure it is processing auction payloads.
    Recent logs for the active Auctioneer should show a record of when it last attempted to execute work. This attempt should correspond to app dev activity, such as a cf push. The TPS Watcher is primarily only active when application instances crash, so if the TPS Watcher is suspected, review the most recent logs.
  6. If you are unable to resolve the problem, pull the logs from the Diego BBS and Auctioneer VMs, which include the locket service component logs, and contact Pivotal Support.

Diego Route Emitter Metrics

Route Emitter Time to Sync


route_emitter.RouteEmitterSyncDuration

Description Time in ns that the active route-emitter took to perform its synchronization pass.

Use: Increases in this metric indicate that the route emitter may have trouble maintaining an accurate routing table to broadcast to the GoRouters. Tune alerting values to your deployment based on historical data and adjust based on observations over time. The suggested starting point is ≥ 5 for the yellow threshold and ≥ 10 for the critical threshold. Pivotal has observed on its Pivotal Web Services deployment that above 10 s, the BBS may be failing.

Origin: Firehose
Type: Gauge (Float in ns)
Frequency: 60 s
Recommended measurement Maximum over the last 15 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Investigate the route_emitter and Diego BBS logs for errors.
  2. Verify that app routes are functional by making a request to an app, pushing an app and pinging it, or if applicable, checking that your smoke tests have passed.

GoRouter Metrics

Router Throughput


gorouter.total_requests

Description The lifetime number of requests completed by the GoRouter VM

Use: Provides insight into the overall traffic flow through a deployment. For performance and capacity management, consider this metric a measure of router throughput and convert it to requests-per-second, by looking at the delta value of gorouter.total_requests and deriving back to 1s, or sum_over_all_indexes(gorouter.total_requests.delta)/5. This helps you see trends in the throughput rate that indicate a need to scale the GoRouter. Use the trends you observe to tune the threshold alerts for this metric.

Origin: Firehose
Type: Counter (Integer)
Frequency: 5 s
Recommended measurement Average over the last 5 minutes of the derived per second calculation
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response For optimizing the GoRouter, consider the requests-per-second derived metric in the context of router latency and GoRouter VM CPU utilization. From performance and load testing of the GoRouter, Pivotal has observed that at approximately 2500 requests per second, latency can begin to increase.

To increase throughput and maintain low latency, scale the GoRouters either horizontally or vertically and watch that the system.cpu.user metric for the GoRouter stays in the suggested range of 60-70% CPU Utilization.

Router Handling Latency


gorouter.latency

Description The time in milliseconds that the GoRouter takes to handle requests to its app endpoints. This is the average round trip response time to an app, which includes router handling.

Use: Indicates how GoRouter jobs in PCF are impacting overall app responsiveness. Latencies above 100 ms can indicate problems with the network, misbehaving apps, or the need to scale the GoRouter itself due to ongoing traffic congestion. An alert value on this metric should be tuned to the specifics of the deployment and its underlying network considerations; a suggested starting point is 100 ms.

Origin: Firehose
Type: Gauge (Float in ms)
Frequency: Emitted per GoRouter request, emission should be constant on a running deployment
Recommended measurement Average over the last 30 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response Extended periods of high latency can point to several factors. The GoRouter latency measure includes network and app latency impacts as well.

  1. First inspect logs for network issues and indications of misbehaving apps.
  2. If it appears that the GoRouter needs to scale due to ongoing traffic congestion, do not scale on the latency metric alone. You should also look at the CPU utilization of the GoRouter VMs and keep it within a maximum 60-70% range.
  3. Resolve high utilization by scaling the GoRouter.

Time Since Last Route Register Received


gorouter.ms_since_last_registry_update

Description Time in milliseconds since the last route register was received.

Use: Indicates if routes are not being registered to apps correctly.

Origin: Firehose
Type: Gauge (Float in ms)
Frequency: 30 s
Recommended measurement Maximum over the last 5 minutes
Recommended alert thresholds Yellow warning: not applicable
Red critical: > 30,000
This threshold is suitable for normal platform usage. It alerts if it has been at least 30 seconds since the GoRouter last received a message from an app.
Recommended response
  1. Search the GoRouter and route_emitter logs for connection issues to NATS.
  2. Check the BOSH logs to see if the NATS, GoRouter, or route_emitter VMs are failing.
  3. Look more broadly at the health of all VMs, particularly Diego-related VMs.
  4. If problems persist, pull the GoRouter and route_emitter logs and contact Pivotal Support to say there are consistently long delays in route registry.

Router Error: 502 Bad Gateway


gorouter.bad_gateways

Description The lifetime number of bad gateways, or 502 responses, from GoRouter itself.
The GoRouter emits a 502 bad gateway error when it has a route in the routing table and, in attempting to make a connection to the backend, finds that the backend does not exist.

Use: Indicates that route tables might be stale. Stale routing tables suggest an issue in the route register management plane, which indicates that something has likely changed with the locations of the containers. Always investigate unexpected increases in this metric.

Origin: Firehose
Type: Count (Integer, Lifetime)
Frequency: 5 s
Recommended measurement Maximum delta per minute over a 5-minute window
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Look in the GoRouter and route_emitter logs for connection issues to NATS.
  2. Check the BOSH logs to see if the NATS, GoRouter, or route_emitter VMs are failing.
  3. Look broadly at the health of all VMs, particularly Diego-related VMs.
  4. If problems persist, pull GoRouter and route_emitter logs, then contact Pivotal Support, telling them that there has been an unusual increase in GoRouter bad gateway responses.

Router Error: Server Error


gorouter.responses.5xx

Description The lifetime number of requests completed by the GoRouter VM for HTTP status family 5xx, server errors.

Use: A repeatedly crashing app is often the cause of a big increase in 5xx responses. However, response issues from apps can also cause an increase in 5xx responses. Always investigate an unexpected increase in this metric.

Origin: Firehose
Type: Counter (Integer)
Frequency: 5 s
Recommended measurement Maximum delta per minute over a 5-minute window
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Look for out-of-memory errors and other app-level errors.
  2. As a temporary measure, ensure that the troublesome app is scaled to more than one instance.

Number of GoRouter Routes Registered


gorouter.total_routes

Description The current total number of routes registered with the GoRouter

Use: Indicates uptake and gives a picture of the overall growth of the environment for capacity planning.

Pivotal also recommends alerting on this metric if the number of routes falls outside of the normal range for your deployment. For example, dramatic increases in the total routes outside of expected business events might point to a denial-of-service attack. Or, dramatic decreases in this metric volume might indicate a problem with the route registration process, such as an app outage or that something in the route register management plane has failed.

If visualizing these metrics on a dashboard, gorouter.total_routes can be helpful by visualizing dramatic drops. However, for alerting purposes, the metric gorouter.ms_since_last_registry_update is more valuable for quicker identification of GoRouter issues. Alerting thresholds for gorouter.total_routes should focus on dramatic increases and decreases out of expected range.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement 5-minute average of the per second delta
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. For capacity needs, scale up or down the GoRouter VMs as necessary.
  2. For significant drops in current total routes, see the gorouter.ms_since_last_registry_update metric value for additional context.
  3. Look at the GoRouter and route_emitter logs to look for connection issues to NATS.
  4. Check the BOSH logs to see if the NATS, GoRouter, or route_emitter VMs are failing.
  5. Look broadly at the health of all VMs, particularly Diego-related VMs.
  6. If problems persist, pull the GoRouter and route_emitter logs and contact Pivotal Support.

Firehose Metrics

Firehose Throughput


DopplerServer.listeners.totalReceivedMessageCount

Description The total number of messages received across all Doppler listeners: UDP, TCP, TLS, and GRPC.

Use: Provides insight into how much traffic the logging system handles. This metric is an indicator of logging consistency.

Origin: Firehose
Type: Counter (Integer)
Frequency: 5 s
Recommended measurement Maximum delta per minute over a 5-minute window
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response Scale up the Firehose log receiver and Dopplers on consistent upward trends.
Pivotal recommends that you do not scale down these components on flat or downward delta trends because unexpected spikes in throughput can cause log loss if not scaled appropriately.

Firehose Dropped Messages


DopplerServer.doppler.shedEnvelopes

Description The lifetime total number of messages intentionally dropped by Doppler due to back pressure.

Use: Indicates logging consistency. Set an alert to indicate if too much traffic is coming into the Dopplers or if the Firehose consumers are not keeping pace. Both issues result in dropped messages.

Origin: Firehose
Type: Counter (Integer)
Frequency: 5 s
Recommended measurement Maximum delta per minute over a 5-minute window
Recommended alert thresholds Yellow warning: ≥ 5
Red critical: ≥ 10
Recommended response Scale up the Firehose log receiver and Dopplers.

System (BOSH) Metrics

VM Health


system.healthy

Description 1 means the system is healthy, and 0 means the system is not healthy.

Use: This is the most important BOSH metric to monitor. It indicates if the VM emitting the metric is healthy. Review this metric for all VMs to estimate the overall health of the system.

Multiple unhealthy VMs signals problems with the underlying IAAS layer.

Origin: JMX Bridge or BOSH HM Forwarder
Type: Gauge (Float, 0-1)
Frequency: 60 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: not applicable
Red critical: < 1
Recommended response Investigate VM logs for the unhealthy component(s).

VM Memory Used


system.mem.percent

Description System Memory — Percentage of memory used on the VM

Use: Set an alert and investigate if the free RAM is low over an extended period.

Origin: JMX Bridge or BOSH HM Forwarder
Type: Gauge (%)
Frequency: 60 s
Recommended measurement Average over the last 10 minutes
Recommended alert thresholds Yellow warning: ≥ 80%
Red critical: ≥ 90%
Recommended response The response depends on the job the metric is associated with. If appropriate, scale affected jobs out and monitor for improvement.

VM Disk Used


system.disk.system.percent

Description System disk — Percentage of the system disk used on the VM

Use: Set an alert to indicate when the system disk is almost full.

Origin: JMX Bridge or BOSH HM Forwarder
Type: Gauge (%)
Frequency: 60 s
Recommended measurement Average over the last 30 minutes
Recommended alert thresholds Yellow warning: ≥ 80%
Red critical: ≥ 90%
Recommended response Investigate what is filling the jobs system partition.
This partition should not typically fill because BOSH deploys jobs to use ephemeral and persistent disks.

VM Ephemeral Disk Used


system.disk.ephemeral.percent

Description Ephemeral disk — Percentage of the ephemeral disk used on the VM

Use: Set an alert and investigate if the ephemeral disk usage is too high for a job over an extended period.

Origin: JMX Bridge or BOSH HM Forwarder
Type: Gauge (%)
Frequency: 60 s
Recommended measurement Average over the last 30 minutes
Recommended alert thresholds Yellow warning: ≥ 80%
Red critical: ≥ 90%
Recommended response
  1. Run bosh vms --details to view jobs on affected deployments.
  2. Determine cause of the data consumption, and, if appropriate, increase disk space or scale out the affected jobs.

VM Persistent Disk Used


system.disk.persistent.percent

Description Persistent disk — Percentage of persistent disk used on the VM

Use: Set an alert and investigate further if the persistent disk usage for a job is too high over an extended period.

Origin: JMX Bridge or BOSH HM Forwarder
Type: Gauge (%)
Frequency: 60 s
Recommended measurement Average over the last 30 minutes
Recommended alert thresholds Yellow warning: ≥ 80%
Red critical: ≥ 90%
Recommended response
  1. Run bosh vms --details to view jobs on affected deployments.
  2. Determine cause of the data consumption, and, if appropriate, increase disk space or scale out affected jobs.

VM CPU Utilization


system.cpu.user

Description CPU utilization — The percentage of CPU spent in user processes

Use: Set an alert and investigate further if the CPU utilization is too high for a job.

For monitoring GoRouter performance, CPU utilization of the GoRouter VM is the recommended key capacity scaling indicator. For more information, see GoRouter Latency and Throughput.

Origin: JMX Bridge or BOSH HM Forwarder
Type: Gauge (%)
Frequency: 60 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: ≥ 85%
Red critical: ≥ 95%
Recommended response
  1. Investigate the cause of the spike.
  2. If the cause is a normal workload increase, then scale up the affected jobs.
Create a pull request or raise an issue on the source for this page in GitHub