Key Performance Indicators

This topic describes Key Performance Indicators (KPIs) that operators may want to monitor with their Pivotal Cloud Foundry (PCF) deployment to help ensure it is in a good operational state.

The following PCF v2.0 KPIs are provided for operators to give general guidance on monitoring a PCF deployment using platform component and system (BOSH) metrics. Although many metrics are emitted from the platform, the following PCF v2.0 KPIs are high-signal-value metrics that can indicate emerging platform issues.

This alerting and response guidance has been shown to apply to most deployments. Pivotal recommends that operators continue to fine-tune the alert measures to their deployment by observing historical trends. Pivotal also recommends that operators expand beyond this guidance and create new, deployment-specific monitoring metrics, thresholds, and alerts based on learning from their deployments.

Note: Thresholds noted as “dynamic” in the tables below indicate that while a metric is highly important to watch, the relative numbers to set threshold warnings at are specific to a given PCF environment and its use cases. These dynamic thresholds should be occasionally revisited because the PCF foundation and its usage continue to evolve. See Determine Warning and Critical Thresholds for more information.

Diego Auctioneer Metrics

Auctioneer App Instance (AI) Placement Failures


auctioneer.AuctioneerLRPAuctionsFailed

Description The number of Long Running Process (LRP) instances that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

Use: This metric can indicate that PCF is out of container space or that there is a lack of resources within your environment. This indicator also increases when the LRP is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work.

This error is most common due to capacity issues, for example, if cells do not have enough resources, or if cells are going back and forth between a healthy and unhealthy state.

Origin: Firehose
Type: Counter (Integer)
Frequency: During each auction
Recommended measurement Per minute delta averaged over a 5-minute window
Recommended alert thresholds Yellow warning: ≥ 0.5
Red critical: ≥ 1
Recommended response
  1. To best determine the root cause, examine the Auctioneer logs. Depending on the specific error and resource constraint, you may also find a failure reason in the Cloud Controller (CC) API.
  2. Investigate the health of your Diego cells to determine if they are the resource type causing the problem.
  3. Consider scaling additional cells using Ops Manager.
  4. If scaling cells does not solve the problem, pull Diego brain logs and BBS node logs and contact Pivotal Support telling them that LRP auctions are failing.

Auctioneer Time to Fetch Cell State


auctioneer.AuctioneerFetchStatesDuration

Description Time in ns that the auctioneer took to fetch state from all the Diego cells when running its auction.

Use: Indicates how the cells themselves are performing. Alerting on this metric helps alert that app staging requests to Diego may be failing.

Origin: Firehose
Type: Gauge, integer in ns
Frequency: During event, during each auction
Recommended measurement Maximum over the last 5 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: ≥ 2 s
Red critical: ≥ 5 s
Recommended response
  1. Check the health of the cells by reviewing the logs and looking for errors.
  2. Review IaaS console metrics.
  3. Pull Diego brain logs and cell logs and contact Pivotal Support telling them that fetching cell states is taking too long.

Auctioneer App Instance Starts


auctioneer.AuctioneerLRPAuctionsStarted

Description The number of LRP instances that the auctioneer successfully placed on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

Use: Provides a sense of running system activity levels in your environment. Can also give you a sense of how many app instances have been started over time. The recommended measurement, below, can help indicate a significant amount of container churn. However, for capacity planning purposes, it is more helpful to observe deltas over a long time window.

Origin: Firehose
Type: Counter (Integer)
Frequency: During event, during each auction
Recommended measurement Per minute delta averaged over a 5-minute window
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response When observing a significant amount of container churn, do the following:

  1. Look to eliminate explainable causes of temporary churn, such as a deployment or increased developer activity.
  2. If container churn appears to continue over an extended period, pull logs from the Diego Brain and BBS node before contacting Pivotal support.
When observing extended periods of high or low activity trends, scale up or down CF components as needed.

Auctioneer Task Placement Failures


auctioneer.AuctioneerTaskAuctionsFailed

Description The number of Tasks that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

Use: Failing Task auctions indicate a lack of resources within your environment and that you likely need to scale. This indicator also increases when the Task is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work.

This error is most common due to capacity issues, for example, if cells do not have enough resources, or if cells are going back and forth between a healthy and unhealthy state.

Origin: Firehose
Type: Counter (Float)
Frequency: During event, during each auction
Recommended measurement Per minute delta averaged over a 5-minute window
Recommended alert thresholds Yellow warning: ≥ 0.5
Red critical: ≥ 1
Recommended response
  1. In order to best determine the root cause, examine the Auctioneer logs. Depending on the specific error or resource constraint, you may also find a failure reason in the CC API.
  2. Investigate the health of Diego cells.
  3. Consider scaling additional cells using Ops Manager.
  4. If scaling cells does not solve the problem, pull Diego brain logs and BBS logs for troubleshooting and contact Pivotal Support for additional troubleshooting. Inform Pivotal Support that Task auctions are failing.

Diego BBS Metrics

BBS Time to Run LRP Convergence


bbs.ConvergenceLRPDuration

Description Time in ns that the BBS took to run its LRP convergence pass.

Use: If the convergence run begins taking too long, apps or Tasks may be crashing without restarting. This symptom can also indicate loss of connectivity to the BBS database.

Origin: Firehose
Type: Gauge (Integer in ns)
Frequency: During event, every 30 seconds when LRP convergence runs, emission should be near-constant on a running deployment
Recommended measurement Maximum over the last 15 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: ≥ 10 s
Red critical: ≥ 20 s
Recommended response
  1. Check BBS logs for errors.
  2. Try vertically scaling the BBS VM resources up. For example, add more CPUs or memory depending on its system.cpu/system.memory metrics.
  3. If that does not solve the issue, pull the BBS logs and contact Pivotal Support for additional troubleshooting.

BBS Time to Handle Requests


bbs.RequestLatency

Description The maximum observed latency time over the past 60 seconds that the BBS took to handle requests across all its API endpoints.

Diego is now aggregating this metric to emit the max value observed over 60 seconds.

Use: If this metric rises, the PCF API is slowing. Response to certain cf CLI commands is slow if request latency is high.

Origin: Firehose
Type: Gauge (Integer in ns)
Frequency: 60 s
Recommended measurement Average over the last 15 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: ≥ 5 s
Red critical: ≥ 10 s
Recommended response
  1. Check CPU and memory statistics in Ops Manager.
  2. Check BBS logs for faults and errors that can indicate issues with BBS.
  3. Try scaling the BBS VM resources up. For example, add more CPUs/memory depending on its system.cpu/system.memory metrics.
  4. If the above steps do not solve the issue, collect a sample of the cell logs from the BBS VMs and contact Pivotal Support to troubleshoot further.

Cloud Controller and Diego in Sync


bbs.Domain.cf-apps

Description Indicates if the cf-apps Domain is up-to-date, meaning that CF App requests from Cloud Controller are synchronized to bbs.LRPsDesired (Diego-desired AIs) for execution.
  • 1 means cf-apps Domain is up-to-date
  • No data received means cf-apps Domain is not up-to-date
Use: If the cf-apps Domain does not stay up-to-date, changes requested in the Cloud Controller are not guaranteed to propagate throughout the system. If the Cloud Controller and Diego are out of sync, then apps running could vary from those desired.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response
  1. Check the BBS logs.
  2. If the problem continues, pull Diego brain logs and BBS logs and contact Pivotal Support to say that the cf-apps domain is not being kept fresh.

More App Instances Than Expected


bbs.LRPsExtra

Description Total number of LRP instances that are no longer desired but still have a BBS record. When Diego wants to add more apps, the BBS sends a request to the auctioneer to spin up additional LRPs. LRPsExtra is the total number of LRP instances that are no longer desired but still have a BBS record.

Use: If Diego has more LRPs running than expected, there may be problems with the BBS.

Deleting an app with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsExtra is unusual and should be investigated.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: ≥ 5
Red critical: ≥ 10
Recommended response
  1. Review the BBS logs for proper operation or errors, looking for detailed error messages.
  2. If the condition persists, pull the BBS logs and contact Pivotal Support.

Fewer App Instances Than Expected


bbs.LRPsMissing

Description Total number of LRP instances that are desired but have no record in the BBS. When Diego wants to add more apps, the BBS sends a request to the auctioneer to spin up additional LRPs. LRPsMissing is the total number of LRP instances that are desired but have no BBS record.

Use: If Diego has less LRP running than expected, there may be problems with the BBS.

An app push with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsMissing is unusual and should be investigated.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: ≥ 5
Red critical: ≥ 10
Recommended response
  1. Review the BBS logs for proper operation or errors, looking for detailed error messages.
  2. If the condition persists, pull the BBS logs and contact Pivotal Support.

Crashed App Instances


bbs.CrashedActualLRPs

Description Total number of LRP instances that have crashed.

Use: Indicates how many instances in the deployment are in a crashed state. An increase in bbs.CrashedActualLRPs can indicate several problems, from a bad app with many instances associated, to a platform issue that is resulting in app crashes. Use this metric to help create a baseline for your deployment. After you have a baseline, you can create a deployment-specific alert to notify of a spike in crashes above the trend line. Tune alert values to your deployment.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Look at the BBS logs for apps that are crashing and at the cell logs to see if the problem is with the apps themselves, rather than a platform issue.
  2. Before contacting Pivotal Support, pull the BBS logs and, if particular apps are the problem, pull the logs from their Diego cells too.

Running App Instances, Rate of Change


1hr average of bbs.LRPsRunning – prior 1hr average of bbs.LRPsRunning

Description Rate of change in app instances being started or stopped on the platform. It is derived from bbs.LRPsRunning and represents the total number of LRP instances that are running on Diego cells.

Use: Delta reflects upward or downward trend for app instances started or stopped. Helps to provide a picture of the overall growth trend of the environment for capacity planning. You may want to alert on delta values outside of the expected range.

Origin: Firehose
Type: Gauge (Float)
Frequency: During event, emission should be constant on a running deployment
Recommended measurement derived=(1-hour average of bbs.LRPsRunning – prior 1-hour average of bbs.LRPsRunning)
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response Scale components as necessary.

Diego Cell Metrics

Remaining Memory Available — Cell Memory Chunks Available


rep.CapacityRemainingMemory

Description Remaining amount of memory in MiB available for this Diego cell to allocate to containers.

Use: Indicates the available cell memory. Insufficient cell memory can prevent pushing and scaling apps.

The strongest operational value of this metric is to understand a deployment’s average app size and monitor/alert on ensuring that at least some cells have large enough capacity to accept standard app size pushes. For example, if pushing a 4GB app, Diego would have trouble placing that app if there is no one cell with sufficient capacity of 4GB or greater.

As an example, Pivotal Cloud Ops uses a standard of 4GB, and computes and monitors for the number of cells with at least 4GB free. When the number of cells with at least 4GB falls below a defined threshold, this is a scaling indicator alert to increase capacity. This free chunk count threshold should be tuned to the deployment size and the standard size of apps being pushed to the deployment.

Origin: Firehose
Type: Gauge (Integer in bytes)
Frequency: 60 s
Recommended measurement For alerting:
  1. Determine the size of a standard app in your deployment. This is the suggested value to calculate free chunks of Remaining Memory by.
  2. Create a script/tool that can iterate through each Diego Cell and do the following:
    1. Pull the rep.CapacityRemainingMemory metric for each cell.
    2. Divide the values received by 1000 to get the value in Gigabytes (if desired threshold is GB-based).
    3. Compare recorded values to your minimum capacity threshold, and count the number of cells that have equal or greater than the desired amount of free chunk space.
  3. Determine a desired scaling threshold based on the minimum amount of free chunks that are acceptable in this deployment given historical trends.
  4. Set an alert to indicate the need to scale cell memory capacity when the value falls below the desired threshold number.
For visualization purposes:
Looking at this metric (rep.CapacityRemainingMemory) as a minimum value per cell has more informational value than alerting value. It can be an interesting heatmap visualization, showing average variance and density over time.
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Assign more resources to the cells or assign more cells.
  2. Scale additional Diego cells using Ops Manager.

Remaining Memory Available — Overall Remaining Memory Available


rep.CapacityRemainingMemory
(Alternative Use)

Description Remaining amount of memory in MiB available for this Diego cell to allocate to containers.

Use: Can indicate low memory capacity overall in the platform. Low memory can prevent app scaling and new deployments. The overall sum of capacity can indicate that you need to scale the platform. Observing capacity consumption trends over time helps with capacity planning.

Origin: Firehose
Type: Gauge (Integer in bytes)
Frequency: 60 s
Recommended measurement Minimum over the last 5 minutes divided by 1024 (across all instances)
Recommended alert thresholds Yellow warning: ≤ 64 GB
Red critical: ≤ 32 GB
Recommended response
  1. Assign more resources to the cells or assign more cells.
  2. Scale additional Diego cells via Ops Manager.

Remaining Disk Available


rep.CapacityRemainingDisk

Description Remaining amount of disk in MiB available for this Diego cell to allocate to containers.

Use: Low disk capacity can prevent app scaling and new deployments. Because Diego staging Tasks can fail without at least 4 GB free, the recommended red threshold is based on the minimum disk capacity across the deployment falling below 4 GB in the previous 5 minutes.

It can also be meaningful to assess how many chunks of free disk space are above a given threshold, similar to rep.CapacityRemainingMemory.

Origin: Firehose
Type: Gauge (Integer in bytes)
Frequency: 60 s
Recommended measurement Minimum over the last 5 minutes divided by 1024 (across all instances)
Recommended alert thresholds Yellow warning: ≤ 8 GB
Red critical: ≤ 3.5 GB
Recommended response
  1. Assign more resources to the cells or assign more cells.
  2. Scale additional cells using Ops Manager.

Cell Rep Time to Sync


rep.RepBulkSyncDuration

Description Time in ns that the Diego Cell Rep took to sync the ActualLRPs that it claimed with its actual garden containers.

Use: Sync times that are too high can indicate issues with the BBS.

Origin: Firehose
Type: Gauge (Float in ns)
Frequency: 30 s
Recommended measurement Maximum over the last 15 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: ≥ 5 s
Red critical: ≥ 10 s
Recommended response
  1. Investigate BBS logs for faults and errors.
  2. If a particular cell or cells appear problematic, pull logs for the cells and the BBS logs before contacting Pivotal Support.

Unhealthy Cells


rep.UnhealthyCell

Description The Diego cell periodically checks its health against the garden backend. For Diego cells, 0 means healthy, and 1 means unhealthy.

Use: Set an alert for further investigation if multiple unhealthy Diego cells are detected in the given time window. If one cell is impacted, it does not participate in auctions, but end-user impact is usually low. If multiple cells are impacted, this can indicate a larger problem with Diego.

Suggested alert threshold based on multiple unhealthy cells in the given time window.

Origin: Firehose
Type: Gauge (Float, 0-1)
Frequency: 30 s
Recommended measurement Maximum over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: > 1
Recommended response
  1. Investigate Diego cell servers for faults and errors.
  2. If a particular cell or cells appear problematic, pull logs for that cell, as well as the BBS logs before contacting Pivotal Support.

Diego Locket Metrics

Active Locks


locket.ActiveLocks

Description Total count of how many locks the system components are holding.

Use: If the ActiveLocks count is not equal to the expected value, there is likely a problem with Diego.

Origin: Firehose
Type: Gauge
Frequency: 60 s
Recommended measurement Maximum over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: ≠ 4
Recommended response
  1. Run monit status to inspect for failing processes.
  2. If there are no failing processes, then review the logs for the components using the Locket service: BBS, Auctioneer, TPS Watcher, and Routing API. Look for indications that only one of each component is active at a time.
  3. Focus triage on the BBS first:
    • A healthy BBS shows obvious activity around starting or claiming LRPs.
    • An unhealthy BBS leads to the Auctioneer showing minimal or no activity. The BBS sends work to the Auctioneer.
    • Reference the BBS-level Locket metric Locks Held by BBS. A value of 0 indicates Locket issues at the BBS level.
  4. If the BBS appears healthy, then check the Auctioneer to ensure it is processing auction payloads.
    • Recent logs for Auctioneer should show all but one of its instances are currently waiting on locks, and the active Auctioneer should show a record of when it last attempted to execute work. This attempt should correspond to app development activity, such as cf push.
    • Reference the Auctioneer-level Locket metric Locks Held by Auctioneer. A value of 0 indicates Locket issues at the Auctioneer level.
  5. The TPS Watcher is primarily active when app instances crash. Therefore, if the TPS Watcher is suspected, review the most recent logs.
  6. If you are unable to resolve on-going excessive active locks, pull logs from the Diego BBS and Auctioneer VMs, which includes the Locket service component logs, and contact Pivotal Support.

Locks Held by BBS


bbs.LockHeld

Description Whether a BBS instance holds the expected BBS lock (in Locket). 1 means the active BBS server holds the lock, and 0 means the lock was lost.

Use: This metric is complimentary to Active Locks, and it offers a BBS-level version of the Locket metrics. Although it is emitted per BBS instance, only 1 active lock is held by BBS. Therefore, the expected value is 1. The metric may occasionally be 0 when the BBS instances are performing a leader transition, but a prolonged value of 0 indicates an issue with BBS.

Origin: Firehose
Type: Gauge
Frequency: Periodically
Recommended measurement Maximum over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: ≠ 1
Recommended response
  1. Run monit status on the Diego database VM to check for failing processes.
  2. If there are no failing processes, then review the logs for BBS.
    • A healthy BBS shows obvious activity around starting or claiming LRPs.
    • An unhealthy BBS leads to the Auctioneer showing minimal or no activity. The BBS sends work to the Auctioneer.
  3. If you are unable to resolve the issue, pull logs from the Diego BBS and Auctioneer VMs, which include the Locket service component logs, and contact Pivotal Support.

Locks Held by Auctioneer


auctioneer.LockHeld

Description Whether an Auctioneer instance holds the expected Auctioneer lock (in Locket). 1 means the active Auctioneer holds the lock, and 0 means the lock was lost.

Use: This metric is complimentary to Active Locks, and it offers an Auctioneer-level version of the Locket metrics. Although it is emitted per Auctioneer instance, only 1 active lock is held by Auctioneer. Therefore, the expected value is 1. The metric may occasionally be 0 when the Auctioneer instances are performing a leader transition, but a prolonged value of 0 indicates an issue with Auctioneer.

Origin: Firehose
Type: Gauge
Frequency: Periodically
Recommended measurement Maximum over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: ≠ 1
Recommended response
  1. Run monit status on the Diego Database VM to check for failing processes.
  2. If there are no failing processes, then review the logs for Auctioneer.
    • Recent logs for Auctioneer should show all but one of its instances are currently waiting on locks, and the active Auctioneer should show a record of when it last attempted to execute work. This attempt should correspond to app development activity, such as cf push.
  3. If you are unable to resolve the issue, pull logs from the Diego BBS and Auctioneer VMs, which includes the Locket service component logs, and contact Pivotal Support.

Active Presences


locket.ActivePresences

Description Total count of active presences. Presences are defined as the registration records that the cells maintain to advertise themselves to the platform.

Use: If the Active Presences count is far from the expected, there might be a problem with Diego.

The number of active presences varies according to the number of cells deployed. Therefore, during purposeful scale adjustments to PCF, this alerting threshold should be adjusted.
Establish an initial threshold by observing the historical trends for the deployment over a brief period of time, Increase the threshold as more cells are deployed. During a rolling deploy, this metric shows variance during the BOSH lifecycle when cells are evacuated and restarted. Tolerable variance is within the bounds of the max inflight range, Max Inflight Container Starts established in Pivotal Application Service (PAS).

Origin: Firehose
Type: Gauge
Frequency: 60 s
Recommended measurement Maximum over the last 15 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Ensure that the variance is not the result of an active rolling deploy. Also ensure that the alert threshold is appropriate to the number of cells in the current deployment.
  2. Run monit status to inspect for failing processes.
  3. If there are no failing processes, then review the logs for the components using the Locket service: BBS, Auctioneer, TPS Watcher, and Routing API.
  4. Focus triage on the BBS first:
    • A healthy BBS shows obvious activity around starting or claiming LRPs.
    • An unhealthy BBS leads to the Auctioneer showing minimal or no activity. The BBS sends work to the Auctioneer.
  5. If the BBS appears healthy, then check the Auctioneer to ensure it is processing auction payloads.
    Recent logs for the active Auctioneer should show a record of when it last attempted to execute work. This attempt should correspond to app dev activity, such as a cf push. The TPS Watcher is primarily only active when application instances crash, so if the TPS Watcher is suspected, review the most recent logs.
  6. If you are unable to resolve the problem, pull the logs from the Diego BBS and Auctioneer VMs, which include the Locket service component logs, and contact Pivotal Support.

Diego Route Emitter Metrics

Route Emitter Time to Sync


route_emitter.RouteEmitterSyncDuration

Description Time in ns that the active Route Emitter took to perform its synchronization pass.

Use: Increases in this metric indicate that the Route Emitter may have trouble maintaining an accurate routing table to broadcast to the Gorouters. Tune alerting values to your deployment based on historical data and adjust based on observations over time. The suggested starting point is ≥ 5 for the yellow threshold and ≥ 10 for the critical threshold. Pivotal has observed on its Pivotal Web Services deployment that above 10 s, the BBS may be failing.

Origin: Firehose
Type: Gauge (Float in ns)
Frequency: 60 s
Recommended measurement Maximum, per job, over the last 15 minutes divided by 1,000,000,000
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response If all or many jobs showing as impacted, there is likely an issue with Diego.
  1. Investigate the Route Emitter and Diego BBS logs for errors.
  2. Verify that app routes are functional by making a request to an app, pushing an app and pinging it, or if applicable, checking that your smoke tests have passed.
If one or a few jobs showing as impacted, there is likely a connectivity issue and the impacted job should be investigated further.

PAS MySQL KPIs

When PAS uses an internal MySQL database, as configured in the PAS tile Settings tab > Databases pane, the database cluster generates KPIs as described below.

MySQL Server Availability

/mysql/available
Description The MySQL Server is currently responding to requests, which indicates that the server is running.

Use: This metric is especially useful in single-node mode, where cluster metrics are not relevant. If the server does not emit heartbeats, it is offline.

Origin: Firehose
Envelope Type: Gauge
Unit: boolean
Frequency: 30 s (default)
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Run mysql-diag and check the MySQL Server logs for errors.

Galera Cluster Node Readiness

/mysql/galera/wsrep_ready
Description Shows whether each cluster node can accept queries. Returns only 0 or 1. When this metric is 0, almost all queries to that node fail with the error:
ERROR 1047 (08501) Unknown Command

Use: Discover when nodes of a cluster have been unable to communicate and, thus, unable to accept transactions.

Origin: Firehose
Envelope Type: Gauge
Unit: boolean
Frequency: 30 s (default)
Recommended measurement Average of values of each cluster node, over the last 5 minutes
Recommended alert thresholds Yellow warning: < 1.0
Red critical: 0 (cluster is down)
Recommended response - Run mysql-diag and check the MySQL Server logs for errors.
- Make sure there has been no infrastructure event that affects intra-cluster communication.
- Ensure that wsrep_ready has not been set to off by using the query:
SHOW STATUS LIKE 'wsrep_ready';

Galera Cluster Size

/mysql/galera/wsrep_cluster_size
Description The number of cluster nodes with which each node is communicating normally.

Use: When running in a multi-node configuration, this metric indicates if each member of the cluster is communicating normally with all other nodes.

Origin: Firehose
Envelope Type: Gauge
Unit: count
Frequency: 30 s (default)
Recommended measurement (Average of the values of each node / cluster size), over the last 5 minutes
Recommended alert thresholds Yellow warning: < 3.0 (availability compromised)
Red critical: < 1.0 (cluster unavailable)
Recommended response Run mysql-diag and check the MySQL Server logs for errors.

Galera Cluster Status

/mysql/galera/wsrep_cluster_status
Description Shows the primary status of the cluster component that the node is in.
Values are:
- Primary = 1
- Non-primary = 0
- Disconnected = -1
See: https://mariadb.com/kb/en/mariadb/galera-cluster-status-variables/

Use: Any value other than “Primary” indicates that the node is part of a nonoperational component. This occurs in cases of multiple membership changes that result in a loss of quorum.

Origin: Firehose
Envelope Type: Gauge
Unit: integer (see above)
Frequency: 30 s (default)
Recommended measurement Sum of each of the nodes, over the last 5 minutes
Recommended alert thresholds Yellow warning: < 3
Red critical: < 1
Recommended response - Check node status to ensure that they are all in working order and able to receive write-sets.
- Run mysql-diag and check the MySQL Server logs for errors.

Connections per Second

/mysql/net/connections
Description Connections per second made to the server.

Use: If the number of connections drastically changes or if apps are unable to connect, there might be a network or app issue.

Origin: Firehose
Envelope Type: Gauge
Unit: count
Frequency: 30 s (default)
Recommended measurement (Average of all nodes / max connections), over last 1 minute
Recommended alert thresholds Yellow warning: > 80%
Red critical: > 90%
Recommended response - Run mysql-diag and check the MySQL Server logs for errors.
- When approaching 100% of max connections, Apps may be experiencing times when they cannot connect to the database. The connections per second for the cluster vary based on application instances and app utilization. If this threshold is met or exceeded for an extended period of time, monitor app usage to ensure everything is behaving as expected.

Query Rate

/mysql/performance/questions
Description The rate of statements execute by the server, shown as queries per second.

Use: The cluster should always be processing some queries, if just as part of the internal automation.

Origin: Firehose
Envelope Type: Gauge
Unit: count
Frequency: 30 s (default)
Recommended measurement Average over the last two minutes
Recommended alert thresholds Yellow warning: 0 for 90 s
Red critical: 0 for 120 s
Recommended response If the rate is ever zero for an extended time, run mysql-diag and investigate the MySQL server logs to understand why query rate changed and determine appropriate action.

MySQL CPU Busy Time

/mysql/performance/busy_time
Description Percentage of CPU time spent by MySQL on user activity, executing user code, as opposed to kernel activity processing system calls.

Use: This closely reflects the amount of server activity dedicated to app queries.

Origin: Firehose
Envelope Type: Gauge
Unit: percentage
Frequency: 30 s (default)
Recommended measurement Average over last 2 minutes
Recommended alert thresholds Yellow warning: > 80%
Red critical: > 90%
Recommended response - If this metric meets or exceeds the recommended thresholds for extended periods of time, run SHOW PROCESSLIST and identify which queries or apps are using so much CPU. Optionally redeploy the MySQL jobs using VMs with more CPU capacity.
- Run mysql-diag and check the MySQL Server logs for errors.

Gorouter Metrics

Router File Descriptors


gorouter.file_descriptors

Description The number of file descriptors currently used by the Gorouter job.

Use: Indicates an impending issue with the Gorouter. Without proper mitigation, it is possible for an unresponsive app to eventually exhaust available Gorouter file descriptors and cause route starvation for other apps running on PCF. Under heavy load, this unmitigated situation can also result in the Gorouter losing its connection to NATS and all routes being pruned.

While a drop in gorouter.total_routes or an increase in gorouter.ms_since_last_registry_update helps to surface that the issue may already be occurring, alerting on gorouter.file_descriptors indicates that such an issue is impending.

The Gorouter limits the number of file descriptors to 100,000 per job. Once the limit is met, the Gorouter is unable to establish any new connections.

To reduce the risk of DDoS attacks, Pivotal recommends doing one or both of the following:

  • Within PAS, set Max Connections Per Backend to define how many requests can be routed to any particular app instance. This prevents a single app from using all Gorouter connections. The value specified should be determined by the operator based on the use cases for that foundation. For example, Pivotal sets the number of connections to 500 for Pivotal Web Services.
  • Add rate limiting at the load balancer level.
Origin: Firehose
Type: Gauge
Frequency: 5 s
Recommended measurement Maximum, per Gorouter job, over the last 5 minutes
Recommended alert thresholds Yellow warning: 50,000 per job
Red critical: 60,000 per job
Recommended response
  1. Identify which app(s) are requesting excessive connections and resolve the impacting issues with these apps.
  2. If the above recommended mitigation steps have not already been taken, do so.
  3. Consider adding more Gorouter VM resources to increase the number of available file descriptors.

Router Exhausted Connections


gorouter.backend_exhausted_conns

Description The lifetime number of requests that have been rejected by the Gorouter VM due to the Max Connections Per Backend limit being reached across all tried backends. The limit controls the number of concurrent TCP connections to any particular app instance and is configured within PAS.

Use: Indicates that PCF is mitigating risk to other applications by self-protecting the platform against one or more unresponsive applications. Increases in this metric indicate the need to investigate and resolve issues with potentially unresponsive applications. A rapid rate of change upward is concerning and should be assessed further.

Origin: Firehose
Type: Counter (Integer)
Frequency: 5 s
Recommended measurement Maximum delta per minute, per Gorouter job, over a 5-minute window
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. If gorouter.backend_exhausted_conns spikes, first look to the Router Throughput metric gorouter.total_requests to determine if this measure is high or low in relation to normal bounds for this deployment.
  2. If Router Throughput appears within normal bounds, it is likely that gorouter.backend_exhausted_conns is spiking due to an unresponsive application, possibly due to application code issues or underlying application dependency issues. To help determine the problematic application, look in access logs for repeated calls to one application. Then proceed to troubleshoot this application accordingly.
  3. If Router Throughput also shows unusual spikes, the cause of the increase in gorouter.backend_exhausted_conns spikes is likely external to the platform. Unusual increases in load may be due to expected business events driving additional traffic to applications. Unexpected increases in load may indicate a DDoS attack risk.

Router Throughput


gorouter.total_requests

Description The lifetime number of requests completed by the Gorouter VM, emitted per Gorouter instance

Use: The aggregation of these values across all Gorouters provide insight into the overall traffic flow of a deployment. Unusually high spikes, if not known to be associated with an expected increase in demand, could indicate a DDoS risk. For performance and capacity management, consider this metric a measure of router throughput per job, converting it to requests-per-second, by looking at the delta value of gorouter.total_requests and deriving back to 1s, or gorouter.total_requests.delta)/5, per Gorouter instance. This helps you see trends in the throughput rate that indicate a need to scale the Gorouter instances. Use the trends you observe to tune the threshold alerts for this metric.

Origin: Firehose
Type: Counter (Integer)
Frequency: 5 s
Recommended measurement Average over the last 5 minutes of the derived per second calculation
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response For optimizing the Gorouter, consider the requests-per-second derived metric in the context of router latency and Gorouter VM CPU utilization. From performance and load testing of the Gorouter, Pivotal has observed that at approximately 2500 requests per second, latency can begin to increase.

To increase throughput and maintain low latency, scale the Gorouters either horizontally or vertically and watch that the system.cpu.user metric for the Gorouter stays in the suggested range of 60-70% CPU Utilization.

Router Handling Latency


gorouter.latency

Description The time in milliseconds that the Gorouter takes to handle requests to its app endpoints. This is the average round trip response time to an app, which includes router handling.

Use: Indicates how Gorouter jobs in PCF are impacting overall app responsiveness. Latencies above 100 ms can indicate problems with the network, misbehaving apps, or the need to scale the Gorouter itself due to ongoing traffic congestion. An alert value on this metric should be tuned to the specifics of the deployment and its underlying network considerations; a suggested starting point is 100 ms.

Origin: Firehose
Type: Gauge (Float in ms)
Frequency: Emitted per Gorouter request, emission should be constant on a running deployment
Recommended measurement Average over the last 30 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response Extended periods of high latency can point to several factors. The Gorouter latency measure includes network and app latency impacts as well.

  1. First inspect logs for network issues and indications of misbehaving apps.
  2. If it appears that the Gorouter needs to scale due to ongoing traffic congestion, do not scale on the latency metric alone. You should also look at the CPU utilization of the Gorouter VMs and keep it within a maximum 60-70% range.
  3. Resolve high utilization by scaling the Gorouter.

Time Since Last Route Register Received


gorouter.ms_since_last_registry_update

Description Time in milliseconds since the last route register was received, emitted per Gorouter instance

Use: Indicates if routes are not being registered to apps correctly.

Origin: Firehose
Type: Gauge (Float in ms)
Frequency: 30 s
Recommended measurement Maximum over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: > 30,000
This threshold is suitable for normal platform usage. It alerts if it has been at least 30 seconds since the Gorouter last received a message from an app.
Recommended response
  1. Search the Gorouter and Route Emitter logs for connection issues to NATS.
  2. Check the BOSH logs to see if the NATS, Gorouter, or Route Emitter VMs are failing.
  3. Look more broadly at the health of all VMs, particularly Diego-related VMs.
  4. If problems persist, pull the Gorouter and Route Emitter logs and contact Pivotal Support to say there are consistently long delays in route registry.

Router Error: 502 Bad Gateway


gorouter.bad_gateways

Description The lifetime number of bad gateways, or 502 responses, from the Gorouter itself, emitted per Gorouter instance.
The Gorouter emits a 502 bad gateway error when it has a route in the routing table and, in attempting to make a connection to the backend, finds that the backend does not exist.

Use: Indicates that route tables might be stale. Stale routing tables suggest an issue in the route register management plane, which indicates that something has likely changed with the locations of the containers. Always investigate unexpected increases in this metric.

Origin: Firehose
Type: Count (Integer, Lifetime)
Frequency: 5 s
Recommended measurement Maximum delta per minute over a 5-minute window
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Check the Gorouter and Route Emitter logs to see if they are experiencing issues when connecting to NATS.
  2. Check the BOSH logs to see if the NATS, Gorouter, or Route Emitter VMs are failing.
  3. Look broadly at the health of all VMs, particularly Diego-related VMs.
  4. If problems persist, pull Gorouter and Route Emitter logs and contact Pivotal Support to say there has been an unusual increase in Gorouter bad gateway responses.

Router Error: Server Error


gorouter.responses.5xx

Description The lifetime number of requests completed by the Gorouter VM for HTTP status family 5xx, server errors, emitted per Gorouter instance.

Use: A repeatedly crashing app is often the cause of a big increase in 5xx responses. However, response issues from apps can also cause an increase in 5xx responses. Always investigate an unexpected increase in this metric.

Origin: Firehose
Type: Counter (Integer)
Frequency: 5 s
Recommended measurement Maximum delta per minute over a 5-minute window
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Look for out-of-memory errors and other app-level errors.
  2. As a temporary measure, ensure that the troublesome app is scaled to more than one instance.

Number of Gorouter Routes Registered


gorouter.total_routes

Description The current total number of routes registered with the Gorouter, emitted per Gorouter instance

Use: The aggregation of these values across all Gorouters indicates uptake and gives a picture of the overall growth of the environment for capacity planning.

Pivotal also recommends alerting on this metric if the number of routes falls outside of the normal range for your deployment. Dramatic decreases in this metric volume may indicate a problem with the route registration process, such as an app outage, or that something in the route register management plane has failed.

If visualizing these metrics on a dashboard, gorouter.total_routes can be helpful for visualizing dramatic drops. However, for alerting purposes, the gorouter.ms_since_last_registry_update metric is more valuable for quicker identification of Gorouter issues. Alerting thresholds for gorouter.total_routes should focus on dramatic increases or decreases out of expected range.

Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
Recommended measurement 5-minute average of the per second delta
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. For capacity needs, scale up or down the Gorouter VMs as necessary.
  2. For significant drops in current total routes, see the gorouter.ms_since_last_registry_update metric value for additional context.
  3. Check the Gorouter and Route Emitter logs to see if they are experiencing issues when connecting to NATS.
  4. Check the BOSH logs to see if the NATS, Gorouter, or Route Emitter VMs are failing.
  5. Look broadly at the health of all VMs, particularly Diego-related VMs.
  6. If problems persist, pull the Gorouter and Route Emitter logs and contact Pivotal Support.

Number of Route Registration Messages Sent and Received


gorouter.registry_message.route-emitter
route_emitter.MessagesEmitted

Description This KPI is based on the following metrics:

  • route_emitter.MessagesEmitted reports the lifetime number of route registration messages sent by the Route Emitter component. The metric is emitted for each Route Emitter.
  • gorouter.registry_message.route-emitter reports the lifetime number of route registration messages received by the Gorouter. The metric is emitted for each Gorouter instance.
Dynamic configuration that enables the Gorouter to route HTTP requests to apps is published by the Route Emitter component colocated on each Diego cell to the NATS clustered message bus. All router instances subscribed to this message bus receive the same configuration. (Router instances within an isolation segment receive configuration only for cells in the same isolation segment.)

As Gorouters prune app instances from the route when a TTL expires, each Route Emitter periodically publishes the routing configuration for the app instances on the same cell.

Therefore, the aggregate number of route registration messages published by all the Route Emitters should be equal to the number of messages received by each Gorouter instance.

Use: A difference in the rate of change of these metrics is an indication of an issue in the control plane responsible for updating the routers with changes to the routing table.

Pivotal recommends alerting when the number of messages received per second for a given router instance falls below the sum of messages emitted per second across all Route Emitters.

If visualizing these metrics on a dashboard, look for increases in the difference between the rate of messages received and sent. If the number of messages received by a Gorouter instance drops below the sum of messages sent by the Route Emitters, this is an indication of a problem in the control plane.

Origin: Firehose
Type: Counter
Frequency: With each event
Recommended measurement Difference of 5-minute average of the per second deltas for gorouter.registry_message.route-emitter and sum of route_emitter.MessagesEmitted for all Route Emitters
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response
  1. Check the Gorouter and Route Emitter logs to see if they are experiencing issues when connecting to NATS.
  2. Check the BOSH logs to see if the NATS, Gorouter, or Route Emitter VMs are failing.
  3. Look broadly at the health of all VMs, particularly Diego-related VMs.
  4. If problems persist, pull the Gorouter and Route Emitter logs and contact Pivotal Support.

UAA Metrics

UAA Throughput


uaa.requests.global.completed.count

Description The lifetime number of requests completed by the UAA VM, emitted per UAA instance. This number includes health checks.

Use: For capacity planning purposes, the aggregation of these values across all UAA instances can provide insight into the overall load that UAA is processing. It is recommended to alert on unexpected spikes per UAA instance. Unusually high spikes, if they are not associated with an expected increase in demand, could indicate a DDoS risk and should be investigated.

For performance and capacity management, look at the UAA Throughput metric as either a requests-completed-per-second or requests-completed-per-minute rate to determine the throughput per UAA instance. This helps you see trends in the throughput rate that may indicate a need to scale UAA instances. Use the trends you observe to tune the threshold alerts for this metric.

From performance and load testing of UAA, Pivotal has observed that while UAA endpoints can have different throughput behavior, once throughput reaches its peak value per VM, it stays constant and latency increases.

Origin: Firehose
Type: Gauge (Integer), emitted value increments over the lifetime of the VM like a counter
Frequency: 5 s
Recommended measurement Average over the last 5 minutes of the derived requests-per-second or requests-per-minute rate, per instance
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response For optimizing UAA, consider this metric in the context of UAA Request Latency and UAA VM CPU Utilization. To increase throughput and maintain low latency, scale the UAA VMs horizontally by editing the number of your UAA VM instances in the Resource Config pane of the PAS tile and ensure that the system.cpu.user metric for UAA is not sustained in the suggested range of 80-90% maximum CPU utilization.

UAA Request Latency


gorouter.latency.uaa

Description Time in milliseconds that UAA took to process a request that the Gorouter sent to UAA endpoints.

Use: Indicates how responsive UAA has been to requests sent from the Gorouter. Some operations may take longer to process, such as creating bulk users and groups. It is important to correlate latency observed with the endpoint and evaluate this data in the context of overall historical latency from that endpoint. Unusual spikes in latency could indicate the need to scale UAA VMs.

This metric is emitted only for the routers serving the UAA system component and is not emitted per isolation segment even if you are using isolated routers.

Origin: Firehose
Type: Gauge (Float in ms)
Frequency: Emitted per Gorouter request to UAA
Recommended measurement Maximum, per job, over the last 5 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response Latency depends on the endpoint and operation being used. It is important to correlate the latency with the endpoint and evaluate this data in the context of the historical latency from that endpoint.
  1. Inspect which endpoints requests are hitting. Use historical data to determine if the latency is unusual for that endpoint. A list of UAA endpoints is available in the UAA API documentation.
  2. If it appears that UAA needs to be scaled due to ongoing traffic congestion, do not scale based on the latency metric alone. You should also ensure that the system.cpu.user metric for UAA stays in the suggested range of 80-90% maximum CPU utilization.
  3. Resolve high utilization by scaling UAA VMs horizontally. To scale UAA, navigate to the Resource Config pane of the PAS tile and edit the number of your UAA VM instances.

UAA Requests In Flight


uaa.server.inflight.count

Description The number of requests UAA is currently processing (in-flight requests), emitted per UAA instance.

Use: Indicates how many concurrent requests are currently in flight for the UAA instance. Unusually high spikes, if they are not associated with an expected increase in demand, could indicate a DDoS risk.

From performance and load testing of the UAA component, Pivotal has observed that the number of concurrent requests impacts throughput and latency. The UAA Requests In Flight metric helps you see trends in the request rate that may indicate the need to scale UAA instances. Use the trends you observe to tune the threshold alerts for this metric.

Origin: Firehose
Type: Gauge (Integer)
Frequency: 5 s
Recommended measurement Maximum, per job, over the last 5 minutes
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response To increase throughput and maintain low latency when the number of in-flight requests is high, scale UAA VMs horizontally by editing the UAA VM field in the Resource Config pane of the PAS tile. Ensure that the system.cpu.user metric for UAA is not sustained in the suggested range of 80-90% maximum CPU utilization.

Firehose Metrics

Firehose Throughput


DopplerServer.listeners.totalReceivedMessageCount + loggregator.doppler.ingress

Description The total number of messages received across all Doppler listeners: UDP, TCP, TLS, and GRPC.

Use: Provides insight into how much traffic the logging system handles. This metric is an indicator of logging consistency.

Origin: Firehose
Type: Counter (Integer)
Frequency: 5 s
Recommended measurement Maximum delta per minute over a 5-minute window
Recommended alert thresholds Yellow warning: Dynamic
Red critical: Dynamic
Recommended response Scale up the Firehose log receiver and Dopplers on consistent upward trends.
Pivotal recommends that you do not scale down these components on flat or downward delta trends because unexpected spikes in throughput can cause log loss if not scaled appropriately.

Firehose Dropped Messages


DopplerServer.doppler.shedEnvelopes + loggregator.doppler.dropped

Description The lifetime total number of messages intentionally dropped by Doppler due to back pressure.

Use: Indicates logging consistency. Set an alert to indicate if too much traffic is coming into the Dopplers or if the Firehose consumers are not keeping pace. Both issues result in dropped messages.

Origin: Firehose
Type: Counter (Integer)
Frequency: 5 s
Recommended measurement Maximum delta per minute over a 5-minute window
Recommended alert thresholds Yellow warning: ≥ 5
Red critical: ≥ 10
Recommended response Scale up the Firehose log receiver and Dopplers.

System (BOSH) Metrics

VM Health


system.healthy

Description 1 means the system is healthy, and 0 means the system is not healthy.

Use: This is the most important BOSH metric to monitor. It indicates if the VM emitting the metric is healthy. Review this metric for all VMs to estimate the overall health of the system.

Multiple unhealthy VMs signals problems with the underlying IAAS layer.

Origin: Firehose
Type: Gauge (Float, 0-1)
Frequency: 60 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Investigate CF logs for the unhealthy component(s).

VM Memory Used


system.mem.percent

Description System Memory — Percentage of memory used on the VM

Use: Set an alert and investigate if the free RAM is low over an extended period.

Origin: Firehose
Type: Gauge (%)
Frequency: 60 s
Recommended measurement Average over the last 10 minutes
Recommended alert thresholds Yellow warning: ≥ 80%
Red critical: ≥ 90%
Recommended response The response depends on the job the metric is associated with. If appropriate, scale affected jobs out and monitor for improvement.

VM Disk Used


system.disk.system.percent

Description System disk — Percentage of the system disk used on the VM

Use: Set an alert to indicate when the system disk is almost full.

Origin: Firehose
Type: Gauge (%)
Frequency: 60 s
Recommended measurement Average over the last 30 minutes
Recommended alert thresholds Yellow warning: ≥ 80%
Red critical: ≥ 90%
Recommended response Investigate what is filling the jobs system partition.
This partition should not typically fill because BOSH deploys jobs to use ephemeral and persistent disks.

VM Ephemeral Disk Used


system.disk.ephemeral.percent

Description Ephemeral disk — Percentage of the ephemeral disk used on the VM

Use: Set an alert and investigate if the ephemeral disk usage is too high for a job over an extended period.

Origin: Firehose
Type: Gauge (%)
Frequency: 60 s
Recommended measurement Average over the last 30 minutes
Recommended alert thresholds Yellow warning: ≥ 80%
Red critical: ≥ 90%
Recommended response
  1. Run bosh vms --details to view jobs on affected deployments.
  2. Determine cause of the data consumption, and, if appropriate, increase disk space or scale out the affected jobs.

VM Persistent Disk Used


system.disk.persistent.percent

Description Persistent disk — Percentage of persistent disk used on the VM

Use: Set an alert and investigate further if the persistent disk usage for a job is too high over an extended period.

Origin: Firehose
Type: Gauge (%)
Frequency: 60 s
Recommended measurement Average over the last 30 minutes
Recommended alert thresholds Yellow warning: ≥ 80%
Red critical: ≥ 90%
Recommended response
  1. Run bosh vms --details to view jobs on affected deployments.
  2. Determine cause of the data consumption, and, if appropriate, increase disk space or scale out affected jobs.

VM CPU Utilization


system.cpu.user

Description CPU utilization — The percentage of CPU spent in user processes

Use: Set an alert and investigate further if the CPU utilization is too high for a job.

For monitoring Gorouter performance, CPU utilization of the Gorouter VM is the recommended key capacity scaling indicator. For more information, see Gorouter Latency and Throughput.

Origin: Firehose
Type: Gauge (%)
Frequency: 60 s
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: ≥ 85%
Red critical: ≥ 95%
Recommended response
  1. Investigate the cause of the spike.
  2. If the cause is a normal workload increase, then scale up the affected jobs.
Create a pull request or raise an issue on the source for this page in GitHub