Key Capacity Scaling Indicators

This topic describes key capacity scaling indicators that operators monitor to determine when they need to scale their Pivotal Cloud Foundry (PCF) deployments.

Pivotal provides these indicators to operators as general guidance for capacity scaling. Each indicator is based on platform metrics from different components. This guidance is applicable to most PCF v2.1 deployments. Pivotal recommends that operators fine-tune the suggested alert thresholds by observing historical trends for their deployments.

Diego Cell Capacity Scaling Indicators

There are three key capacity scaling indicators recommended for Diego cell:

Diego Cell Memory Capacity


rep.CapacityRemainingMemory / rep.CapacityTotalMemory

Description Percentage of remaining memory capacity for a given cell. Monitor this derived metric across all cells in a deployment.

The metric rep.CapacityRemainingMemory indicates the remaining amount in MiB of memory available for this cell to allocate to containers.
The metric rep.CapacityTotalMemory indicates the total amount in MiB of memory available for this cell to allocate to containers.
Purpose A best practice deployment of Cloud Foundry includes three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

The Recommended threshold assumes a three-AZ configuration. Adjust the threshold percentage if you have more or fewer AZs.
Recommended thresholds < avg(35%)
How to scale Scale up your Diego Cells
Additional details Origin: Firehose
Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells

Diego Cell Disk Capacity


rep.CapacityRemainingDisk / rep.CapacityTotalDisk

Description Percentage of remaining disk capacity for a given cell. Monitor this derived metric across all cells in a deployment.

The metric rep.CapacityRemainingDisk indicates the remaining amount in MiB of disk available for this cell to allocate to containers.
The metric rep.CapacityTotalDisk indicates the total amount in MiB of disk available for this cell to allocate to containers.
Purpose A best practice deployment of Cloud Foundry includes three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

The Recommended threshold assumes a three-AZ configuration. Adjust the threshold percentage if you have more or fewer AZs.
Recommended thresholds < avg(35%)
How to scale Scale up your Diego Cells
Additional details Origin: Firehose
Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells

Diego Cell Container Capacity


rep.CapacityRemainingContainers / rep.CapacityTotalContainers

Description Percentage of remaining container capacity for a given cell. Monitor this derived metric across all cells in a deployment.

The metric rep.CapacityRemainingContainers indicates the remaining number of containers this cell can host.
The metric by rep.CapacityTotalContainer indicates the total number of containers this cell can host.
Purpose A best practice deployment of Cloud Foundry includes three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

The Recommended threshold assumes a three-AZ configuration. Adjust the threshold percentage if you have more or fewer AZs.
Recommended thresholds < avg(35%)
How to scale Scale up your Diego Cells
Additional details Origin: Firehose
Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells

Firehose Performance Scaling Indicators

There are two key capacity scaling indicators recommended for Firehose performance.

Log Transport Loss Rate

loggregator.doppler.dropped / loggregator.doppler.ingress
Description This derived value represents the loss rate occurring as messages are transported from the Metron Agent components (log ingress point) through the Doppler components to the firehose endpoints. Metric loggregator.doppler.ingress represents the number of messages entering Dopplers for transport through the firehose, and loggregator.doppler.dropped represents the number of messages dropped without delivery. Messages include the combined stream of logs from all apps and the metrics data from Cloud Foundry components.
Purpose Excessive dropped messages can indicate the Dopplers and/or Traffic Controllers are not processing messages fast enough.

The recommended scaling indicator is to look at the total dropped as a percentage of the total throughput and scale if the derived loss rate value grows greater than 0.01.
Recommended thresholds Scale indicator: ≥ 0.01
If alerting:
Yellow warning: ≥ 0.005
Red critical: ≥ 0.01
How to scale Scale up the number of Traffic Controller and Doppler instances.
Additional details Origin: Firehose
Type: Gauge (float)
Frequency: Base metrics are emitted every 5 s
Applies to: cf:doppler

Doppler Message Rate Capacity

loggregator.doppler.ingress (sum across instances) / current number of Doppler instances
Description This derived value represents the average rate of envelopes (messages) per Doppler instance. Deriving this into a per-Doppler envelopes-per-second, or envelopes-per-minute, rate can indicate the need to scale when Doppler instances are at their recommended maximum load.
Purpose The recommended scaling indicator is to look at the average load on the Doppler instances, and increase the number of Doppler instances when the derived rate is 16,000 envelopes per second, or 1 million envelopes per minute.
Recommended thresholds Scale indicator: ≥ 16,000 envelopes per second (or 1 million envelopes per minute)
How to scale Increase the number of Doppler VMs in the Resource Config pane of the Pivotal Application Service (PAS) tile.
Additional details Origin: Firehose
Type: Gauge (float)
Frequency: Emitted every 15 s
Applies to: cf:doppler

CF Syslog Drain Performance Scaling Indicators

There are three key capacity scaling indicators recommended for CF Syslog Drain performance.

Note: These CF Syslog Drain scaling indicators are only relevant if your deployment contains apps using the CF syslog drain binding feature.

Adapter Loss Rate

cf-syslog-drain.adapter.dropped / cf-syslog-drain.adapter.ingress
Description The loss rate of the Syslog Adapters, that is, the total messages dropped as a percentage of the total traffic coming through the Syslog Adapters. Total messages include only logs for bound applications.

This loss rate is specific to the Syslog Adapters and does not impact the Firehose loss rate. For example, you can suffer lossiness in syslog while not suffering any lossiness in the Firehose.
Purpose Indicates that the syslog drains are not keeping up with the number of logs that a syslog-drain-bound app is producing. This likely means that the syslog-drain consumer is failing to keep up with the incoming log volume.

The recommended scaling indicator is to look at the maximum per minute loss rate over a 5-minute window and scale if the derived loss rate value grows greater than 0.1.
Recommended thresholds Scale indicator: ≥ 0.1
If alerting:
Yellow warning: ≥ 0.01
Red critical: ≥ 0.1
How to scale Performance test your syslog server, review the logs of the syslog consuming system for intake and other performance issues that indicate a need to scale the consuming system.
Additional details Origin: Firehose
Type: Counter (Integer)
Frequency: Emitted every 60 s
Applies to: cf:cf-syslog

Reverse Log Proxy Loss Rate

loggregator.rlp.dropped / loggregator.rlp.ingress
Description The loss rate of the reverse log proxies (RLP), that is, the total messages dropped as a percentage of the total traffic coming through the reverse log proxy. Total messages include only logs for bound applications.

This loss rate is specific to the RLP and does not impact the Firehose loss rate. For example, you can suffer lossiness in syslog while not suffering any lossiness in the Firehose.
Purpose Excessive dropped messages can indicate that the RLP is overloaded and that the Traffic Controllers need to be scaled.

The recommended scaling indicator is to look at the maximum per minute loss rate over a 5-minute window and scale if the derived loss rate value grows greater than 0.1.
Recommended thresholds Scale indicator: ≥ 0.1
If alerting:
Yellow warning: ≥ 0.01
Red critical: ≥ 0.1
How to scale Scale up the number of traffic controller instances to further balance log load.
Additional details Origin: Firehose
Type: Counter (Integer)
Frequency: Emitted every 60 s
Applies to: cf:cf-syslog

CF Syslog Drain Bindings Count

cf-syslog-drain.scheduler.drains
Description The number of CF syslog drain bindings.
Purpose Each Syslog Adapter can handle approximately 500 drain bindings. The recommended initial configuration is a minimum of two Syslog Adapters (to handle approximately 1000 drain bindings). A new Adapter instance should be added for each 500 additional drain bindings.

Therefore, the recommended initial scaling indicator is 900 (as a maximum value over a 1-hr window). This indicates the need to scale up to three Adapters from the initial two-Adapter configuration.
Recommended thresholds Scale indicator: ≥ 900
Consider this threshold to be dynamic. Adjust the threshold to the PCF deployment as adoption of CF Syslog Drain increases or decreases.
How to scale Increase the number of Syslog Adapter VMs in the Resource Config pane of the Pivotal Application Service (PAS) tile.
Additional details Origin: Firehose
Type: Gauge (float)
Frequency: Emitted every 60 s
Applies to: cf:cf-syslog

Router Performance Scaling Indicator

There is one key capacity scaling indicator recommended for Router performance.

Router VM CPU Utilization

system.cpu.user of the Gorouter VM(s)
Description CPU utilization of the Gorouter VM(s)
Purpose High CPU utilization of the Gorouter VMs can increase latency and cause throughput, or requests per/second, to level-off. Pivotal recommends keeping the CPU utilization within a maximum range of 60-70% for best Gorouter performance.

If you want to increase throughput capabilities while also keeping latency low, Pivotal recommends scaling the Gorouter while continuing to ensure that CPU utilization does not exceed the maximum recommended range.
Recommended thresholds Scale indicator: ≥ 60%
If alerting:
Yellow warning: ≥ 60%
Red critical: ≥ 70%
How to scale Resolve high utilization by scaling the Gorouters horizontally or vertically by editing the Router VM in the Resource Config pane of the PAS tile.
Additional details Origin: Firehose
Type: Gauge (float)
Frequency: Emitted every 60 s
Applies to: cf:router

UAA Performance Scaling Indicator

There is one key capacity scaling indicator recommended for UAA performance.

UAA VM CPU Utilization

system.cpu.user of the UAA VM(s)
Description CPU utilization of the UAA VM(s)
Purpose High CPU utilization of the UAA VMs can increase latency and cause throughput, or requests per/second, to level-off. Pivotal recommends keeping the CPU utilization within a maximum range of 80-90% for best UAA performance.

If you want to increase throughput capabilities while keeping latency low, Pivotal recommends scaling the UAA VMs and ensuring that CPU utilization does not exceed the maximum recommended range.
Recommended thresholds Scale indicator: ≥ 80%
If alerting:
Yellow warning: ≥ 80%
Red critical: ≥ 90%
How to scale Resolve high utilization by scaling UAA horizontally. To scale UAA, navigate to the Resource Config pane of the PAS tile and edit the number of your UAA VM instances.
Additional details Origin: Firehose
Type: Gauge (float)
Frequency: Emitted every 60 s
Applies to: cf:uaa

NFS/WebDAV Backed Blobstore

There is one key capacity scaling indicator for external S3 external storage.

Note: This metric is only relevant if your deployment does not use an external S3 repository for external storage with no capacity constraints.


system.disk.persistent.percent of NFS server VM(s)

Description If applicable: Monitor the percentage of persistent disk used on the VM for the NFS Server job.
Purpose If you do not use an external S3 repository for external storage with no capacity constraints, you must monitor the PCF object store to push new app and buildpacks.

If you use an internal NFS/WebDAV backed blobstore, consider scaling the persistent disk when it reaches 75% capacity.
Recommended thresholds ≥ 75%
How to scale Give your NFS Server additional persistent disk resources.
Additional details Origin: Firehose
Type: Gauge (%)
Applies to: cf:nfs_server
Create a pull request or raise an issue on the source for this page in GitHub