Key Capacity Scaling Indicators

This topic describes key capacity scaling indicators that operators monitor to determine when they need to scale their Pivotal Cloud Foundry (PCF) deployments.

Pivotal provides these indicators to operators as general guidance for capacity scaling. Each indicator is based on platform metrics from different components. This guidance is applicable to most PCF v1.11 deployments. Pivotal recommends that operators fine-tune the suggested alert thresholds by observing historical trends for their deployments.

Diego Cell Capacity Scaling Indicators

There are three key capacity scaling indicators recommended for Diego cell:

Diego Cell Memory Capacity


rep.CapacityRemainingMemory / rep.CapacityTotalMemory

Description Percentage of remaining memory capacity for a given cell. Monitor this derived metric across all cells in a deployment.

The metric rep.CapacityRemainingMemory indicates the remaining amount in MiB of memory available for this cell to allocate to containers.
The metric rep.CapacityTotalMemory indicates the total amount in MiB of memory available for this cell to allocate to containers.
Purpose A best practice deployment of Cloud Foundry includes three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

The Recommended threshold assumes a three AZ configuration. Adjust the threshold percentage if you have more or less AZs.
Recommended thresholds < avg(30%)
How to scale Scale up your Diego Cells
Additional details Origin: Firehose
Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells

Diego Cell Disk Capacity


rep.CapacityRemainingDisk / rep.CapacityTotalDisk

Description Percentage of remaining disk capacity for a given cell. Monitor this derived metric across all cells in a deployment.

The metric rep.CapacityRemainingDisk indicates the remaining amount in MiB of disk available for this cell to allocate to containers.
The metric rep.CapacityTotalDisk indicates the total amount in MiB of disk available for this cell to allocate to containers.
Purpose A best practice deployment of Cloud Foundry includes three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

The Recommended threshold assumes a three AZ configuration. Adjust the threshold percentage if you have more or less AZs.
Recommended thresholds < avg(30%)
How to scale Scale up your Diego Cells
Additional details Origin: Firehose
Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells

Diego Cell Container Capacity


rep.CapacityRemainingContainers / rep.CapacityTotalContainers

Description Percentage of remaining container capacity for a given cell. Monitor this derived metric across all cells in a deployment.

The metric rep.CapacityRemainingContainers indicates the remaining number of containers this cell can host.
The metric by rep.CapacityTotalContainer indicates the total number of containers this cell can host.
Purpose A best practice deployment of Cloud Foundry includes three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

The Recommended threshold assumes a three AZ configuration. Adjust the threshold percentage if you have more or less AZs.
Recommended thresholds < avg(30%)
How to scale Scale up your Diego Cells
Additional details Origin: Firehose
Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells

Firehose Performance Scaling Indicator

There is one key capacity scaling indicator recommended for Firehose performance.

Firehose Loss Rate

DopplerServer.doppler.shedEnvelopes / DopplerServer.listeners.totalReceivedMessageCount
Description This derived value represents the Firehose loss rate, or the total messages dropped as a percentage of the total message throughput. Total messages include the combined stream of logs from all apps and the metrics data from Cloud Foundry components.
Purpose Excessive dropped messages can indicate the Dopplers are not processing messages fast enough.

The recommended scaling indicator is to look at the total dropped as a percentage of the total throughput and scale if the derived loss rate value grows greater than 0.1.
Recommended thresholds Scale indicator: ≥ 0.1
If alerting:
Yellow warning: ≥ 0.05
Red critical: ≥ 0.1
How to scale Scale up the Firehose log receiver and Dopplers.
Additional details Origin: Firehose
Type: Gauge (float)
Frequency: Base metrics are emitted every 5 s
Applies to: cf:doppler

Scalable Syslog Performance Scaling Indicators

There are three key capacity scaling indicators recommended for Scalable Syslog performance.

Note: These scalable syslog scaling indicators are only relevant if your deployment contains apps using the scalable syslog drain binding feature, new in PCF v1.11.

Adapter Loss Rate

scalablesyslog.adapter.dropped / scalablesyslog.adapter.ingress
Description The loss rate of the scalable syslog adapters, that is, the total messages dropped as a percentage of the total traffic coming through the scalable syslog adapters. Total messages include only logs for bound applications.

This loss rate is specific to the scalable syslog adapters and does not impact the Firehose loss rate. For example, you can suffer lossiness in syslog while not suffering any lossiness in the Firehose.
Purpose Indicates that the syslog drains are not keeping up with the number of logs that a syslog-drain-bound app is producing. This likely means that the syslog-drain consumer is failing to keep up with the incoming log volume.

The recommended scaling indicator is to look at the maximum per minute loss rate over a 5-minute window and scale if the derived loss rate value grows greater than 0.1.
Recommended thresholds Scale indicator: ≥ 0.1
If alerting:
Yellow warning: ≥ 0.01
Red critical: ≥ 0.1
How to scale Performance test your syslog server, review the logs of the syslog consuming system for intake and other performance issues that indicate a need to scale the consuming system.
Additional details Origin: Firehose
Type: Counter (Integer)
Frequency: Emitted every 60 s
Applies to: cf:scalablesyslog

Reverse Log Proxy Loss Rate

loggregator.rlp.dropped / loggregator.rlp.ingress
Description The loss rate of the reverse log proxies (RLP), that is, the total messages dropped as a percentage of the total traffic coming through the reverse log proxy. Total messages include only logs for bound applications.

This loss rate is specific to the scalable syslog RLP and does not impact the Firehose loss rate. For example, you can suffer lossiness in syslog while not suffering any lossiness in the Firehose.
Purpose Excessive dropped messages can indicate that the RLP is overloaded and that the Traffic Controllers need to be scaled.

The recommended scaling indicator is to look at the maximum per minute loss rate over a 5-minute window and scale if the derived loss rate value grows greater than 0.1.
Recommended thresholds Scale indicator: ≥ 0.1
If alerting:
Yellow warning: ≥ 0.01
Red critical: ≥ 0.1
How to scale Scale up the number of traffic controller instances to further balance log load.
Additional details Origin: Firehose
Type: Counter (Integer)
Frequency: Emitted every 60 s
Applies to: cf:scalablesyslog

Scalable Syslog Drain Bindings Count

scalablesyslog.scheduler.drains
Description The number of scalable syslog drain bindings.
Purpose Each scalable syslog adapter can handle approximately 200 drain bindings. The recommended initial configuration is a minimum of two scalable syslog adapters (to handle approximately 400 drain bindings). A new adapter instance should be added for each 200 additional drain bindings.

Therefore, the recommended initial scaling indicator is 350 (as a maximum value over a 1-hr window). This indicates the need to scale up to three adapters from the initial two-adapter configuration.
Recommended thresholds Scale indicator: ≥ 350
Consider this threshold to be dynamic. Adjust the threshold to the PCF deployment as adoption of scalable syslog increases or decreases.
How to scale Increase the number of Scalable Syslog Adapter VMs in the Resource Config pane of the Elastic Runtime tile.
Additional details Origin: Firehose
Type: Gauge (float)
Frequency: Emitted every 60 s
Applies to: cf:scalablesyslog

Router Performance Scaling Indicator

There is one key capacity scaling indicator recommended for Router performance.

Router VM CPU Utilization

system.cpu.user of Gorouter VM(s)
Description CPU utilization of the Gorouter VM(s)
Purpose High CPU utilization of the Gorouter VMs can increase latency and cause throughput, or requests per/second, to level-off. Pivotal recommends keeping the CPU utilization within a maximum range of 60-70% for best Gorouter performance.

If you want to increase throughput capabilities while also keeping latency low, Pivotal recommends scaling the Gorouter while continuing to ensure that CPU utilization does not exceed the maximum recommended range.
Recommended thresholds Scale indicator: ≥ 60%
If alerting:
Yellow warning: ≥ 60%
Red critical: ≥ 70%
How to scale Resolve high utilization by scaling the Gorouters horizontally or vertically by editing the Router VM in the Resource Config pane of the Elastic Runtime tile.
Additional details Origin: JMX Bridge or BOSH HM Forwarder
Type: Gauge (float)
Frequency: Emitted every 60 s
Applies to: cf:router

NFS/WebDAV Backed Blobstore

There is one key capacity scaling indicator for external S3 external storage.

Note: This metric is only relevant if your deployment does not use an external S3 repository for external storage with no capacity constraints.


system.disk.persistent.percent of NFS server VM(s)

Description If applicable: Monitor the percentage of persistent disk used on the VM for the NFS Server job.
Purpose If you do not use an external S3 repository for external storage with no capacity constraints, you must monitor the PCF object store to push new app and buildpacks.

If you use an internal NFS/WebDAV backed blobstore, consider scaling the persistent disk when it reaches 75% capacity.
Recommended thresholds ≥ 75%
How to scale Give your NFS Server additional persistent disk resources.
Additional details Origin: JMX Bridge or BOSH HM Forwarder
Type: Gauge (%)
Applies to: cf:nfs_server
Create a pull request or raise an issue on the source for this page in GitHub