LATEST VERSION: 1.10 - CHANGELOG
Pivotal Cloud Foundry v1.10

Key Capacity Scaling Indicators

This topic describes key capacity scaling indicators that operators monitor to determine when they need to scale their Pivotal Cloud Foundry (PCF) deployments.

Pivotal provides these indicators to operators as general guidance for capacity scaling. Each indicator is based on platform metrics from different components. This guidance is applicable to most PCF v1.10 deployments. Pivotal recommends that operators fine-tune the suggested alert thresholds by observing historical trends for their deployments.

Diego Cell Capacity Scaling Indicators

Currently, there are three key capacity scaling indicators recommended for Diego cell:

Diego Cell Memory Capacity


rep.CapacityRemainingMemory / rep.CapacityTotalMemory

Description Percentage of remaining memory capacity for a given cell. Monitor this derived metric across all cells in a deployment.

The metric rep.CapacityRemainingMemory indicates the remaining amount in MiB of memory available for this cell to allocate to containers.
The metric rep.CapacityTotalMemory indicates the total amount in MiB of memory available for this cell to allocate to containers.
Purpose A best practice deployment of Cloud Foundry includes three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

The Recommended threshold assumes a three AZ configuration. Adjust the threshold percentage if you have more or less AZs.
Recommended thresholds < avg(30%)
How to scale Scale up your Diego Cells
Additional Details Origin: Doppler/Firehose
Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells

Diego Cell Disk Capacity


rep.CapacityRemainingDisk / rep.CapacityTotalDisk

Description Percentage of remaining disk capacity for a given cell. Monitor this derived metric across all cells in a deployment.

The metric rep.CapacityRemainingDisk indicates the remaining amount in MiB of disk available for this cell to allocate to containers.
The metric rep.CapacityTotalDisk indicates the total amount in MiB of disk available for this cell to allocate to containers.
Purpose A best practice deployment of Cloud Foundry includes three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

The Recommended threshold assumes a three AZ configuration. Adjust the threshold percentage if you have more or less AZs.
Recommended thresholds < avg(30%)
How to Scale Scale up your Diego Cells
Additional Details Origin: Doppler/Firehose
Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells

Diego Cell Container Capacity


rep.CapacityRemainingContainers / rep.CapacityTotalContainers

Description Percentage of remaining container capacity for a given cell. Monitor this derived metric across all cells in a deployment.

The metric rep.CapacityRemainingContainers indicates the remaining number of containers this cell can host.
The metric by rep.CapacityTotalContainer indicates the total number of containers this cell can host.
Purpose A best practice deployment of Cloud Foundry includes three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

The Recommended threshold assumes a three AZ configuration. Adjust the threshold percentage if you have more or less AZs.
Recommended thresholds < avg(30%)
How to scale Scale up your Diego Cells
Additional Details Origin: Doppler/Firehose
Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells

Firehose Performance Scaling Indicator

Currently, there is one key capacity scaling indicator recommended for Firehose performance.

Firehose Loss Rate

(DopplerServer.TruncatingBuffer.totalDroppedMessages + DopplerServer.doppler.shedEnvelopes) / DopplerServer.listeners.totalReceivedMessageCount
Description This derived value represents the firehose loss rate, or the total messages dropped as a percentage of the total message throughput
Purpose Excessive dropped messages can indicate the Dopplers are not processing messages fast enough.

The recommended scaling indicator is to look at the total dropped as a percentage of the total throughput and scale if the derived loss rate value grows greater than 0.1.
Recommended thresholds Scale indicator: ≥ 0.1
If alerting:
Yellow warning: ≥ 0.05
Red critical: ≥ 0.1
How to scale up Scale up the Firehose log receiver and Dopplers.
Additional Details Origin: Doppler/Firehose
Type: Gauge (float)
Frequency: Base metrics are emitted every 5 s
Applies to: cf:doppler

Router Performance Scaling Indicator

Currently, there is one key capacity scaling indicator recommended for Router performance.

Router VM CPU Utilization

system.cpu.user of Gorouter VM(s)
Description CPU utilization of the Gorouter VM(s)
Purpose High CPU utilization of the Gorouter VMs can increase latency and cause throughput, or requests per/second, to level-off. Pivotal recommends keeping the CPU utilization within a maximum range of 60-70% for best Gorouter performance.

If you want to increase throughput capabilities while also keeping latency low, Pivotal recommends scaling the Gorouter while continuing to ensure that CPU utilization does not exceed the maximum recommended range.
Recommended thresholds Scale indicator: ≥ 60%
If alerting:
Yellow warning: ≥ 60%
Red critical: ≥ 70%
How to scale Resolve high utilization by scaling the Gorouters horizontally or vertically (the Router VM in the Resource Config pane of the Elastic Runtime tile).
Additional Details Origin: JMX Bridge or BOSH HM Forwarder
Type: Gauge (float)
Frequency: Emitted every 60 s
Applies to: cf:router

NFS/WebDAV Backed Blobstore

Note: This metric is only relevant if your deployment does not use an external S3 repository for external storage with no capacity constraints


system.disk.persistent.percent of NFS server VM(s)

Description If applicable: Monitor the percentage of persistent disk used on the VM for the NFS Server job.
Purpose If you do not use an external S3 repository for external storage with no capacity constraints, you must monitor the PCF object store to push new app and buildpacks.

If you use an internal NFS/WebDAV backed blobstore, consider scaling the persistent disk when it reaches 75% capacity.
Recommended thresholds ≥ 75%
How to scale Give your NFS Server additional persistent disk resources.
Additional Details Origin: JMX Bridge or BOSH HM Forwarder
Type: Gauge (%)
Applies to: cf:nfs_server
Create a pull request or raise an issue on the source for this page in GitHub