Key Capacity Scaling Indicators

This topic describes key capacity scaling indicators that operators monitor to determine when they need to scale their Pivotal Application Service (PAS) deployments.

Pivotal provides these indicators to operators as general guidance for capacity scaling. Each indicator is based on platform metrics from different components. This guidance is applicable to most PAS v2.4 deployments. Pivotal recommends that operators fine-tune the suggested alert thresholds by observing historical trends for their deployments.

Diego Cell Capacity Scaling Indicators

There are three key capacity scaling indicators recommended for Diego cell:

Diego Cell Memory Capacity


rep.CapacityRemainingMemory / rep.CapacityTotalMemory

Description Percentage of remaining memory capacity for a given cell. Monitor this derived metric across all cells in a deployment.

The metric rep.CapacityRemainingMemory indicates the remaining amount in MiB of memory available for this cell to allocate to containers.
The metric rep.CapacityTotalMemory indicates the total amount in MiB of memory available for this cell to allocate to containers.
Purpose A best practice deployment of Cloud Foundry includes three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

The Recommended threshold assumes a three-AZ configuration. Adjust the threshold percentage if you have more or fewer AZs.
Recommended thresholds < avg(35%)
How to scale Scale up your Diego Cells
Additional details Origin: Firehose
Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells
Alternative Metric PCF Healthwatch expresses this indicator with the metrichealthwatch.Diego.TotalPercentageAvailableMemoryCapacity.5M.

Diego Cell Disk Capacity


rep.CapacityRemainingDisk / rep.CapacityTotalDisk

Description Percentage of remaining disk capacity for a given cell. Monitor this derived metric across all cells in a deployment.

The metric rep.CapacityRemainingDisk indicates the remaining amount in MiB of disk available for this cell to allocate to containers.
The metric rep.CapacityTotalDisk indicates the total amount in MiB of disk available for this cell to allocate to containers.
Purpose A best practice deployment of Cloud Foundry includes three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

The Recommended threshold assumes a three-AZ configuration. Adjust the threshold percentage if you have more or fewer AZs.
Recommended thresholds < avg(35%)
How to scale Scale up your Diego Cells
Additional details Origin: Firehose
Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells
Alternative Metric PCF Healthwatch expresses this indicator with the metrichealthwatch.Diego.TotalPercentageAvailableDiskCapacity.5M.

Diego Cell Container Capacity


rep.CapacityRemainingContainers / rep.CapacityTotalContainers

Description Percentage of remaining container capacity for a given cell. Monitor this derived metric across all cells in a deployment.

The metric rep.CapacityRemainingContainers indicates the remaining number of containers this cell can host.
The metric by rep.CapacityTotalContainer indicates the total number of containers this cell can host.
Purpose A best practice deployment of Cloud Foundry includes three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

The Recommended threshold assumes a three-AZ configuration. Adjust the threshold percentage if you have more or fewer AZs.
Recommended thresholds < avg(35%)
How to scale Scale up your Diego Cells
Additional details Origin: Firehose
Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells
Alternative Metric PCF Healthwatch expresses this indicator with the metrichealthwatch.Diego.TotalPercentageAvailableContainerCapacity.5M.

Firehose Performance Scaling Indicators

Pivotal recommends two key capacity scaling indicators for monitoring Firehose performance.

Log Transport Loss Rate

loggregator.doppler.dropped{direction=ingress} / loggregator.doppler.ingress
Description This derived value represents the loss rate occurring as messages are transported from the Loggregator Agent components through the Doppler components to the Firehose endpoints.

Metric loggregator.doppler.ingress represents the number of messages entering Dopplers for transport through the firehose, and loggregator.doppler.dropped represents the number of messages dropped without delivery.

Messages include the combined stream of logs from all apps and the metrics data from Cloud Foundry components.

For more information about Loggregator components, see Loggregator Architecture.
Purpose Excessive dropped messages can indicate the Dopplers or Traffic Controllers are not processing messages quickly enough.

The recommended scaling indicator is a dropped message rate greater than 0.01. This scaling indicator is calculated by expressing the total number of dropped messages as a percentage of the total throughput and scale.

Doppler emits two separate dropped metrics, one for ingress and one for egress. The envelopes have a direction. For this indicator, use the metric with a direction tag with a value of ingress.
Recommended thresholds Scale indicator: ≥ 0.01
If alerting:
Yellow warning: ≥ 0.005
Red critical: ≥ 0.01
How to scale Scale up the number of Traffic Controller and Doppler instances.

Note: At approximately 40 Doppler instances and 20 Traffic Controller instances, horizontal scaling is no longer useful for improving Firehose performance. To improve performance, add vertical scale to the existing Doppler and Traffic Controller instances by increasing CPU resources.

Additional details Origin: Firehose
Type: Gauge (float)
Frequency: Base metrics are emitted every 5 s
Applies to: cf:doppler
Alternative Metrics PCF Healthwatch expresses this indicator with the metricshealthwatch.Firehose.LossRate.1H and healthwatch.Firehose.LossRate.1M.

Doppler Message Rate Capacity

loggregator.doppler.ingress (sum across instances) / current number of Doppler instances
Description This derived value represents the average rate of envelopes (messages) per Doppler instance. Deriving this into a per-Doppler envelopes-per-second, or envelopes-per-minute, rate can indicate the need to scale when Doppler instances are at their recommended maximum load.
Purpose The recommended scaling indicator is to look at the average load on the Doppler instances, and increase the number of Doppler instances when the derived rate is 16,000 envelopes per second, or 1 million envelopes per minute.
Recommended thresholds Scale indicator: ≥ 16,000 envelopes per second (or 1 million envelopes per minute)
How to scale Increase the number of Doppler VMs in the Resource Config pane of the PAS tile.
Additional details Origin: Firehose
Type: Gauge (float)
Frequency: Emitted every 5 s
Applies to: cf:doppler
Alternative Metric PCF Healthwatch expresses this indicator with the metrichealthwatch.Doppler.MessagesAverage.1M.

Reverse Log Proxy Loss Rate

loggregator.rlp.dropped / loggregator.rlp.ingress
Description The loss rate of the reverse log proxies (RLP), that is, the total messages dropped as a percentage of the total traffic coming through the reverse log proxy. Total messages include only logs for bound applications.

This loss rate is specific to the RLP and does not impact the Firehose loss rate. For example, you can suffer lossiness in the RLP while not suffering any lossiness in the Firehose.
Purpose Excessive dropped messages can indicate that the RLP is overloaded and that the Traffic Controllers need to be scaled.

The recommended scaling indicator is to look at the maximum per minute loss rate over a 5-minute window and scale if the derived loss rate value grows greater than 0.1.
Recommended thresholds Scale indicator: ≥ 0.1
If alerting:
Yellow warning: ≥ 0.01
Red critical: ≥ 0.1
How to scale Scale up the number of traffic controller instances to further balance log load.
Additional details Origin: Firehose
Type: Counter (Integer)
Frequency: Emitted every 60 s
Applies to: cf:loggregator
Alternative Metric PCF Healthwatch expresses this indicator with the metrichealthwatch.SyslogDrain.RLP.LossRate.1M.

Firehose Consumer Scaling Indicator

Pivotal recommends the following scaling indicator for monitoring the performance of consumers of the Firehose.

Slow Consumer Drops

doppler_proxy.slow_consumer
Description Within PAS, metrics and logs enter the Firehose for transport and exit out of the platform via a consumer nozzle. If the consuming downstream system fails to keep up with the exiting stream of metrics, the Firehose is forced to close the connection to protect itself from back-pressure. The Firehose increments slow_consumer with each connection that it closes because a consumer could not keep up.
Purpose This metric indicates that a Firehose consumer, such as a monitoring tool nozzle, is ingesting too slowly. If this number is anomalous, it may result in the downstream monitoring tool not having all expected data, even though that data was successfully transported through the Firehose.
Recommended thresholds Scale indicator: It is recommended to scale when the rate of Firehose Slow Consumer Drops is anomalous for a given environment.
How to scale Scale up the number of nozzle instances. You can scale a nozzle using the subscription ID specified when the nozzle connects to the Firehose. If you use the same subscription ID on each nozzle instance, the Firehose evenly distributes data across all instances of the nozzle. For example, if you have two nozzle instances with the same subscription ID, the Firehose sends half of the data to one nozzle instance and half to the other. Similarly, if you have three nozzle instances with the same subscription ID, the Firehose sends one-third of the data to each instance. If you want to scale a nozzle, the number of nozzle instances should match the number of Traffic Controller instances.
Additional details Origin: Firehose
Type: Counter
Frequency: Emitted every 5 s
Applies to: cf:doppler

Reverse Log Proxy Egress Dropped Messages

rlp.dropped, direction: egress
Description

Within PAS, logs and metrics enter Loggregator for transport and then egress through the Reverse Log Proxy (RLP). The RLP drops messages when consumers of the RLP, such as monitoring tool nozzles, ingest the exiting stream of logs and metrics too slowly.

Note: The rlp.dropped metric includes both ingress and egress directions. To differentiate between ingress and egress, refer to the direction tag on the metric.

Purpose This metric indicates that a consumer of logs and metrics from the RLP, such as a monitoring tool nozzle, is ingesting RLP messages too slowly.
Recommended thresholds Scale indicator: Scale when the rate of rlp.dropped, direction: egress metrics is continuously increasing.
How to scale Scale up the number of nozzle instances. The number of nozzle instances should match the number of Traffic Controller instances. You can scale a nozzle using the subscription ID specified when the nozzle connects to the RLP. If you use the same subscription ID on each nozzle instance, the RLP evenly distributes data across all instances of the nozzle. For example, if you have two nozzle instances with the same subscription ID, the RLP sends half of the data to one nozzle instance and half to the other. Similarly, if you have three nozzle instances with the same subscription ID, the RLP sends one-third of the data to each instance.
Additional details Origin: Reverse Log Proxy
Type: Counter
Frequency: Emitted every 5 s
Applies to: cf:loggregator_trafficcontroller

Doppler Egress Dropped Messages

doppler.dropped, direction: egress
Description

Within PAS, logs and metrics enter Loggregator for transport and then egress through Doppler. Doppler drops messages when consumers of the RLP, such as monitoring tool nozzles, ingest the exiting stream of logs and metrics too slowly.

Note: The doppler.dropped metric includes both ingress and egress directions. To differentiate between ingress and egress, refer to the direction tag on the metric.

Purpose This metric indicates that a consumer of logs and metrics from the RLP, such as a monitoring tool nozzle, is ingesting too slowly.
Recommended thresholds Scale indicator: Scale when the rate of doppler.dropped, direction: egress metrics is continuously increasing.
How to scale Scale up the number of nozzle instances. The number of nozzle instances should match the number of Traffic Controller instances. You can scale a nozzle using the subscription ID specified when the nozzle connects to the RLP. If you use the same subscription ID on each nozzle instance, the RLP evenly distributes data across all instances of the nozzle. For example, if you have two nozzle instances with the same subscription ID, the RLP sends half of the data to one nozzle instance and half to the other. Similarly, if you have three nozzle instances with the same subscription ID, the RLP sends one-third of the data to each instance.
Additional details Origin: Doppler
Type: Counter
Frequency: Emitted every 5 s
Applies to: cf:doppler

CF Syslog Drain Performance Scaling Indicators

There are three key capacity scaling indicators recommended for CF Syslog Drain performance.

Note: These CF Syslog Drain scaling indicators are only relevant if your deployment contains apps using the CF syslog drain binding feature.

Note: If you enable agent-based syslog on your deployment, the Adapter Loss Rate and CF Syslog Drain Bindings Count indicators are not relevant and should be ignored. For more information about the Adapter Loss Rate and CF Syslog Drain Bindings Count indicators, see Adapter Loss Rate and CF Syslog Drain Bindings Count. For more information about enabling agent-based syslog, see Loggregator Syslog Agent Increases Scale For Syslog Drains.

Adapter Loss Rate

cf-syslog-drain.adapter.dropped / cf-syslog-drain.adapter.ingress
Description The loss rate of the Syslog Adapters, that is, the total messages dropped as a percentage of the total traffic coming through the Syslog Adapters. Total messages include only logs for bound applications.

This loss rate is specific to the Syslog Adapters and does not impact the Firehose loss rate. For example, you can suffer lossiness in syslog while not suffering any lossiness in the Firehose.
Purpose Indicates that the syslog drains are not keeping up with the number of logs that a syslog-drain-bound app is producing. This likely means that the syslog-drain consumer is failing to keep up with the incoming log volume.

The recommended scaling indicator is to look at the maximum per minute loss rate over a 5-minute window and scale if the derived loss rate value grows greater than 0.1.
Recommended thresholds Scale indicator: ≥ 0.1
If alerting:
Yellow warning: ≥ 0.01
Red critical: ≥ 0.1
How to scale Performance test your syslog server, review the logs of the syslog consuming system for intake and other performance issues that indicate a need to scale the consuming system.
Additional details Origin: Firehose
Type: Counter (Integer)
Frequency: Emitted every 60 s
Applies to: cf:cf-syslog
Alternative Metric PCF Healthwatch expresses this indicator with the metrichealthwatch.SyslogDrain.Adapter.LossRate.1M.

CF Syslog Drain Bindings Count

cf-syslog-drain.drain_adapter.drain_bindings (sum across instances) / number of Syslog Adapters
Description The number of CF syslog drain bindings. CF syslog drain bindings enable app syslog drains by managing the connections between the Syslog Adapters and the Reverse Log Proxies (RLPs).
Purpose Each syslog drain binding requires two Syslog Adapters. A configuration with two Syslog Adapters allows for approximately 500 syslog drain bindings. Pivotal recommends adding one Syslog Adapter instance for every 250 additional syslog drain bindings.
Recommended thresholds Scale indicator: ≥ 450 syslog drain bindings

Pivotal recommends scaling to three Syslog Adapters if a maximum of 450 syslog drain bindings is reached in a one hour window.
How to scale Increase the number of Syslog Adapter VMs in the Resource Config pane of the PAS tile.
Additional details Origin: Firehose
Type: Gauge (float)
Frequency: Emitted every 60 s
Applies to: cf:cf-syslog
Alternative Metric PCF Healthwatch expresses this indicator with the metrichealthwatch.SyslogDrain.Adapter.BindingsAverage.5M.

Syslog Agent Loss Rate

Note: The Syslog Agent Loss Rate indicator is only relevant if agent-based syslog is enabled on your deployment. Agent-based syslog is disabled by default. For more information about enabling agent-based syslog, see Loggregator Syslog Agent Increases Scale For Syslog Drains.

loggregator-agent.loggr-syslog-agent.dropped{direction:egress} / loggregator-agent.loggr-syslog-agent.dropped{direction:ingress}
Description The loss rate of Syslog Agents. The loss rate of Syslog Agents is the messages dropped as a percentage of total message traffic through Syslog Agents. The message traffic through Syslog Agents includes logs for bound apps.

The Syslog Agent loss rate does not affect the Firehose loss rate. Message loss can occur in Syslog Agents without message loss occuring in the Firehose.
Purpose This metric indicates that the syslog drain consumer is ingesting logs from a syslog-drain-bound app too slowly.
Recommended thresholds The recommended scaling indicator is the maximum Syslog Agent loss rate per minute within a five-minute window. You should scale up if the maximum loss rate is greater than 0.1.

Scale indicator: ≥ 0.1
If alerting:
Yellow warning: ≥ 0.01
Red critical: ≥ 0.1
How to scale Review the logs of the syslog server for intake issues and other performance issues. Scale the syslog server if necessary.
Additional details Origin: Syslog Agent
Type: Counter (Integer)
Frequency: Emitted every 60 s

Log Cache Scaling Indicator

Pivotal recommends the following scaling indicator for monitoring the performance of Log Cache.

Log Cache Caching Duration

log_cache.cache_period
Description This metric indicates the age in milliseconds of the oldest data point stored in Log Cache.
Purpose Log Cache stores all messages that are passed through the Firehose in an ephemeral in-memory store. The size of this store and the cache duration are dependent on the amount of memory available on the VM on which Log Cache runs. Some features of PAS rely on data being available in Log Cache, such as App Autoscaler.

Pivotal recommends scaling the VM on which Log Cache runs, so Log Cache can hold all messages that pass through Loggregator in the last 15 minutes, or 900000 milliseconds.
Recommended thresholds Scale indicator: Scale the VM on which Log Cache runs when the cache period drops below 15 minutes, or 900000 milliseconds. Typically, Log Cache runs on the Doppler VM.
How to scale Scale up the number of Doppler VMs or chose a VM type for Doppler that provides more memory.
Additional details Origin: log-cache
Type: Gauge
Frequency: Emitted every 15 s
Applies to: cf:log-cache

Router Performance Scaling Indicator

There is one key capacity scaling indicator recommended for Router performance.

Router VM CPU Utilization

system.cpu.user of the Gorouter VM(s)
Description CPU utilization of the Gorouter VM(s)
Purpose High CPU utilization of the Gorouter VMs can increase latency and cause throughput, or requests per/second, to level-off. Pivotal recommends keeping the CPU utilization within a maximum range of 60-70% for best Gorouter performance.

If you want to increase throughput capabilities while also keeping latency low, Pivotal recommends scaling the Gorouter while continuing to ensure that CPU utilization does not exceed the maximum recommended range.
Recommended thresholds Scale indicator: ≥ 60%
If alerting:
Yellow warning: ≥ 60%
Red critical: ≥ 70%
How to scale Resolve high utilization by scaling the Gorouters horizontally or vertically by editing the Router VM in the Resource Config pane of the PAS tile.
Additional details Origin: Firehose
Type: Gauge (float)
Frequency: Emitted every 60 s
Applies to: cf:router

UAA Performance Scaling Indicator

There is one key capacity scaling indicator recommended for UAA performance.

UAA VM CPU Utilization

system.cpu.user of the UAA VM(s)
Description CPU utilization of the UAA VM(s)
Purpose High CPU utilization of the UAA VMs can increase latency and cause throughput, or requests per/second, to level-off. Pivotal recommends keeping the CPU utilization within a maximum range of 80-90% for best UAA performance.

If you want to increase throughput capabilities while keeping latency low, Pivotal recommends scaling the UAA VMs and ensuring that CPU utilization does not exceed the maximum recommended range.
Recommended thresholds Scale indicator: ≥ 80%
If alerting:
Yellow warning: ≥ 80%
Red critical: ≥ 90%
How to scale Resolve high utilization by scaling UAA horizontally or vertically. To scale UAA, navigate to the Resource Config pane of the PAS tile and edit the number of your UAA VM instances or change the VM type to a type that utilizes more CPU cores.
Additional details Origin: Firehose
Type: Gauge (float)
Frequency: Emitted every 60 s
Applies to: cf:uaa

NFS/WebDAV Backed Blobstore

There is one key capacity scaling indicator for external S3 external storage.

Note: This metric is only relevant if your deployment does not use an external S3 repository for external storage with no capacity constraints.


system.disk.persistent.percent of NFS server VM(s)

Description If applicable: Monitor the percentage of persistent disk used on the VM for the NFS Server job.
Purpose If you do not use an external S3 repository for external storage with no capacity constraints, you must monitor the PAS object store to push new app and buildpacks.

If you use an internal NFS/WebDAV backed blobstore, consider scaling the persistent disk when it reaches 75% capacity.
Recommended thresholds ≥ 75%
How to scale Give your NFS Server additional persistent disk resources.
Additional details Origin: Firehose
Type: Gauge (%)
Applies to: cf:nfs_server