Key Capacity Scaling Indicators

Page last updated:

This topic describes key capacity scaling indicators that operators monitor to determine when they need to scale their VMware Tanzu Application Service for VMs (TAS for VMs) deployments.

Overview

VMware provides these indicators to operators as general guidance for capacity scaling. Each indicator is based on all platform metrics from all components.

This guidance is applicable to most TAS for VMs deployments. VMware recommends that operators fine-tune the suggested alert thresholds by observing historical trends for their deployments.

For more information about accessing metrics used in these key capacity scaling indicators, see Overview of Logging and Metrics.

Diego Cell Capacity Scaling Indicators

There are three key capacity scaling indicators VMware recommends for a Diego Cell.


Diego Cell Memory Capacity

Description The Diego Cell Memory Capacity indicator is the percentage of remaining memory your Diego Cells can allocate to containers.
Divide the CapacityRemainingMemory metric with the CapacityTotalMemory to get this percentage.
The metric CapacityRemainingMemory is the remaining memory, in MiB, available to a Diego Cell.
The metric CapacityTotalMemory is the total memory, in MiB, available to a Diego Cell.
Source ID rep
Metrics CapacityRemainingMemory
CapacityTotalMemory
Recommended thresholds < average (35%)
This threshold assumes you have three AZs.
How to scale Deploy additional diego cells until the average free memory is 35%. This threshold assumes you have three AZs.
Additional details Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells

Diego Cell Disk Capacity

Description The Diego Cell Disk Capacity indicator is the percentage of remaining disk capacity a given Diego Cell can allocate to containers.
Divide the CapacityRemainingDisk metric by the CapacityTotalDisk metric to get this percentage.
The metric CapacityRemainingDisk is the remaining amount of disk avaiable, in MiB, for this Diego Cell.
The metric CapacityTotalDisk indicates the total amount of disk available, in MiB, for this Diego Cell.
Source ID rep
Metrics CapacityRemainingDisk
CapacityTotalDisk
Recommended thresholds < average (35%)
This threshold assumes you have three AZs.
How to scale Deploy additional diego cells until the average free memory is 35%. This threshold assumes you have three AZs.
Additional details Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells

Diego Cell Container Capacity

Description The Diego Cell Container Capacity indicator is the percentage of containers remaining that a given Diego Cell can host.
Divide the CapacityRemainingContainers metric by the CapacityTotalContainers metric to get this percentage.
The metric CapacityRemainingContainers is the remaining number of containers.
The metric CapacityTotalContainer is the total number of containers.
Source ID rep
Metrics CapacityRemainingContainers
CapacityTotalContainers
Recommended thresholds < average (35%)
This threshold assumes you have three AZs.
How to scale Deploy additional diego cells until the average free memory is 35%. This threshold assumes you have three AZs.
Additional details Type: Gauge (%)
Frequency: Emitted every 60 s
Applies to: cf:diego_cells

Firehose Performance Scaling Indicators

VMware recommends three key capacity scaling indicators for monitoring Firehose performance.


Log Transport Loss Rate

Description The Log Transport Loss Rate indicator is the rate of messages dropped between the Dopplers and the Firehouse.
Divide the dropped{direction=ingress} metric by the ingress metric to get the loss rate.

Metric ingress is the number of messages entering the Dopplers. dropped is the number of messages never delivered to the Firehose.

For more information about Loggregator components, see Loggregator Architecture.
Source ID doppler
Metrics dropped
ingress
Label {direction=ingress}
Dopplers emit two separate dropped metrics, one for ingress and one for egress. The envelopes have a direction label. For this indicator, use the metric with a direction tag with a value of ingress.
Recommended thresholds Scale indicator: ≥ 0.01
If alerting:
Yellow warning: ≥ 0.005
Red critical: ≥ 0.01
Excessive dropped messages can indicate the Dopplers or Traffic Controllers are not processing messages quickly enough.
How to scale Scale up the number of Traffic Controller and Doppler instances.

Note: At approximately 40 Doppler instances and 20 Traffic Controller instances, horizontal scaling is no longer useful for improving Firehose performance. To improve performance, add vertical scale to the existing Doppler and Traffic Controller instances by increasing CPU resources.

Additional details Type: Gauge (float)
Frequency: Base metrics are emitted every 5 s
Applies to: cf:doppler
Doppler Message Rate Capacity
Description The Doppler Message Rate Capacity indicator is the average number of messages per Doppler instance. Divide the sum of ingress metrics across instances by the current number of Doppler instances to get this average.
Source ID doppler
Metrics ingress
Recommended thresholds Scale indicator: ≥ 16,000 envelopes per second (or 1 million envelopes per minute)
How to scale Increase the number of Doppler VMs in the Resource Config pane of the TAS for VMs tile.
Additional details Type: Gauge (float)
Frequency: Emitted every 5 s
Applies to: cf:doppler

Reverse Log Proxy Loss Rate

Description The Reverse Log Proxy Loss Rate indicator is the rate of bound app logs dropped from the Reverse Log Proxies (RLP). Divide the dropped metric by the ingress metric to get this indicator.

This loss rate is specific to the RLP and does not impact the Firehose loss rate.
Source ID rlp
Metrics ingress
dropped
Recommended thresholds Scale indicator: ≥ 0.1
If alerting:
Yellow warning: ≥ 0.01
Red critical: ≥ 0.1
Excessive dropped messages can indicate that the RLP is overloaded and that the Traffic Controllers need to be scaled.
How to scale Scale up the number of traffic controller instances to further balance log load.
Additional details Type: Counter (Integer)
Frequency: Emitted every 60 s
Applies to: cf:loggregator

Firehose Consumer Scaling Indicator

VMware recommends the following scaling indicator for monitoring the performance of consumers of the Firehose.


Slow Consumer Drops

Description The Slow Consumer Drops indicator is the slow_consumer metric incremented for each connection the Firehose closes because a consumer could not keep up.
This indicator shows how fast a Firehose consumer, such as a monitoring tool nozzle, is ingesting data. If this number is anomalous, it may result in the downstream monitoring tool not having all expected data, even though that data was successfully transported through the Firehose.
Source ID doppler_proxy
Metrics slow_consumer
Recommended thresholds Scale indicator: VMware recommends scaling when the rate of Firehose Slow Consumer Drops is anomalous for a given environment.
How to scale Scale up the number of nozzle instances. You can scale a nozzle using the subscription ID specified when the nozzle connects to the Firehose. If you use the same subscription ID on each nozzle instance, the Firehose evenly distributes data across all instances of the nozzle. For example, if you have two nozzle instances with the same subscription ID, the Firehose sends half of the data to one nozzle instance and half to the other. Similarly, if you have three nozzle instances with the same subscription ID, the Firehose sends one-third of the data to each instance. If you want to scale a nozzle, the number of nozzle instances should match the number of Traffic Controller instances.
Additional details Type: Counter
Frequency: Emitted every 5 s
Applies to: cf:doppler

Reverse Log Proxy Egress Dropped Messages

Description The Reverse Log Proxy Egress Dropped Messages indicator shows the number of messages dropped when consumers of the RLP, such as monitoring tool nozzles, ingest the exiting stream of logs and metrics too slowly. Within TAS for VMs, logs and metrics enter Loggregator for transport and then egress through the Reverse Log Proxy (RLP).
Source ID rlp
Metrics dropped
Label direction: egress
Recommended thresholds Scale indicator: Scale when the rate of rlp.dropped, direction: egress metrics is continuously increasing.
How to scale Scale up the number of nozzle instances. The number of nozzle instances should match the number of Traffic Controller instances. You can scale a nozzle using the subscription ID specified when the nozzle connects to the RLP. If you use the same subscription ID on each nozzle instance, the RLP evenly distributes data across all instances of the nozzle. For example, if you have two nozzle instances with the same subscription ID, the RLP sends half of the data to one nozzle instance and half to the other. Similarly, if you have three nozzle instances with the same subscription ID, the RLP sends one-third of the data to each instance.
Additional details Type: Counter
Frequency: Emitted every 5 s
Applies to: cf:loggregator_trafficcontroller

Doppler Egress Dropped Messages

Description The Doppler Egress Dropped Messages indicator shows the number of messages that the Dopplers drop when consumers of the RLP, such as monitoring tool nozzles, ingest the exiting stream of logs and metrics too slowly. For more information about how the Dopplers transport logs and metrics through Loggregator, see Loggregator Architecture in Loggregator Architecture.

Note: The doppler.dropped metric includes both ingress and egress directions. To differentiate between ingress and egress, refer to the direction tag on the metric.

Source ID doppler
Metrics dropped
egress
Label direction: egress
Recommended thresholds Scale indicator: Scale when the rate of doppler.dropped, direction: egress metrics is continuously increasing.
How to scale Scale up the number of nozzle instances. The number of nozzle instances should match the number of Traffic Controller instances. You can scale a nozzle using the subscription ID specified when the nozzle connects to the RLP. If you use the same subscription ID on each nozzle instance, the RLP evenly distributes data across all instances of the nozzle. For example, if you have two nozzle instances with the same subscription ID, the RLP sends half of the data to one nozzle instance and half to the other. Similarly, if you have three nozzle instances with the same subscription ID, the RLP sends one-third of the data to each instance.
Additional details Type: Counter
Frequency: Emitted every 5 s
Applies to: cf:doppler

Syslog Drain Performance Scaling Indicators

There is a single key capacity scaling indicator VMware recommends for Syslog Drain performance.

Note: These Syslog Drain scaling indicators are only relevant if your deployment contains apps using the syslog drain binding feature.


Syslog Agent Loss Rate

Description Divide the loggregator-agent.syslog_agent.dropped{direction:egress} metric by the loggregator-agent.loggr-syslog-agent.ingress{scope:all_drains} metric to get the rate of messages dropped as a percentage of total message traffic through Syslog Agents. The message traffic through Syslog Agents includes logs for bound apps.The loss rate of Syslog Agents indicates that the syslog drain consumer is ingesting logs from a syslog-drain-bound app too slowly.
The Syslog Agent loss rate does not affect the Firehose loss rate. Message loss can occur in Syslog Agents without message loss occuring in the Firehose.
Source ID loggregator-agent
Metrics dropped
ingress
Label direction:egress
scope:all_drains
Recommended thresholds The scaling indicator VMware recommends is the minimum Syslog Agent loss rate per minute within a five-minute window. You should scale up if the loss rate is greater than 0.1 for five minutes or longer.

Scale indicator: ≥ 0.1
If alerting:
Yellow warning: ≥ 0.01
Red critical: ≥ 0.1
How to scale Review the logs of the syslog server for intake issues and other performance issues. Scale the syslog server if necessary.
Additional details Type: Counter (Integer)
Frequency: Emitted every 60 s

Log Cache Scaling Indicator

VMware recommends the following scaling indicator for monitoring the performance of Log Cache.


Log Cache Caching Duration

Description The Log Cache Caching Duration indicator shows the age in milliseconds of the oldest data point stored in Log Cache.
Log Cache stores all messages that are passed through the Firehose in an ephemeral in-memory store. The size of this store and the cache duration are dependent on the amount of memory available on the VM where Log Cache runs. Typically, Log Cache runs on the Doppler VM.
Source ID log_cache
Metrics log_cache_cache_period
Recommended thresholds Scale indicator: Scale the VM on which Log Cache runs when the log_cache_cache_period metric drops below 900000 milliseconds.
How to scale Scale up the number of Doppler VMs or choose a VM type for Doppler that provides more memory.
Additional details Type: Gauge
Frequency: Emitted every 15 s
Applies to: cf:log-cache

Gorouter Performance Scaling Indicator

There is one key capacity scaling indicator VMware recommends for Gorouter performance.

Note: The following metric appears in the Firehose in two different formats. The below table lists both formats. For more information, see Duplicate Metrics Appear in the Firehose in VMware Tanzu Application Service for VMs v2.8 Release Notes.


Gorouter VM CPU Utilization

Description The Gorouter VM CPU Utilization indicator shows how much of a Gorouter VM’s CPU is being used. High CPU utilization of the Gorouter VMs can increase latency and cause requests per second to decrease.
Source ID cpu
Metrics user
Recommended thresholds Scale indicator: ≥ 60%
If alerting:
Yellow warning: ≥ 60%
Red critical: ≥ 70%
How to scale Scale the Gorouters horizontally or vertically by editing the Router VM in the Resource Config pane of the TAS for VMs tile. At greater than 8 CPUs, vertical scaling is no longer beneficial for increasing throughput.
Additional details Type: Gauge (float)
Frequency: Emitted every 60 s
Applies to: cf:router

UAA Performance Scaling Indicator

There is one key capacity scaling indicator VMware recommends for UAA performance.

Note: The following metric appears in the Firehose in two different formats. The below table lists both formats. For more information, see Duplicate Metrics Appear in the Firehose in VMware Tanzu Application Service for VMs v2.8 Release Notes.


UAA VM CPU Utilization

Description The UAA VM CPU Utilization indicator shows how much of the UAA VM’s CPU is used. High CPU utilization of the UAA VMs can cause requests per second to decrease.
Source ID cpu
Metrics user
Recommended thresholds Scale indicator: ≥ 80%
If alerting:
Yellow warning: ≥ 80%
Red critical: ≥ 90%
How to scale Scale UAA horizontally or vertically. To scale UAA, navigate to the Resource Config pane of the TAS for VMs tile and edit the number of your UAA VM instances or change the VM type to a type that utilizes more CPU cores.
Additional details Type: Gauge (float)
Frequency: Emitted every 60 s
Applies to: cf:uaa

NFS/WebDAV Backed Blobstore

There is one key capacity scaling indicator for external S3 external storage.

Note: This metric is only relevant if your deployment does not use an external S3 repository for external storage with no capacity constraints.

Note: The following metric appears in the Firehose in two different formats. The below table lists both formats. For more information, see Duplicate Metrics Appear in the Firehose in VMware Tanzu Application Service for VMs v2.8 Release Notes.


External S3 External Storage

Description The External S3 External Storage indicator shows the percentage of persistent disk used. If applicable: Monitor the percentage of persistent disk used on the VM for the NFS Server job.
If you do not use an external S3 repository for external storage with no capacity constraints, you must monitor the TAS for VMs object store to push new app and buildpacks.
Source ID disk
Metrics persistent.percent
Recommended thresholds ≥ 75%
How to scale Give your NFS Server additional persistent disk resources. If you use an internal NFS/WebDAV backed blobstore, consider scaling the persistent disk when it reaches 75% capacity.
Additional details Type: Gauge (%)
Applies to: cf:nfs_server