Using HTTP Latency as a Scaling Metric

Page last updated:

This topic describes the HTTP latency metric and how to configure App Autoscaler to use this metric when scaling apps.

Overview of HTTP Latency

When an HTTP request is made to an app, the Gorouter in VMware Tanzu Application Service for VMs (TAS for VMs) generates several metrics. One of these metrics is gorouter.latency, or HTTP latency. The HTTP latency metric measures the length of time it takes to process an HTTP request, starting from when the Gorouter receives a request and ending when the Gorouter finishes processing the response from the app. This metric includes the length of time it takes all back-end endpoints to respond, including other apps and TAS for VMs components such as Cloud Controller and UAA. Long uploads, downloads, or app responses increase the time.

For example, you might have a Service Level Agreement (SLA) specifying that 95% of requests for an app must be processed in less than 300 milliseconds. To help achieve this, you can configure an autoscaling rule for Autoscaler to create additional instances of the app when the HTTP latency metric reaches 250 milliseconds.

You can configure Autoscaler to use HTTP latency as the scaling metric for an app in the following ways:

To monitor when Autoscaler scales an app based on changes in HTTP latency, see Reviewing Autoscaling Events for Changes in HTTP Latency below.

For information about use cases that may complicate or prevent you from configuring HTTP latency as the scaling metric for an app, see Special Considerations for Using HTTP Latency as a Scaling Metric below.

VMware recommends that you load-test your app to verify that the autoscaling rules you configured are effective. For more information, see Load-Testing Your App in Using Autoscaler in Production.

For more information about the HTTP latency metric, see Router Handling Latency in Key Performance Indicators. For more information about how TAS for VMs routes HTTP requests, see TAS for VMs Routing Architecture.

Configuring HTTP Latency as the Scaling Metric for an App Through the cf CLI

The procedures in this section describe how to configure Autoscaler to use HTTP latency as the scaling metric for an app through the cf CLI.

You can configure configure Autoscaler to use HTTP latency as the scaling metric for an app in the following ways:

For the procedures in this section, you must use the App Autoscaler CLI plugin. To download and install the App Autoscaler CLI plugin, see Install the App Autoscaler CLI Plugin in Using the App Autoscaler CLI.

Configure an Autoscaling Rule Using a Manifest File

You can configure autoscaling rules declaratively through a manifest file. This manifest file only configures Autoscaler, and does not interfere with any other existing app manifest files in your TAS for VMs deployment.

To configure an autoscaling rule that defines HTTP latency as its scaling metric using a manifest file:

  1. In a terminal window, target the space in which the app you want to scale is deployed by running:

    cf target -o ORG-NAME -s SPACE-NAME
    

    Where:

    • ORG-NAME is the name of the org containing the space in which the app you want to scale is deployed.
    • SPACE-NAME is the name of the space in which the app you want to scale is deployed.
  2. If the space in which the app you want to scale is deployed does not already have an Autoscaler service instance of Autoscaler deployed in it, create an Autoscaler service instance by running:

    cf create-service app-autoscaler PLAN-NAME SERVICE-NAME
    

    Where:

    • PLAN-NAME is the name of the service plan you want to use for the Autoscaler service instance.
    • SERVICE-INSTANCE-NAME is the name you want to give the Autoscaler service instance. For example, autoscaler.

      If there is already an Autoscaler service instance in the space in which the app you want to scale is deployed, skip this step.
  3. Bind the Autoscaler service instance you created in the previous step to the app you want to scale by running:

    cf bind-service APP-NAME SERVICE-INSTANCE-NAME
    

    Where:

    • APP-NAME is the name of the app you want to scale.
    • SERVICE-INSTANCE-NAME is the name of the Autoscaler service instance in the previous step.
  4. To create a manifest file for Autoscaler that configures an autoscaling rule with HTTP latency as its scaling metric, create a YAML file that includes the following configuration parameters:

    ---
    instance_limits:
      min: LOWER-SCALING-LIMIT
      max: UPPER-SCALING-LIMIT
    rules:
    - rule_type: http_latency
      rule_sub_type: PERCENTILE
      threshold:
        min: MINIMUM-LATENCY-THRESHOLD
        max: MAXIMUM-LATENCY-THRESHOLD
    scheduled_limit_changes: []
    

    Where:

    • LOWER-SCALING-LIMIT is the minimum number of instances you want Autoscaler to create for the app.
    • UPPER-SCALING-LIMIT is the maximum number of instances you want Autoscaler to create for the app.
    • PERCENTILE is the percentile that Autoscaler uses in scaling decisions. Valid values are avg_95th or avg_99th. This value configures Autoscaler to ignore HTTP requests that fall outside either the 95th or 99th percentile and average the latency of the remaining 95% or 99% of HTTP requests.
    • MINIMUM-LATENCY-THRESHOLD is the minimum HTTP latency threshold in milliseconds. If the average latency of HTTP requests falls below this number, Autoscaler scales the number of app instances down.
    • MAXIMUM-LATENCY-THRESHOLD is the maximum HTTP latency threshold in milliseconds. If the average latency of HTTP requests rises above this number, Autoscaler scales the number of app instances up. To avoid excessive cycling, VMware recommends that you configure a maximum threshold that is at least twice the value of the minimum threshold.

      The following example shows an Autoscaler manifest file with a percentile of 95%, a minimum HTTP latency threshold of 125 milliseconds, and a maximum HTTP latency threshold of 250 milliseconds:
    ---
    instance_limits:
      min: LOWER-SCALING-LIMIT
      max: UPPER-SCALING-LIMIT
    rules:
    - rule_type: http_latency
      rule_sub_type: PERCENTILE
      threshold:
        min: MINIMUM-LATENCY-THRESHOLD
        max: MAXIMUM-LATENCY-THRESHOLD
    scheduled_limit_changes: []
    
  5. Apply the autoscaling rule you configured in the previous step to the app you want to scale by running:

    cf configure-autoscaling APP-NAME MANIFEST-FILENAME
    

    Where:

    • APP-NAME is the name of the app.
    • MANIFEST-FILENAME is the filename of the manifest file you created in the previous step. For example, autoscaler.yml.

Configure an Autoscaling Rule Using CLI Commands

To configure an autoscaling rule that defines HTTP latency as its scaling metric using CLI commands:

  1. In a terminal window, target the space in which the app you want to scale is deployed by running:

    cf target -o ORG-NAME -s SPACE-NAME
    

    Where:

    • ORG-NAME is the name of the org containing the space in which the app you want to scale is deployed.
    • SPACE-NAME is the name of the space in which the app you want to scale is deployed.
  2. If the space in which the app you want to scale is deployed does not already have a service instance of Autoscaler deployed in it, create an Autoscaler service instance by running:

    cf create-service app-autoscaler PLAN-NAME SERVICE-INSTANCE-NAME
    

    Where:

    • PLAN-NAME is the name of the service plan you want to use for the Autoscaler service instance.
    • SERVICE-INSTANCE-NAME is the name you want to give the Autoscaler service instance. For example, autoscaler.

      If there is already an Autoscaler service instance in the space in which the app you want to scale is deployed, skip this step.
  3. Bind the Autoscaler service instance you created in the previous step to the app you want to scale by running:

    cf bind-service APP-NAME SERVICE-INSTANCE-NAME
    

    Where:

    • APP-NAME is the name of the app you want to scale.
    • SERVICE-INSTANCE-NAME is the name of the Autoscaler service instance in the previous step.
  4. Configure upper and lower scaling limits for the app by running:

    cf update-autoscaling-limits APP-NAME LOWER-SCALING-LIMIT UPPER-SCALING-LIMIT
    

    Where:

    • APP-NAME is the name of the app.
    • LOWER-SCALING-LIMIT is the minimum number of instances you want Autoscaler to create for the app.
    • UPPER-SCALING-LIMIT is the maximum number of instances you want Autoscaler to create for the app.
  5. Allow Autoscaler to begin making scaling decisions for the app by running:

    cf enable-autoscaling APP-NAME
    

    Where APP-NAME is the name of the app.

  6. Create an http_latency autoscaling rule by running:

    cf create-autoscaling-rule APP-NAME http_latency MINIMUM-LATENCY-THRESHOLD MAXIMUM-LATENCY-THRESHOLD --subtype PERCENTILE
    

    Where:

    • APP-NAME is the name of the app for which you want to create an autoscaling rule.
    • MINIMUM-LATENCY-THRESHOLD is the minimum HTTP latency threshold in milliseconds. If the average latency of HTTP requests falls below this number, Autoscaler scales the number of app instances down.
    • MAXIMUM-LATENCY-THRESHOLD is the maximum HTTP latency threshold in milliseconds. If the average latency of HTTP requests rises above this number, Autoscaler scales the number of app instances up. To avoid excessive cycling, VMware recommends that you configure a maximum threshold that is at least twice the value of the minimum threshold.
    • PERCENTILE is the percentile that Autoscaler uses in scaling decisions. Valid values are avg_95th or avg_99th. This value configures Autoscaler to ignore HTTP requests that fall outside either the 95th or 99th percentile and average the latency of the remaining 95% or 99% of HTTP requests.

      The following example command configures an http_latency autoscaling rule for the example-app app, with a minimum HTTP latency threshold of 125 milliseconds, a maximum HTTP latency threshold of 250 milliseconds, and a percentile of 95%:
    cf create-autoscaling-rule example-app http_latency 125 250 --subtype avg_95th
    

Configure HTTP Latency as the Scaling Metric for an App Through Apps Manager

To configure Autoscaler to use HTTP latency as the scaling metric for an app through Apps Manager:

  1. Log in to Apps Manager. For more information, see Logging In to Apps Manager.

  2. Select the org that contains the space in which the app you want to scale is deployed.

  3. Select the space in which the app you want to scale is deployed.

  4. Under Under Processes and Instances, click Manage Autoscaling. The Manage Autoscaling window appears.

  5. Next to Scaling Rules, click Edit. The Edit Scaling Rules window appears.

  6. Click Add rule. The Select type dropdown appears.

  7. From the Select type dropdown, select HTTP Latency. More configuration fields appear below.

  8. For Scale down if less than, enter in milliseconds the minimum HTTP latency threshold you want to configure. If the average latency of HTTP requests falls below this number, Autoscaler scales the number of app instances down.

  9. For Scale up if more than, enter in milliseconds the maximum HTTP latency threshold you want to configure. If the average latency of HTTP requests rises above this number, Autoscaler scales the number of app instances up. To avoid excessive cycling, VMware recommends that you configure a maximum threshold that is at least twice the value of the minimum threshold.

  10. Under Percent of traffic to apply, select either 95% or 99%. This configuration setting is the percentile that Autoscaler uses in scaling decisions. Depending on which option you select, Autoscaler ignores HTTP requests that fall outside either the 95th or 99th percentile and averages the latency of the remaining 95% or 99% of HTTP requests.

  11. Click Save.

Reviewing Autoscaling Events for Changes in HTTP Latency

When Autoscaler scales the number of app instances up after the HTTP latency metric increases above the maximum HTTP latency threshold, Autoscaler records an autoscaling event.

You can monitor the autoscaling events that Autoscaler records for changes in HTTP latency in the following ways:

Review Autoscaling Events for Changes in HTTP Latency Through the cf CLI

To review the autoscaling events that Autoscaler records for changes in HTTP latency through the cf CLI:

  1. In a terminal window, run:

    cf autoscaling-events APP-NAME
    

    Where APP-NAME is the name of the app for which you want to review autoscaling events.

    If Autoscaler has scaled the number of app instances up due to increases in the HTTP latency metric, the above command returns output that contains autoscaling events similar to the following example:

    Time                   Description
    2022-05-23T21:47:45Z   Scaled up from 10 to 11 instances. Current HTTP Latency of 1010.96ms is above upper threshold of 250.00ms.
    

Review Autoscaling Events for Changes in HTTP Latency Through Apps Manager

To review the autoscaling events that Autoscaler records for changes in HTTP latency through Apps Manager:

  1. Log in to Apps Manager. For more information, see Logging In to Apps Manager.

  2. Select the org that contains the space in which the app you want to scale is deployed.

  3. Select the space in which the app you want to scale is deployed.

  4. Under Under Processes and Instances, click Manage Autoscaling.

  5. Under Event History, click View More. A list of autoscaling events appears. If Autoscaler has scaled the number of app instances up due to increases in the HTTP latency metric, the list of autoscaling events includes events similar to the following example:

    Scaled up from 10 to 11 instances. Current HTTP Latency of 1010.96ms is above upper threshold of 250.00ms.
    

Special Considerations for Using HTTP Latency as a Scaling Metric

This section describes use cases that may complicate or prevent you from configuring HTTP latency as the scaling metric for an app. For more information, see the sections below:

Multiple Endpoints

In an app that exposes multiple endpoints, the value of the HTTP latency metric is the average HTTP latency across all app endpoints. If one or more endpoints process requests at a slower rate than the others, HTTP latency might not be an ideal scaling metric to use. Even a fast endpoint can cause the average HTTP latency to increase if it receives a large number of requests.

External Factors

Components or services that receive data from an app are known as downstream components. If any downstream components respond slowly to requests from an app, they may cause HTTP latency to increase. In this case, scaling the app up does not improve its performance. In fact, scaling the app up might increase HTTP latency, because requests from the additional app instances add a greater burden on the downstream component. The downstream component must be scaled up or improved before scaling the app up can improve its performance.

Other external factors, such as network congestion or database performance, can also cause HTTP latency to increase. In this case, scaling the app does not decrease HTTP latency that results from these external factors.

Container-to-Container Networking

Autoscaler can only use HTTP latency as a scaling metric for apps that receive requests directly through the Gorouter. Autoscaler does not currently support using HTTP latency as a scaling metric for apps that receive requests from other apps through container-to-container (C2C) networking or TCP routers.

If your app relies on back-end HTTP services that apps in your TAS for VMs deployment must access through C2C networking, the Gorouter does not generate HTTP events for those requests. As a result, Autoscaler cannot scale those HTTP services. In order for Autoscaler to scale them, you must either use a different default scaling metric or create a custom scaling metric for them.

Log Cache Eviction

Autoscaler retrieves HTTP metrics from Log Cache, which can hold a maximum of 100,000 envelopes per app by default. If your app receives a large number of HTTP requests or is configured to create very verbose logs, Log Cache might drop some of the timer envelopes that it holds. If Autoscaler can only retrieve some of the total timer envelopes it needs to calculate accurate metrics, then the HTTP latency metric may inaccurately represent the actual HTTP latency of the app or causes of decreased app performance. However, in most cases, the HTTP latency metric still approximates the actual HTTP latency of the app.

For more information, see Log Cache in Operating App Autoscaler.

Infrequent Requests

If an app receives requests infrequently and responds slowly, Autoscaler might continue scaling the app up because there are no other HTTP latency metrics to restore the average. In this case, Autoscaler usually stops scaling the app up after the original request falls outside of the metric collection interval.

For more information about how Autoscaler’s metric collection interval affects its scaling decisions, see How App Autoscaler Decides When to Scale in About App Autoscaler.