Configuring PAS for Upgrades

This topic describes several configuration options for Pivotal Application Service (PAS) that can help ensure successful upgrades. In addition to following the Upgrade Preparation Checklist, review the sections in this document to better understand how to prepare for PAS Upgrades.

Limit PCF Component Instance Restarts

The max_in_flight variable limits how many instances of a component can restart simultaneously during updates or upgrades. Increasing the value of max_in_flight can make updates run faster, but setting it too high risks overloading VMs and causing failure. See Best Practices for guidance on setting max_in_flight values.

Values for max_in_flight can be any integer between 1 and 100, or a percentage of the total number of instances. For example, a max_in_flight value of 20% in a deployment with 10 Diego cell instances would make no more than two cell instances restart at once.

Set max_in_flight

The max_in_flight variable is a system-wide value with optional component-specific overrides. You can override the default value for individual jobs using an API endpoint.

Use the max_in_flight API Endpoint

Use the max_in_flight API endpoint to configure the maximum value for component instances that can start at a given time. This endpoint overrides product defaults. You can specify values as a percentage or an integer.

Use the string “default” as the max_in_flight value to force the component to use the deployment’s default value.

Note: The example below lists three JOB_GUIDs. These three GUIDs are examples of the three different types of values you can use to configure max_in_flight. The endpoint only requires one GUID.

    curl "https://EXAMPLE.com/api/v0/staged/products/PRODUCT-TYPE1-GUID/max_in_flight" \
    -X PUT \
    -H "Authorization: Bearer UAA_ACCESS_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
          "max_in_flight": {
            "JOB_1_GUID": 1,
            "JOB_2_GUID": "20%",
            "JOB_3_GUID": "default"
          }
        }'

Specific Guidance for Diego Cells

To upgrade Cloud Foundry, BOSH must drain all Diego cell VMs that host app instances. BOSH manages this process by upgrading a batch of cells at a time.

The number of cells that undergo upgrade simultaneously (either in a state of shutting down or coming back online) is controlled by the max_in_flight value of the Diego cell job. For example, if max_in_flight is set to 10% and your deployment has 20 Diego cell job instances, then the maximum number of cells that BOSH can upgrade at a single time is 2.

When BOSH triggers an upgrade, each Diego cell undergoing upgrade enters “evacuation” mode. Evacuation mode means that the cell stops accepting new work and signals the rest of the Diego system to schedule replacements for its app instances. This scheduling is managed by the Diego auctioneer process.

The evacuating cells continue to interact with the Diego system as replacements come online. The cell undergoing upgrade waits until either its app instance replacements run successfully before shutting down the original local instances, or for the evacuation process to time out. This “evacuation timeout” defaults to 10 minutes.

If cell evacuation exceeds this timeout, then the cell stops its app instances and shuts down. The Diego system continues to re-emit start requests for the app replacements.

Prevent Overload

A potential issue arises if too many app instance replacements are slow to start or do not start successfully at all.

If too many app instances are starting concurrently, then the load of these starts on the rest of the system can cause other applications that are already running to crash and be rescheduled. These events can result in a cascading failure.

To prevent this issue, PCF provides two throttle configurations: the maximum number of in-flight diego cell instances and the maximum number of starting containers.

The values of the above throttle configurations depend on the version of PCF that you have deployed and whether you have overridden the default values.

Refer to the following table for existing defaults and, if necessary, determine the override values in your deployment.

PCF Version Starting Container Count Maximum Starting Container Count Overridable? Maximum In Flight Diego Cell Instances Maximum In Flight Diego Cell Instances Overridable?
PCF 1.7.43 and earlier No limit set No 1 instance No
PCF 1.7.44 to 1.7.49 200 No 1 instance No
PCF 1.7.50 + 200 No 1 instance No
PCF 1.8.0 to 1.8.29 No limit set No 10% of total instances No
PCF 1.8.30 + 200 Yes 10% of total instances No
PCF 1.9.0 to 1.9.7 No limit set No 4% of total instances Yes
PCF 1.9.8 + 200 Yes 4% of total instances Yes
PCF 1.10.0 and later 200 Yes 4% of total instances Yes
PCF 1.12.0 and later 200 Yes 4% of total instances Yes

Best Practices

Set the max_in_flight variable high enough that the remaining component instances are not overloaded by typical use. If component instances are overloaded during updates, upgrades, or typical use, users may experience downtime.

Some more precise guidelines include:

  • For jobs with high resource usage, set max_in_flight low. For example, for Diego cells, max_in_flight allows non-migrating cells to pick up the work of cells stopping and restarting during migration. If resource usage is already close to 100%, scale up your jobs before making any updates.
  • For quorum-based components (these are components with odd-numbered settings in the manifest), such as etcd, consul, and Diego BBS, set max_in_flight to 1. This preserves quorum and prevents a split-brain scenario from occurring as jobs restart.
  • For other components, set max_in_flight to the number of instances that you can afford to have down at any one time. The best values for your deployment vary based on your capacity planning. In a highly redundant deployment, you can make the number high so that updates run faster. If your components are at high utilization, however, you should keep the number low to prevent downtime.
  • Never set max_in_flight to a value greater than or equal to the number of instances you have running for a component.

Set a Maximum Number of Starting Containers

This section describes how to use the auctioneer job to configure the maximum number of app instances starting at a given time. This prevents Diego from scheduling too much new work for your platform to handle concurrently. A lower default can prevent server overload during cold start, which may be important if your infrastructure is not sized for a large number of concurrent cold starts.

The auctioneer only schedules a fixed number of app instances to start concurrently. This limit applies to both single and multiple Diego Cells. For example, if you set the limit to five starting instances, it does not matter if you have one Diego Cell with ten instances or five Diego Cells with two instances each. The auctioneer will not allow more than five instances to start at the same time.

If you are using a cloud-based IaaS, rather than a smaller on-premise solution, Pivotal recommends setting a larger default. By default, the maximum number of started instances is 200.

You can configure the maximum number of started instances in the Settings tab of the Pivotal Application Service (PAS) tile.

  1. Log in to Operations Manager.
  2. Click the PAS tile.
  3. Click Application instances in the sidebar.
  4. In the Max Inflight Container Starts field, type the maximum number of started instances.
  5. Click Save.
  6. Configure File Storage

    This section describes critical factors to consider when evaluating the type of file storage to use in your Pivotal Cloud Foundry (PCF) deployment. The Pivotal Application Service (PAS) blobstore relies on the file storage system to read and write resources, app packages, and droplets.

    During an upgrade of PCF, file storage with insufficient IOPS numbers can negatively impact the performance and stability of your PCF deployment.

    If disk processing time takes longer than the evacuation timeout for Diego cells, then Diego cells and app instances may take too long to start up, resulting in a cascading failure.

    However, the minimum required IOPS depends upon a number of deployment-specific factors and configuration choices. Use this section as a guide when deciding on the file storage configuration for your deployment.

    To see an example of system performance and IOPS load during an upgrade, refer to Upgrade Load Example: Pivotal Web Services.

    Select Internal or External File Storage

    When you deploy PCF, you can select internal file storage or external file storage, either network-accessible or IaaS-provided, as an option in the PAS tile.

    Selecting internal storage causes PCF to deploy a dedicated virtual machine (VM) that uses either NFS or WebDAV for file storage. Selecting external storage allows you to configure file storage provided in network-accessible location or by an IaaS, such as Amazon S3, Google Cloud Storage, or Azure Storage.

    Whenever possible, Pivotal recommends using external file storage.

    Calculate Potential Disk Load Requirements

    As a best-effort calculation, estimate the total number of bits needed to move during a system upgrade to determine how IOPS-performant your file storage needs to be.

    • Number of Diego Cells
      • As a first calculation, determine the number of Diego cells that your deployment currently uses. To view the number of Diego cell instances currently running in your deployment, see the Resource Config section of your PAS tile. If you expect to scale up the number of instances, use the anticipated scaled number.

        Note: If your deployment uses more than 20 Diego cells, you should avoid using internal file storage. Instead, you should always select external or IaaS-provided file storage.

    • Maximum In-Flight Load and Container Starts for Diego Cells
      • Operators can limit the number of containers and Diego cell instances that Diego starts concurrently. If operators impose no limits, your file storage may experience exceptionally heavy load during an upgrade.