Configuring PAS for Upgrades

Page last updated:

This topic describes several configuration options for Pivotal Application Service (PAS) that can help ensure successful upgrades. In addition to following the Upgrade Preparation Checklist, review this topic to better understand how to prepare for PAS upgrades.

Limit Pivotal Platform Component Instance Restarts

The max_in_flight variable limits how many instances of a component can restart simultaneously during updates or upgrades. Increasing the value of max_in_flight can make updates run faster, but setting it too high risks overloading VMs and causing failure. For guidance on setting max_in_flight values, see Best Practices.

Values for max_in_flight can be any integer between 1 and 100, or a percentage of the total number of instances. For example, a max_in_flight value of 20% in a deployment with 10 Diego Cell instances would make no more than two Diego Cell instances restart at once.

Set max_in_flight

The max_in_flight variable is a system-wide value with optional component-specific overrides. You can override the default value for individual jobs using an API endpoint.

Use the max_in_flight API Endpoint

Use the max_in_flight API endpoint to configure the maximum value for component instances that can start at a given time. This endpoint overrides product defaults. You can specify values as a percentage or an integer.

Use the string default as the max_in_flight value to force the component to use the deployment’s default value.

Note: The example below lists three JOB_GUIDs. These three GUIDs are examples of the three different types of values you can use to configure max_in_flight. The endpoint only requires one GUID.

    curl "" \
    -X PUT \
    -H "Authorization: Bearer UAA_ACCESS_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
          "max_in_flight": {
            "JOB_1_GUID": 1,
            "JOB_2_GUID": "20%",
            "JOB_3_GUID": "default"

Specific Guidance for Diego Cells

To upgrade Pivotal Platform, BOSH must drain all Diego Cell VMs that host app instances. BOSH manages this process by upgrading a batch of Diego Cells at a time.

The number of Diego Cells that undergo upgrade simultaneously (either in a state of shutting down or coming back online) is controlled by the max_in_flight value of the Diego Cell job. For example, if max_in_flight is set to 10% and your deployment has 20 Diego Cell job instances, then the maximum number of Diego Cells that BOSH can upgrade at a single time is 2.

When BOSH triggers an upgrade, each Diego Cell undergoing upgrade enters “evacuation” mode. Evacuation mode means that the Diego Cell stops accepting new work and signals the rest of the Diego system to schedule replacements for its app instances. This scheduling is managed by the Diego auctioneer process. For more information, see How Diego Balances App Processes.

The evacuating Diego Cells continue to interact with the Diego system as replacements come online. The Diego Cell undergoing upgrade waits until either its app instance replacements run successfully before shutting down the original local instances, or for the evacuation process to time out. This “evacuation timeout” defaults to 10 minutes.

If Diego Cell evacuation exceeds this timeout, then the Diego Cell stops its app instances and shuts down. The Diego system continues to re-emit start requests for the app replacements.

Prevent Overload

A potential issue arises if too many app instance replacements are slow to start or do not start successfully at all.

If too many app instances are starting concurrently, then the load of these starts on the rest of the system can cause other apps that are already running to crash and be rescheduled. These events can result in a cascading failure.

To prevent this issue, Pivotal Platform provides two throttle configurations: the maximum number of in-flight Diego Cell instances and the maximum number of starting containers.

The values of the above throttle configurations depend on the version of Pivotal Platform that you have deployed and whether you have overridden the default values.

See the following table for existing defaults and, if necessary, determine the override values in your deployment.

Pivotal Platform Version Starting Container Count Maximum Starting Container Count Overridable? Maximum In-Flight Diego Cell Instances Maximum In-Flight Diego Cell Instances Overridable?
Pivotal Platform v1.7.43 and earlier No limit set No 1 instance No
Pivotal Platform v1.7.44 to v1.7.49 200 No 1 instance No
Pivotal Platform v1.7.50 + 200 No 1 instance No
Pivotal Platform v1.8.0 to v1.8.29 No limit set No 10% of total instances No
Pivotal Platform v1.8.30 + 200 Yes 10% of total instances No
Pivotal Platform v1.9.0 to v1.9.7 No limit set No 4% of total instances Yes
Pivotal Platform v1.9.8 + 200 Yes 4% of total instances Yes
Pivotal Platform v1.10.0 and later 200 Yes 4% of total instances Yes
Pivotal Platform v1.12.0 and later 200 Yes 4% of total instances Yes

Best Practices

Set the max_in_flight variable high enough that the remaining component instances are not overloaded by typical use. If component instances are overloaded during updates, upgrades, or typical use, users may experience downtime.

Some more precise guidelines include:

  • For jobs with high resource usage, set max_in_flight low. For example, for Diego Cells, max_in_flight allows non-migrating Diego Cells to pick up the work of Diego Cells stopping and restarting during migration. If resource usage is already close to 100%, scale up your jobs before making any updates.

  • Quorum-based components are components with odd-numbered settings in the manifest. For quorum-based components such as etcd and Diego BBS, set max_in_flight to 1. This preserves quorum and prevents a split-brain scenario from occurring as jobs restart. For more information, about split-brain scenarios, see Split-brain (computing) on Wikipedia.

  • For other components, set max_in_flight to the number of instances that you can afford to have down at any one time. The best values for your deployment vary based on your capacity planning. In a highly redundant deployment, you can make the number high so that updates run faster. However, if your components are at high utilization, you should keep the number low to prevent downtime.

  • Never set max_in_flight to a value greater than or equal to the number of instances you have running for a component.

Set a Maximum Number of Starting Containers

This section describes how to use the Diego Auctioneer job to configure the maximum number of app instances starting at a given time. This prevents Diego from scheduling too much new work for your platform to handle concurrently. A lower default can prevent server overload during cold start, which may be important if your infrastructure is not sized for a large number of concurrent cold starts.

The Diego Auctioneer only schedules a fixed number of app instances to start concurrently. This limit applies to both single and multiple Diego Cells. For example, if you set the limit to five starting instances, it does not matter if you have one Diego Cell with ten instances or five Diego Cells with two instances each. The auctioneer does not allow more than five instances to start at the same time.

If you are using a cloud-based IaaS, rather than a smaller on-premise solution, Pivotal recommends setting a larger default. By default, the maximum number of started instances is 200.

To configure the maximum number of started instances in the Settings tab of the PAS tile:

  1. Log in to Ops Manager.

  2. Click the PAS tile.

  3. Select App Containers.

  4. In the Max-in-flight container starts field, enter the maximum number of started instances.

  5. Click Save.

Configure File Storage

This section describes critical factors to consider when evaluating the type of file storage to use in your Pivotal Platform deployment. The PAS blobstore relies on the file storage system to read and write resources, app packages, and droplets. For more information, see Blobstore in Cloud Controller.

During an upgrade of Pivotal Platform, file storage with insufficient IOPS numbers can negatively impact the performance and stability of your Pivotal Platform deployment.

If disk processing time takes longer than the evacuation timeout for Diego Cells, then Diego Cells and app instances may take too long to start up, resulting in a cascading failure.

However, the minimum required IOPS depends upon a number of deployment-specific factors and configuration choices. Use this section as a guide when deciding on the file storage configuration for your deployment.

For an example of system performance and IOPS load during an upgrade, see Upgrade Load Example: Pivotal Web Services.

Select Internal or External File Storage

When you deploy Pivotal Platform, you can select internal file storage or external file storage, either network-accessible or IaaS-provided, as an option in the PAS tile.

Selecting internal storage causes Pivotal Platform to deploy a dedicated VM that uses either NFS or WebDAV for file storage. Selecting external storage allows you to configure file storage provided in network-accessible location or by an IaaS, such as Amazon S3, Google Cloud Storage, or Azure Storage.

Whenever possible, Pivotal recommends using external file storage.

Calculate Potential Disk Load Requirements

As a best-effort calculation, estimate the total number of bits needed to move during a system upgrade to determine how IOPS-performant your file storage needs to be.

  • Number of Diego Cells: As a first calculation, determine the number of Diego Cells that your deployment currently uses. To view the number of Diego Cell instances currently running in your deployment, see the Resource Config pane of the PAS tile. If you expect to scale up the number of instances, use the anticipated scaled number.

    Note: If your deployment uses more than 20 Diego Cells, you should avoid using internal file storage. Instead, you should always select external or IaaS-provided file storage.

  • Maximum In-Flight Load and Container Starts for Diego Cells: Operators can limit the number of containers and Diego Cell instances that Diego starts concurrently. If operators impose no limits, your file storage may experience exceptionally heavy load during an upgrade.