LATEST VERSION: 1.10 - CHANGELOG
Pivotal Cloud Foundry v1.10

Managing Diego Cell Limits During Upgrade

Cloud Foundry supports rolling upgrades in high availability environments. A rolling upgrade means that you can continue to operate an existing Cloud Foundry deployment and its running app instances while upgrading the platform.

Note: Rolling upgrade is not available in your deployment if you have not configured your deployment to be highly available. See High Availability in Cloud Foundry.

Diego Cell Upgrade Process

To upgrade Cloud Foundry, BOSH must drain all Diego cell VMs that host app instances. BOSH manages this process by upgrading a batch of cells at a time.

The number of cells that undergo upgrade simultaneously (either in a state of shutting down or coming back online) is controlled by the max_in_flight value of the Diego cell job. For example, if max_in_flight is set to 10% and your deployment has 20 Diego cell job instances, then the maximum number of cells that BOSH can upgrade at a single time is 2.

When BOSH triggers an upgrade, each Diego cell undergoing upgrade enters “evacuation” mode. Evacuation mode means that the cell stops accepting new work and signals the rest of the Diego system to schedule replacements for its app instances. This scheduling is managed by the Diego auctioneer process.

The evacuating cells continue to interact with the Diego system as replacements come online. The cell undergoing upgrade waits until either its app instance replacements run successfully before shutting down the original local instances, or for the evacuation process to time out. This “evacuation timeout” defaults to 10 minutes.

If cell evacuation exceeds this timeout, then the cell stops its app instances and shuts down. The Diego system continues to re-emit start requests for the app replacements.

Preventing Overload

A potential issue arises if too many app instance replacements are slow to start or do not start successfully at all.

If too many app instances are started at once, then the load of these starts on the rest of the system can cause other applications that are already running to crash and be rescheduled. These events can result in a cascading failure.

To prevent overload, Cloud Foundry provides two major throttle configurations:

  • The maximum number of starting containers that Diego can start in Cloud Foundry: This is a deployment-wide limit. The default value and ability to override this configuration depends on the version of Cloud Foundry deployed. For information about how to configure this setting, see the Setting a Maximum Number of Started Containers topic.

  • The max_in_flight setting for the Diego cell job configured in the BOSH manifest: This configuration, expressed as a percentage or an integer, sets the maximum number of job instances that can be upgraded simultaneously. For example, if your deployment is running 10 Diego cell job instances and the configured max_in_flight value is 20%, then only 2 Diego cell job instances can start up at a single time.

    To retrieve or override the existing max_in_flight value in Ops Manager Director, use the Ops Manager API. See the Ops Manager API documentation provided with your Ops Manager installation at https://YOUR-OPSMAN-FQDN/docs/.

The values of the above throttle configurations depend on the version of PCF that you have deployed and whether you have overridden the default values.

Refer to the following table for existing defaults and, if necessary, determine the override values in your deployment.

PCF Version Starting Container Count Maximum Starting Container Count Overridable? Maximum In Flight Diego Cell Instances Maximum In Flight Diego Cell Instances Overridable?
PCF 1.7.43 and earlier No limit set No 1 instance No
PCF 1.7.44 to 1.7.49 200 No 1 instance No
PCF 1.7.50 + 200 No 1 instance No
PCF 1.8.0 to 1.8.29 No limit set No 10% of total instances No
PCF 1.8.30 + 200 Yes 10% of total instances No
PCF 1.9.0 to 1.9.7 No limit set No 10% of total instances Yes
PCF 1.9.8 + 200 Yes 10% of total instances Yes
PCF 1.10.0 and later 200 Yes 10% of total instances Yes
Create a pull request or raise an issue on the source for this page in GitHub