Managing Diego Cell Limits During Upgrade
Cloud Foundry supports rolling upgrades in high availability environments. A rolling upgrade means that you can continue to operate an existing Cloud Foundry deployment and its running app instances while upgrading the platform.
Note: Rolling upgrade is not available in your deployment if you have not configured your deployment to be highly available. See High Availability in Cloud Foundry.
To upgrade Cloud Foundry, BOSH must drain all Diego cell VMs that host app instances. BOSH manages this process by upgrading a batch of cells at a time.
The number of cells that undergo upgrade simultaneously (either in a state of shutting down or coming back online) is controlled by the
max_in_flight value of the Diego cell job. For example, if
max_in_flight is set to
10% and your deployment has 20 Diego cell job instances, then the maximum number of cells that BOSH can upgrade at a single time is
When BOSH triggers an upgrade, each Diego cell undergoing upgrade enters “evacuation” mode. Evacuation mode means that the cell stops accepting new work and signals the rest of the Diego system to schedule replacements for its app instances. This scheduling is managed by the Diego auctioneer process.
The evacuating cells continue to interact with the Diego system as replacements come online. The cell undergoing upgrade waits until either its app instance replacements run successfully before shutting down the original local instances, or for the evacuation process to time out. This “evacuation timeout” defaults to 10 minutes.
If cell evacuation exceeds this timeout, then the cell stops its app instances and shuts down. The Diego system continues to re-emit start requests for the app replacements.
A potential issue arises if too many app instance replacements are slow to start or do not start successfully at all.
If too many app instances are starting concurrently, then the load of these starts on the rest of the system can cause other applications that are already running to crash and be rescheduled. These events can result in a cascading failure.
To prevent overload, Cloud Foundry provides two major throttle configurations:
The maximum number of starting containers that Diego can start in Cloud Foundry: This is a deployment-wide limit. The default value and ability to override this configuration depends on the version of Cloud Foundry deployed. For information about how to configure this setting, see the Setting a Maximum Number of Starting Containers topic.
max_in_flightsetting for the Diego cell job configured in the BOSH manifest: This configuration, expressed as a percentage or an integer, sets the maximum number of job instances that can be upgraded simultaneously. For example, if your deployment is running 10 Diego cell job instances and the configured
20%, then only 2 Diego cell job instances can start up at a single time.
To retrieve or override the existing
max_in_flightvalue in Ops Manager Director, use the Ops Manager API. See the Ops Manager API documentation.
The values of the above throttle configurations depend on the version of PCF that you have deployed and whether you have overridden the default values.
Refer to the following table for existing defaults and, if necessary, determine the override values in your deployment.
|PCF Version||Starting Container Count Maximum||Starting Container Count Overridable?||Maximum In Flight Diego Cell Instances||Maximum In Flight Diego Cell Instances Overridable?|
|PCF 1.7.43 and earlier||No limit set||No||1 instance||No|
|PCF 1.7.44 to 1.7.49||200||No||1 instance||No|
|PCF 1.7.50 +||200||No||1 instance||No|
|PCF 1.8.0 to 1.8.29||No limit set||No||10% of total instances||No|
|PCF 1.8.30 +||200||Yes||10% of total instances||No|
|PCF 1.9.0 to 1.9.7||No limit set||No||4% of total instances||Yes|
|PCF 1.9.8 +||200||Yes||4% of total instances||Yes|
|PCF 1.10.0 and later||200||Yes||4% of total instances||Yes|
|PCF 1.12.0 and later||200||Yes||4% of total instances||Yes|