Disaster Recovery in Pivotal Cloud Foundry
This document provides an overview of the options and considerations for disaster recovery in Pivotal Cloud Foundry (PCF).
Operators have a range of approaches for ensuring they can recover Pivotal Cloud Foundry, apps, and data in case of a disaster. The approaches fall into the following two categories:
- Using data from a backup to restore the data in the PCF Deployment. See Back up and Restore Using BOSH Backup and Restore (BBR) for more information.
- Recreating the data in PCF by automating the creation of state in PCF. See Disaster Recovery by Recreating the Deployment for more information.
BOSH Backup and Restore (BBR) is a CLI for orchestrating backing up and restoring BOSH deployments and BOSH Directors. BBR triggers the backup or restore process on the deployment or Director, and transfers the backup artifact to and from the deployment or Director.
Use BOSH Backup and Restore to reliably create backups of core PCF components and their data. See the BOSH Backup and Restore topic for more information about the framework.
Backing up PCF requires backing up the following components:
- Ops Manager settings
- BOSH Director, including CredHub and UAA
- Elastic Runtime
- Data services
For more information, see Backing up Pivotal Cloud Foundry with BBR. With these backup artifacts, operators can recreate PCF exactly as it was when the backup was taken.
The restore process involves creating a new PCF deployment starting with the Ops Manager VM. For more information, see Restoring Pivotal Cloud Foundry from Backup with BBR.
The time required to restore the data is proportionate to the size of the data because the restore process includes copying data. For example, restoring a 1 TB blobstore takes one thousand times as long as restoring a 1 GB blobstore.
Unlike other backup solutions, using BBR to back up PCF enables the following:
- Completeness: BBR supports backing up BOSH, including releases, CredHub, UAA, and service instances created with an on-demand service broker. With PCF v1.12, Ops Manager export no longer includes releases.
- Consistency: BBR provides referential integrity between the database and the blobstore because a lock is held while both the database and blobstore are backed up.
- Correctness: Using the BBR restore flow addresses C2C and routing issues that can occur during restore.
Apps are not affected during backups, but certain APIs are unavailable. The downtime occurs only while the backup is being taken, not while the backup is being copied to the jumpbox.
In a consistent backup, the blobs in the blobstore match the blobs in the Cloud Controller Database. To take a consistent backup, changes to the data are prevented during the backup. This means that the CF API, Routing API, Usage Service, Autoscaler, Notification Service, Network Policy Server, and CredHub are unavailable while the backup is being taken. UAA is in read-only mode during the backup.
Blobstores can be very large. To minimize downtime, only metadata about the blobs is taken during the back up. For example, in the case of internal blobstores (Webdav/NFS), a list of hardlinks to the blobs is taken. After API access is restored, copies of the blobs are made.
The follow components and products do not yet support BBR:
- Data services: BBR is not yet supported in Pivotal’s flagship data services (MySQL, RabbitMQ, Redis, PCC). In the meantime, operators should use the automatic backups feature of each tile, available within Ops Manager.
- External blobstores: BBR only supports versioned S3-compatible external blobstores. Any other type of external blobstore is not supported by BBR, but BBR can be used to back up the rest of Elastic Runtime. Pivotal recommends that operators copy incompatible blobstores using IaaS tooling, in conjunction with backing up Elastic Runtime with BBR.
- External databases: BBR supports a defined list of external databases. For external databases not supported by BBR, Pivotal recommends that operators copy the database using IaaS tooling.
To address the limitations noted above, follow the guidlines below when using BBR to back up PCF when Elastic Runtime configured with an unsupported external blobstore or external database:
With Elastic Runtime configured with an internal database and an unsupported external blobstore, follow the PCF backup process using BBR and copy the external blobstore using your IaaS. Inconsistencies between the blobstore and database may result in apps failing to restart during the restore. You can push these apps again to restart them.
With Elastic Runtime configured with an unsupported external database and an unsupported external blobstore, follow the PCF backup process using BBR, but skip the backup of Elastic Runtime. Copy the external database and blobstore using your IaaS. Inconsistencies between the blobstore and database may result in apps failing to restart during the restore. You can push these apps again to restart them.
Pivotal recommends that you take backups in proportion to the rate of change of the data in PCF to minimize the number of changes lost if a restore is required. We suggest starting with backing up every 24 hours. If app developers make frequent changes, you should increase the frequency of backups.
Operators should retain backup artifacts based on the timeframe they need to be able to restore to. For example, if backups are taken every 24 hours and PCF must be able to be restored to three days prior, three sets of backup artifacts should be retained.
Artifacts should be stored in two data centers other than the PCF data center. When deciding the restore timeframe, you should take other factors such as compliance and auditability into account.
Pivotal strongly recommends that you encrypt artifacts and stored them securely.
An alternative strategy for recovering PCF after a disaster is to have automation in place so that all the data can be recreated. This requires that every modification to PCF settings and state be automated, typically through use of a pipeline.
Recovery steps include creating a new PCF, recreating orgs, spaces, users, services, service bindings and other state, and re-pushing apps.
For more information about this approach, see the following Cloud Foundry Summit presentation: Multi-DC Cloud Foundry: What, Why and How?.
To prevent app downtime, some Pivotal customers run active-active, where they run two or more identical PCF deployments in different data centers. If one PCF deployment becomes unavailable, traffic is seamlessly routed to the other deployment. To achieve identical deployments, all operations to PCF are automated so they can be applied to both PCF deployments in parallel.
Because all operations have been automated, the automation approach to disaster recovery is a viable option for active-active. Disaster recovery requires recreating PCF, then running all the automation to recreate state.
This option requires discipline to automate all changes to PCF. Some of the operations that need to be automated are the following:
- App push, restage, scale
- Org, space, and user create, read, update, and delete (CRUD)
- Service instance CRUD
- Service bindings CRUD
- Routes CRUD
- Security groups CRUD
- Quota CRUD
Human-initiated changes always make their way into the system. These changes can include quotas being raised, new settings being enabled, and incident responses. For this reason, Pivotal recommends taking backups even when using an automated disaster recovery strategy.
|Restore the PCF Data||Recreate the PCF Data|
|Preconditions||IaaS prepared for PCF install||IaaS prepared for PCF install|
|RTO (Recovery Time Objective)|
|Platform||Time to recreate PCF||Time to recreate PCF|
|Apps||Time to restore||Time until orgs/spaces/etc have been recreated + apps have been repushed|
|RPO (Recovery Point Objective)|
|Platform||Time of the last backup||Current time|
|Apps||Time of the last backup||Current time|
Instead of having a true active-active deployment across all layers, some Pivotal customers prefer to install a PCF or Elastic Runtime deployment on a backup site. The backup site resides on-premises, in a co-location facility, or the public cloud. The backup site includes an operational deployment, with only the most critical apps ready to accept traffic should a failure occur in the primary data center. Disaster recovery in this scenario involves the following:
- Switching traffic to the passive PCF, making it active.
- Recovering the formerly-active PCF. Operators can choose to do this through automation, if that option is available, or by using BBR and the restore process.
The RTO and RPO for recreating the active PCF are the same as outlined in the table above.
Both the restore and recreate data disaster recovery options require standing up a new PCF, which can take hours. If you require shorter RTO, several options involving a pre-created standby hardware and PCF are available:
Public cloud environment ready for PCF installation, no PCF installed. This saves both IaaS costs and PCF instance costs. For on-prem installations, this requires hardware on standby, ready to install on, which may not be a realistic option.
PCF installed on standby hardware and kept up to date, VMs scaled down to zero (spin them up each time there is a platform update), no apps installed, no orgs or spaces defined.
Bare minimum PCF install, either with no applications, or a small number of each app in a stopped state. On recovery, push a small number of apps or start current apps, while simultaneously triggering automation to scale the platform to the primary node size, or a smaller version if large percentages of loss are acceptable. This mode allows you to start sending some traffic immediately, while not paying for a full non-primary platform. This method requires data seeded, but it is usually acceptable to complete data sync while platform is scaling up.
Non-primary deployment scaled to the primary node size, or smaller version if large percentages of loss are acceptable, with a small number of Diego cells (VMs). On failover, scale Diego cells up to primary node counts. This mode allows you to start sending most traffic immediately, while not paying for all the AIs of a fully fledged node. This method requires data to be there very quickly after failure. It does not require real-time sync, but near-real time.
There is a tradeoff between cost and RTO: the less the replacement PCF needs to be deployed and scaled, the faster the restore.
BBR generates the backup artifacts required for PCF, but does not handle scheduling, artifact management, or encryption. The BBR team has created a starter Concourse pipeline to automate backups with BBR.
Also, Stark & Wayne’s Shield can be used as a front end management tool using the BBR plugin.
To ensure that backup artifacts are valid, the BBR tool creates checksums of the generated backup artifacts, and ensures that the checksums match the artifacts on the jumpbox.
However, the only way to be sure that the backup artifact can be used to successfully recreate PCF is to test it in the restore process. This is a cumbersome, dangerous process so should be done with care. For instructions, see Step 11: (Optional) Validate Your Backup of the Backing Up Pivotal Cloud Foundry with BBR.