Restoring Pivotal Cloud Foundry from Backup with BBR

Page last updated:

This topic describes the procedure for restoring your critical backend PCF components with BOSH Backup and Restore (BBR), a command-line tool for backing up and restoring BOSH deployments. To perform the procedures in this topic, you must have backed up Pivotal Cloud Foundry (PCF) by following the steps in the Backing Up Pivotal Cloud Foundry with BBR topic.

To view the BBR release notes, see the BOSH Backup and Restore Release Notes. To restore PCF manually, see the Restoring Pivotal Cloud Foundry Manually from Backup topic.

The procedures described in this topic prepare your environment for PCF, deploy Ops Manager, import your installation settings, and use BBR to restore your PCF components.

Warning: Restoring Pivotal Cloud Foundry (PCF) with BBR is a destructive operation. If the restore fails, the new environment may be left in an unusable state and require reprovisioning. Only perform the procedures in this topic for the purpose of disaster recovery, such as recreating PCF after a storage-area network (SAN) corruption.

Warning: When validating your backup, the VMs and disks from the backed-up BOSH Director should not visible to the new BOSH Director. As a result, Pivotal recommends that you deploy the new BOSH Director to a different IaaS network and account than the VMs and disks of the backed up BOSH Director.

Note: BBR is a feature in PCF v1.11. You can only use BBR to back up PCF v1.11 and later. To restore earlier versions of PCF, perform the manual procedures.

Note: If you are restoring in order to validate a backup, look for notes marked Validation throughout the topic.

Compatibility of Restore

This section describes the restrictions for a backup artifact to be restorable to another environment. This section is for guidance only, and Pivotal highly recommends that operators validate their backups by using the backup artifacts in a restore.

Consult the following restrictions for a backup artifact to be restorable:

  • Topology: BBR requires the BOSH topology of a deployment to be the same in the restore environment as it was in the backup environment.
  • Naming of instance groups and jobs: For any deployment that implements the backup and restore scripts, the instance groups and jobs must have the same names.
  • Number of instance groups and jobs: For instance groups and jobs that have backup and restore scripts, there must be the same number of instances.
  • Limited validation: BBR puts the backed up data into the corresponding instance groups and jobs in the restored environment, but can’t validate the restore beyond that. For example, if the MySQL encryption key is different in the restore environment, the BBR restore might succeed although the restored MySQL database is unusable.

Note: A change in VM size or underlying hardware should not affect BBR’s ability to restore data, as long as there is adequate storage space to restore the data.

(Optional) Step 1: Prepare Your Environment

In an event of a disaster, you may lose not only your VMs and disks, but your IaaS resources as well, such as networks and load balancers.

If you need to recreate your IaaS resources, prepare your environment for PCF by following the instructions specific to your IaaS in Installing Pivotal Cloud Foundry.

Note: The instructions for installing PCF on Amazon Web Services (AWS) and OpenStack combine the procedures for preparing your environment and deploying Ops Manager into a single topic. The instructions for the other supported IaaSes split these procedures into two separate topics.

If you recreate your IaaS resources, you must also add those resources to Ops Manager by performing the procedures in the (Optional) Step 3: Configure Ops Manager for New Resources section.

Step 2: Deploy Ops Manager and Import Installation Settings

  1. Perform the procedures for your IaaS to deploy Ops Manager:

  2. Access your new Ops Manager by navigating to YOUR-OPS-MAN-FQDN in a browser.

  3. On the Welcome to Ops Manager page, click Import Existing Installation.

    Welcome

  4. In the import panel, perform the following tasks:

    • Enter your Decryption Passphrase.
    • Click Choose File and browse to the installation zip file that you exported in the Step 7: Export Installation Settings section of the Backing Up Pivotal Cloud Foundry with BBR topic.

    Decryption passphrase

  5. Click Import.

    Note: Some browsers do not provide feedback on the status of the import process, and may appear to hang. The import process takes at least 10 minutes, and takes longer the more tiles that were present on the backed-up Ops Manager.

  6. A Successfully imported installation message appears upon completion.

    Success

(Optional) Step 3: Configure Ops Manager for New Resources

If you recreated IaaS resources such as networks and load balancers by following the steps in the (Optional) Step 1: Prepare Your Environment section above, perform the following steps to update Ops Manager with your new resources:

  1. Enable the Ops Manager advanced mode following this Knowledge Base article.
  2. Navigate to the Ops Manager Installation Dashboard and click the Ops Manager Director tile.
  3. Click Create Networks and update the network names to reflect the network names for the new environment.
  4. If running on GCP, click Google Config and update the Project ID to reflect the new GCP project ID.
  5. Return to the Ops Manager Installation Dashboard and click the Elastic Runtime tile.
  6. Click Resource Config. If necessary for your IaaS, enter the name of your new load balancer in the Load Balancer column.
  7. If necessary, click Networking and update the load balancer SSL certificate and private key under Router SSL Termination Certificate and Private Key.
  8. If your environment has a new DNS address, update the old environment DNS entries to point to the new load balancer addresses. For more information, see the Step 4: Configure Networking section of the Using Your Own Load Balancer topic and follow the link to the instructions for your IaaS.
  9. If you are using Google Cloud Platform (GCP), navigate to the Google Config section of the Ops Manager Director tile and update the Default Deployment Tag to reflect the new environment.
  10. Make sure you disable the Ops Manager advanced mode, as recommended in the Knowledge Base article.

Step 4: Remove BOSH State File

  1. SSH into your Ops Manager VM. For more information, see the SSH into Ops Manager section of the Advanced Troubleshooting with the BOSH CLI topic.
  2. On the Ops Manager VM, delete the /var/tempest/workspaces/default/deployments/bosh-state.json file:
    $ rm /var/tempest/workspaces/default/deployments/bosh-state.json
    
  3. Navigate to YOUR-OPS-MAN-FQDN in a browser and log into Ops Manager.
  4. For each tile that requires one, upload the required stemcell.

    Warning: Do not click Apply Changes at this point.

  5. Perform the steps in the Applying Changes to Ops Manager Director topic to use the Ops Manager API to only deploy the Ops Manager Director.

    Validation: If your BOSH Director has an external hostname, you should change it in Ops Manager Director > Director Config > Directory Hostname to ensure it does not conflict with the hostname of the backed up Director.

Step 5: Transfer Artifacts to Jumpbox

In the Step 9: Back Up Your Elastic Runtime Deployment section of the Backing Up Pivotal Cloud Foundry with BBR topic, you moved the TAR and metadata files of the backup artifact off your jumpbox to your preferred storage space. Now you must transfer those files back to your jumpbox.

For instance, you could SCP the backup artifact to your jumpbox:

$ scp LOCAL_PATH_TO_BACKUP_ARTIFACT JUMPBOX_USER/JUMPBOX_ADDRESS

Step 6: Retrieve BOSH Director Address and Credentials

Perform the following steps to retrieve the IP address of your BOSH Director and the credentials for logging in from the Ops Manager Director tile:

  1. Install the BOSH v2 CLI on a machine outside of your PCF deployment. You can use the jumpbox for this task.
  2. From the Installation Dashboard in Ops Manager, select Ops Manager Director > Status and record the IP address listed for the Director. You access the BOSH Director using this IP address.

  3. Click Credentials and record the Director credentials.
  4. From the command line, log into the BOSH Director using the IP address and credentials that you recorded:
    $ bosh -e DIRECTOR_IP \
    --ca-cert PATH-TO-BOSH-SERVER-CERT log-in
    Email (): director
    Password (): *******************
    Successfully authenticated with UAA
    Succeeded
    

Step 7: Restore the BOSH Director

  1. Navigate to the Ops Manager Installation Dashboard.
  2. Click the Ops Manager tile.
  3. Click the Credentials tab.
  4. Locate Bbr Ssh Credentials and click Link to Credential next to it.

    You can also retrieve the credentials using the Ops Manager API with a GET request to the following endpoint: /api/v0/deployed/director/credentials/bbr_ssh_credentials. For more information, see the Using the Ops Manager API topic.

  5. Copy the value for private_key_pem, beginning with "-----BEGIN RSA PRIVATE KEY-----".

  6. SSH into your jumpbox.

  7. Run the following command to reformat the key and save it to a file called PRIVATE_KEY in the current directory, pasting in the contents of your private key for YOUR_PRIVATE_KEY:

    $ printf -- "YOUR_PRIVATE_KEY" > PRIVATE_KEY
    

  8. Ensure the BOSH Director backup artifact is in the folder you will run BBR from.

  9. Run the BBR restore command from your jumpbox to restore the BOSH Director:

    $ nohup bbr director \
      --private-key-path PRIVATE_KEY \
      --username bbr \
      --host HOST \
      restore \
        --artifact-path PATH_TO_DIRECTOR_BACKUP
    
    Use the optional --debug flag to enable debug logs. See the Logging section of the Backing Up Pivotal Cloud Foundry with BBR topic for more information.

    Replace the placeholder values as follows:

    • PATH_TO_DIRECTOR_BACKUP: This is the path to the Director backup you want to restore.
    • PRIVATE_KEY: This is the path to the private key file you created above.
    • HOST: This is the address of the BOSH Director. If the BOSH Director is public, this will be a URL, such as https://my-bosh.xxx.cf-app.com. Otherwise, it will be the BOSH_DIRECTOR_IP, which you retrieved in the Step 5: Retrieve BOSH Director Address and Credentials section.

Note: The BBR BOSH Director restore command can take at least 15 minutes to complete. Pivotal recommends that you run it independently of the SSH session, so that the process can continue running even if your connection to the jumpbox fails. The command above uses nohup but you could also run the command in a screen or tmux session.

If the commands completes successfully, continue to Step 8: Identify Your Deployment.

If the command fails, do the following:

  1. Ensure all the parameters in the command are set.
  2. Ensure the BOSH Director credentials are valid.
  3. Ensure the specified deployment exists.
  4. Ensure the source deployment is compatible with the target deployment.
  5. Ensure that the jumpbox can reach the BOSH Director.

Step 8: Identify Your Deployment

After logging in to your BOSH Director, run bosh deployments to identify the name of the BOSH deployment that contains PCF:

$ bosh -e DIRECTOR_IP --ca-cert /var/tempest/workspaces/default/root_ca_certificate deployments

Name                     Release(s)
cf-example               push-apps-manager-release/661.1.24
                         cf-backup-and-restore/0.0.1
                         binary-buildpack/1.0.11
                         capi/1.28.0
                         cf-autoscaling/91
                         cf-mysql/35
                         ...

In the above example, the name of the BOSH deployment that contains PCF is cf-example.

Step 9: Remove Stale Cloud IDs for All Deployments

For every deployment in the BOSH Director, run the following command:

$ bosh -e DIRECTOR_IP -d DEPLOYMENT_NAME -n cck \
  --resolution delete_disk_reference \
  --resolution delete_vm_reference

This reconciles the BOSH Director’s internal state with the state in the IaaS. You can use the list of deployments returned in Step 8: Identify Your Deployment.

If the bosh cck command does not successfully delete disk references and you see a message similar to the following, perform the additional procedures in the Remove Unused Disks section below.

Scanning 19 persistent disks: 19 OK, 0 missing ...

Step 10: Redeploy Elastic Runtime

  1. Perform the following steps to determine which stemcell is used by Elastic Runtime:

    1. Navigate to the Ops Manager Installation Dashboard.
    2. Click the Pivotal Elastic Runtime tile.
    3. Click Stemcell and record the release number included in the displayed filename: Stemcell In the screenshot above, the stemcell release number is 3421.9.

      You can also retrieve the stemcell release using the BOSH CLI:

      $ bosh -e DIRECTOR_IP deployments
      Using environment '10.0.0.5' as user 'director' (bosh.*.read, openid, bosh.*.admin, bosh.read, bosh.admin)

      Name Release(s) Stemcell(s) Team(s) Cloud Config cf-9cb6995b7d746cd77438 push-apps-manager-release/661.1.24 bosh-google-kvm-ubuntu-trusty-go_agent/3421.9 - latest ...

  2. Download the stemcell from Pivotal Network.

  3. Run the following command to upload the stemcell used by Elastic Runtime:

    $ bosh -e BOSH_DIRECTOR_IP \
      -d DEPLOYMENT_NAME \
      --ca-cert PATH_TO_BOSH_SERVER_CERT \
      upload-stemcell \
      --fix PATH_TO_STEMCELL
    

  4. If you have any other tiles installed, ensure you upload their stemcells if they are different from the Elastic Runtime stemcell. Upload stemcells to the BOSH Director with bosh upload-stemcell --fix PATH_TO_STEMCELL, as in the command above.

  5. From the Ops Manager Installation Dashboard, navigate to Pivotal Elastic Runtime > Resource Config.

  6. Ensure the number of instances for MySQL Server is set to 1.

    Warning: Restore will fail if there is not exactly one MySQL Server instance deployed.

  7. Return to the Ops Manager Installation Dashboard and click Apply Changes to redeploy.

    Validation: If your Elastic Runtime uses an external blobstore, ensure that the Elastic Runtime tile is configured to use a different blobstore before clicking Apply Changes. Otherwise it will attempt to connect to the blobstore that the existing Elastic Runtime is using.

    Validation: Ensure your System Domain and Apps Domain under Pivotal Elastic Runtime > Domains are updated to refer to the validation environment.

Step 11: Restore Elastic Runtime

Validation: If your apps must not be running after a restore run bosh stop on each diego_cell VM in the deployment.

  1. Run the BBR restore command from your jumpbox to restore Elastic Runtime:

    $ BOSH_CLIENT_SECRET=BOSH_PASSWORD \
      bbr deployment \
        --target BOSH_DIRECTOR_IP \
        --username BOSH_CLIENT \
        --deployment DEPLOYMENT_NAME \
        --ca-cert PATH_TO_BOSH_SERVER_CERT \
        restore \
          --artifact-path PATH_TO_ERT_BACKUP
    

    Replace the placeholder values as follows:

    • BOSH_CLIENT, BOSH_PASSWORD: Use the BOSH UAA user provided in Pivotal Ops Manager > Credentials > Uaa Bbr Client Credentials.

      You can also retrieve the credentials using the Ops Manager API with a GET request to the following endpoint: /api/v0/deployed/director/credentials/uaa_bbr_client_credentials. For more information, see the Using the Ops Manager API topic.

    • BOSH_DIRECTOR_IP: You retrieved this value in the Step 6: Retrieve BOSH Director Address and Credentials section.
    • DEPLOYMENT-NAME: You retrieved this value in the Step 8: Identify Your Deployment section.
    • PATH_TO_BOSH_SERVER_CERT: This is the path to the BOSH Director’s Certificate Authority (CA) certificate, if the certificate is not verifiable by the local machine’s certificate chain.
    • PATH_TO_ERT_BACKUP: This is the path to the Elastic Runtime backup you want to restore.

    Validation: If you ran bosh stop on each diego_cell before running bbr restore, you can now run cf stop on all apps and then run bosh start on each diego_cell. After this, all apps will be deployed in a stopped state.

  2. Perform the following steps after restoring Elastic Runtime:

    1. Retrieve the MySQL admin password by following one of the procedures below:
      • Log in to Ops Manager and navigate to Pivotal Elastic Runtime > Credentials > Mysql Admin Credentials.
      • Retrieve the credentials using the Ops Manager API by performing the following steps:
        1. Perform the procedures in the Using the Ops Manager API topic to authenticate and access the Ops Manager API.
        2. Use the GET /api/v0/deployed/products endpoint to retrieve a list of deployed products, replacing UAA-ACCESS-TOKEN with the access token recorded in the Using the Ops Manager API topic:
          $ curl "https://OPS-MAN-FQDN/api/v0/deployed/products" \
          -X GET \
          -H "Authorization: Bearer UAA-ACCESS-TOKEN"
        3. In the response to the above request, locate the product with an installation_name starting with cf- and copy its guid.
        4. Run the following curl command, replacing PRODUCT-GUID with the value of guid from the previous step:
          $ curl "https://OPS-MAN-FQDN/api/v0/deployed/products/PRODUCT-GUID/credentials/" \
          -X GET \
          -H "Authorization: Bearer UAA-ACCESS-TOKEN"
        5. Retrieve the MySQL admin password from the response to the above request.
    2. List the VMs in your deployment:
      $ bosh -e DIRECTOR_IP --ca-cert /var/tempest/workspaces/default/root_ca_certificate \
      -d DEPLOYMENT_NAME \
      ssh
    3. Select the mysql VM to SSH into.
    4. From the mysql VM, run the following command:
      $ sudo /var/vcap/packages/mariadb/bin/mysql -u root -p
      When prompted, enter the MySQL admin password.

    5. At the MySQL prompt, run the following command:
      mysql> use silk; drop table subnets; drop table gorp_migrations;
    6. Exit MySQL:
      mysql> exit
    7. Exit the mysql VM:
      $ exit
    8. List the VMs in your deployment:
      $ bosh -e DIRECTOR_IP --ca-cert /var/tempest/workspaces/default/root_ca_certificate \
      -d DEPLOYMENT_NAME \
      ssh
    9. SSH onto each diego_database VM and run the following command:
      $ sudo monit restart silk-controller

    Restored apps will begin to start. The amount of time it takes for all apps to start depends on the number of app instances, the resources available to the underlying infrastructure, and the value of the Max Inflight Container Starts field in the Elastic Runtime tile.

  3. If desired, scale the MySQL Server job back up to its previous number of instances by navigating to the Resource Config section of the Elastic Runtime tile. After scaling the job, return to the Ops Manager Installation Dashboard and click Apply Changes to deploy.

(Optional) Step 12: Restore On-Demand Service Instances

Note: These procedures restore the on-demand service instances but do not restore service instance data.

If you have on-demand service instances provisioned by an on-demand service broker, perform the following steps to restore them after successfully restoring PCF:

  1. Use the Cloud Foundry Command Line Interface (cf CLI) to target your PCF deployment:
    $ cf api api.YOUR-SYSTEM-DOMAIN
    
  2. Log in:
    $ cf login
    
  3. Perform the following steps to make a list of all the service instances provisioned by your on-demand service broker:
    1. List your service offerings:
      $ cf curl /v2/services
      
    2. Record the GUID of the on-demand service offering you want to restore by examining the value for guid under metadata:
      "metadata": {
      "guid": "ab2b01cc-2a22-525a-a333-e6e666a6aa66",
      "url": "/v2/services/ab2b01cc-2a22-525a-a333-e6e666a6aa66",
      "created_at": "2017-02-10T18:19:35Z",
      "updated_at": "2017-02-10T18:19:35Z"
      
    3. List all service plans for the service offering, replacing SERVICE-OFFERING-GUID with the GUID obtained in the previous step:
      $ cf curl /v2/services/SERVICE-OFFERING-GUID/service_plans
      
    4. Record the GUID of each service plan by examining the value for guid under metadata.
    5. For each service plan, list all service instances:
      $ cf curl /v2/service_plans/SERVICE-PLAN-GUID/service_instances
      
    6. Record the GUID of each service instance by examining the value for guid under metadata.
  4. Perform the following steps to obtain the BOSH credentials used by your on-demand service broker:
    1. Navigate to https://YOUR-OPS-MAN-FQDN/api/v0/staged/products in a browser to obtain the product GUID of your tile.
    2. Navigate to https://YOUR-OPS-MAN-FQDN/api/v0/staged/products/PRODUCT-GUID/manifest to obtain your product’s staged manifest.
    3. Copy the manifest into a file on your local machine called manifest.json.
    4. Run the following command to find the name of the deployment’s on-demand broker instance group:
      $ cat manifest.json | jq '(.instance_groups[].name )' | grep on-demand-broker | grep -v -E "register|smoke"
      > redis-on-demand-broker
      
    5. Run the following command to extract the BOSH credentials:
      $ cat manifest.json | jq '(.instance_groups[] |
      select(.name == "redis-on-demand-broker").jobs[] |
      select(.name == "broker").properties.bosh.authentication.uaa )'
      
  5. SSH into your Ops Manager VM. For more information, see the SSH into Ops Manager section of the Advanced Troubleshooting with the BOSH CLI topic.
  6. Using the BOSH credentials retrieved above, authenticate with your BOSH Director by running the following commands with the BOSH CLI v2:
    $ export BOSH_CLIENT=YOUR-CLIENT-ID
    $ export BOSH_CLIENT_SECRET=YOUR-CLIENT-SECRET
    $ bosh alias-env director -e DIRECTOR-IP \
    --ca-cert /var/tempest/workspaces/default/root_ca_certificate
    
  7. Using the list of service instance GUIDs gathered above, deploy each instance with the following commands:
    $ bosh -e director manifest \
    -d service-instance_SERVICE-INSTANCE-GUID > /tmp/manifest.yml
    $ bosh -e director \
    -d service-instance_SERVICE-INSTANCE-GUID deploy /tmp/manifest.yml
    
  8. After deploying all service instances, remove the manifest from tmp.
    $ rm /tmp/manifest.yml
    
  9. Any ERT apps bound to these services will have to be restarted to pick up the recreated service instances.

Rolling Back Elastic Runtime in the Event of a Failed Upgrade

If you have previously backed up PCF using BBR, you can roll back Elastic Runtime (ERT) to an earlier deployment if the ERT upgrade fails. For instructions to roll back the ERT deployment, see the Rolling Back ERT Deployment to an Earlier Backup with BBR topic.

Remove Unused Disks

If bosh cck does not clean up all disk references, you must manually delete the disks from a previous deployment that will prevent recreated deployments from working.

Warning: This is a very destructive operation.



To delete the disks, perform one of the following procedures:

  • Use the BOSH CLI to delete the disks by performing the following steps:
    1. Target the redeployed BOSH Director using the BOSH CLI by performing the procedures in Step 6: Retrieve BOSH Director Address and Credentials.
    2. List the deployments by running the following command:
      $ bosh -e DIRECTOR_IP --ca-cert /var/tempest/workspaces/default/root_ca_certificate deployments
      
    3. Delete each deployment with the following command:
      $ bosh -d DEPLOYMENT_NAME delete-deployment
      
  • Log in to your IaaS account and delete the disks manually. Run the following command to retrieve a list of disk IDs:
    $ bosh -e DIRECTOR_IP --ca-cert /var/tempest/workspaces/default/root_ca_certificate instances
    

Once the disks are deleted, continue with Step 9: Remove Stale Cloud IDs for All Deployments.

Create a pull request or raise an issue on the source for this page in GitHub