Checking PCF State after a Power Failure on vSphere

This topic describes how to check Pivotal Cloud Foundry (PCF) state after a power failure in an on-premises vSphere installation.

If you have a procedure at your company for handling power failure scenarios and would to like add steps for checking that PCF is in a good state, you can use this procedure as a template.

Overview

This section describes the process used by PCF to recover from power failures and exceptions to that process.

Automatic Recovery Process

When power returns after a failure, vSphere and PCF automatically do the following to recover your environment:

  1. vSphere High Availability (HA) recovers VMs.
  2. BOSH ensures the processes on those VMs are healthy, with the exception of the Ops Manager VM and the BOSH VM itself. PCF uses BOSH to deploy and manage its VMs. For more information, see BOSH.
  3. The Diego runtime of Pivotal Application Service (PAS) recovers apps that were running on the VMs. For more information, see Diego.

Scenarios that Require Manual Intervention

There are two scenarios that can require manual intervention when recovering your environment after a power failure:

  • If PAS is configured to use a MySQL cluster instead of a single node, the cluster does not recover automatically.
  • If you have Ops Manager v2.5.3 or earlier and encounter the following known issue in the BOSH Director: Monit inaccurately reports the health of UAA.

The procedure in this topic includes more detail about addressing these scenarios.

Checklist

Use the checklist in this section to ensure PCF is in a good state after a power failure. It includes links to sections that contain more detail about each phase.

This checklist assumes your PCF on vSphere installation is set up for vSphere HA and you have the BOSH Resurrector enabled.

Phase Component Action
1 vSphere Ensure vSphere is Running
2 Ops Manager Ensure Ops Manager is Running
3 BOSH Director Ensure BOSH Director is Running
4 BOSH Director Ensure BOSH Resurrector Finished Recovering
5 PAS Ensure PAS VMs are Running (This may include manually recovering the MySQL cluster)
6 PAS Ensure Apps Hosted on PAS are Running
7 PCF Healthwatch Check the Healthwatch Dashboard

Phase 1: Ensure vSphere is Running

Ensure that vSphere is running and has fully recovered from the power failure. Check your internal vSphere monitoring dashboard.

Phase 2: Ensure Ops Manager is Running

To ensure Ops Manager is running, do the following:

  1. Open vCenter and navigate to the resource pool that hosts your PCF deployment.

  2. Select the Related Objects, and then Virtual Machines.

  3. Locate the VM with the name OpsMan-VERSION, such as OpsMan-2.6.

  4. Review the State and Status columns for the Ops Manager VM. If Ops Manager is running, they say Powered On and Normal. If this is not the case, restart the VM.

Phase 3: Ensure BOSH Director is Running

To ensure BOSH Director is running, do the following:

  1. In a browser, navigate to the PCF Ops Manager UI and select the BOSH Director for vSphere tile.

    Note: If you do not know the URL of the Ops Manager VM, you can use the IP address from vCenter.

  2. Select Status.

  3. In the BOSH Director row, record the CID. The CID is the cloud ID and corresponds to the VM name in vSphere.

  4. Navigate to the vCenter resource pool or cluster that hosts your PCF deployment.

  5. Select Related Objects, and then Virtual Machines.

  6. Locate the VM with the name that corresponds to the CID value you copied.

  7. Review the State and Status columns for the VM. If the State is not Powered On, restart the VM.

  8. If the VM is Powered On but Status does not display Normal, it may be due the following known issue: Monit inaccurately reports the health of UAA. To resolve this issue, do the following:

    1. SSH into the BOSH Director VM using the instructions in SSH into the BOSH Director VM.
    2. Run the following command to see that all processes are running:

      monit summary
      
    3. If the uaa process is not running, run the following command:

      monit restart UAA
      

Phase 4: Ensure BOSH Resurrector Finished Recovering

If enabled, the BOSH Resurrector re-creates any VMs in a problematic state after being recovered by vSphere HA.

To ensure BOSH Resurrector finished recovering, do the following:

  1. Log in to the Ops Manager VM with SSH using the instructions in Log in to the Ops Manager VM with SSH.

  2. Authenticate with the BOSH Director VM using the instructions in Authenticate with the BOSH Director VM.

  3. Run the following command to see if there is any currently running or queued Resurrector activity:

    bosh tasks --all -d ''
    

    Look for scan and fix in the task description. If there are no tasks running, it is likely that BOSH Director has finished recovering. You can also run bosh tasks --recent --all -d '' to view finished tasks.

Phase 5: Ensure PAS VMs are Running

Note: You can also apply the steps in this section to any PCF services. To further ensure the health of PCF services, use the PCF Healthwatch dashboard and the documentation for each service.

To ensure PAS VMs are running, do the following:

  1. Run the following command to confirm that VMs are running:

    bosh vms
    

    BOSH lists VMs by deployment. The deployment with the cf- prefix is the PAS deployment.

  2. If the mysql VM is not running, it is likely because it is a cluster and not a single node. Clusters require manual intervention after an outage. See Manually Recover PAS MySQL (Clusters Only) to confirm and recover the cluster.

  3. If any other VMs are not running, run the following command:

    bosh cck -d DEPLOYMENT
    

    This command scans for problems and provides options for recovering VMs. For more information, see IaaS Reconciliation in the BOSH documentation.

  4. If you cannot get all VMs running, contact Pivotal Support for assistance. Provide the following information:

    • You have started this checklist to recover from a power failure on vSphere
    • A list of failing VMs
    • Your PCF version

Manually Recover PAS MySQL (Clusters Only)

To manually recover PAS MySQL, do the following:

  1. In a browser, navigate to the PCF Ops Manager UI and select the Pivotal Application Service tile.

  2. Select the Resource Config pane.

  3. Review the INSTANCES column of the MySQL Server job. If the number of instances is greater than 1, manually recover MySQL by following this procedure: Recovering From MySQL Cluster Downtime.

Phase 7: Ensure Apps Hosted on PAS are Running

To ensure apps hosted on PAS are running, do the following:

  1. Check the status of an app your company runs on PCF. Run any healthchecks that the app has or visit the URL of the app to see that it is working.

  2. Push an app to PCF.

Phase 8: Check the Healthwatch Dashboard

You can use PCF Healthwatch to further assess the state of PCF. For more information, see Using PCF Healthwatch.