Checking Ops Manager State after a Power Failure on vSphere

Page last updated:

Warning: VMware Tanzu Application Service for VMs (TAS for VMs) v2.9 is no longer supported because it has reached the End of General Support (EOGS) phase as defined by the Support Lifecycle Policy. To stay up to date with the latest software and security updates, upgrade to a supported version.

This topic describes how to check the state of Ops Manager after a power failure in an on-premises vSphere installation.

If you have a procedure at your company for handling power failure scenarios and want to add steps for checking the state of Ops Manager, use this procedure as a template.

Overview

This section describes the process used by Ops Manager to recover from power failures and exceptions to that process.

Automatic Recovery Process

When power returns after a failure, vSphere and Ops Manager automatically do the following to recover your environment:

  1. vSphere High Availability (HA) recovers VMs.

  2. BOSH ensures that the processes on those VMs are healthy, with the exception of the Ops Manager VM and the BOSH VM itself. Ops Manager uses BOSH to deploy and manage its VMs. For more information, see the BOSH documentation.

  3. The Diego runtime of VMware Tanzu Application Service for VMs (TAS for VMs) recovers apps that were running on the VMs. For more information, see Diego Components and Architecture .

Scenarios that Require Manual Intervention

Two scenarios exist that can require manual intervention when recovering your environment after a power failure:

  • If TAS for VMs is configured to use a MySQL cluster instead of a single node, the cluster does not recover automatically.

  • If you run Ops Manager v2.5.3 or earlier and encounter the following known issue in the BOSH Director: Monit inaccurately reports the health of UAA.

The procedure in this topic includes more detail about addressing these scenarios.

Checklist

Use the checklist in this section to ensure that Ops Manager is in a good state after a power failure.

This checklist assumes that your Ops Manager on vSphere installation is configured for vSphere HA and that you have the BOSH Resurrector enabled.

Phase Component Action
1 vSphere Ensure vSphere is Running
2 Ops Manager Ensure Ops Manager is Running
3 BOSH Director Ensure BOSH Director is Running
4 BOSH Director Ensure BOSH Resurrector Finished Recovering
5 TAS for VMs Ensure VMs for TAS for VMs are Running. This may include manually recovering the MySQL cluster.
6 TAS for VMs Ensure Apps Hosted on TAS for VMs are Running
7 Ops Manager Healthwatch Check the Healthwatch Dashboard

Phase 1: Ensure vSphere is Running

Ensure that vSphere is running and has fully recovered from the power failure. Check your internal vSphere monitoring dashboard.

Phase 2: Ensure Ops Manager is Running

To ensure that Ops Manager is running, do the following:

  1. Open vCenter and navigate to the resource pool that hosts your Ops Manager deployment.

  2. Select the Related Objects > Virtual Machines.

  3. Locate the VM with the name OpsMan-VERSION, for example OpsMan-2.6.

  4. Review the State and Status columns for the Ops Manager VM. If Ops Manager is running, the columns show Powered On and Normal. If the columns do not show Powered On and Normal, restart the VM.

Phase 3: Ensure BOSH Director is Running

To ensure that BOSH Director is running, do the following:

  1. In a browser, navigate to the Ops Manager UI and select the BOSH Director for vSphere tile.

    Note: If you do not know the URL of the Ops Manager VM, you can use the IP address that you obtain from vCenter.

  2. Select Status.

  3. In the BOSH Director row, locate and record the CID. The CID is the cloud ID and corresponds to the VM name in vSphere.

  4. Navigate to the vCenter resource pool or cluster that hosts your Ops Manager deployment.

  5. Select Related Objects > Virtual Machines.

  6. Locate the VM with the name that corresponds to the CID value that you copied.

  7. Review the State and Status columns for the VM. If the State column does not show Powered On, restart the VM.

  8. If the State column shows that the VM is Powered On but the Status column does not does not show Normal, it may be due the following known issue: Monit inaccurately reports the health of UAA. To resolve this issue, do the following:

    1. SSH into the BOSH Director VM using the instructions in SSH into the BOSH Director VM.
    2. Run the following command to see that all processes are running: monit summary
    3. If the uaa process is not running, run the following command: monit restart UAA

Phase 4: Ensure BOSH Resurrector Finished Recovering

If enabled, the BOSH Resurrector re-creates any VMs in a problematic state after being recovered by vSphere HA.

To ensure BOSH Resurrector finished recovering, do the following:

  1. Log in to the Ops Manager VM with SSH using the instructions in Log in to the Ops Manager VM with SSH.

  2. Authenticate with the BOSH Director VM using the instructions in Authenticate with the BOSH Director VM.

  3. Run the following command to see if there is any currently running or queued Resurrector activity:

    bosh tasks --all -d ''
    

    Review the task description for scan and fix. If no task are running, the BOSH Director has probably finished recovering. Run bosh tasks --recent --all -d '' to view finished tasks.

Phase 5: Ensure the VMs for TAS for VMs are Running

Note: You can also apply the steps in this section to any Ops Manager services. To further ensure the health of Ops Manager services, use the Ops Manager Healthwatch dashboard and the documentation for each service.

To ensure that the VMs for TAS for VMs are running, do the following:

  1. Run the following command to confirm that VMs are running:

    bosh vms
    

    BOSH lists VMs by deployment. The deployment with the cf- prefix is the TAS for VMs deployment.

  2. If the mysql VM is not running, it is likely because it is a cluster and not a single node. Clusters require manual intervention after an outage. For instructions to confirm and recover the cluster, see Manually Recover MySQL (Clusters Only).

  3. If any other VMs are not running, run the following command:

    bosh cck -d DEPLOYMENT
    

    This command scans for problems and provides options for recovering VMs. For more information, see IaaS Reconciliation in the BOSH documentation.

  4. If you cannot get all VMs running, contact Support for assistance. Provide the following information:

    • You have started this checklist to recover from a power failure on vSphere
    • A list of failing VMs
    • Your Ops Manager version

Manually Recover MySQL (Clusters Only)

To manually recover MySQL, do the following:

  1. In a browser, navigate to the Ops Manager UI and select the VMware Tanzu Application Service for VMs tile.

  2. Select the Resource Config pane.

  3. Review the INSTANCES column of the MySQL Server job. If the number of instances is greater than 1, manually recover MySQL by following Recovering From MySQL Cluster Downtime.

Phase 7: Ensure Apps Hosted on TAS for VMs are Running

To ensure apps hosted on TAS for VMs are running, do the following:

  1. Check the status of an app that you run on Ops Manager. Run any health checks that the app has or visit the URL of the app to verify that it is working.

  2. Push an app to Ops Manager.

Phase 8: Check the Healthwatch Dashboard

You can use Ops Manager Healthwatch to further assess the state of Ops Manager. For more information, see Using Ops Manager Healthwatch.