Recovering MySQL from Elastic Runtime Downtime

Page last updated:

This topic describes the procedure for recovering a terminated Elastic Runtime cluster using a process known as bootstrapping.

About the BOSH CLI

This topic requires you to run commands from the Ops Manager Director using the BOSH Command Line Interface (CLI).

There are two major releases of the BOSH CLI, and the Ops Manager Director VM includes both versions. You can use either version of the BOSH CLI to interact with your deployment, using bosh commands for the old CLI and bosh2 commands for the new CLI.

For more information about the differences between the old and new versions of the BOSH CLI, see the BOSH documentation.

This topic provides example commands for both versions of the BOSH CLI. Pivotal recommends using bosh2 for compatibility with future PCF versions.

See Advanced Troubleshooting with the BOSH CLI for more information.

When to Bootstrap

You must bootstrap a cluster that loses quorum. A cluster loses quorum when less than half of the nodes can communicate with each other for longer than the configured grace period. If a cluster does not lose quorum, individual unhealthy nodes automatically rejoin the cluster after resolving the error, restarting the node, or restoring connectivity.

You can detect lost quorum through the following symptoms:

  • All nodes appear “Unhealthy” on the proxy dashboard, viewable at proxy-BOSH-JOB-INDEX.p-mysql.YOUR-SYSTEM-DOMAIN:
  • All responsive nodes report the value of wsrep_cluster_status as non-Primary:

    mysql> SHOW STATUS LIKE 'wsrep_cluster_status';
    +----------------------+-------------+
    | Variable_name        | Value       |
    +----------------------+-------------+
    | wsrep_cluster_status | non-Primary |
    +----------------------+-------------+
    
  • All responsive nodes respond with ERROR 1047 when queried with most statement types:

    mysql> select * from mysql.user;
    ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
    

See the Cluster Scaling, Node Failure, and Quorum topic for more details about determining cluster state.

Follow the steps below, using either BOSH CLI v1 or BOSH CLI v2, to recover a cluster that has lost quorum.

Step 1: Choose the Correct Manifest (BOSH CLI v1)

  1. Log into the BOSH director by running bosh target DIRECTOR-URL followed by bosh login USERNAME PASSWORD.

  2. Run bosh deployments.

    $ bosh deployments
    Acting as user 'director' on 'p-bosh-30c19bdd43c55c627d70'
    
    +-------------------------+-------------------------------+----------------------------------------------+--------------+
    | Name                    | Release(s)                    | Stemcell(s)                                  | Cloud Config |
    +-------------------------+-------------------------------+----------------------------------------------+--------------+
    | cf-e82cbf44613594d8a155 | cf-autoscaling/28             | bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3140 | none         |
    |                         | cf-mysql/23                   |                                              |              |
    |                         | cf/225                        |                                              |              |
    |                         | diego/0.1441.0                |                                              |              |
    |                         | etcd/18                       |                                              |              |
    |                         | garden-linux/0.327.0          |                                              |              |
    |                         | notifications-ui/10           |                                              |              |
    |                         | notifications/19              |                                              |              |
    |                         | push-apps-manager-release/397 |                                              |              |
    +-------------------------+-------------------------------+----------------------------------------------+--------------+
    
  3. Download the manifest.

    $ bosh download manifest cf-e82cbf44613594d8a155 /tmp/cf.yml
    Acting as user 'director' on deployment 'cf-e82cbf44613594d8a155' on 'p-bosh-30c19bdd43c55c627d70'
    Deployment manifest saved to `/tmp/cf.yml'
    
  4. Set BOSH to use the deployment manifest you downloaded.

    $ bosh deployment /tmp/cf.yml
    
  5. Continue to Scenario 1.

Step 2: Run the Bootstrap Errand

Elastic Runtime versions 1.7.0 and later include a BOSH errand to automate the process of bootstrapping. The bootstrap errand automates the steps described in the Manual Bootstrapping section below. It finds the node with the highest transaction sequence number and asks it to start up by itself in bootstrap mode. Finally, it asks the remaining nodes to join the cluster.

In most cases, running the errand will recover your cluster. However, certain scenarios require additional steps. To determine which set of instructions to follow, you must determine the state of your Virtual Machines (VMs).

  1. Run bosh instances and examine the output.
    • If the output of bosh instances shows the state of the jobs as failing, proceed to Scenario 1.
      $ bosh instances
      [...]
      +--------------------------------------------------+---------+------------------------------------------------+------------+
      | Instance                                         | State   | Resource Pool                                  | IPs        |
      +--------------------------------------------------+---------+------------------------------------------------+------------+
      | mysql-partition-a813339fde9330e9b905/0           | failing | mysql-partition-a813339fde9330e9b905           | 203.0.113.55 |
      | mysql-partition-a813339fde9330e9b905/1           | failing | mysql-partition-a813339fde9330e9b905           | 203.0.113.56 |
      | mysql-partition-a813339fde9330e9b905/2           | failing | mysql-partition-a813339fde9330e9b905           | 203.0.113.57 |
      +--------------------------------------------------+---------+------------------------------------------------+------------+
      
    • If the output of bosh instances shows the state of jobs as unknown/unknown, proceed to Scenario 2.
      $ bosh instances
      +--------------------------------------------------+--------------------+------------------------------------------------+------------+
      | Instance                                         | State              | Resource Pool                                  | IPs        |
      +--------------------------------------------------+--------------------+------------------------------------------------+------------+
      | unknown/unknown                                  | unresponsive agent |                                                |            |
      +--------------------------------------------------+--------------------+------------------------------------------------+------------+
      | unknown/unknown                                  | unresponsive agent |                                                |            |
      +--------------------------------------------------+--------------------+------------------------------------------------+------------+
      | unknown/unknown                                  | unresponsive agent |                                                |            |
      +--------------------------------------------------+--------------------+------------------------------------------------+------------+
      

Scenario 1: Virtual Machines Running, Cluster Disrupted

In this scenario, nodes are up and running, but the cluster has been disrupted. You can run the bootstrap errand without recreating the VMs.

  1. Run bosh run errand bootstrap. The errand command prints the following message when finished running:

    Bootstrap errand completed

    [stderr] + echo 'Started bootstrap errand ...' + JOB_DIR=/var/vcap/jobs/bootstrap + CONFIG_PATH=/var/vcap/jobs/bootstrap/config/config.yml + /var/vcap/packages/bootstrap/bin/cf-mysql-bootstrap -configPath=/var/vcap/jobs/bootstrap/config/config.yml + echo 'Bootstrap errand completed' + exit 0

    Errand `bootstrap' completed successfully (exit code 0)

    Note: Sometimes the bootstrap errand fails on the first try. If this happens, run the command again in a few minutes.

  2. If the errand fails, try performing the steps automated by the errand manually by following the Manual Bootstrapping procedure.

Scenario 2: Virtual Machines Terminated or Lost

In this scenario, severe circumstances such as power failure have terminated all of your VMs. You need to recreate the VMs before you can recover the cluster.

  1. If you enabled the VM Resurrector in Ops Manager, the system detects the terminated VMs and automatically attempts to recreate them. Run bosh tasks recent --no-filter to see the scan and fix job run by the VM Resurrector.

    $ bosh tasks recent --no-filter
    +-----+------------+-------------------------+----------+--------------------------------------------+---------------------------------------------------+
    | #   | State      | Timestamp               | User     | Description                                | Result                                            |
    +-----+------------+-------------------------+----------+--------------------------------------------+---------------------------------------------------+
    | 123 | queued     | 2016-01-08 00:18:07 UTC | director | scan and fix                               |                                                   |
    

    If you have not enabled the VM Resurrector, run the BOSH Cloudcheck command bosh cck to delete any placeholder VMs. When prompted, choose Delete VM reference by entering 3.

    $ bosh cck
    
    Acting as user 'director' on deployment 'cf-e82cbf44613594d8a155' on 'p-bosh-30c19bdd43c55c627d70'
    Performing cloud check...
    
    Director task 34
      Started scanning 22 vms
      Started scanning 22 vms > Checking VM states. Done (00:00:10)
      Started scanning 22 vms > 19 OK, 0 unresponsive, 3 missing, 0 unbound, 0 out of sync. Done (00:00:00)
         Done scanning 22 vms (00:00:10)
    
      Started scanning 10 persistent disks
      Started scanning 10 persistent disks > Looking for inactive disks. Done (00:00:02)
      Started scanning 10 persistent disks > 10 OK, 0 missing, 0 inactive, 0 mount-info mismatch. Done (00:00:00)
         Done scanning 10 persistent disks (00:00:02)
    
    Task 34 done
    
    Started   2015-11-26 01:42:42 UTC
    Finished  2015-11-26 01:42:54 UTC
    Duration  00:00:12
    
    Scan is complete, checking if any problems found.
    
    Found 3 problems
    
    Problem 1 of 3: VM with cloud ID `i-afe2801f' missing.
        1. Skip for now
        2. Recreate VM
        3. Delete VM reference
    Please choose a resolution [1 - 3]: 3
    
    Problem 2 of 3: VM with cloud ID `i-36741a86' missing.
        1. Skip for now
        2. Recreate VM
        3. Delete VM reference
    Please choose a resolution [1 - 3]: 3
    
    Problem 3 of 3: VM with cloud ID `i-ce751b7e' missing.
        1. Skip for now
        2. Recreate VM
        3. Delete VM reference
    Please choose a resolution [1 - 3]: 3
    
    Below is the list of resolutions you've provided
    Please make sure everything is fine and confirm your changes
    
        1. VM with cloud ID `i-afe2801' missing.
         Delete VM reference
    
        2. VM with cloud ID `i-36741a86' missing.
         Delete VM reference
    
        3. VM with cloud ID `i-ce751b7e' missing.
         Delete VM reference
    
    Apply resolutions? (type 'yes' to continue): yes
    Applying resolutions...
    
    Director task 35
      Started applying problem resolutions
      Started applying problem resolutions > missing_vm 11: Delete VM reference. Done (00:00:00)
      Started applying problem resolutions > missing_vm 27: Delete VM reference. Done (00:00:00)
      Started applying problem resolutions > missing_vm 26: Delete VM reference. Done (00:00:00)
         Done applying problem resolutions (00:00:00)
    
    Task 35 done
    
    Started   2015-11-26 01:47:08 UTC
    Finished  2015-11-26 01:47:08 UTC
    Duration  00:00:00
    Cloudcheck is finished
    
  2. Run bosh instances and examine the output. The VMs transition from unresponsive agent to starting. Ultimately, two appear as failing. Do not proceed to the next step until all three VMs are in the starting or failing state.

    $ bosh instances
    [...]
    +--------------------------------------------------+----------+------------------------------------------------+------------+
    | mysql-partition-e97dae91e44681e0b543/0           | starting | mysql-partition-e97dae91e44681e0b543           | 203.0.113.60 |
    | mysql-partition-e97dae91e44681e0b543/1           | failing  | mysql-partition-e97dae91e44681e0b543           | 203.0.113.61 |
    | mysql-partition-e97dae91e44681e0b543/2           | failing  | mysql-partition-e97dae91e44681e0b543           | 203.0.113.62 |
    +--------------------------------------------------+----------+------------------------------------------------+------------+
    
  3. Complete the following steps to prepare your deployment for the bootstrap errand:

    1. Run bosh edit deployment to launch a vi editor and modify the deployment.
    2. Search for the jobs section: jobs.
    3. Search for the mysql-partition: mysql-partition.
    4. Search for the update section: update.
    5. Change max_in_flight to 3.
    6. Below the max_in_flight line, add a new line: canaries: 0.
    7. Set update.serial to false.
    8. Run bosh deploy.
  4. Run bosh run errand bootstrap.

  5. Run bosh instances and examine the output to confirm that the errand completes successfully. Some instances may still appear as failing.

  6. Complete the following steps to restore the BOSH configuration:

    1. Run bosh edit deployment.
    2. Re-set canaries to 1, max_in_flight to 1, and serial to true in the same manner as above.
    3. Run bosh deploy.
    4. Validate that all mysql instances are in running state.

    Note: You must reset the values in the BOSH manifest to ensure successful future deployments and accurate reporting of the status of your jobs.

  7. If this procedure fails, try performing the steps automated by the errand manually by following the Manual Bootstrapping procedure.

Step 1: Choose the Correct Manifest (BOSH CLI v2)

  1. Log in to the BOSH Director by running bosh2 -e MY-ENV log-in. Replace MY-ENV with the environment where you deployed the cluster.

    $ bosh2 -e gcp log-in
    

  2. Run bosh2 -e MY-ENV deployments. Replace MY-ENV with the environment where you deployed the cluster.

    $ bosh2 -e gcp deployments
    Using environment '192.168.56.6' as client 'admin'
    Name                    Release(s)                Stemcell(s)                                         Team(s)  Cloud Config
    cf                      binary-buildpack/1.0.9    bosh-warden-boshlite-ubuntu-trusty-go<\_agent/3363.9  -        latest
                            capi/1.21.0
                            cf-mysql/34
                            cf-smoke-tests/11
                            cflinuxfs2-rootfs/1.52.0
                            consul/155
                            diego/1.8.1
                            garden-runc/1.2.0
                            loggregator/78
                            nats/15
                            routing/0.145.0
                            statsd-injector/1.0.20
                            uaa/25
    service-instance        mysql/0.6.0               bosh-warden-boshlite-ubuntu-trusty-go\_agent/3363.9  -        latest
    
    2 deployments
    
    Succeeded
    
  3. Run bosh2 -e MY-ENV -d MY-DEPLOYMENT manifest > /tmp/MANIFEST.yml to download the manifest. Replace the example text with the following:

    • MY-ENV: the environment where you deployed the cluster
    • MY-DEPLOYMENT: the name of your deployment cluster
    • MANIFEST.yml: the name you want to give the manifest
    $ bosh2 -e gcp -d mysql manifest /tmp/mysql.yml
    

Step 2: Run the Bootstrap Errand

Elastic Runtime versions 1.7.0 and later include a BOSH errand to automate the process of bootstrapping. The bootstrap errand automates the steps described in the Manual Bootstrapping section below. It finds the node with the highest transaction sequence number and asks it to start up by itself in bootstrap mode. Finally, it asks the remaining nodes to join the cluster.

In most cases, running the errand will recover your cluster. However, certain scenarios require additional steps. To determine which set of instructions to follow, you must determine the state of your Virtual Machines (VMs).

  1. Run bosh2 -e MY-ENV -d MY-DEPLOYMENT instances and examine the output. Replace MY-ENV with the environment where you deployed the cluster and MY-DEPLOYMENT with the deployment cluster name.
    $ bosh2 -e gcp -d mysql instances
    
    • If the output of bosh2 instances shows the state of the jobs as failing, proceed to Scenario 1.
    • If the output of bosh2 instances shows the state of jobs as unknown/unknown, proceed to Scenario 2.

Scenario 1: Virtual Machines Running, Cluster Disrupted

In this scenario, nodes are up and running, but the cluster has been disrupted. You can run the bootstrap errand without recreating the VMs.

  1. Run bosh2 -e MY-ENV -d MY-DEPLOYMENT run-errand bootstrap. Replace MY-ENV with the name of the environment where you deployed the cluster and MY-DEPLOYMENT with the deployment cluster name.

    $ bosh2 -e gcp -d mysql run-errand bootstrap
    

    Note: Sometimes the bootstrap errand fails on the first try. If this happens, run the command again in a few minutes.

  2. If the errand fails, try performing the steps automated by the errand manually by following the Manual Bootstrapping procedure.

Scenario 2: Virtual Machines Terminated or Lost

In this scenario, severe circumstances such as power failure have terminated all of your VMs. You need to recreate the VMs before you can recover the cluster.

  1. If you enabled the VM Resurrector in Ops Manager, the system detects the terminated VMs and automatically attempts to recreate them. Run bosh2 -e MY-ENV tasks to see the scan and fix job run by the VM Resurrector.

    $ bosh2 -e gcp tasks                                                  |
    

    If you have not enabled the VM Resurrector, run the BOSH Cloudcheck command bosh2 -e MY-ENV -d MY-DEPLOYMENT cloud-check to delete any placeholder VMs. When prompted, choose Delete VM reference by entering 6.

    $ bosh2 -e gcp -d mysql cloud-check
    
    Using environment '192.168.56.6' as user 'director' (bosh.*.read, openid, bosh.*.admin, bosh.read, bosh.admin)
    
    Task 34
    
    19:19:12 | Scanning 21 VMs: Checking VM states (00:00:16)
    19:19:28 | Scanning 21 VMs: 19 OK, 2 unresponsive, 0 missing, 0 unbound (00:00:00)
    19:19:28 | Scanning 5 persistent disks: Looking for inactive disks (00:00:00)
    19:19:28 | Scanning 5 persistent disks: 5 OK, 0 missing, 0 inactive, 0 mount-info mismatch (00:00:00)
    
    Started  Fri Aug  4 19:19:12 UTC 2017
    Finished Fri Aug  4 19:19:28 UTC 2017
    Duration 00:00:16
    
    Task 34 done
    
    #  Type                Description
    1  unresponsive_agent  VM for 'uaa/0 (0)' with cloud ID 'vm-001' is not responding.
    2  unresponsive_agent  VM for 'mysql/0 (0)' with cloud ID 'vm-007' is not responding.
    
    2 problems
    
    1: Skip for now
    2: Reboot VM
    3: Recreate VM without waiting for processes to start
    4: Recreate VM and wait for processes to start
    5: Delete VM
    6: Delete VM reference (forceful; may need to manually delete VM from the Cloud to avoid IP conflicts)
    
  2. Run bosh2 -e MY-ENV -d MY-DEPLOYMENT instances and examine the output. Do not proceed to the next step until all VMs are in either the starting or failing state.

    $ bosh2 -e gcp -d mysql instances
    
  3. Complete the following steps to prepare your deployment for the bootstrap errand:

    1. Open /tmp/MANIFEST.yml in a text editor.
    2. Search for the jobs section: jobs.
    3. Search for the mysql-partition: mysql-partition.
    4. Search for the update section: update.
    5. Change max_in_flight to 3.
    6. Below the max_in_flight line, add a new line: canaries: 0.
    7. Set update.serial to false.
    8. Run bosh2 -e MY-ENV -d MY-DEPLOYMENT deploy /tmp/MANIFEST.yml.
  4. Run bosh2 -e MY-ENV -d MY-DEPLOYMENT run-errand bootstrap. Replace MY-ENV with the name of the environment where you deployed the cluster and MY-DEPLOYMENT with the deployment cluster name.

  5. Run bosh2 -e MY-ENV -d MY-DEPLOYMENT instances and examine the output to confirm that the errand completes successfully. Some instances may still appear as failing.

  6. Complete the following steps to restore the BOSH configuration:

    1. Open /tmp/MANIFEST.yml in a text editor.
    2. Re-set canaries to 1, max_in_flight to 1, and serial to true in the same manner as above.
    3. Run bosh2 -e MY-ENV -d MY-DEPLOYMENT deploy /tmp/MANIFEST.yml.
    4. Validate that all mysql instances are in running state.

    Note: You must reset the values in the BOSH manifest to ensure successful future deployments and accurate reporting of the status of your jobs.

  7. If this procedure fails, try performing the steps automated by the errand manually by following the Manual Bootstrapping procedure.

Manual Bootstrapping

Note: The following steps are prone to user error and can result in lost data if followed incorrectly. Please follow the Run the Bootstrap Errand instructions above first, and only resort to the manual process if the errand fails to repair the cluster.

If the bootstrap errand cannot recover the cluster, you need to perform the steps automated by the errand manually.

  • If the output of bosh instances shows the state of the jobs as failing (Scenario 1), proceed directly to the manual steps below.
  • If the output of bosh instances shows the state of the jobs as unknown/unknown, perform Steps 1-3 of Scenario 2, substitute the manual steps below for Step 4, and then perform Steps 5-6 of Scenario 2.
  1. SSH to each node in the cluster and, as root, shut down the mariadb process.

    $ monit stop mariadb_ctrl
    

    Re-bootstrapping the cluster will not be successful unless all other nodes have been shut down.

  2. Choose a node to bootstrap by locating the node with the highest transaction sequence number (seqno). You can obtain the seqno of a stopped node in one of two ways:

    • If a node shut down gracefully, the seqno is in the Galera state file of the node.
      $ cat /var/vcap/store/mysql/grastate.dat | grep 'seqno:'
      
    • If the node crashed or was killed, the seqno in the Galera state file of the node is -1. In this case, the seqno may be recoverable from the database.
      1. Run the following command to start up the database, log the recovered sequence number, and exit.
        $ /var/vcap/packages/mariadb/bin/mysqld --wsrep-recover
        
      2. Scan the error log for the recovered sequence number. The last number after the group id (uuid) is the recovered seqno:
        $ grep "Recovered position" /var/vcap/sys/log/mysql/mysql.err.log | tail -1
        150225 18:09:42 mysqld_safe WSREP: Recovered position e93955c7-b797-11e4-9faa-9a6f0b73eb46:15
        
        If the node never connected to the cluster before crashing, it may not have a group id (uuid in grastate.dat). In this case, you cannot recover the seqno. Unless all nodes crashed this way, do not choose this node for bootstrapping.
  3. Choose the node with the highest seqno value as the bootstrap node. If all nodes have the same seqno, you can choose any node as the bootstrap node.

    Note: Only perform these bootstrap commands on the node with the highest seqno. Otherwise, the node with the highest seqno will be unable to join the new cluster unless its data is abandoned. Its mariadb process will exit with an error. See the Cluster Scaling, Node Failure, and Quorum topic for more details on intentionally abandoning data.

  4. On the bootstrap node, update the state file and restart the mariadb process.

    $ echo -n "NEEDS_BOOTSTRAP" > /var/vcap/store/mysql/state.txt
    $ monit start mariadb_ctrl
    
  5. Check that the mariadb process has started successfully.

    $ watch monit summary
    

    It can take up to ten minutes for monit to start the mariadb process.

  6. Once the bootstrapped node is running, start the mariadb process on the remaining nodes using monit.

    $ monit start mariadb_ctrl
    
  7. Verify that the new nodes have successfully joined the cluster. The following command displays the total number of nodes in the cluster:

    mysql> SHOW STATUS LIKE 'wsrep_cluster_size';
    
  8. Complete the following steps to restore the BOSH configuration:

    1. Run bosh edit deployment.
    2. Re-set canaries to 1, max_in_flight to 1, and serial to true in the same manner as above.
    3. Run bosh deploy.
    4. Validate that all mysql instances are in running state.

    Note: You must reset the values in the BOSH manifest to ensure successful future deployments and accurate reporting of the status of your jobs.

Create a pull request or raise an issue on the source for this page in GitHub