Recovering From MySQL Cluster Downtime

Page last updated:

This topic describes the procedure for recovering a terminated Elastic Runtime MySQL cluster using the bootstrapping process.

You can bootstrap your cluster by using one of two methods:

Note: The procedures below assume you are using BOSH CLI v2 or later. For more information about BOSH v2, see Commands in the BOSH documentation.

When to Bootstrap

You must bootstrap a cluster that loses quorum. A cluster loses quorum when less than half of the nodes can communicate with each other for longer than the configured grace period. If a cluster does not lose quorum, individual unhealthy nodes automatically rejoin the cluster after resolving the error, restarting the node, or restoring connectivity.

You can detect lost quorum through the following symptoms:

  • All nodes appear “Unhealthy” on the proxy dashboard, viewable at proxy-BOSH-JOB-INDEX.p-mysql.YOUR-SYSTEM-DOMAIN: quorum lost
  • All responsive nodes report the value of wsrep_cluster_status as non-Primary:

    mysql> SHOW STATUS LIKE 'wsrep_cluster_status';
    +----------------------+-------------+
    | Variable_name        | Value       |
    +----------------------+-------------+
    | wsrep_cluster_status | non-Primary |
    +----------------------+-------------+
    
  • All responsive nodes respond with ERROR 1047 when queried with most statement types:

    mysql> select * from mysql.user;
    ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
    

See the Cluster Scaling, Node Failure, and Quorum topic for more details about determining cluster state.

Run the Bootstrap Errand

The following sections describe what the bootstrap errand is and how to use it based on the type of cluster failure.

About the Bootstrap Errand

The bootstrap errand automates the steps described in the Manual Bootstrapping section below. It finds the node with the highest transaction sequence number and asks it to start up by itself in bootstrap mode. Finally, it asks the remaining nodes to join the cluster.

In most cases, running the errand recovers your cluster. But certain scenarios require additional steps.

Determine Type of Cluster Failure

To determine which set of instructions to follow, do the following:

  1. Run the following command.

    bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
    

    Where:

    • YOUR-ENV is the environment where you deployed the cluster.
    • YOUR-DEPLOYMENT is the deployment cluster name.

    For example:

    $ bosh -e prod -d mysql instances
    

  2. Find and record the Process State for your MySQL instances. In the following example output, the MySQL instances are in the failing process state.

    Instance                                                             Process State  AZ             IPs
    backup-prepare/c635410e-917d-46aa-b054-86d222b6d1c0                  running        us-central1-b  10.0.4.47
    bootstrap/a31af4ff-e1df-4ff1-a781-abc3c6320ed4                       -              us-central1-b  -
    broker-registrar/1a93e53d-af7c-4308-85d4-3b2b80d504e4                -              us-central1-b  10.0.4.58
    cf-mysql-broker/137d52b8-a1b0-41f3-847f-c44f51f87728                 running        us-central1-c  10.0.4.57
    cf-mysql-broker/28b463b1-cc12-42bf-b34b-82ca7c417c41                 running        us-central1-b  10.0.4.56
    deregister-and-purge-instances/4cb93432-4d90-4f1d-8152-d0c238fa5aab  -              us-central1-b  -
    monitoring/f7117dcb-1c22-495e-a99e-cf2add90dea9                      running        us-central1-b  10.0.4.48
    mysql/220fe72a-9026-4e2e-9fe3-1f5c0b6bf09b                           failing        us-central1-b  10.0.4.44
    mysql/28a210ac-cb98-4ab4-9672-9f4c661c57b8                           failing        us-central1-f  10.0.4.46
    mysql/c1639373-26a2-44ce-85db-c9fe5a42964b                           failing        us-central1-c  10.0.4.45
    proxy/87c5683d-12f5-426c-b925-62521529f64a                           running        us-central1-b  10.0.4.60
    proxy/b0115ccd-7973-42d3-b6de-edb5ae53c63e                           running        us-central1-c  10.0.4.61
    rejoin-unsafe/8ce9370a-e86b-4638-bf76-e103f858413f                   -              us-central1-b  -
    smoke-tests/e026aaef-efd9-4644-8d14-0811cb1ba733                     -              us-central1-b  10.0.4.59
    

  3. Choose your scenario:

    • If your MySQL instances are in the failing state, continue to Scenario 1.
    • If your MySQL instances are in the - state, continue to Scenario 2.

Scenario 1: Virtual Machines Running, Cluster Disrupted

In this scenario, the VMs are running, but the jobs are failing.

To bootstrap in this scenario, do the following:

  1. Run the bootstrap errand.

    bosh -e YOUR-ENV -d YOUR-DEPLOYMENT run-errand bootstrap
    

    Note: The errand runs for a long time, during which no output is returned.

    The command returns many lines of output, eventually with the following successful output:

    Bootstrap errand completed
    [stderr]
    echo 'Started bootstrap errand ...'
    JOB_DIR=/var/vcap/jobs/bootstrap
    CONFIG_PATH=/var/vcap/jobs/bootstrap/config/config.yml
    /var/vcap/packages/bootstrap/bin/cf-mysql-bootstrap -configPath=/var/vcap/jobs/bootstrap/config/config.yml
    echo 'Bootstrap errand completed'
    exit 0
    Errand `bootstrap' completed successfully (exit code 0)
    

  2. If the errand fails, run the bootstrap errand command again after a few minutes. The bootstrap errand may not work immediately.

  3. If the errand fails after several tries, bootstrap your cluster manually. See Bootstrap Manually below.

Scenario 2: Virtual Machines Terminated or Lost

In severe circumstances, such as a power failure, it is possible to lose all your VMs. You must recreate them before you can begin recovering the cluster.

When MySQL instances are in the - state, the VMs are lost. The procedures in this scenario bring the instances from a - state to a failing state. Then you run the bootstrap errand similar to Scenario 1 above and restore configuration.

To recover terminated or lost VMs, do the procedures in the sections below:

  1. Recreate the Missing VMs: Bring MySQL instances from a - state to a failing state.

  2. Run the Bootstrap Errand: Since your instances are now in the failing state, you continue similarly to Scenario 1 above.

  3. Restore the BOSH Configuration: Go back to unignoring all instances and redeploy. This is a critical and mandatory step.

WARNING: If you do not bosh unignore your instances, your instances are not updated in future deploys. You must perform the procedure in the final section of Scenario 2, Restore the BOSH Configuration.

Recreate the Missing VMs

The procedure in this section uses BOSH to recreate the VMs, install software on them, and try to start the jobs.

The procedure below allows you to do the following:

  • Redeploy your cluster while expecting the jobs to fail.

  • Instruct BOSH to ignore the state of each instance in your cluster. This allows BOSH to deploy the software to each instance even if the instance is failing.

To recreate your missing VMs, do the following:

  1. If BOSH resurrection is enabled, disable it.

    bosh -e YOUR-ENV update-resurrection off
    
  2. Download the current manifest.

    bosh -e YOUR-ENV -d YOUR-DEPLOYMENT manifest > /tmp/manifest.yml
    
  3. Redeploy and expect one of the MySQL VMs to fail. Deploying causes BOSH to create new VMs and install the software. Forming a cluster is in a subsequent step.

    bosh -e YOUR-ENV -d YOUR-DEPLOYMENT deploy /tmp/manifest.yml
    
  4. Run the following command and record the instance GUID of the VM that attempted to start. Your instance GUID is the string after mysql/ in your BOSH instances output.

    bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
    
  5. Tell BOSH to ignore your MySQL instance. Ignoring the state allows BOSH to deploy software to the failed instance.

    bosh -e YOUR-ENV -d YOUR-DEPLOYMENT ignore mysql/INSTANCE_GUID
    

    Where:

    • YOUR-ENV is the environment where you deployed the cluster.
    • YOUR-DEPLOYMENT is the deployment cluster name.
    • INSTANCE-GUID is the string after mysql/ in your BOSH instances output. For example:
      $ bosh -e prod -d mysql ignore mysql/220fe72a-9026-4e2e-9fe3-1f5c0b6bf09b
      
  6. Repeat steps 3 through 5 until all MySQL instances have attempted to start.

  7. Re-enable BOSH resurrection if you disabled it in the first step.

    bosh -e YOUR-ENV update-resurrection on
    
  8. See that your MySQL instances have gone from the - state to the failing state.

    bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
    
Run the Bootstrap Errand

All MySQL instances have a failing process state, but they now have the MySQL code installed on them. In this section, the bootstrap process recovers the cluster.

To bootstrap, do the following:

  1. Run the bootstrap errand.

    bosh -e YOUR-ENV -d YOUR-DEPLOYMENT run-errand bootstrap
    

    Note: The errand runs for a long time, during which no output is returned.

    The command returns many lines of output, eventually with the following successful output:

    Bootstrap errand completed
    [stderr]
    echo 'Started bootstrap errand ...'
    JOB_DIR=/var/vcap/jobs/bootstrap
    CONFIG_PATH=/var/vcap/jobs/bootstrap/config/config.yml
    /var/vcap/packages/bootstrap/bin/cf-mysql-bootstrap -configPath=/var/vcap/jobs/bootstrap/config/config.yml
    echo 'Bootstrap errand completed'
    exit 0
    Errand `bootstrap' completed successfully (exit code 0)
    

  2. If the errand fails, run the bootstrap errand command again after a few minutes. The bootstrap errand may not work immediately.

  3. See that the errand completes successfully in the shell output and continue to Restore the BOSH Configuration below.

    Note: After you complete the bootstrap errand, you may still see instances in the failing state. Continue to the next section anyway.

Restore the BOSH Configuration

WARNING: If you do not bosh unignore all your ignored instances, your instances are never updated in future deploys.

To restore your BOSH configuration to its previous state, this procedure unignores each instance that was previously ignored.

  1. Set each ignored instance to unignore.

    bosh -e MY-ENV -d MY-DEP unignore mysql/INSTANCE_GUID
    
  2. Redeploy.

    bosh -e MY-ENV -d MY-DEP deploy
    
  3. Validate that all mysql instances are in a running state.

    bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
    

Bootstrap Manually

If the bootstrap errand is not able to automatically recover the cluster, you might need to do the steps manually.

WARNING: The following procedures are prone to user-error and can result in lost data if followed incorrectly. Follow the procedure in Bootstrap with the BOSH Errand above first, and only resort to the manual process if the errand fails to repair the cluster.

Do the procedures in the sections below to manually bootstrap your cluster.

Shut Down MariaDB

Do the following for each node in the cluster:

  1. SSH into the node. See the BOSH CLI v2 instructions for SSHing into BOSH-deployed VMs.

  2. Shut down the mariadb process on the node. Run the following command: monit stop mariadb_ctrl

Re-bootstrapping the cluster is not successful unless you shut down the mariadb process on all nodes in the cluster.

Choose Node to Bootstrap

To choose the node to bootstrap, you must find the node with the highest transaction sequence number (seqno).

Do the following to find the node with the highest seqno:

  1. Run the following command from the node:
    cat /var/vcap/store/mysql/grastate.dat | grep 'seqno:'
  2. If a node shut down gracefully, the seqno is in the Galera state file. Retrieve the seqno and continue to Bootstrap the First Node.

    If a node crashed or was killed, the seqno in the Galera state file is recorded as -1. In this case, the seqno might be recoverable from the database. Run the following command to start up the database, log the recovered seqno, and then exit: /var/vcap/packages/mariadb/bin/mysqld --wsrep-recover Scan the error log for the recovered seqno. It is the last number after the group id (uuid). For example:

    $ grep "Recovered position" /var/vcap/sys/log/mysql/mysql.err.log | tail -1
    150225 18:09:42 mysqld_safe WSREP: Recovered position e93955c7-b797-11e4-9faa-9a6f0b73eb46:15
    
    If the node never connected to the cluster before crashing, it may not even have a group id (uuid in grastate.dat). In this case, there is nothing to recover. Unless all nodes crashed this way, do not choose this node for bootstrapping.

  3. After determining the seqno for all nodes in your cluster, identify the node with the highest seqno. If all nodes have the same seqno, you can choose any node as the new bootstrap node.

Bootstrap the First Node

After determining the node with the highest seqno, do the following to bootstrap the node:

Note: Only run these bootstrap commands on the node with the highest seqno. Otherwise the node with the highest seqno is unable to join the new cluster unless its data is abandoned. Its mariadb process exits with an error.

  1. On the new bootstrap node, update the state file and restart the mariadb process. Run the following commands:

    echo -n "NEEDS_BOOTSTRAP" > /var/vcap/store/mysql/state.txt
    monit start mariadb_ctrl

  2. It can take up to ten minutes for monit to start the mariadb process. To check if the mariadb process has started successfully, run the following command:

    watch monit summary

Restart Remaining Nodes

  1. After the bootstrapped node is running, start the mariadb process on the remaining nodes with monit. From the bootstrap node, run the following command:

    monit start mariadb_ctrl
    If the node is prevented from starting by the Interruptor, do the manual procedure to force the node to rejoin the cluster, documented in Pivotal Knowledge Base.

    WARNING: Forcing a node to rejoin the cluster is a destructive procedure. Only do the procedure with the assistance of Pivotal Support.

  2. If the monit start command fails, it might be because the node with the highest seqno is mysql/0. In this case, do the following:

    1. From the Ops Manager VM, use the BOSH CLI to make BOSH ignore updating mysql/0: bosh -e MY-ENV -d MY-DEP ignore mysql/0
    2. Navigate to Ops Manager in a browser, log in, and click Apply Changes.
    3. When the deploy finishes, run the following command from the Ops Manager VM: bosh -e MY-ENV -d MY-DEP unignore mysql/0
  3. Verify that the new nodes have successfully joined the cluster. SSH into the bootstrap node and run the following command to output the total number of nodes in the cluster:

    mysql> SHOW STATUS LIKE 'wsrep_cluster_size';

Create a pull request or raise an issue on the source for this page in GitHub