LATEST VERSION: 2.1 - CHANGELOG

Bootstrapping

This topic describes how to bootstrap your MySQL cluster in the event of a cluster failure.

When to Bootstrap

To determine whether you need to bootstrap your cluster, you must check whether the cluster has lost quorum. Bootstrapping is only required when the cluster has lost quorum. See Check Cluster State for more information about checking the state of your cluster.

Quorum is lost when less than half of the nodes can communicate with each other for longer than the configured grace period. In Galera terminology, if a node can communicate with the rest of the cluster, its database is in a good state, and it reports itself as synced.

If quorum has not been lost, individual unhealthy nodes should automatically rejoin the cluster once repaired, which means the error is resolved, the node is restarted, or connectivity is restored.

To check whether your cluster has lost quorum, look for the following symptoms:

  • All nodes appear “Unhealthy” on the proxy dashboard, as in the following screenshot: 3 out of 3 nodes are unhealthy.
  • All responsive nodes report the value of wsrep_cluster_status as non-Primary in the MySQL client.

    mysql> SHOW STATUS LIKE 'wsrep_cluster_status';
    +----------------------+-------------+
    | Variable_name        | Value       |
    +----------------------+-------------+
    | wsrep_cluster_status | non-Primary |
    +----------------------+-------------+
    
  • All responsive nodes respond with ERROR 1047 when using most statement types in the MySQL client:

    mysql> select * from mysql.user;
    ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
    

If your cluster has lost quorum, see the Bootstrapping topic for information about manually bootstrapping a cluster.

Keep in mind the following:

  • The start script bootstraps node 0 only on initial deploy. If bootstrapping is necessary at a later date, it must be done manually.
  • If the single node is bootstrapped, it creates a new one-node cluster that other nodes can join.

Bootstrapping

Before running the bootstrapping procedures below, you must SSH into the Ops Manager VM and log in to the BOSH Director. For more information, see Prepare to Use the BOSH CLI.

Note: The examples in these instructions reflect a three-node MySQL for Pivotal Cloud Foundry (PCF) deployment. The process to bootstrap a two-node plus an arbitrator is identical, but the output will not match the examples.

Assisted Bootstrap

MySQL for PCF v1.8.0 and later include a BOSH errand to automate the process of bootstrapping. It is still necessary to manually initiate the bootstrap process, but using this errand reduces the number of manual steps necessary to complete the process.

In most cases, running the errand is sufficient, however there are some conditions which require additional steps.

How It Works

The bootstrap errand simply automates the steps in the manual bootstrapping process documented below. It finds the node with the highest transaction sequence number, and asks it to start up by itself (i.e. in bootstrap mode), then asks the remaining nodes to join the cluster.

Scenario 1: Virtual Machines Running, Cluster Disrupted

In this scenario, the nodes are up and running, but the cluster has been disrupted.

To determine whether the cluster has been disrupted, use the BOSH CLI to list the jobs and see if they are failing.

If you are using PCF v1.10, use the BOSH CLI v1 command bosh vms.

If you are using PCF v1.11 or later, use the BOSH CLI v2 command bosh2 -e YOUR-ENV instances.

The output will resemble the following:

+--------------------------------------------------+---------+------------------------------------------------+------------+
| Instance                                         | State   | Resource Pool                                  | IPs        |
+--------------------------------------------------+---------+------------------------------------------------+------------+
| cf-mysql-broker-partition-a813339fde9330e9b905/0 | running | cf-mysql-broker-partition-a813339fde9330e9b905 | 192.0.2.61 |
| cf-mysql-broker-partition-a813339fde9330e9b905/1 | running | cf-mysql-broker-partition-a813339fde9330e9b905 | 192.0.2.62 |
| mysql-partition-a813339fde9330e9b905/0           | failing | mysql-partition-a813339fde9330e9b905           | 192.0.2.55 |
| mysql-partition-a813339fde9330e9b905/1           | failing | mysql-partition-a813339fde9330e9b905           | 192.0.2.56 |
| mysql-partition-a813339fde9330e9b905/2           | failing | mysql-partition-a813339fde9330e9b905           | 192.0.2.57 |
| proxy-partition-a813339fde9330e9b905/0           | running | proxy-partition-a813339fde9330e9b905           | 192.0.2.59 |
| proxy-partition-a813339fde9330e9b905/1           | running | proxy-partition-a813339fde9330e9b905           | 192.0.2.60 |
+--------------------------------------------------+---------+------------------------------------------------+------------+

In this situation, run the bootstrap errand:

  1. Log in to the BOSH director.
  2. Select the correct deployment.
  3. If you are using PCF v1.10, use the BOSH CLI v1 command bosh run errand bootstrap.

    If you are using PCF v1.11 or later, use the BOSH CLI v2 command bosh2 run-errand bootstrap.

You will see many lines of output, eventually followed by:

Bootstrap errand completed

[stderr]
+ echo 'Started bootstrap errand ...'
+ JOB_DIR=/var/vcap/jobs/bootstrap
+ CONFIG_PATH=/var/vcap/jobs/bootstrap/config/config.yml
+ /var/vcap/packages/bootstrap/bin/cf-mysql-bootstrap -configPath=/var/vcap/jobs/bootstrap/config/config.yml
+ echo 'Bootstrap errand completed'
+ exit 0

Errand `bootstrap' completed successfully (exit code 0)

There are times when this won’t work immediately. Unfortunately, sometimes it is best to wait and try again a few minutes later.

Scenario 2: Virtual Machines Terminated or Lost

In more severe circumstances, such as power failure, it’s possible that all of your VMs have been lost. They’ll need to be recreated before you can begin to recover the cluster. In this scenario, you’ll see the nodes appear as unknown/unknown in the BOSH output.

If you are using PCF v1.10, use the BOSH CLI v1 command bosh vms.

If you are using PCF v1.11 or later, use the BOSH CLI v2 command bosh2 -e YOUR-ENV instances.

The output will resemble the following:

+--------------------------------------------------+--------------------+------------------------------------------------+------------+
| Instance                                         | State              | Resource Pool                                  | IPs        |
+--------------------------------------------------+--------------------+------------------------------------------------+------------+
| unknown/unknown                                  | unresponsive agent |                                                |            |
+--------------------------------------------------+--------------------+------------------------------------------------+------------+
| unknown/unknown                                  | unresponsive agent |                                                |            |
+--------------------------------------------------+--------------------+------------------------------------------------+------------+
| unknown/unknown                                  | unresponsive agent |                                                |            |
+--------------------------------------------------+--------------------+------------------------------------------------+------------+
| cf-mysql-broker-partition-e97dae91e44681e0b543/0 | running            | cf-mysql-broker-partition-e97dae91e44681e0b543 | 192.0.2.65 |
| cf-mysql-broker-partition-e97dae91e44681e0b543/1 | running            | cf-mysql-broker-partition-e97dae91e44681e0b543 | 192.0.2.66 |
+--------------------------------------------------+--------------------+------------------------------------------------+------------+
| proxy-partition-e97dae91e44681e0b543/0           | running            | proxy-partition-e97dae91e44681e0b543           | 192.0.2.63 |
| proxy-partition-e97dae91e44681e0b543/1           | running            | proxy-partition-e97dae91e44681e0b543           | 192.0.2.64 |
+--------------------------------------------------+--------------------+------------------------------------------------+------------+

Recover Terminated or Lost VMs

To recover your VMs, perform the following steps:

  1. If you use the VM Resurrector, disable it.
  2. Run the BOSH Cloud Check interactive command.
    If you are using PCF v1.10, use the BOSH CLI v1 command bosh cck.
    If you are using PCF v1.11 or later, use the BOSH CLI v2 command bosh2 -e YOUR-ENV -d YOUR-DEP cck.

When prompted, select Recreate VM. If this option fails, select Delete VM reference.

The output will resemble the following:

Acting as user 'director' on deployment 'cf-e82cbf44613594d8a155' on 'p-bosh-30c19bdd43c55c627d70'
Performing cloud check...

Director task 34
Started scanning 22 vms
Started scanning 22 vms > Checking VM states. Done (00:00:10)
Started scanning 22 vms > 19 OK, 0 unresponsive, 3 missing, 0 unbound, 0 out of sync. Done (00:00:00)
Done scanning 22 vms (00:00:10)

Started scanning 10 persistent disks
Started scanning 10 persistent disks > Looking for inactive disks. Done (00:00:02)
Started scanning 10 persistent disks > 10 OK, 0 missing, 0 inactive, 0 mount-info mismatch. Done (00:00:00)
Done scanning 10 persistent disks (00:00:02)

Task 34 done

Started   2015-11-26 01:42:42 UTC
Finished  2015-11-26 01:42:54 UTC
Duration  00:00:12

Scan is complete, checking if any problems found.

Found 3 problems

Problem 1 of 3: VM with cloud ID `i-afe2801f' missing.
1. Skip for now
2. Recreate VM
3. Delete VM reference
Please choose a resolution [1 - 3]: 2

Problem 2 of 3: VM with cloud ID `i-36741a86' missing.
1. Skip for now
2. Recreate VM
3. Delete VM reference
Please choose a resolution [1 - 3]: 2

Problem 3 of 3: VM with cloud ID `i-ce751b7e' missing.
1. Skip for now
2. Recreate VM
3. Delete VM reference
Please choose a resolution [1 - 3]: 2
  1. Re-enable the VM Resurrector if you want to continue to use it.

Do not proceed to the next step until all three VMs are in the starting or failing state.

Update the BOSH Configuration

In standard deployment, BOSH is configured to manage the cluster in a specific manner. You must change that configuration in order for the bootstrap errand to perform its work.

If you are using PCF v1.10, use the the BOSH CLI v1 process below to make it possible for the bootstrap errand to succeed.

If you are using PCF v1.11 or later, use the the BOSH CLI v2 process below to make it possible for the bootstrap errand to succeed.

BOSH CLI v1

  1. Log in to the BOSH Director.
  2. Target the correct deployment.
  3. Run bosh edit deployment:
    • Locate the jobs.mysql-partition.update section.
    • Change max_in_flight to 3.
    • Below the max_in_flight line, add a line: canaries: 0.
  4. Run bosh deploy.

BOSH CLI v2

  1. Log in to the BOSH Director.
  2. Download the current manifest using bosh2 -e MY-ENV -d MY-DEP > /tmp/manifest.yml
  3. Open this manifest to edit the following:
    • Locate the jobs.mysql-partition.update section.
    • Change max_in_flight to 3.
    • Below the max_in_flight line, add a line: canaries: 0.
  4. Run bosh2 -e MY-ENV -d MY-DEP deploy /tmp/manifest.yml to update the manifest.

Run the Bootstrap Errand

  1. If you are using PCF v1.10, use the BOSH CLI v1 command bosh run errand bootstrap. If you are using PCF v1.11 or later, use the BOSH CLI v2 command bosh2 run-errand bootstrap.

  2. Validate that the errand completes successfully.

    • Some instances may still appear as failing. It’s OK to proceed to the next step.

Restore the BOSH Configuration

To restore your BOSH configuration to it’s previous state, follow the previous steps from Update the BOSH Configuration and use the following values:

  • Set canaries to 1.
  • Set max_in_flight to 1.
  • Set serial to true in the same manner as above.
  1. Redeploy your BOSH Director.
  2. Validate that all mysql instances are in running state.

Note: It is critical that you run all of the steps. If you do not re-set the values in the BOSH manifest, the status of the jobs will not be reported correctly and can lead to troubles in future deploys.


Manual Bootstrap

  1. If the bootstrap errand is not able to automatically recover the cluster, you may need to perform the steps manually. The following steps are prone to user-error and can result in lost data if followed incorrectly. Please follow the assisted boostrap instructions above first, and only resort to the manual process if the errand fails to repair the cluster.

    1. SSH to each node in the cluster and, as root, shut down the mariadb process. To SSH into BOSH-deployed VMs, see the Advanced Troubleshooting with the BOSH CLI topic.

      $ monit stop mariadb_ctrl
      

      Re-bootstrapping the cluster will not be successful unless all other nodes have been shut down.

    2. Choose a node to bootstrap.

      Find the node with the highest transaction sequence number (seqno). The sequence number of a stopped node can be retained by either reading the node’s state file under /var/vcap/store/mysql/grastate.dat, or by running a mysqld command with a WSREP flag, like mysqld --wsrep-recover.

      If a node shutdown gracefully, the seqno should be in the galera state file.

      $ cat /var/vcap/store/mysql/grastate.dat | grep 'seqno:'
      

      If the node crashed or was killed, the seqno in the galera state file should be -1. In this case, the seqno may be recoverable from the database. The following command will cause the database to start up, log the recovered sequence number, and then exit.

      $ /var/vcap/packages/mariadb/bin/mysqld --wsrep-recover
      

      Note: The galera state file will still say seqno: -1 afterward.

      Scan the error log for the recovered sequence number (the last number after the group id (uuid) is the recovered seqno):

      $ grep "Recovered position" /var/vcap/sys/log/mysql/mysql.err.log | tail -1
      150225 18:09:42 mysqld_safe WSREP: Recovered position e93955c7-b797-11e4-9faa-9a6f0b73eb46:15
      

      If the node never connected to the cluster before crashing, it may not even have a group id (uuid in grastate.dat). In this case there’s nothing to recover. Unless all nodes crashed this way, don’t choose this node for bootstrapping.

      Bootstrap the first node

      Use the node with the highest seqno value as the new bootstrap node. If all nodes have the same seqno, you can choose any node as the new bootstrap node.

      Note: Only perform these bootstrap commands on the node with the highest seqno. Otherwise the node with the highest seqno will be unable to join the new cluster (unless its data is abandoned). Its mariadb process will exit with an error. See cluster behavior for more details on intentionally abandoning data.

    3. On the new bootstrap node, update state file and restart the mariadb process:

      $ echo -n "NEEDS_BOOTSTRAP" > /var/vcap/store/mysql/state.txt
      $ monit start mariadb_ctrl
      

      You can check that the mariadb process has started successfully by running:

      $ watch monit summary
      

      It can take up to 10 minutes for monit to start the mariadb process.

      Restart the remaining nodes

    4. After the bootstrapped node is running, start the mariadb process on the remaining nodes via monit.

      Start the mariadb process:

      $ monit start mariadb_ctrl
      

      If the node is prevented from starting by the Interruptor, perform the manual procedure to force the node to rejoin the cluster, documented in Pivotal Knowledge Base.

      WARNING: Forcing a node to rejoin the cluster is a destructive procedure. Only perform it with the assistance of Pivotal Support.

    5. Verify that the new nodes have successfully joined the cluster. The following command outputs the total number of nodes in the cluster:

      mysql> SHOW STATUS LIKE 'wsrep_cluster_size';
      
Create a pull request or raise an issue on the source for this page in GitHub