Recovering From MySQL Cluster Downtime
Page last updated:
Warning: Pivotal Cloud Foundry (PCF) v2.5 is no longer supported because it has reached the End of General Support (EOGS) phase as defined by the Support Lifecycle Policy. To stay up to date with the latest software and security updates, upgrade to a supported version.
Page last updated:
This topic describes how to bootstrap your MySQL cluster in the event of a cluster failure.
You can bootstrap your cluster by using one of two methods:
Run the bootstrap errand. See Run the Bootstrap Errand below.
Bootstrap manually. See Bootstrap Manually below.
When to Bootstrap
You must bootstrap a cluster that loses quorum. A cluster loses quorum when less than half of the nodes can communicate with each other for longer than the configured grace period. If a cluster does not lose quorum, individual unhealthy nodes automatically rejoin the cluster after resolving the error, restarting the node, or restoring connectivity.
To check whether your cluster has lost quorum, look for the following symptoms:
All nodes appear “Unhealthy” on the proxy dashboard, viewable at
https://BOSH-JOB-INDEX-proxy-p-mysql-ert.YOUR-SYSTEM-DOMAIN
:All responsive nodes report the value of
wsrep_cluster_status
asnon-Primary
:mysql> SHOW STATUS LIKE 'wsrep_cluster_status'; +----------------------+-------------+ | Variable_name | Value | +----------------------+-------------+ | wsrep_cluster_status | non-Primary | +----------------------+-------------+
All unresponsive nodes respond with
ERROR 1047
when using most statement types in the MySQL client:mysql> select * from mysql.user; ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
For more information about checking the state of your cluster, see Check Cluster State.
Run the Bootstrap Errand
PCF includes a BOSH errand that automates the manual bootstrapping procedure in the Bootstrap Manually section below. For more information about BOSH errands, see Jobs in the BOSH documentation.
It finds the node with the highest transaction sequence number and asks it to start up by itself (i.e. in bootstrap mode) and then asks the remaining nodes to join the cluster.
In most cases, running the errand recovers your cluster. Certain scenarios require additional steps.
Determine Type of Cluster Failure
To determine which set of instructions to follow:
Run:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
Where:
YOUR-ENV
is the environment where you deployed the cluster.YOUR-DEPLOYMENT
is the deployment cluster name.
For example:
$ bosh -e prod -d mysql instances
Find and record the
Process State
for your MySQL instances. In the following example output, the MySQL instances are in thefailing
process state.Instance Process State AZ IPs backup-restore/c635410e-917d-46aa-b054-86d222b6d1c0 running us-central1-b 10.0.4.47 bootstrap/a31af4ff-e1df-4ff1-a781-abc3c6320ed4 - us-central1-b - broker-registrar/1a93e53d-af7c-4308-85d4-3b2b80d504e4 - us-central1-b 10.0.4.58 cf-mysql-broker/137d52b8-a1b0-41f3-847f-c44f51f87728 running us-central1-c 10.0.4.57 cf-mysql-broker/28b463b1-cc12-42bf-b34b-82ca7c417c41 running us-central1-b 10.0.4.56 deregister-and-purge-instances/4cb93432-4d90-4f1d-8152-d0c238fa5aab - us-central1-b - monitoring/f7117dcb-1c22-495e-a99e-cf2add90dea9 running us-central1-b 10.0.4.48 mysql/220fe72a-9026-4e2e-9fe3-1f5c0b6bf09b failing us-central1-b 10.0.4.44 mysql/28a210ac-cb98-4ab4-9672-9f4c661c57b8 failing us-central1-f 10.0.4.46 mysql/c1639373-26a2-44ce-85db-c9fe5a42964b failing us-central1-c 10.0.4.45 proxy/87c5683d-12f5-426c-b925-62521529f64a running us-central1-b 10.0.4.60 proxy/b0115ccd-7973-42d3-b6de-edb5ae53c63e running us-central1-c 10.0.4.61 rejoin-unsafe/8ce9370a-e86b-4638-bf76-e103f858413f - us-central1-b - smoke-tests/e026aaef-efd9-4644-8d14-0811cb1ba733 - us-central1-b 10.0.4.59
Choose your scenario:
- If your MySQL instances are in the
failing
state, continue to Scenario 1: VMs Running, Cluster Disrupted. - If your MySQL instances are in the
-
state, continue to Scenario 2: VMs Terminated or Lost.
- If your MySQL instances are in the
Scenario 1: VMs Running, Cluster Disrupted
In this scenario, the VMs are running, but the cluster has been disrupted.
To bootstrap in this scenario:
To run the bootstrap errand, run:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT run-errand bootstrap
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.
Note: The errand runs for a long time, during which no output is returned.
The command returns many lines of output, eventually followed by:
Bootstrap errand completed [stderr] + echo 'Started bootstrap errand ...' + JOB_DIR=/var/vcap/jobs/bootstrap + CONFIG_PATH=/var/vcap/jobs/bootstrap/config/config.yml + /var/vcap/packages/bootstrap/bin/cf-mysql-bootstrap -configPath=/var/vcap/jobs/bootstrap/config/config.yml + echo 'Bootstrap errand completed' + exit 0 Errand 'bootstrap' completed successfully (exit code 0)
If the errand fails, run the bootstrap errand command again after a few minutes. The bootstrap errand might not work the first time.
If the errand fails after several tries, bootstrap your cluster manually. See Bootstrap Manually.
Scenario 2: VMs Terminated or Lost
In severe circumstances, such as a power failure, it is possible to lose all your VMs. You must recreate them before you can begin recovering the cluster.
When MySQL instances are in the -
state, the VMs are lost. The procedures in this scenario bring the instances from a -
state to a failing
state. Then you run the bootstrap errand similar to Scenario 1 above and restore configuration.
To recover terminated or lost VMs:
Bring MySQL instances from a
-
state to afailing
state. See Recreate the Missing VMs.Since your instances are now in the
failing
state, you continue similarly to Scenario 1 above. See Run the Bootstrap Errand.Go back to unignoring all instances and redeploy. This is a critical and mandatory step. See Restore the BOSH Configuration.
Warning: If you do not set each of your ignored instances to unignore
, your instances are not updated in future deploys. You must do the procedure in the final section of Scenario 2, Restore the BOSH Configuration.
Recreate the Missing VMs
The procedure in this section uses BOSH to recreate the VMs, install software on them, and try to start the jobs.
The procedure below allows you to:
Redeploy your cluster while expecting the jobs to fail.
Instruct BOSH to ignore the state of each instance in your cluster. This allows BOSH to deploy the software to each instance even if the instance is failing.
To recreate your missing VMs:
SSH into the BOSH VMs using the procedure in the BOSH SSH section of the Advanced Troubleshooting with the BOSH CLI topic.
If BOSH resurrection is enabled, disable it by running:
bosh -e YOUR-ENV update-resurrection off
Where
YOUR-ENV
is the name of your environment.To download the current manifest, run:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT manifest > /tmp/manifest.yml
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.
To redeploy deployment, run:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT deploy /tmp/manifest.yml
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.
Note: Expect one of the MySQL VMs to fail. Deploying causes BOSH to create new VMs and install the software. Forming a cluster is in a subsequent step.
To view the instance GUID of the VM that attempted to start, run:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.
Record the instance GUID, which is the string after
mysql/
in your BOSH instances output.To instruct BOSH to ignore each MySQL VM, run:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT ignore mysql/INSTANCE-GUID
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.INSTANCE-GUID
is the GUID of your instance you recorded in the above step.
Repeat steps 4 through 6 until all instances have attempted to start.
If you disabled BOSH resurrection in step 2, to re-enable it, run:
bosh -e YOUR-ENV update-resurrection on
Where
YOUR-ENV
is the name of your environment.To confirm that your MySQL instances have gone from the
-
state to thefailing
state, run:bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.
Run the Bootstrap Errand
After you recreate the VMs, all instances now have a failing
process state and have the MySQL code. You must run the bootstrap errand to recover the cluster.
To bootstrap:
To run the bootstrap errand, run:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT run-errand bootstrap
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.Note: The errand runs for a long time, during which no output is returned.
The command returns many lines of output, eventually with the following successful output:
Bootstrap errand completed [stderr] echo 'Started bootstrap errand ...' JOB_DIR=/var/vcap/jobs/bootstrap CONFIG_PATH=/var/vcap/jobs/bootstrap/config/config.yml /var/vcap/packages/bootstrap/bin/cf-mysql-bootstrap -configPath=/var/vcap/jobs/bootstrap/config/config.yml echo 'Bootstrap errand completed' exit 0 Errand 'bootstrap' completed successfully (exit code 0)
If the errand fails, run the bootstrap errand command again after a few minutes. The bootstrap errand might not work immediately.
See that the errand completes successfully in the shell output and continue to Restore the BOSH Configuration.
Note: After you complete the bootstrap errand, you might still see instances in the
failing
state. Continue to the next section anyway.
Restore the BOSH Configuration
Warning: If you do not set each of your ignored instances to unignore
, your instances are never updated in future deploys.
To restore your BOSH configuration to its previous state, this procedure unignores each instance that was previously ignored. To unignore these instances:
For each ignored instance, run:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT unignore mysql/INSTANCE-GUID
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.INSTANCE-GUID
is the GUID of your instance.
To redeploy your deployment, run:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT deploy
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.
To validate that all
mysql
instances are in arunning
state, run:bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.
Bootstrap Manually
If the bootstrap errand is not able to automatically recover the cluster, you might need to do the steps manually.
To manually bootstrap your cluster, follow the procedures below.
Warning: The following procedures are prone to user error and can result in lost data if followed incorrectly. Follow the procedure in Bootstrap with the BOSH Errand first, and only resort to the manual process if the errand fails to repair the cluster.
Shut Down MySQL
To shut down MySQL: or each node in the cluster:
SSH into the node using the procedure in the BOSH SSH section of the Advanced Troubleshooting with the BOSH CLI topic.
To shut down the
mysqld
process on the node, run:monit stop galera-init
Repeat this procedure for each node in the cluster.
Re-bootstrapping the cluster is not successful unless you shut down the mysqld
process on all nodes in the cluster.
Choose Node to Bootstrap
To avoid losing data, you must bootstrap from a node in the cluster that has the highest transaction sequence number (seqno
).
To find the seqno
for each node in the cluster:
SSH into the node using the procedure in the BOSH SSH section of the Advanced Troubleshooting with the BOSH CLI topic.
To find
seqno
values in the node’s Galera state file,grastate.dat
, run:
cat /var/vcap/store/pxc-mysql/grastate.dat | grep 'seqno:'
Do one of the following, based on the last and highest
seqno
value in thegrep
output above:- If
seqno
is a positive integer, then the node shut down gracefully. Record this number for comparison with the latestseqno
of other nodes in the cluster. - If
seqno
is-1
, then node crashed or was killed. Proceed as follows to possibly recover theseqno
from the database:- To temporarily start the database and append the last sequence number to its error log, run:
/var/vcap/packages/pxc/bin/mysqld --defaults-file=/var/vcap/jobs/pxc-mysql/config/my.cnf --wsrep-recover
- Scan the end of the error log
mysql.err.log
for the recoveredseqno
. It appears after the group ID (a UUID) and a colon. For example, it is15026
in the following output:$ grep "Recovered position" /var/vcap/sys/log/pxc-mysql/mysql.err.log | tail -1 150225 18:09:42 mysqld_safe WSREP: Recovered position e93955c7-b797-11e4-9faa-9a6f0b73eb46:15026
- If the node never connected to the cluster before crashing, it may never have been assigned a group ID. In this case, there is nothing to recover. Do not choose this node for bootstrapping unless all the other nodes also nodes crashed.
- To temporarily start the database and append the last sequence number to its error log, run:
- If
After you determine the
seqno
for all nodes in your cluster, identify the one with the highestseqno
. If multiple nodes share the same highestseqno
, and it is not-1
, you can bootstrap from any of them.
Bootstrap the First Node
After determining the node with the highest sequence_number
, you must bootstrap the node.
To bootstrap the node:
Note: Only run these bootstrap commands on the node with the highest sequence_number
. Otherwise, the node with the highest sequence_number
is unable to join the new cluster unless its data is abandoned. Its mysqld
process exits with an error. For more information about intentionally abandoning data, see Architecture.
SSH into the node using the procedure in the BOSH SSH section of the Advanced Troubleshooting with the BOSH CLI topic.
To update the state file and restart the
mysqld
process on the new bootstrap node, run:echo -n "NEEDS_BOOTSTRAP" > /var/vcap/store/pxc-mysql/state.txt monit start galera-init
It can take up to ten minutes for
monit
to start themysqld
process. To check if themysqld
process has started successfully, run:watch monit summary
Restart Remaining Nodes
After the bootstrapped node is running, you must restart the nodes.
To restart the nodes:
SSH into the node using the procedure in the BOSH SSH section of the Advanced Troubleshooting with the BOSH CLI topic.
To start the
mysqld
process withmonit
, run:monit start galera-init
If the node is prevented from starting by the Interruptor, do the manual procedure to force the node to rejoin the cluster in How to Manually Force a MySQL Node to Rejoin if a Node Cannot Rejoin the Cluster for P-MySQL 1.9 and 1.10 in the Pivotal Knowledge Base.
Warning: Forcing a node to rejoin the cluster is a destructive procedure. Only do the procedure with the assistance of Pivotal Support.
If the
monit start
command fails, it might be because the node with the highestsequence_number
ismysql/0
.In this case:
To make BOSH ignore updating
mysql/0
, run:bosh -e YOUR-ENV -d YOUR-DEPLOYMENT ignore mysql/0
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.
Navigate to Ops Manager in a browser, log in, and click Apply Changes.
When the deploy finishes, run the following command from the Ops Manager VM:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT unignore mysql/0
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.
To verify that the new nodes have successfully joined the cluster, SSH into the bootstrap node and run the following command to output the total number of nodes in the cluster:
mysql> SHOW STATUS LIKE 'wsrep_cluster_size';