Troubleshooting and Diagnostics
This topic describes how to diagnose and troubleshoot problems with MySQL for Pivotal Cloud Foundry (PCF).
If your cluster is experiencing downtime or is in a degraded state, Pivotal recommends the following workflow for gathering information to diagnose the type of failure the cluster is experiencing.
- Use the mysql-diag tool to gather a summary of the network, disk, and replication state of each cluster node.
- Run download-logs against each node in your MySQL cluster, the MySQL proxies, and the MySQL backup-prepare node. It’s important to do this before attempting recovery, because any failures in the recovery procedure can result in logs being lost or made inaccessible.
- Depending on the reported state from
mysql-diag, the troubleshooting techniques listed below might allow you to recover your cluster.
- If you are uncertain about the recovery steps to take, submit a ticket through Pivotal Support
with the following information:
- Logs fetched by running the
- Output from
- Type of environment the MySQL deployment is running in (Elastic Runtime, MySQL for PCF, or other)
- Versions of the installed Ops Manager, Elastic Runtime, and MySQL for PCF
- Logs fetched by running the
For more diagnostic techniques, see Diagnostic Techniques.
After performing the procedure in Diagnose Problems, consult the following additional diagnostic techniques to learn more about your cluster.
If your cluster has lost quorum, you must bootstrap it.
To determine whether you need to bootstrap your cluster, check whether the cluster has lost quorum. For more information, see Bootstrapping for more information.
To check the cluster state, connect to each MySQL node using a MySQL client and check its status.
$ mysql -h NODE_IP -u root -pPASSWORD -e 'SHOW STATUS LIKE "wsrep_cluster_status";' +----------------------+---------+ | Variable_name | Value | +----------------------+---------+ | wsrep_cluster_status | Primary | +----------------------+---------+
If all nodes are in the
Primary component, you have a healthy cluster.
If some nodes are in a
Non-primary component, those nodes are not able to join the cluster.
To check how many nodes are in the cluster, perform the following command:
$ mysql -h NODE_IP -u root -pPASSWORD -e 'SHOW STATUS LIKE "wsrep_cluster_size";' +--------------------+-------+ | Variable_name | Value | +--------------------+-------+ | wsrep_cluster_size | 3 | +--------------------+-------+
If the value of
wsrep_cluster_size is equal to the expected number of nodes, then all nodes have joined the cluster.
Otherwise, check network connectivity between nodes and use
monit status to identify any issues preventing nodes from starting.
For more information, see the official Galera documentation for Checking Cluster Integrity.
To restart your cluster, see Bootstrapping.
If you see stale data in your cluster, check whether replication is functioning normally.
Perform the following steps to check the replication status:
Obtain the IP addresses of your MySQL server by performing the following steps:
- From the PCF Installation Dashboard, click the MySQL for Pivotal Cloud Foundry tile.
- Click the Status tab.
- Record the IP addresses for all instances of the MySQL Server job.
Obtain the admin credentials for your MySQL server by performing the following steps:
- From the MySQL for PCF tile, click the Credentials tab.
- Locate the Mysql Admin Password entry in the MySQL Server section and click Link to Credential.
- Record the values for
SSH into the Ops Manager VM. Because the procedures vary by IaaS, see SSH into Ops Manager for more information.
From the Ops Manager VM, place some data in the first node by performing the following steps, replacing
FIRST-NODE-IP-ADDRESSwith the IP address of the first node retrieved from the initial step and
identityvalue obtained from the second step. When prompted for a password, provide the
passwordvalue obtained from the second step.
- Create a dummy database in the first node:
$ mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -e "create database verify_healthy;"
- Create a dummy table in the dummy database:
$ mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "create table dummy_table (id int not null primary key auto_increment, info text) engine='InnoDB';"
- Insert some data into the dummy table:
$ mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "insert into dummy_table(info) values ('dummy data'),('more dummy data'),('even more dummy data');"
- Query the table and verify that the three rows of dummy data exist on the first node:
mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "select * from dummy_table;" Enter password: +----+----------------------+ | id | info | +----+----------------------+ | 4 | dummy data | | 7 | more dummy data | | 10 | even more dummy data | +----+----------------------+
- Create a dummy database in the first node:
Verify that the other nodes contain the same dummy data by performing the following steps for each of the remaining MySQL server IP addresses obtained above:
- Query the dummy table, replacing
NEXT-NODE-IP-ADDRESSwith the IP address of the MySQL server instance and
identityvalue obtained above. When prompted for a password, provide the
passwordvalue obtained above.
$ mysql -h NEXT-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "select * from dummy_table;"
- Examine the output of the
mysqlcommand and verify that the node contains the same three rows of dummy data as the other nodes.
+----+----------------------+ | id | info | +----+----------------------+ | 4 | dummy data | | 7 | more dummy data | | 10 | even more dummy data | +----+----------------------+
- Query the dummy table, replacing
If each MySQL server instance does not return the same result, contact Pivotal Support before proceeding further or making any changes to your deployment. If each MySQL server instance returns the same result, then you can safely proceed to scaling down your cluster to a single node.
This section lists symptoms, solutions, and explanations for the following common problems:
- Cluster errors:
- User errors:
You see many replication errors in the MySQL logs.
Unless the GRA files show a clear execution error, such as running out of disk space, seeing many replication errors in your logs is a normal behavior. The MySQL for PCF team is developing a more advanced monitoring solution to detect the failure case and alert operators in the future.
Occasionally, replication errors in the MySQL logs resemble the following:
160318 9:25:16 [Warning] WSREP: RBR event 1 Query apply warning: 1, 16992456 160318 9:25:16 [Warning] WSREP: Ignoring error for TO isolated action: source: abcd1234-abcd-1234-abcd-1234abcd1234 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 246804 trx_id: -1 seqnos (l: 865022, g: 16992456, s: 16992455, d: 16992455, ts: 2530660989030983) 160318 9:25:16 [ERROR] Slave SQL: Error 'Duplicate column name 'number'' on query. Default database: 'cf_0123456_1234_abcd_1234_abcd1234abcd'. Query: 'ALTER TABLE ...'
This error occurs when an app issues an
ALTER TABLE command that fails to apply to the current schema.
This is typically the result of user error.
The MySQL node that receives the request processes it normally.
If it fails, the node sends the failure to the app and the app must determine the next step.
In a Galera cluster, however, all DDL is replicated and all replication failures are logged.
Therefore, the bad
ALTER TABLE command is run by both slave nodes.
If it fails, the slave nodes log it as a “replication failure” because they cannot tell the difference.
Cases where a valid DDL works on some nodes yet fails on others are rare. Usually those cases are limited to problems with running out of disk space or working memory.
An article from Percona suggests that the schemata can get out of sync. However, the author had to deliberately switch a node to RSU, which MySQL for PCF never does except during SST. The article offers a demonstration of what is possible, but does not explain how a customer may actually experience this issue in production.
Disregard the replication errors as a normal behavior. If you see the “ALTER TABLE” error, fix the user error that produced it.
You may encounter one or more of the following symptoms when MySQL has run out of space:
- MySQL gives an error that reports it is out of space
- Queries start failing
- Apps are unable to connect
- You are unable to SSH into a node
When MySQL runs out of space, it stops functioning normally.
- Use the mysql-diag tool to determine how much space you have left. If the disk is too full, the
- Redeploy with more persistent disk.
- After redeploying, you may encounter further errors. To troubleshoot these errors, see the other sections in Troubleshoot Problems.
A MySQL node is unable to rejoin a cluster.
Existing server nodes restarted with
monit should automatically join the cluster.
If a detached existing node fails to join the cluster, it might be because its sequence number (
seqno) is higher than those of the nodes with quorum.
seqno on the detached node indicates that it has recent changes to the data that the primary lacks.
Check this by looking at the node’s error log at
If the detached node has a higher sequence number than the primary component, perform one of the following procedures to restore the cluster:
- Shut down the healthy, primary nodes and bootstrap from the detached node to preserve its data. See Bootstrapping for more details.
Abandon the unsynchronized data on the detached node by performing the manual procedure to force the node to rejoin the cluster, documented in Pivotal Knowledge Base.
WARNING: Forcing a node to rejoin the cluster is a destructive procedure. Only perform it with the assistance of Pivotal Support.
It might also be possible to manually dump recent transactions from the detached node and apply them to the running cluster, but this process is error-prone.
A MySQL node has become unresponsive.
Unresponsive nodes stop responding to queries and, after timing out, leave the cluster.
Nodes are marked as unresponsive or inactive under either of the following circumstances:
- If they fail to respond to one node within 15 seconds
- If they fail to respond to all other nodes within 5 seconds
Unresponsive nodes that become responsive again will rejoin the cluster,
as long as they are on the same IP which is pre-configured in the
gcomm address on all of the other running nodes and a quorum is held by the remaining nodes.
All nodes suspend writes once they notice a problem with the cluster.
After a timeout period of 5 seconds, requests to non-quorum nodes fail.
Most clients return the error:
WSREP has not yet prepared this node for application use.
Some clients may instead return
Nodes who have reached quorum continue fulfilling write requests.
If deployed using a proxy, a continually inactive node causes the proxy to fail over, selecting a different MySQL node to route new queries to.
A node can become unresponsive for a number of reasons. Each reason corresponds to a different solution. Consult the following list:
If network latency causes a node to become unresponsive, the node drops but eventually rejoins.
The node will only rejoin automatically if only one node has left the cluster.
Consult your IaaS network settings to reduce your network latency.
If the MySQL process crashes, monit and BOSH will bring the process back. No further action is necessary.
If firewall rules change, it may prevent a node from reaching the rest of the cluster, causing the node to become unresponsive.
In this scenario, the logs show the node leaving the cluster but do not show network latency errors.
To confirm that the node is unresponsive due to a firewall rule change, try to SSH from a responsive node to the unresponsive node. If you can’t connect, the node is unresponsive due to a firewall rule change.
Change the firewall rules to enable the unresponsive node to rejoin the cluster.
If you cannot SSH onto a node, and you are not detecting either network latency or firewall issues, your node may be down due to VM failure.
To confirm that VM failure has caused the node to become unresponsive, perform the following steps:
- SSH into the Ops Manager Director. For more information, see the SSH into Ops Manager section of Advanced Troubleshooting with the BOSH CLI.
- Retrieve the IP address for the MySQL server by navigating to the MySQL for PCF tile and clicking the Status tab.
- Retrieve the VM credentials for the MySQL server by navigating to the MySQL for PCF tile and clicking the Credentials tab.
- From the Ops Manager Director VM, use the BOSH CLI to log in to the BOSH Director. For more information, see the Log in to the BOSH Director section of Advanced Troubleshooting with the BOSH CLI.
- From the Ops Manager VM, use the BOSH CLI to run
bosh cloudcheck. For more information, see the BOSH Cloudcheck section of Advanced Troubleshooting with the BOSH CLI.
bosh cloudcheck, identify the unresponsive node and recreate it.
WARNING: Recreating a node will clear its logs. Ensure the node is completely down before recreating it.
WARNING: Only recreate one node. Do not recreate the entire cluster. If more than one node is down, contact Pivotal Support.
If you have enabled the Interruptor, it may be preventing a node from starting.
You can confirm that the Interruptor has activated by examining
/var/vcap/sys/log/mysql/mysql.err.log on the failing node. The log contains the following message:
WSREP_SST: [ERROR] ##################################################################################### (20160610 04:33:21.338) WSREP_SST: [ERROR] SST disabled due to danger of data loss. Verify data and run the rejoin-unsafe errand (20160610 04:33:21.340) WSREP_SST: [ERROR] ##################################################################################### (20160610 04:33:21.341)
Perform the procedure in the Override the Interruptor section of Using the Interruptor to force the node to rejoin the cluster.
You have deleted a service plan accidentally.
A service plan is only unrecoverable if you perform all of the following steps in sequence:
- Click the trash-can icon in the Service Plan section of the MySQL for PCF configuration.
- Create a plan with the same name.
- Click Save.
- Return to the Ops Manager Installation Dashboard, and click Apply Changes.
The deploy eventually fails with the following error:
Server error, status code: 502, error code: 270012, message: Service broker catalog is invalid: Plan names must be unique within a service
After you have deployed with Apply Changes, the original plan cannot be recovered. For as long as service instances of that plan exist, you may not enter a new plan of the same name. At this point, the only workaround is to create a new plan with the same specifications, but specify a different name. Existing instances continue to appear under the old plan name, but new instances need to be created using the new plan name.
If you have performed steps 1 and 2 in the above sequence, do not click Save. Return to the Ops Manager Installation Dashboard. Any accidental changes are discarded.
If you have performed steps 1, 2, and 3 in the above sequence, do not click Apply Changes. Return to the Ops Manager Installation Dashboard and click the Revert button. Any accidental changes are discarded.