Troubleshooting and Diagnostics

This topic describes how to diagnose and troubleshoot problems with MySQL for Pivotal Cloud Foundry (PCF).

Diagnose Problems

If your cluster is experiencing downtime or is in a degraded state, Pivotal recommends the following workflow for gathering information to diagnose the type of failure the cluster is experiencing.

WARNING: Do not attempt to resolve cluster issues by reconfiguring the cluster, such as changing the number of nodes or networks. Only follow the diagnosis steps in this document. If you are unsure how to proceed, contact Pivotal Support.

  1. Use the mysql-diag tool to gather a summary of the network, disk, and replication state of each cluster node.
  2. Run download-logs against each node in your MySQL cluster, the MySQL proxies, and the MySQL backup-prepare node. Do this before attempting recovery because any failures in the recovery procedure can result in logs being lost or made inaccessible.

    Pivotal recommends that you use the -X flag to get the complete set of available logs. However, if your cluster processes a high volume of transactions, the complete set might be too large and you can omit this flag to fetch the essential set of logs.

  3. Depending on the reported state from mysql-diag, the troubleshooting techniques listed in Troubleshoot Common Problems below might allow you to recover your cluster.

  4. If you are uncertain about the recovery steps to take, submit a ticket through Pivotal Support with the following information:

    • Logs fetched by running the download-logs script
    • Output from mysql-diag
    • Type of environment the MySQL deployment is running in, for example, Pivotal Application Service (PAS), Elastic Runtime, or MySQL for PCF
    • Versions of the installed Ops Manager, PAS or Elastic Runtime, and MySQL for PCF

Diagnostic Techniques

After performing the procedure in Diagnose Problems, consult the following additional diagnostic techniques to learn more about your cluster.

Check Cluster State

mysql_diag reports the cluster’s state. However, you can check the state manually by connecting to each MySQL node using a MySQL client and checking its status:

$ mysql -h NODE_IP -u root -pPASSWORD -e 'SHOW STATUS LIKE "wsrep_cluster_status";'
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| wsrep_cluster_status | Primary |
+----------------------+---------+

If all nodes are in the Primary component, you have a healthy cluster. If some nodes are in a Non-primary component, those nodes are not able to join the cluster.

To check how many nodes are in the cluster, perform the following command:

$ mysql -h NODE_IP -u root -pPASSWORD -e 'SHOW STATUS LIKE "wsrep_cluster_size";'
+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| wsrep_cluster_size | 3     |
+--------------------+-------+

If the value of wsrep_cluster_size is equal to the expected number of nodes, then all nodes have joined the cluster. Otherwise, check network connectivity between nodes and use monit status to identify any issues preventing nodes from starting.

For more information, see the official Galera documentation for Checking Cluster Integrity.

If none of your nodes are in the Primary component, then your cluster has lost quorum and must be bootstrapped.

For more information, see Bootstrapping for more information.

Check Replication Status

If you see stale data in your cluster, check whether replication is functioning normally.

Perform the following steps to check the replication status:

  1. Obtain the IP addresses of your MySQL server by performing the following steps:

    1. From the PCF Installation Dashboard, click the MySQL for Pivotal Cloud Foundry tile.
    2. Click the Status tab.
    3. Record the IP addresses for all instances of the MySQL Server job.
  2. Obtain the admin credentials for your MySQL server by performing the following steps:

    1. From the MySQL for PCF tile, click the Credentials tab.
    2. Locate the Mysql Admin Password entry in the MySQL Server section and click Link to Credential.
    3. Record the values for identity and password.
  3. SSH into the Ops Manager VM. Because the procedures vary by IaaS, see SSH into Ops Manager for more information.

  4. From the Ops Manager VM, place some data in the first node by performing the following steps, replacing FIRST-NODE-IP-ADDRESS with the IP address of the first node retrieved from the initial step and YOUR-IDENTITY with the identity value obtained from the second step. When prompted for a password, provide the password value obtained from the second step.

    1. Create a dummy database in the first node:
      $ mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -e "create database verify_healthy;"
      
    2. Create a dummy table in the dummy database:
      $ mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "create table dummy_table (id int not null primary key auto_increment, info text) engine='InnoDB';"
      
    3. Insert some data into the dummy table:
      $ mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "insert into dummy_table(info) values ('dummy data'),('more dummy data'),('even more dummy data');"
      
    4. Query the table and verify that the three rows of dummy data exist on the first node:
      mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "select * from dummy_table;"
      Enter password:
      +----+----------------------+
      | id | info                 |
      +----+----------------------+
      |  4 | dummy data           |
      |  7 | more dummy data      |
      | 10 | even more dummy data |
      +----+----------------------+
      
  5. Verify that the other nodes contain the same dummy data by performing the following steps for each of the remaining MySQL server IP addresses obtained above:

    1. Query the dummy table, replacing NEXT-NODE-IP-ADDRESS with the IP address of the MySQL server instance and YOUR-IDENTITY with the identity value obtained above. When prompted for a password, provide the password value obtained above.
      $ mysql -h NEXT-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "select * from dummy_table;"
      
    2. Examine the output of the mysql command and verify that the node contains the same three rows of dummy data as the other nodes.
      +----+----------------------+
      | id | info                 |
      +----+----------------------+
      |  4 | dummy data           |
      |  7 | more dummy data      |
      | 10 | even more dummy data |
      +----+----------------------+
      
  6. If each MySQL server instance does not return the same result, contact Pivotal Support before proceeding further or making any changes to your deployment. If each MySQL server instance returns the same result, then you can safely proceed to scaling down your cluster to a single node.

Troubleshoot Common Problems

This section lists symptoms, solutions, and explanations for the following common problems:

Many Replication Errors in Logs

Symptom

You see many replication errors in the MySQL logs, such as the following:

160318 9:25:16 [Warning] WSREP: RBR event 1 Query apply warning: 1, 16992456
160318 9:25:16 [Warning] WSREP: Ignoring error for TO isolated action: source: abcd1234-abcd-1234-abcd-1234abcd1234 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 246804 trx_id: -1 seqnos (l: 865022, g: 16992456, s: 16992455, d: 16992455, ts: 2530660989030983)
160318 9:25:16 [ERROR] Slave SQL: Error 'Duplicate column name 'number'' on query. Default database: 'cf_0123456_1234_abcd_1234_abcd1234abcd'. Query: 'ALTER TABLE ...'

Explanation and Solution

Replication errors in MySQL log files can occur for various reasons.

Sometimes you can ignore them and sometimes you should not. See the table below.

If you see replication errors and… then…
ALTER TABLE errors fix the ALTER TABLE error.
This error can occur when an app issues an invalid data definition statement. Other nodes log this problem as a replication error because they fail to replicate the ALTER TABLE.
Increased persistent disk usage or running out of working memory decode the GRA_*.log files and look for errors. See How to decode Galera GRA log files in the Pivotal Support knowledge base. The GRA log files contain failing DDL statements.
No other problems ignore the errors.

No Space Left

Symptom

You might encounter one or more of the following symptoms when MySQL has run out of space:

  • MySQL gives an error that reports it is out of space
  • Queries start failing
  • Apps are unable to connect
  • You are unable to SSH into a node

Explanation

When MySQL runs out of space, it stops functioning normally.

Solution

  1. Use the mysql-diag tool to determine how much space you have left. If the disk is too full, the mysql-diag won’t work.
  2. Redeploy with more persistent disk.
  3. After redeploying, you might encounter further errors. To troubleshoot these errors, see the other sections in Troubleshoot Common Problems.

Node Unable to Rejoin

Symptom

A MySQL node is unable to rejoin a cluster.

Explanation

Existing server nodes restarted with monit should automatically join the cluster. If a detached existing node fails to join the cluster, it might be because its sequence_number is higher than those of the nodes with quorum.

A higher sequence_number on the detached node indicates that it has recent changes to the data that the primary lacks. Check this by looking at the node’s error log at /var/vcap/sys/log/mysql/mysql.err.log.

Solution

If the detached node has a higher sequence number than the primary component, perform one of the following procedures to restore the cluster:

  • Shut down the healthy, primary nodes and bootstrap from the detached node to preserve its data. See Bootstrapping for more details.
  • Abandon the unsynchronized data on the detached node by performing the manual procedure to force the node to rejoin the cluster, documented in Pivotal Knowledge Base.

    WARNING: Forcing a node to rejoin the cluster is a destructive procedure. Only perform it with the assistance of Pivotal Support.

It might also be possible to manually dump recent transactions from the detached node and apply them to the running cluster, but this process is error-prone.

Unresponsive Node(s)

Symptom

A client connected to a MySQL cluster node reports: WSREP has not yet prepared this node for application use. Some clients might instead return unknown error.

Explanation

If the client is connected to a MySQL cluster node that loses connection to the rest of the cluster (that is, becomes non-Primary), the node stops accepting writes. If the connection to this node is made through the proxy, the proxy automatically re-routes further connections to a different node.

A node can become unresponsive for a number of reasons. Each reason corresponds to a different solution. Consult the following list:

Network Latency

If network latency causes a node to become unresponsive, the node drops but eventually rejoins.

The node will only rejoin automatically if only one node has left the cluster.

Consult your IaaS network settings to reduce your network latency.

MySQL Process Failure

If the MySQL process crashes, monit and BOSH will bring the process back. No further action is necessary.

Firewall Rule Change

If firewall rules change, it might prevent a node from reaching the rest of the cluster, causing the node to become unresponsive.

In this scenario, the logs show the node leaving the cluster but do not show network latency errors.

To confirm that the node is unresponsive due to a firewall rule change, try to SSH from a responsive node to the unresponsive node. If you can’t connect, the node is unresponsive due to a firewall rule change.

Change the firewall rules to enable the unresponsive node to rejoin the cluster.

VM Failure

If you cannot SSH onto a node, and you are not detecting either network latency or firewall issues, your node might be down due to VM failure.

To confirm that VM failure has caused the node to become unresponsive, perform the following steps:

  1. SSH into the Ops Manager Director. For more information, see the SSH into Ops Manager section in Advanced Troubleshooting with the BOSH CLI.
  2. Retrieve the IP address for the MySQL server by navigating to the MySQL for PCF tile and clicking the Status tab.
  3. Retrieve the VM credentials for the MySQL server by navigating to the MySQL for PCF tile and clicking the Credentials tab.
  4. From the Ops Manager Director VM, use the BOSH CLI to log in to the BOSH Director:
  5. From the Ops Manager VM, use the BOSH CLI to run bosh cloudcheck. Follow the procedure in BOSH Cloudcheck. If you are using PCF v1.10, use the BOSH CLI v1 commands. If you are using PCF v1.11 or later, use the BOSH CLI v2 commands.
  6. In bosh cloudcheck, identify the unresponsive node and recreate it.

    WARNING: Recreating a node will clear its logs. Ensure the node is completely down before recreating it.

    WARNING: Only recreate one node. Do not recreate the entire cluster. If more than one node is down, contact Pivotal Support.

The Interruptor

If you have enabled the Interruptor, it might be preventing a node from starting.

You can confirm that the Interruptor has activated by examining /var/vcap/sys/log/mysql/mysql.err.log on the failing node. The log contains the following message:

WSREP_SST: [ERROR] ##################################################################################### (20160610 04:33:21.338)
WSREP_SST: [ERROR] SST disabled due to danger of data loss. Verify data and run the rejoin-unsafe errand (20160610 04:33:21.340)
WSREP_SST: [ERROR] ##################################################################################### (20160610 04:33:21.341)

Perform the procedure in the Override the Interruptor section of Using the Interruptor to force the node to rejoin the cluster.

Accidental Deletion of Service Plan

Symptom

You have deleted a service plan accidentally.

Explanation

A service plan is only unrecoverable if you perform all of the following steps in sequence:

  1. Click the trash-can icon in the Service Plan section of the MySQL for PCF configuration.
  2. Create a plan with the same name.
  3. Click Save.
  4. Return to the Ops Manager Installation Dashboard, and click Apply Changes.

The deploy eventually fails with the following error:

Server error, status code: 502, error code: 270012, message: Service broker catalog is invalid: Plan names must be unique within a service

Solution

After you have deployed with Apply Changes, the original plan cannot be recovered. For as long as service instances of that plan exist, you may not enter a new plan of the same name. At this point, the only workaround is to create a new plan with the same specifications, but specify a different name. Existing instances continue to appear under the old plan name, but new instances need to be created using the new plan name.

If you have performed steps 1 and 2 in the above sequence, do not click Save. Return to the Ops Manager Installation Dashboard. Any accidental changes are discarded.

If you have performed steps 1, 2, and 3 in the above sequence, do not click Apply Changes. Return to the Ops Manager Installation Dashboard and click the Revert button. Any accidental changes are discarded.