LATEST VERSION: 2.1 - CHANGELOG

Troubleshooting and Diagnostics

This topic describes how to diagnose and troubleshoot problems with MySQL for Pivotal Cloud Foundry (PCF).

Diagnose Problems

If your cluster is experiencing downtime or is in a degraded state, Pivotal recommends the following workflow for gathering information to diagnose the type of failure the cluster is experiencing.

  1. Use the mysql-diag tool to gather a summary of the network, disk, and replication state of each cluster node.
  2. Run download-logs against each node in your MySQL cluster, the MySQL proxies, and the MySQL backup-prepare node. It’s important to do this before attempting recovery, because any failures in the recovery procedure can result in logs being lost or made inaccessible.
  3. Depending on the reported state from mysql-diag, the troubleshooting techniques listed below might allow you to recover your cluster.
  4. If you are uncertain about the recovery steps to take, submit a ticket through Pivotal Support with the following information:
    • Logs fetched by running the download-logs script
    • Output from mysql-diag
    • Type of environment the MySQL deployment is running in (Elastic Runtime, MySQL for PCF, or other)
    • Versions of the installed Ops Manager, Elastic Runtime, and MySQL for PCF

For more diagnostic techniques, see the following section.

Diagnostic Techniques

Check for Lost Quorum

If your cluster has lost quorum, you must bootstrap it.

To determine whether you need to bootstrap your cluster, check whether the cluster has lost quorum. For more information, see Bootstrapping for more information.

Check Replication Status

Perform the following steps to check the replication status:

  1. Obtain the IP addresses of your MySQL server by performing the following steps:

    1. From the PCF Installation Dashboard, click the MySQL for Pivotal Cloud Foundry tile.
    2. Click the Status tab.
    3. Record the IP addresses for all instances of the MySQL Server job.
  2. Obtain the admin credentials for your MySQL server by performing the following steps:

    1. From the MySQL for PCF tile, click the Credentials tab.
    2. Locate the Mysql Admin Password entry in the MySQL Server section and click Link to Credential.
    3. Record the values for identity and password.
  3. SSH into the Ops Manager VM. Because the procedures vary by IaaS, see SSH into Ops Manager for more information.

  4. From the Ops Manager VM, place some data in the first node by performing the following steps, replacing FIRST-NODE-IP-ADDRESS with the IP address of the first node retrieved from the initial step and YOUR-IDENTITY with the identity value obtained from the second step. When prompted for a password, provide the password value obtained from the second step.

    1. Create a dummy database in the first node:
      $ mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -e "create database verify_healthy;"
      
    2. Create a dummy table in the dummy database:
      $ mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "create table dummy_table (id int not null primary key auto_increment, info text) engine='InnoDB';"
      
    3. Insert some data into the dummy table:
      $ mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "insert into dummy_table(info) values ('dummy data'),('more dummy data'),('even more dummy data');"
      
    4. Query the table and verify that the three rows of dummy data exist on the first node:
      mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "select * from dummy_table;"
      Enter password:
      +----+----------------------+
      | id | info                 |
      +----+----------------------+
      |  4 | dummy data           |
      |  7 | more dummy data      |
      | 10 | even more dummy data |
      +----+----------------------+
      
  5. Verify that the other nodes contain the same dummy data by performing the following steps for each of the remaining MySQL server IP addresses obtained above:

    1. Query the dummy table, replacing NEXT-NODE-IP-ADDRESS with the IP address of the MySQL server instance and YOUR-IDENTITY with the identity value obtained above. When prompted for a password, provide the password value obtained above.
      $ mysql -h NEXT-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "select * from dummy_table;"
      
    2. Examine the output of the mysql command and verify that the node contains the same three rows of dummy data as the other nodes.
      +----+----------------------+
      | id | info                 |
      +----+----------------------+
      |  4 | dummy data           |
      |  7 | more dummy data      |
      | 10 | even more dummy data |
      +----+----------------------+
      
  6. If each MySQL server instance does not return the same result, contact Pivotal Support before proceeding further or making any changes to your deployment. If each MySQL server instance returns the same result, then you can safely proceed to scaling down your cluster to a single node.

Check Cluster State

To check the cluster state, connect to each MySQL node using a MySQL client and check its status.

$ mysql -h NODE_IP -u root -pPASSWORD -e 'SHOW STATUS LIKE "wsrep_cluster_status";'
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| wsrep_cluster_status | Primary |
+----------------------+---------+

If all nodes are in the Primary component, you have a healthy cluster. If some nodes are in a Non-primary component, those nodes are not able to join the cluster.

To check how many nodes are in the cluster, perform the following command:

$ mysql -h NODE_IP -u root -pPASSWORD -e 'SHOW STATUS LIKE "wsrep_cluster_size";'
+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| wsrep_cluster_size | 3     |
+--------------------+-------+

If the value of wsrep_cluster_size is equal to the expected number of nodes, then all nodes have joined the cluster. Otherwise, check network connectivity between nodes and use monit status to identify any issues preventing nodes from starting.

For more information, see the official Galera documentation for Checking Cluster Integrity.

To restart your cluster, see Bootstrapping.

Troubleshoot Problems

This section lists symptoms, solutions, and explanations for the following common problems:

Many Replication Errors in Logs

Symptom

You see many replication errors in the MySQL logs.

Explanation

Unless the GRA files show a clear execution error, such as running out of disk space, seeing many replication errors in your logs is a normal behavior. The MySQL for PCF team is developing a more advanced monitoring solution to detect the failure case and alert operators in the future.

Occasionally, replication errors in the MySQL logs resemble the following:

160318 9:25:16 [Warning] WSREP: RBR event 1 Query apply warning: 1, 16992456
160318 9:25:16 [Warning] WSREP: Ignoring error for TO isolated action: source: abcd1234-abcd-1234-abcd-1234abcd1234 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 246804 trx_id: -1 seqnos (l: 865022, g: 16992456, s: 16992455, d: 16992455, ts: 2530660989030983)
160318 9:25:16 [ERROR] Slave SQL: Error 'Duplicate column name 'number'' on query. Default database: 'cf_0123456_1234_abcd_1234_abcd1234abcd'. Query: 'ALTER TABLE ...'

This error occurs when an app issues an ALTER TABLE command that fails to apply to the current schema. This is typically the result of user error.

The MySQL node that receives the request processes it normally. If it fails, the node sends the failure to the app and the app must determine the next step. In a Galera cluster, however, all DDL is replicated and all replication failures are logged. Therefore, the bad ALTER TABLE command is run by both slave nodes. If it fails, the slave nodes log it as a “replication failure” because they cannot tell the difference.

Cases where a valid DDL works on some nodes yet fails on others are rare. Usually those cases are limited to problems with running out of disk space or working memory.

An article from Percona suggests that the schemata can get out of sync. However, the author had to deliberately switch a node to RSU, which MySQL for PCF never does except during SST. The article offers a demonstration of what is possible, but does not explain how a customer may actually experience this issue in production.

Solution

Disregard the replication errors as a normal behavior. If you see the “ALTER TABLE” error, fix the user error that produced it.

Accidental Deletion of Service Plan

Symptom

You have deleted a service plan accidentally.

Explanation

A service plan is only unrecoverable if you perform all of the following steps in sequence:

  1. Click the trash-can icon in the Service Plan section of the MySQL for PCF configuration.
  2. Create a plan with the same name.
  3. Click Save.
  4. Return to the Ops Manager Installation Dashboard, and click Apply Changes.

The deploy eventually fails with the following error:

Server error, status code: 502, error code: 270012, message: Service broker catalog is invalid: Plan names must be unique within a service

Solution

After you have deployed with Apply Changes, the original plan cannot be recovered. For as long as service instances of that plan exist, you may not enter a new plan of the same name. At this point, the only workaround is to create a new plan with the same specifications, but specify a different name. Existing instances continue to appear under the old plan name, but new instances need to be created using the new plan name.

If you have performed steps 1 and 2 in the above sequence, do not click Save. Return to the Ops Manager Installation Dashboard. Any accidental changes are discarded.

If you have performed steps 1, 2, and 3 in the above sequence, do not click Apply Changes. Return to the Ops Manager Installation Dashboard and click the Revert button. Any accidental changes are discarded.

No Space Left

Symptom

You receive an error that reports MySQL has no space left.

Explanation

If a server simultaneously processes multiple large queries that require Temporary Tables, they may use more disk space than allocated to the MySQL nodes. Queries might return no space left errors.

Users can see if a query is using a temporary table by using the EXPLAIN command and looking for Using temporary in the output.

Solution

Redeploy with more persistent disk to avoid these issues.

Node Unable to Rejoin

Symptom

A MySQL node is unable to rejoin a cluster.

Explanation

Existing server nodes restarted with monit should automatically join the cluster. If a detached existing node fails to join the cluster, it might be because its sequence number (seqno) is higher than those of the nodes with quorum.

A higher seqno on the detached node indicates that it has recent changes to the data that the primary lacks. Check this by looking at the node’s error log at /var/vcap/sys/log/mysql/mysql.err.log.

Solution

If the detached node has a higher sequence number than the primary component, perform one of the following procedures to restore the cluster:

  • Shut down the healthy, primary nodes and bootstrap from the detached node to preserve its data. See Bootstrapping for more details.
  • Abandon the unsynchronized data on the detached node by performing the manual procedure to force the node to rejoin the cluster, documented in Pivotal Knowledge Base.

    WARNING: Forcing a node to rejoin the cluster is a destructive procedure. Only perform it with the assistance of Pivotal Support.

It might also be possible to manually dump recent transactions from the detached node and apply them to the running cluster, but this process is error-prone.

Unresponsive Node(s)

Symptom

A MySQL node has become unresponsive.

Explanation

A node can become unresponsive for a number of reasons:

  • Network latency
  • MySQL process failure
  • Firewall rule changes
  • VM failure
  • The Interruptor has prevented a node from starting

Unresponsive nodes stop responding to queries and, after timing out, leave the cluster.

Nodes are marked as unresponsive or inactive under either of the following circumstances:

  • If they fail to respond to one node within 15 seconds
  • If they fail to respond to all other nodes within 5 seconds

Unresponsive nodes that become responsive again will rejoin the cluster, as long as they are on the same IP which is pre-configured in the gcomm address on all of the other running nodes and a quorum is held by the remaining nodes.

All nodes suspend writes once they notice a problem with the cluster. After a timeout period of 5 seconds, requests to non-quorum nodes fail. Most clients return the error: WSREP has not yet prepared this node for application use. Some clients may instead return unknown error. Nodes who have reached quorum continue fulfilling write requests.

If deployed using a proxy, a continually inactive node causes the proxy to fail over, selecting a different MySQL node to route new queries to.

Create a pull request or raise an issue on the source for this page in GitHub