LATEST VERSION: 2.6 - RELEASE NOTES

Using the Interruptor

This topic explains how to use the Interruptor, a component of MySQL for Pivotal Cloud Foundry (PCF) that provides a solution for preventing data loss.

WARNING: Using the Interruptor is potentially destructive and should only be attempted by advanced users.

Overview

There are rare cases in which a MySQL node silently falls out of sync with the other nodes of the cluster. The Replication Canary closely monitors the cluster for this condition.

However, if the Replication Canary does not detect the failure, the Interruptor provides a solution for preventing data loss.

How the Interruptor Works

If the node receiving traffic from the proxy falls out of sync with the cluster, it generates a dataset that the other nodes do not have. If the same node later receives a transaction that is not compatible with the datasets of the other nodes, it discards its local dataset and adopts the datasets of the other nodes. This is generally desired behavior, unless data replication is not functioning across the cluster. The node could destroy valid data by discarding its local dataset.

When enabled, the Interruptor prevents the node from destroying its local dataset if there is a risk of losing valid data.

Note: If you receive a notification that the Interruptor has activated, contact Pivotal Support immediately. Support will work with you to determine the nature of the failure, and provide guidance regarding a solution.

An out-of-sync node employs one of two modes to catch up with the cluster:

  • Incremental State Transfer (IST): If a node has been out of the cluster for a relatively short period of time, such as a reboot, the node invokes IST. This is not a dangerous operation, and the Interruptor does not interfere.
  • State Snapshot Transfer (SST): If a node has been unavailable for an extended amount of time, such as a hardware failure that requires physical repair, the node might invoke SST. In cases of failed replication, SST can cause data loss. When enabled, the Interruptor prevents this method of recovery.

For more information about these modes, see State Transfers in the Galera documentation.

Sample Notification E-mail

The Interruptor sends an email through the Pivotal Application Service (PAS) or Elastic Runtime notification service when it prevents a node from rejoining a cluster. See the following example:

Subject: CF Notification: p-mysql alert 100

This message was sent directly to your email address.

{alert-code 100}
Hello, just wanted to let you know that the MySQL node/cluster has gone down and has been disallowed from re-joining by the interruptor.

Enable the Interruptor

The Interruptor is deactivated by default. To enable it, perform the following steps:

  1. Navigate to the Ops Manager Installation Dashboard.
  2. Click the MySQL for PCF tile.
  3. Click Advanced Options.
  4. Under Enable optional protections, select the Prevent node auto re-join checkbox. Prevent node
  5. Click Save.
  6. Return to the Ops Manager Installation Dashboard and click Apply Changes to redeploy the tile.

You can confirm that the Interruptor has activated by examining /var/vcap/sys/log/mysql/mysql.err.log on the failing node. The log contains the following message:

WSREP_SST: [ERROR] ##################################################################################### (20160610 04:33:21.338)
WSREP_SST: [ERROR] SST disabled due to danger of data loss. Verify data and run the rejoin-unsafe errand (20160610 04:33:21.340)
WSREP_SST: [ERROR] ##################################################################################### (20160610 04:33:21.341)

Override the Interruptor

In general, if the Interruptor has activated but the Replication Canary has not triggered, it is safe for the node to rejoin the cluster. You can check the health of the remaining nodes in the cluster by following the Check Replication Status instructions.

Before running the BOSH CLI commands below, you must SSH into the Ops Manager VM and log in to the BOSH Director. For more information, see Advanced Troubleshooting with the BOSH CLI.

If you are using PCF v1.10, follow the BOSH CLI v1 procedures to force a node to rejoin the cluster.

If you are using PCF v1.11 or later, follow the BOSH CLI v2 procedures to force a node to rejoin the cluster.

Force a Node to Rejoin the Cluster with the BOSH CLI v1

  1. List your deployments:
    $ bosh deployments
  2. From the output, locate the MySQL for PCF deployment and record its name.
  3. Download the manifest for the MySQL for PCF deployment, specifying the name of the deployment. For example:
    $ bosh download manifest p-mysql-deployment ./manifest.yml
  4. Set the BOSH CLI to the MySQL for PCF deployment:
    $ bosh deployment ./manifest.yml
  5. Run bosh run errand rejoin-unsafe to force a node to rejoin the cluster:

    $ bosh run errand rejoin-unsafe
    [...]
    [stdout]
    Started rejoin-unsafe errand ...
    Successfully repaired cluster
    rejoin-unsafe errand completed
    
    [stderr]
    None
    
    Errand `rejoin-unsafe' completed successfully (exit code 0)
    

If the rejoin-unsafe errand is not able to cause a node to join the cluster, log in to each node that has tripped the Interruptor and perform the manual procedure to force the node to rejoin the cluster, documented in Pivotal Knowledge Base.

WARNING: Forcing a node to rejoin the cluster is a destructive procedure. Only perform it with the assistance of Pivotal Support.

Force a Node to Rejoin the Cluster with the BOSH CLI v2

  1. List your deployments. For example:
    $ bosh2 -e my-env deployments
  2. From the output, locate the MySQL for PCF deployment and record its name.
  3. Run bosh2 run-errand rejoin-unsafe to force a node to rejoin the cluster, specifying the MySQL for PCF deployment. For example:

    $ bosh2 -e my-env -d p-mysql-deployment run-errand rejoin-unsafe
    [...]
    [stdout]
    Started rejoin-unsafe errand ...
    Successfully repaired cluster
    rejoin-unsafe errand completed
    
    [stderr]
    None
    
    Errand `rejoin-unsafe' completed successfully (exit code 0)
    

If the rejoin-unsafe errand is not able to cause a node to join the cluster, log in to each node that has tripped the Interruptor and perform the manual procedure to force the node to rejoin the cluster, documented in Pivotal Knowledge Base.

WARNING: Forcing a node to rejoin the cluster is a destructive procedure. Only perform it with the assistance of Pivotal Support.

Create a pull request or raise an issue on the source for this page in GitHub