LATEST VERSION: 2.5 - RELEASE NOTES

Troubleshooting MySQL for PCF

Page last updated:

This topic provides operators with basic instructions for troubleshooting on-demand MySQL for PCF.

For information about temporary MySQL for PCF service interruptions, see Service Interruptions.

Troubleshooting Errors

This section provides information about how to troubleshooting specific errors or error messages.

Deprecated Service Bindings Found (Upgrade Error)

This problem happens when upgrading to MySQL for PCF v2.5 if there are service bindings that are still using IP addresses.

Symptom

You see in the Ops Manager installation pane that the upgrade process fails at the following step:

Error step

The error output in the logs contains a table of bindings or service keys which must be updated.

Stdout    +---------------------------+--------------------------------------+------------------------+--------------------------+--------------------+-------------------+-----------------------------+
          |          SERVICE          |             SERVICE GUID             |          ORG           |          SPACE           | APP OR SERVICE KEY |       TYPE        |           REASON            |
          +---------------------------+--------------------------------------+------------------------+--------------------------+--------------------+-------------------+-----------------------------+
          | upgrade-outdated-instance | 34f26746-fb46-4f14-87bc-e1ddce26f340 | MYSQL-ORG-upgrade-test | MYSQL-SPACE-upgrade-test | cs-accept          | AppBinding        | no dns: hostname="10.0.8.5" |
          +---------------------------+--------------------------------------+------------------------+--------------------------+--------------------+-------------------+-----------------------------+
          | tlsDB                     | a999db0b-176e-4ac8-8342-d72b338d1f0c | MYSQL-ORG-upgrade-test | MYSQL-SPACE-upgrade-test | cs-accept-tls      | AppBinding        | no dns: hostname="10.0.8.6" |
          +---------------------------+--------------------------------------+------------------------+--------------------------+--------------------+-------------------+-----------------------------+
          | tlsDB                     | a999db0b-176e-4ac8-8342-d72b338d1f0c | MYSQL-ORG-upgrade-test | MYSQL-SPACE-upgrade-test | test-cli           | ServiceKeyBinding | no dns: hostname="10.0.8.6" |
          +---------------------------+--------------------------------------+------------------------+--------------------------+--------------------+-------------------+-----------------------------+

Errand 'validate-before-upgrade-instances' completed with error (exit code 1)
Exit code 1
Stderr     2018/12/18 02:51:51 Deprecated service instances found, try upgrade it manually before upgrading tile

Explanation

In MySQL for PCF v2.5, if you run upgrade-all-service-instances while there are service bindings that use IP addresses, app downtime can occur. This is because bindings are still directing apps to connect with IP addresses, but the TLS certificate is signed with a DNS hostname. This prevents apps that validate the server hostname in the TLS certificate from establishing a TLS connection to the database.

In MySQL for PCF v2.5, the validate-before-upgrade-instances errand checks that there are no service bindings or service keys using IP addresses, and, if there are, outputs the above error and prevents the upgrade-all-service-instances errand from running.

For general information about errands, including the upgrade-all-service-instances errand, see MySQL for PCF Errands.

Solution

To solve this issue, do the following:

  1. Do the procedures in Remove Deprecated Bindings.
    After you restart all of your apps, your apps can connect to the database and resume service. However, your Ops Manager installation is still in a failed state.
  2. To finish the installation, do the following:
    1. In the Ops Manager Installation Dashboard, click Review Pending Changes.
    2. In the Review Pending Changes pane, ensure that the Validate no IP-based bindings in use before upgrade-all-service-instances and Upgrade all On-demand MySQL Service Instances errands are selected.
    3. After reviewing the pending changes, click Apply Changes to complete the installation.

Failed Install

  1. Certificate issues: The on-demand broker (ODB) requires valid certificates. Ensure that your certificates are valid and generate new ones if necessary. To generate new certificates, contact Pivotal Support.
  2. Deploy fails: Deploys can fail for a variety of reasons. View the logs using Ops Manager to determine why the deploy is failing.
  3. Networking problems:
    • Cloud Foundry cannot reach the MySQL for PCF service broker
    • Cloud Foundry cannot reach the service instances
    • The service network cannot access the BOSH director
  4. Register broker errand fails.
  5. The smoke test errand fails.
  6. Resource sizing issues: These occur when the resource sizes selected for a given plan are less than the MySQL for PCF service requires to function. Check your resource configuration in Ops Manager and ensure that the configuration matches that recommended by the service.
  7. Other service-specific issues.

Cannot Create or Delete Service Instances

If developers report errors such as:

Instance provisioning failed: There was a problem completing your request. Please contact your operations team providing the following information: service: redis-acceptance, service-instance-guid: ae9e232c-0bd5-4684-af27-1b08b0c70089, broker-request-id: 63da3a35-24aa-4183-aec6-db8294506bac, task-id: 442, operation: create

Follow these steps:

  1. If the BOSH error shows a problem with the deployment manifest, open the manifest in a text editor to inspect it.

  2. To continue troubleshooting, Log in to BOSH and target the MySQL for PCF service instance using the instructions on parsing a Cloud Foundry error message.

  3. Retrieve the BOSH task ID from the error message and run the following command:

    bosh task TASK-ID
    
  4. If you need more information, access the broker logs and use the broker-request-id from the error message above to search the logs for more information. Check for:

Broker Request Timeouts

If developers report errors such as:

Server error, status code: 504, error code: 10001, message: The request to the service broker timed out: https://BROKER-URL/v2/service_instances/e34046d3-2379-40d0-a318-d54fc7a5b13f/service_bindings/aa635a3b-ef6d-41c3-a23f-55752f3f651b

Follow these steps:

  1. Confirm that Cloud Foundry (CF) is connected to the service broker.
  2. Check the BOSH queue size:
    1. Log into BOSH as an admin.
    2. Run bosh tasks.
  3. If there are a large number of queued tasks, the system may be under too much load. BOSH is configured with two workers and one status worker, which may not be sufficient resources for the level of load. Advise app developers to try again once the system is under less load.

Cannot Bind to or Unbind from Service Instances

Instance Does Not Exist

If developers report errors such as:

Server error, status code: 502, error code: 10001, message: Service broker error: instance does not exist`

Follow these steps:

  1. Confirm that the MySQL for PCF service instance exists in BOSH and obtain the GUID CF by running:

    cf service MY-INSTANCE --guid
    
  2. Using the GUID obtained above, the following BOSH CLI command:

    bosh -d service-instance_GUID vms
    

If the BOSH deployment is not found, it has been deleted from BOSH. Contact Pivotal support for further assistance.

Other Errors

If developers report errors such as:

Server error, status code: 502, error code: 10001, message: Service broker error: There was a problem completing your request. Please contact your operations team providing the following information: service: example-service, service-instance-guid: 8d69de6c-88c6-4283-b8bc-1c46103714e2, broker-request-id: 15f4f87e-200a-4b1a-b76c-1c4b6597c2e1, operation: bind

To find out the exact issue with the binding process:

  1. Access the service broker logs.

  2. Search the logs for the broker-request-id string listed in the error message above.

  3. Contact Pivotal support for further assistance if you are unable to resolve the problem.

  4. Check for:

Cannot Connect to a Service Instance

If developers report that their app cannot use service instances that they have successfully created and bound:

Ask the user to send application logs that show the connection error. If the error is originating from the service, then follow MySQL for PCF-specific instructions. If the issue appears to be network-related, then:

  1. Check that application security groups are configured correctly. Access should be configured for the service network that the tile is deployed to.

  2. Ensure that the network the Pivotal Application Service (PAS) tile is deployed to has network access to the service network. You can find the network definition for this service network in the BOSH Director tile.

  3. In Ops Manager go into the service tile and see the service network that is configured in the networks tab.

  4. In Ops Manager go into the PAS tile and see the network it is assigned to. Make sure that these networks can access each other.

Service instances can also become temporarily inaccessible during upgrades and VM or network failures. See Service Interruptions for more information.

Upgrade All Instances Fails

If the upgrade-all-service-instances errand fails, look at the errand output in the Ops Manager log.

If an instance fails to upgrade, debug and fix it before running the errand again to prevent any failure issues from spreading to other on-demand instances.

Once the Ops Manager log no longer lists the deployment as failing, re-run the errand to upgrade the rest of the instances.

Missing Logs and Metrics

If no logs are being emitted by the on-demand broker, check that your syslog forwarding address is correct in Ops Manager.

  1. Ensure you have configured syslog for the tile.

  2. Ensure that you have network connectivity between the networks that the tile is using and the syslog destination. If the destination is external, you need to use the public ip VM extension feature available in your Ops Manager tile configuration settings.

  3. Verify that the Firehose is emitting metrics:

    1. Install the cf nozzle plugin. For instructions, see the firehose plugin GitHub repository.
    2. To find logs from your service in the cf nozzle output, run the following:

      cf nozzle -f ValueMetric | grep --line-buffered "on-demand-broker/MY-SERVICE"
      

If no metrics appear within five minutes, verify that the broker network has access to the Loggregator system on all required ports.

Contact Pivotal support if you are unable to resolve the issue.

Unable to Determine Leader and Follower (Errand Error)

This problem happens when the configure-leader-follower errand fails because it cannot determine the VM roles.

Symptom

The configure-leader-follower errand exits with 1 and the errand logs contain the following:

$ Unable to determine leader and follower based on transaction history.

Explanation

Something has happened to the instances, such as a failure or manual intervention. As a result, there is not enough information available to determine the correct state and topology without operator intervention to resolve the issue.

Solution

Use the inspect errand to determine which instance should be the leader. Then, using the orchestration errands and backup/restore, you can put the service instance into a safe topology, and then rerun the configure-leader-follower errand. This is shown in the example below.

This example shows one outcome that the inspect errand can return:

  1. Use the inspect errand to retrieve relevant information about the two VMs:
    $ bosh2 -e my-env -d my-dep run-errand inspect
    [...]
    Instance   mysql/4ecad54b-0704-47eb-8eef-eb228cab9724
    Exit Code  0
    Stdout     -
    Stderr     2017/12/11 18:25:54 Started executing command: inspect
             2017/12/11 18:25:54 Started GET https://127.0.0.1:8443/status
             2017/12/11 18:25:54
             Has Data: false
             Read Only: true
             GTID Executed: 1d774323-de9e-11e7-be01-42010a001014:1-25
             Replication Configured: false

    Instance mysql/e0b94ade-0114-4d49-a929-ce1616d8beda Exit Code 0 Stdout - Stderr 2017/12/11 18:25:54 Started executing command: inspect 2017/12/11 18:25:54 Started GET https://127.0.0.1:8443/status 2017/12/11 18:25:54 Has Data: true Read Only: true GTID Executed: 1d774323-de9e-11e7-be01-42010a001014:1-25 Replication Configured: true

    2 errand(s)
    Succeeded
    In the above scenario, the first instance is missing data but does not have replication configured. The second instance has data, and also has replication configured. The instructions below resolve this by copying data to the first instance, and resuming replication.
  2. Take a backup of the second instance using the Manual Backup steps.
  3. Restore the backup artifact to the first instance using the Manual Restore steps.

    At this point, the instances have equivalent data.
  4. Run the configure-leader-follower errand to reconfigure replication: bosh2 -e ENVIRONMENT -d DEPLOYMENT run-errand configure-leader-follower --instance=mysql/GUID-OF-LEADER
    $ bosh2 -e my-env -d my-dep \
      run-errand configure-leader-follower \
      --instance=mysql/4ecad54b-0704-47eb-8eef-eb228cab9724
    
  5. Both Leader and Follower Instances Are Writable (Errand Error)

    This problem happens when the configure-leader-follower errand fails because both VMs are writable and the VMs might hold differing data.

    Symptom

    The configure–leader-follower errand exits with 1 and the errand logs contain the following:

    $ Both mysql instances are writable. Please ensure no divergent data and set one instance to read-only mode.

    Explanation

    MySQL for PCF tries to ensure that there is only one writable instance of the leader-follower pair at any given time. However, in certain situations, such as network partitions, or manual intervention outside of the provided bosh errands, it is possible for both instances to be writable. The service instances remain in this state until an operator resolves the issue to ensure that the correct instance is promoted and reduce the potential for data divergence.

    Solution

    1. Use the inspect errand to retrieve the GTID Executed set for each VM:
      $ bosh2 -e my-env -d my-dep run-errand inspect
      [...]
      Instance   mysql/4ecad54b-0704-47eb-8eef-eb228cab9724
      Exit Code  0
      Stdout     -
      Stderr     2017/12/11 18:25:54 Started executing command: inspect
               2017/12/11 18:25:54 Started GET https:127.0.0.1:8443/status
               2017/12/11 18:25:54
               Has Data: true
               Read Only: false
               GTID Executed: 1d774323-de9e-11e7-be01-42010a001014:1-23
               Replication Configured: false

      Instance mysql/e0b94ade-0114-4d49-a929-ce1616d8beda Exit Code 0 Stdout - Stderr 2017/12/11 18:25:54 Started executing command: inspect 2017/12/11 18:25:54 Started GET https:127.0.0.1:8443/status 2017/12/11 18:25:54 Has Data: true Read Only: false GTID Executed: 1d774323-de9e-11e7-be01-42010a001014:1-25 Replication Configured: false

      2 errand(s)
      Succeeded
      If the GTID Executed sets for both instances are the same, continue to Step 2. If they are different, continue to Step 4.
    2. Look at the value of GTID Executed for both instances.
      • If the range after the GUID is equivalent, either instance can be made read-only, as described in Step 3.
      • If one instance has a range that is a subset of the other, the instance with the subset should must be made read-only, as described in Step 3.
    3. Based on the information you gathered in the step above, run the make-read-only errand to make the appropriate instance read-only: bosh2 -e ENVIRONMENT -d DEPLOYMENT run-errand make-read-only --instance=mysql/MYSQL-SUBSET-INSTANCE
      $ bosh2 -e my-env -d my-dep \
        run-errand make-read-only \
        --instance=mysql/e0b94ade-0114-4d49-a929-ce1616d8beda
      [...]
      Succeeded
      
    4. If the GTID Executed sets are neither equivalent nor subsets, data has diverged and you must determine what data has diverged as part of the procedure below:
      1. Use the make-read-only errand to set both instances to read-only to prevent further data divergence. bosh2 -e ENVIRONMENT -d DEPLOYMENT run-errand make-read-only --instance=mysql/MYSQL-INSTANCE
        $ bosh2 -e my-env -d my-dep \
          run-errand make-read-only \
          --instance=mysql/e0b94ade-0114-4d49-a929-ce1616d8beda
        [...]
        Succeeded
      2. Take a backup of both instances using the Manual Backup steps.
      3. Manually inspect the data on each instance to determine the discrepancies and put the data on the instance that is further ahead—this instance has the higher GTID Executed set, and will be the new leader.
      4. Migrate all appropriate data to the new leader instance.
      5. After putting all data on the leader, ssh onto the follower: bosh2 -e ENVIRONMENT -d DEPLOYMENT ssh mysql/GUID-OF-FOLLOWER
        $ bosh2 -e my-env -d my-dep ssh mysql/e0b94ade-0114-4d49-a929-ce1616d8beda
      6. Become root with the command sudo su.
      7. Stop the mysql process with the command monit stop mysql.
      8. Delete the data directory of the follower with the command rm -rf /var/vcap/store/mysql.
      9. Start the mysql process with the command monit start mysql.
      10. Use the configure-leader-follower errand to copy the leader’s data to the follower and resume replication: bosh2 -e ENVIRONMENT -d DEPLOYMENT run-errand configure-leader-follower --instance=mysql/GUID-OF-LEADER
        $ bosh2 -e my-env -d my-dep \
          run-errand configure-leader-follower \
          --instance=mysql/4ecad54b-0704-47eb-8eef-eb228cab9724

    Both Leader and Follower Instances Are Read-Only

    In a leader-follower topology, the leader VM is writable and the follower VM is read-only. However if both VMs are read only, apps cannot write to the database.

    Symptom

    Developers report that apps cannot write to the database.

    Explanation

    This problem happens if the leader VM fails and the BOSH Resurrector is enabled. When the leader is resurrected, it is set as read-only.

    Solution

    1. Use the inspect errand to confirm that both VMs are in a read-only state: bosh2 -e ENVIRONMENT -d DEPLOYMENT run-errand inspect
    2. Examine the output and locate the information about the leader-follower MySQL VMs:
      Instance   mysql/4eexample54b-0704-47eb-8eef-eb2example724
      Exit Code  0
      Stdout     -
      Stderr     2017/12/11 18:25:54 Started executing command: inspect
               2017/12/11 18:25:54 Started GET https:999.0.0.1:8443/status
               2017/12/11 18:25:54
               Has Data: true
               Read Only: true
               GTID Executed: 1d779999-de9e-11e7-be01-42010a009999:1-23
               Replication Configured: true

      Instance mysql/e0exampleade-0114-4d49-a929-cexample8beda Exit Code 0 Stdout - Stderr 2017/12/11 18:25:54 Started executing command: inspect 2017/12/11 18:25:54 Started GET https:999.0.0.1:8443/status 2017/12/11 18:25:54 Has Data: true Read Only: true GTID Executed: 1d779999-de9e-11e7-be01-42010a009999:1-25 Replication Configured: false

      2 errand(s)
      Succeeded
    3. If Read Only is set to true for both VMs, make the leader writable using the following command: bosh2 -e ENVIRONMENT -d DEPLOYMENT run-errand configure-leader-follower --instance=mysql/GUID-OF-LEADER

      For example, if the second instance above is the leader:
      $ bosh2 -e my-env -d my-dep \
        run-errand configure-leader-follower \
        --instance=mysql/e0exampleade-0114-4d49-a929-cexample8beda
      

    Persistent Disk is Full

    If your persistent disk is full, apps become inoperable. In this state, read, write, and Cloud Foundry Command-Line Interface (cf CLI) operations do not work.

    Symptom

    Developers report that read, write, and cf CLI operations do not work. Developers cannot upgrade to a larger MySQL service plan to free up disk space.

    Explanation

    This problem happens if your persistent disk is full. When you use the BOSH CLI to target your deployment, you see that instances are at 100% persistent disk usage.

    Available disk space can be increased by deleting audit log files. After deleting audit logs, you can then upgrade to a larger MySQL service plan.

    Solution

    The solution to this problem is to confirm that your persistent disk is full, and then delete an audit log file to free up disk space. When the disk is no longer full, the service instance can be upgraded to a larger plan to prevent the disk from becoming full again.

    Follow the steps below:

    1. To retrieve and record the GUID of your service instance, run the following command:
      cf service SERVICE-INSTANCE-NAME --guid

      Where SERVICE-INSTANCE-NAME is the name of your service instance.

      For example:
      $ cf service my-service-instance --guid
          12345678-90ab-cdef-1234-567890abcdef
      If you do not know the name of your service instance, you can list service instances in the space with cf services.
    2. To confirm that your persistent disk usage is at 100%, run the following command:
      bosh -d service-instance_SERVICE-INSTANCE-GUID instances --vitals

      Where SERVICE-INSTANCE-GUID is the GUID you recorded in the above step.

      For example:
       $bosh -d service-instance_12345678-90ab-cdef-1234-567890abcdef instances --vitals
      Using environment 'https://10.0.0.6:25555' as client 'admin'
      
      Task 19243. Done
      
      Deployment 'service-instance_12345678-90ab-cdef-1234-567890abcdef'
      
      Instance                                    Process State  AZ  IPs         VM Created At                 Uptime           Load              CPU    CPU    CPU   CPU   Memory        Swap      System      Ephemeral   Persistent
                                                                                                                                (1m, 5m, 15m)     Total  User   Sys   Wait  Usage         Usage     Disk Usage  Disk Usage  Disk Usage
      mysql/ca0ed8b5-7590-4cde-bba8-7ca2935f2bd0  running        z3  10.0.18.20  Wed Sep 12 22:01:44 UTC 2018  35d 20h 54m 17s  0.02, 0.03, 0.00  -      10.2%  0.4%  0.2%  14% (1.1 GB)  0% (0 B)  54% (37i%)  11% (4i%)   7% (0i%)
      
      1 instances
      
      Succeeded
        
    3. To retrieve and record the instance ID of your service instance, follow the procedure below that correponds with your VM topology:
      • If you are using single node MySQL VMs, to retrieve and record the instance ID, run the following command:
        bosh -d service-instance_SERVICE-INSTANCE-GUID instances

        Where SERVICE-INSTANCE-GUID is the GUID you recorded in the step 1.

        For example:
        $ bosh -d service-instance_12345678-90ab-cdef-1234-567890abcdef
              instances
              Using environment '34.237.123.534' as client 'admin'
        
              Task 204. Done
        
              Deployment 'service-instance_12345678-90ab-cdef-1234-567890abcdef'
        
              Instance                                    Process State  AZ  IPs
              mysql/ca0ed8b5-7590-4cde-bba8-7ca2935f2bd0  running        z2  10.244.17.3
        
              1 instances
        
              Succeeded
            
        The instance ID is the value for Instance after mysql/.

        In the above output, the instance ID of the leader VM is d15419ba-fc6c-4013-b056-19f91c6b0f1d.

      • If you are using leader-follower MySQL VMs, to retrieve and record the leader instance ID, run the following command:
        bosh -d service-instance_SERVICE-INSTANCE-GUID run-errand inspect

        Where SERVICE-INSTANCE-GUID is the GUID you recorded in the step 1.

        For example:
          $ bosh -d service-instance_12345678-90ab-cdef-1234-567890abcdef run-errand inspect
          Instance   mysql/ca0ed8b5-7590-4cde-bba8-7ca2935f2bd0
          Exit Code  0
          Stdout     2018/04/03 18:08:46 Started executing command: inspect
                    2018/04/03 18:08:46
                    IP Address: 10.0.8.11
                    Role: leader
                    Read Only: false
                    Replication Configured: false
                    Replication Mode: async
                    Has Data: true
                    GTID Executed: 82ddc607-710a-404e-b1b8-a7e3ea7ec063:1-18
                    2018/04/03 18:08:46 Successfully executed command: inspect
          Stderr     -
          
        Instance mysql/37e4b6bc-2ed6-4bd2-84d1-e59a91f5e7f8 Exit Code 0 Stdout 2018/04/03 18:08:46 Started executing command: inspect 2018/04/03 18:08:46 IP Address: 10.0.8.10 Role: follower Read Only: true Replication Configured: true Replication Mode: async Has Data: true GTID Executed: 82ddc607-710a-404e-b1b8-a7e3ea7ec063:1-18 2018/04/03 18:08:46 Successfully executed command: inspect
        The leader instance ID is the value for Instance after mysql/ corresponding with the instance marked Role: leader .

        In the above output, the instance ID of the leader VM is ca0ed8b5-7590-4cde-bba8-7ca2935f2bd0.
    4. To BOSH SSH into your service instances, run the following command:
      bosh -d service-instance_SERVICE-INSTANCE-GUID ssh mysql/INSTANCE-ID

      Where:
      • SERVICE-INSTANCE-GUID is the GUID you recorded in the step 1.
      • INSTANCE-ID is the instance ID you recorded in the above step.

      • For example:
              $ bosh -d service-instance_12345678-90ab-cdef-1234-567890abcdef ssh mysql/ca0ed8b5-7590-4cde-bba8-7ca2935f2bd0
              Using environment 'https://10.0.0.6:25555' as client 'admin'
        
              Using deployment 'service-instance_12345678-90ab-cdef-1234-567890abcdef'
        
              Task 19244. Done
            
    5. To move to the directory where your audit log files are located, run the following command:
      cd /var/vcap/sys/log/mysql/mysql-audit-log
    6. Delete at least one audit log file with rm
    7. Update your service instance to a larger plan. For more information, see Update a Service Instance to a Larger Plan

    Unresponsive Node in a Highly Available Cluster

    Symptom

    A client connected to a MySQL cluster node reports the following error:

    WSREP has not yet prepared this node for application use

    Some clients might instead return the following:
    unknown error

    Explanation

    If the client is connected to a MySQL cluster node and that node loses connection to the rest of the cluster, the node stops accepting writes. If the connection to this node is made through the proxy, the proxy automatically re-routes further connections to a different node.

    Solutions

    A node can become unresponsive for a number of reasons. For solutions, see the following:

    • Network Latency: If network latency causes a node to become unresponsive, the node drops but eventually rejoins. The node automatically rejoins only if one node has left the cluster. Consult your IaaS network settings to reduce your network latency.
    • MySQL Process Failure: If the MySQL process crashes, monit and BOSH should restore the process. If the process is not restored, consult error logs using the download-logs tool. For more information, see download-logs in the cf-mysql-release GitHub repository.
    • Firewall Rule Change: If your firewall rules change, it might prevent a node from communicating with the rest of the cluster. This causes the node to become unresponsive. In this case, the logs show the node leaving the cluster but do not show network latency errors.

      To confirm that the node is unresponsive because of a firewall rule change, SSH from a responsive node to the unresponsive node. If you cannot connect, the node is unresponsive due to a firewall rule change. Change your firewall rules to enable the unresponsive node to rejoin the cluster.
    • VM Failure: If you cannot SSH into a node and you are not detecting either network latency or firewall issues, your node might be down due to VM failure. To confirm that the node is unresponsive and re-create the VM, see Re-create a Corrupted VM in a Highly Available below.
    • Node Unable to Rejoin: If a detached existing node fails to join the cluster, its sequence_number might be higher than those of the nodes with quorum. A higher sequence_number on the detached node indicates that it has recent changes to the data that the primary component lacks. You can check this by looking at the node’s error log at /var/vcap/sys/log/pxc-mysql/mysql.err.log.

      To restore the cluster, do one the following:
      • If the detached node has a higher sequence number than the primary component, do the procedures in Bootstrapping.
      • If bootstrapping does not restore the cluster, you can manually force the node to rejoin the cluster. This removes all of the unsynchronized data from the detached server node and creates a new copy of the cluster data on the node. For more information, see Force a Node to Rejoin a Highly Available Cluster Manually below.

        Warning: Forcing a node to rejoin the cluster is a destructive procedure. Only do this procedure with the assistance of Pivotal Support.

      Many Replication Errors in Logs for Highly Available Clusters

      This problem happens when there are errors in SQL statements.

      Symptom

      You see many replication errors in the MySQL logs, like the following:

      160318 9:25:16 [Warning] WSREP: RBR event 1 Query apply warning: 1, 16992456
      160318 9:25:16 [Warning] WSREP: Ignoring error for TO isolated action: source: abcd1234-abcd-1234-abcd-1234abcd1234 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 246804 trx_id: -1 seqnos (l: 865022, g: 16992456, s: 16992455, d: 16992455, ts: 2530660989030983)
      160318 9:25:16 [ERROR] Slave SQL: Error 'Duplicate column name 'number'' on query. Default database: 'cf_0123456_1234_abcd_1234_abcd1234abcd'. Query: 'ALTER TABLE ...'

      Explanation and Solution

      For solutions for replication errors in MySQL log files, see the table below.
      Additional Error Solution
      ALTER TABLE errors Fix the ALTER TABLE error.
      This error can occur when an app issues an invalid data definition statement. Other nodes log this problem as a replication error because they fail to replicate the ALTER TABLE.
      Increased persistent disk usage or running out of working memory Decode the GRA_*.log files and look for errors. See How to decode Galera GRA log files in the Pivotal Support knowledge base. The GRA log files contain failing DDL statements.
      If you see replication errors, but no ALTER TABLE or persistent disk or memory issues, you can ignore the replication errors.

      Troubleshooting Components

      This section provides guidance on checking for and fixing issues in on-demand service components.

      BOSH Problems

      Large BOSH Queue

      On-demand service brokers add tasks to the BOSH request queue, which can back up and cause delay under heavy loads. An app developer who requests a new MySQL for PCF instance sees create in progress in the Cloud Foundry Command Line Interface (cf CLI) until BOSH processes the queued request.

      Ops Manager currently deploys two BOSH workers to process its queue. Future versions of Ops Manager will let users configure the number of BOSH workers.

      Configuration

      Service instances in failing state

      You may have configured a VM / Disk type in tile plan page in Ops Manager that is insufficiently large for the MySQL for PCF service instance to start. See tile-specific guidance on resource requirements.

      Authentication

      UAA Changes

      If you have rotated any UAA user credentials then you may see authentication issues in the service broker logs.

      To resolve this, redeploy the MySQL for PCF tile in Ops Manager. This provides the broker with the latest configuration.

      Note: You must ensure that any changes to UAA credentials are reflected in the Ops Manager credentials tab of the Pivotal Application Service (PAS) tile.

      Networking

      Common issues with networking include:

      Issue Solution
      Latency when connecting to the MySQL for PCF service instance to create or delete a binding. Try again or improve network performance.
      Firewall rules are blocking connections from the MySQL for PCF service broker to the service instance. Open the MySQL for PCF tile in Ops Manager and check the two networks configured in the Networks pane. Ensure that these networks allow access to each other.
      Firewall rules are blocking connections from the service network to the BOSH director network. Ensure that service instances can access the Director so that the BOSH agents can report in.
      Apps cannot access the service network. Configure Cloud Foundry application security groups to allow runtime access to the service network.
      Problems accessing BOSH’s UAA or the BOSH director. Follow network troubleshooting and check that the BOSH director is online

      Validate Service Broker Connectivity to Service Instances

      To validate connectivity, do the following:

      1. To SSH into the MySQL for PCF service broker, run the following commnand:

        bosh -d service-instance_GUID ssh
        
      2. If no BOSH task-id appears in the error message, look in the broker log using the broker-request-id from the task.

      Validate App Access to Service Instance

      Use cf ssh to access to the app container, then try connecting to the MySQL for PCF service instance using the binding included in the VCAP_SERVICES environment variable.

      Quotas

      Plan Quota issues

      If developers report errors such as:

      Message: Service broker error: The quota for this service plan has been exceeded. 
      Please contact your Operator for help.
      
      1. Check your current plan quota.
      2. Increase the plan quota.
      3. Log into Ops Manager.
      4. Reconfigure the quota on the plan page.
      5. Deploy the tile.
      6. Find who is using the plan quota and take the appropriate action.

      Global Quota Issues

      If developers report errors such as:

      Message: Service broker error: The quota for this service has been exceeded. 
      Please contact your Operator for help.
      
      1. Check your current global quota.
      2. Increase the global quota.
      3. Log into Ops Manager.
      4. Reconfigure the quota on the on-demand settings page.
      5. Deploy the tile.
      6. Find out who is using the quota and take the appropriate action.

      Failing Jobs and Unhealthy Instances

      To determine whether there is an issue with the MySQL for PCF service deployment, inspect the VMs. To do so, run the following command:

      bosh -d service-instance_GUID vms --vitals
      

      For additional information, run the following command:

      bosh instances --ps --vitals
      

      If the VM is failing, follow the service-specific information. Any unadvised corrective actions (such as running BOSH restart on a VM) can cause issues in the service instance.

      A failing process or failing VM might come back automatically after a temporary service outage. See VM Process Failure and VM Failure.

      AZ or Region Failure

      Failures at the IaaS level, such as Availability Zone (AZ) or region failures, can interrupt service and require manual restoration. See AZ Failure and Region Failure.

      Techniques for Troubleshooting

      Instructions on interacting with the on-demand service broker and on-demand service instance BOSH deployments, and on performing general maintenance and housekeeping tasks

      Parse a Cloud Foundry (CF) Error Message

      Failed operations (create, update, bind, unbind, delete) result in an error message. You can retrieve the error message later by running the cf CLI command cf service INSTANCE-NAME.

      $ cf service myservice
      
      Service instance: myservice
      Service: super-db
      Bound apps:
      Tags:
      Plan: dedicated-vm
      Description: Dedicated Instance
      Documentation url:
      Dashboard:
      
      Last Operation
      Status: create failed
      Message: Instance provisioning failed: There was a problem completing your request.
           Please contact your operations team providing the following information:
           service: redis-acceptance,
           service-instance-guid: ae9e232c-0bd5-4684-af27-1b08b0c70089,
           broker-request-id: 63da3a35-24aa-4183-aec6-db8294506bac,
           task-id: 442,
           operation: create
      Started: 2017-03-13T10:16:55Z
      Updated: 2017-03-13T10:17:58Z
      

      Use the information in the Message field to debug further. Provide this information to Pivotal Support when filing a ticket.

      The task-id field maps to the BOSH task ID. For more information on a failed BOSH task, use the bosh task TASK-ID.

      The broker-request-guid maps to the portion of the On-Demand Broker log containing the failed step. Access the broker log through your syslog aggregator, or access BOSH logs for the broker by typing bosh logs broker 0. If you have more than one broker instance, repeat this process for each instance.

      Access Broker and Instance Logs and VMs

      Before following the procedures below, log into the cf CLI and the BOSH CLI.

      Access Broker Logs and VM(s)

      You can access logs using Ops Manager by clicking on the Logs tab in the tile and downloading the broker logs.

      To access logs using the BOSH CLI, do the following:

      1. Identify the on-demand broker (ODB) deployment by running the following command:

        bosh deployments
        
      2. View VMs in the deployment by running the following command:

        bosh -d DEPLOYMENT-NAME instances
        
      3. SSH onto the VM by running the following command:

        bosh -d service-instance_GUID ssh
        
      4. Download the broker logs by running the following command:

        bosh -d service-instance_GUID logs
        


      The archive generated by BOSH or Ops Manager includes the following logs:

      Log Name Description
      broker.log Requests to the on-demand broker and the actions the broker performs while orchestrating the request (e.g. generating a manifest and calling BOSH). Start here when troubleshooting.
      broker_ctl.log Control script logs for starting and stopping the on-demand broker.
      post-start.stderr.log Errors that occur during post-start verification.
      post-start.stdout.log Post-start verification.
      drain.stderr.log Errors that occur while running the drain script.

      Access Service Instance Logs and VMs

      1. To target an individual service instance deployment, retrieve the GUID of your service instance with the following cf CLI command:

        cf service MY-SERVICE --guid
        
      2. To view VMs in the deployment, run the following command:

        bosh -d DEPLOYMENT-NAME instances
        
      3. To SSH into a VM, run the following command:

        bosh -d service-instance_GUID ssh
        
      4. To download the instance logs, run the following command:

        bosh -d service-instance_GUID logs
        

      Retrieve Admin and Read-Only Admin Credentials for a Service Instance

      To retrieve the admin and read-only admin credentials for a service instance from BOSH CredHub, perform the following steps:

      1. Use the cf CLI to determine the GUID associated with the service instance for which you want to retrieve credentials. Run the following command:
        cf service SERVICE-INSTANCE --guid

        Where SERVICE-INSTANCE is the name of the service instance.

        For example:
        $ cf service my-service-instance --guid
            12345678-90ab-cdef-1234-567890abcdef
        If you do not know the name of the service instance, you can list service instances in the space with cf services.
      2. Perform the steps in Gather Credential and IP Address Information and SSH into Ops Manager of Advanced Troubleshooting with the BOSH CLI to SSH into the Ops Manager VM.
      3. From the Ops Manager VM, log in to your BOSH Director with the BOSH CLI. See Log in to the BOSH Director in Advanced Troubleshooting with the BOSH CLI.
      4. Follow the steps in Find the CredHub Credentials, and record the values for identity and password.
      5. Set the API target of the CredHub CLI to your BOSH CredHub server.

        Run the following command:
        credhub api https://BOSH-DIRECTOR:8844 --ca-cert=/var/tempest/workspaces/default/root_ca_certificate

        Where BOSH-DIRECTOR is the IP address of the BOSH Director VM.

        For example:
        $ credhub api https://10.0.0.5:8844 --ca-cert=/var/tempest/workspaces/default/root_ca_certificate
      6. Log in to CredHub.

        Run the following command:
        credhub login \
            --client-name=CREDHUB-CLIENT-NAME \
            --client-secret=CREDHUB-CLIENT-SECRET
        Where: For example:
        $ credhub login \
                --client-name=credhub \
                --client-secret=abcdefghijklm123456789
      7. Use the CredHub CLI to retrieve the credentials.
        • To retrieve the password for the admin user, run the following command:
          credhub get -n /p-bosh/service-instance_GUID/admin_password
          In the output, the password appears under value.

        • To retrieve the password for the read-only admin user, run the following command:
          credhub get -n /p-bosh/service-instance_GUID/read_only_admin_password
          In the output, the password appears under value.

        For example:
        $ credhub get -n /p-bosh/service-instance_70d30bb6-7f30-441a-a87c-05a5e4afff26/admin_password
            id: d6e5bd10-3b60-4a1a-9e01-c76da688b847
            name: /p-bosh/service-instance_70d30bb6-7f30-441a-a87c-05a5e4afff26/admin_password
            type: password
            value: UMF2DXsqNPPlCNWMdVMcNv7RC3Wi10
            version_created_at: 2018-04-02T23:16:09Z

      Reinstall a Tile

      To reinstall the MySQL for PCF tile, see the Reinstalling MySQL for Pivotal Cloud Foundry version v2 and later Knowledge Base article.

      View Resource Saturation and Scaling

      To view usage statistics for any service, do the following:

      1. Run the following command:

        bosh -d DEPLOYMENT-NAME vms --vitals
        
      2. To view process-level information, run:

        bosh -d DEPLOYMENT-NAME instances --ps
        

      Identify Service Instance Owner

      If you want to identify which apps are using a specific service instance from the BOSH deployments name, you can run the following steps:

      1. Take the deployment name and strip the service-instance_ leaving you with the GUID.
      2. Log in to CF as an admin.
      3. Obtain a list of all service bindings by running the following:

        cf curl /v2/service_instances/GUID/service_bindings
        
      4. The output from the above curl gives you a list of resources, with each item referencing a service binding, which contains the APP-URL. To find the name, org, and space for the app, run the following:

        1. cf curl APP-URL and record the app name under entity.name.
        2. cf curl SPACE-URL to obtain the space, using the entity.space_url from the above curl. Record the space name under entity.name.
        3. cf curl ORGANIZATION-URL to obtain the org, using the entity.organization_url from the above curl. Record the organization name under entity.name.

      Note: When running cf curl ensure that you query all pages, because the responses are limited to a certain number of bindings per page. The default is 50. To find the next page curl the value under next_url.

      Monitor Quota Saturation and Service Instance Count

      Quota saturation and total number of service instances are available through ODB metrics emitted to Loggregator. The metric names are shown below:

      Metric Name Description
      on-demand-broker/SERVICE-NAME-MARKETPLACE/quota_remaining global quota remaining for all instances across all plans
      on-demand-broker/SERVICE-NAME-MARKETPLACE/PLAN-NAME/quota_remaining quota remaining for a particular plan
      on-demand-broker/SERVICE-NAME-MARKETPLACE/total_instances total instances created across all plans
      on-demand-broker/SERVICE-NAME-MARKETPLACE/PLAN-NAME/total_instances total instances created for a given plan

      Note: Quota metrics are not emitted if no quota has been set.

      Techniques for Troubleshooting Highly Available Clusters

      If your cluster is experiencing downtime or in a degraded state, Pivotal recommends gathering information to diagnose the type of failure the cluster is experiencing with the following workflow:

      1. Consult solutions for common errors. See Highly Available Cluster Troubleshooting Errors above.
      2. Use mysql-diag to view a summary of the network, disk, and replication state of each cluster node. Depending on the output from mysql-diag, you might recover your cluster with the following troubleshooting techniques: For more information about mysql-diag, see Running mysql-diag.
      3. Run download-logs against each node in your MySQL cluster, MySQL proxies, and jumpbox VM. You must run download-logs before attempting recovery because any failures in the recovery procedure can result in logs being lost or made inaccessible.

        For more information, see download-logs in the cf-mysql-release GitHub repository.

        Note: Pivotal recommends that you use the -X flag to get the complete set of available logs. However, if your cluster processes a high volume of transactions, the complete set might be too large and you can omit this flag to fetch the essential set of logs.

      4. If you are uncertain about the recovery steps to take, submit a ticket through Pivotal Support. When you submit a ticket provide the following information:
        • mysql-diag output: The ouput summary of the network, disk, and replication state.
        • download-logs logs: Logs about your MySQL cluster, MySQL proxies, and jumpbox VM.
        • Deployment environment: The environment that MySQL for PCF is running in such as Pivotal Application Service (PAS) or a service tile.
        • Version numbers: The versions of the installed Ops Manager, PAS, and MySQL for PCF
      5. Warning: Do not attempt to resolve cluster issues by reconfiguring the cluster, such as changing the number of nodes or networks. Only follow the diagnosis steps in this document. If you are unsure how to proceed, contact Pivotal Support.

        Force a Node to Rejoin a Highly Available Cluster Manually

        If a detached node fails to rejoin the cluster after a configured grace period, you can manually force the node to rejoin the cluster. This procedure removes all the data on the node, forces the node to join the cluster, and creates a new copy of the cluster data on the node.

        Warning: If you manually force a node to rejoin the cluster, data stored on the local node is lost. Do not force nodes to rejoin the cluster if you want to preserve unsynchronized data. Only do this procedure with help from Pivotal Support.

        Before following this procedure, try to bootstrap the cluster. For more information, see Bootstrapping.

        To manually force a node to rejoin the cluster, do the following:
        1. To SSH into the node, follow the procedure in BOSH SSH.
        2. To become root, run the following command:
           sudo su
        3. To shut down the mysqld process on the node, run the following command:
            monit stop galera-init 
        4. To remove the unsynchronized data on the node, run the following command:
            rm -rf /var/vcap/store/pxc-mysql
        5. To prepare the node before restarting, run the following command:
            /var/vcap/jobs/pxc-mysql/bin/pre-start
        6. To restart the mysqld process, run the following command:
           monit start galera-init

        Re-create a Corrupted VM in a Highly Available Cluster

        To re-create a corrupted VM, do the following:
        1. To log in to the BOSH Director VM, do the following procedures:
          1. To gather the information needed to log in to the BOSH Director VM, see Gather Credential and IP Address Information.
          2. To log in to the Ops Manager VM, see Log in to the Ops Manager VM with SSH.
          3. To log in to the BOSH Director VM, see Log in to the BOSH Director VM.
        2. To identify and re-create the unresponsive node with bosh cloudcheck, do the procedure in BOSH Cloudcheck and run Recreate VM using last known apply spec.

          Warning: Re-creating a node will clear its logs. Ensure the node is completely down before re-creating it.

          Warning: Only re-create one node. Do not re-create the entire cluster. If more than one node is down, contact Pivotal Support.

        Check Replication Status in a Highly Available Cluster

        If you see stale data in your cluster, you can check whether replication is functioning normally.

        To check the replication status, do the following:
        1. To log in to the BOSH Director VM, do the following procedures:
          1. To gather the information needed to log in to the BOSH Director VM, see Gather Credential and IP Address Information.
          2. To log in to the Ops Manager VM, see Log in to the Ops Manager VM with SSH.
        2. To create a dummy database in the first node, run the following command:
          mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -e "create database verify_healthy;"
          Where:
          • FIRST-NODE-IP-ADDRESS is the IP address of the first node you recorded in step 1.
          • YOUR-IDENTITY is the value of identity that you recorded in step 1.
        3. To create a dummy table in the dummy database, do the following:
          mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "create table dummy_table (id int not null primary key auto_increment, info text) engine='InnoDB';"
        4. To insert data into the dummy table, run the following command:
          mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "insert into dummy_table(info) values ('dummy data'),('more dummy data'),('even more dummy data');"
        5. To query the table and verify that the three rows of dummy data exist on the first node, run the following:
          mysql -h FIRST-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "select * from dummy_table;"
          When prompted for a password, provide the password value recorded in step 1.
          The above command returns output similar to the following:
              +----+----------------------+
              | id | info                 |
              +----+----------------------+
              |  4 | dummy data           |
              |  7 | more dummy data      |
              | 10 | even more dummy data |
              +----+----------------------+
        6. To verify that the other nodes contain the same dummy data do the following for each of the remaining MySQL server IP addresses:
          1. To query the dummy table, do the following:
            mysql -h NEXT-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "select * from dummy_table;"
            When prompted for a password, provide the password value recorded in step 1.
          2. To verify that the node contains the same three rows of dummy data as the other nodes, run the following command:
            mysql -h NEXT-NODE-IP-ADDRESS -u YOUR-IDENTITY -p -D verify_healthy -e "select * from dummy_table;"
            When prompted for a password, provide the password value recorded in step 1. Verify that the above command returns output similar to the following:
                +----+----------------------+
                | id | info                 |
                +----+----------------------+
                |  4 | dummy data           |
                |  7 | more dummy data      |
                | 10 | even more dummy data |
                +----+----------------------+

        7. If each MySQL server instance does not return the same result, before proceeding further or making any changes to your deployment, contact Pivotal Support
          If each MySQL server instance returns the same result, then you can safely proceed to scaling down your cluster to a single node.

        Knowledge Base (Community)

        Find the answer to your question and browse product discussions and solutions by searching the Pivotal Knowledge Base.

        File a Support Ticket

        You can file a ticket with Pivotal Support. Be sure to provide the error message from cf service YOUR-SERVICE-INSTANCE.

        To expedite troubleshooting, provide your service broker logs and your service instance logs. If your cf service YOUR-SERVICE-INSTANCE output includes a task-id, provide the BOSH task output.