LATEST VERSION: 2.1 - CHANGELOG

Monitoring the MySQL Service

This document describes the metrics that are produced by MySQL for PCF, and everything you need to know about the Replication Canary. For information about the Interruptor, see Using the Interruptor.

Metrics

MySQL emits a number of metrics that can be used to monitor the health and performance of the MySQL deployment. The Loggregator Firehose exposes these metrics.

The metrics polling interval defaults to 30 seconds. This can be changed by navigating to the Advanced Options configuration pane and entering a new value in Metrics polling interval (min: 10).

Metrics Polling Interval

Third-party monitoring tools can consume MySQL for PCF’s metrics via a nozzle to monitor MySQL performance and health. For an example Datadog configuration that displays some of the significant metrics outlined below, see the CF Redis and MySQL example dashboards. Pivotal does not endorse or provide support for any third party solution.

Key Performance Indicators

Key Performance Indicators (KPIs) for MySQL for PCF are metrics that operators find most useful for monitoring their MySQL service to ensure smooth operation. KPIs are high-signal-value metrics that can indicate emerging issues. KPIs can be raw component metrics or derived metrics generated by applying formulas to raw metrics.

Pivotal provides the following KPIs as general alerting and response guidance for typical MySQL for PCF installations. Pivotal recommends that operators continue to fine-tune the alert measures to their installation by observing historical trends. Pivotal also recommends that operators expand beyond this guidance and create new, installation-specific monitoring metrics, thresholds, and alerts based on learning from their own installations.

MySQL for PCF KPIs

This section lists the KPIs that are specific for MySQL for PCF.

For a list of general KPIs that apply to all instances, and not specifically to MySQL for PCF, see BOSH System Metrics.

For a list of all MySQL for PCF component metrics, see All MySQL Metrics.

MySQL Server Availability

/p-mysql/available
Description The MySQL Server is currently responding to requests, which indicates that the server is running.

Use: This metric is especially useful in single-node mode, where cluster metrics are not relevant. If the server does not emit heartbeats, it is offline.

Origin: Firehose
Envelope Type: Gauge
Unit: boolean
Frequency: 30 s (default)
Recommended measurement Average over the last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Run mysql-diag and check the MySQL Server logs for errors.

Galera Cluster Node Readiness

/p-mysql/galera/wsrep_ready
Description Shows whether each cluster node can accept queries. Returns only 0 or 1. When this metric is 0, almost all queries to that node fail with the error:
ERROR 1047 (08501) Unknown Command

Use: Discover when nodes of a cluster have been unable to communicate and, thus, unable to accept transactions.

Origin: Firehose
Envelope Type: Gauge
Unit: boolean
Frequency: 30 s (default)
Recommended measurement Average of values of each cluster node, over the last 5 minutes
Recommended alert thresholds Yellow warning: < 1.0
Red critical: 0 (cluster is down)
Recommended response - Run mysql-diag and check the MySQL Server logs for errors.
- Make sure there has been no infrastructure event that affects intra-cluster communication.
- Ensure that wsrep_ready has not been set to off by using the query:
SHOW STATUS LIKE 'wsrep_ready';

Galera Cluster Size

/p-mysql/galera/wsrep_cluster_size
Description The number of cluster nodes with which each node is communicating normally.

Use: When running in a multi-node configuration, this metric indicates if each member of the cluster is communicating normally with all other nodes.

Origin: Firehose
Envelope Type: Gauge
Unit: count
Frequency: 30 s (default)
Recommended measurement (Average of the values of each node / cluster size), over the last 5 minutes
Recommended alert thresholds Yellow warning: < 3.0 (availability compromised)
Red critical: < 1.0 (cluster unavailable)
Recommended response Run mysql-diag and check the MySQL Server logs for errors.

Galera Cluster Status

/p-mysql/galera/wsrep_cluster_status
Description Shows the primary status of the cluster component that the node is in.
Values are:
- Primary = 1
- Non-primary = 0
- Disconnected = -1
See: https://mariadb.com/kb/en/mariadb/galera-cluster-status-variables/

Use: Any value other than “Primary” indicates that the node is part of a nonoperational component. This occurs in cases of multiple membership changes that result in a loss of quorum.

Origin: Firehose
Envelope Type: Gauge
Unit: integer (see above)
Frequency: 30 s (default)
Recommended measurement Sum of each of the nodes, over the last 5 minutes
Recommended alert thresholds Yellow warning: < 3
Red critical: < 1
Recommended response - Check node status to ensure that they are all in working order and able to receive write-sets.
- Run mysql-diag and check the MySQL Server logs for errors.

Connections per Second

/p-mysql/net/connections
Description Connections per second made to the server.

Use: If the number of connections drastically changes or if apps are unable to connect, there might be a network or app issue.

Origin: Firehose
Envelope Type: Gauge
Unit: count
Frequency: 30 s (default)
Recommended measurement (Average of all nodes / max connections), over last 1 minute
Recommended alert thresholds Yellow warning: > 80%
Red critical: > 90%
Recommended response - Run mysql-diag and check the MySQL Server logs for errors.
- When approaching 100% of max connections, Apps may be experiencing times when they cannot connect to the database. The connections per second for the cluster vary based on application instances and app utilization. If this threshold is met or exceeded for an extended period of time, monitor app usage to ensure everything is behaving as expected.

Query Rate

/p-mysql/performance/questions
Description The number of statements executed by the server, excluding statements executed within stored programs.

Use: The cluster should always be processing some queries, if just as part of the internal automation.

Origin: Firehose
Envelope Type: Gauge
Unit: count
Frequency: 30 s (default)
Recommended measurement Change in number between the current and previous polling period
Recommended alert thresholds Yellow warning: 0 for 90 s
Red critical: 0 for 120 s
Recommended response If the rate is ever zero for an extended time, run mysql-diag and investigate the MySQL server logs to understand why query rate changed and determine appropriate action.

MySQL CPU Busy Time

/p-mysql/performance/busy_time
Description
Recommended measurement Percentage of CPU time spent by MySQL on user activity, executing user code, as opposed to kernel activity processing system calls.

Use: This closely reflects the amount of server activity dedicated to app queries.

Origin: Firehose
Envelope Type: Gauge
Unit: percentage
Frequency: 30 s (default)
Recommended measurement Average over last 2 minutes
Recommended alert thresholds Yellow warning: > 80%
Red critical: > 90%
Recommended response - If this metric meets or exceeds the recommended thresholds for extended periods of time, run SHOW PROCESSLIST and identify which queries or apps are using so much CPU. Optionally redeploy the MySQL jobs using VMs with more CPU capacity.
- Run mysql-diag and check the MySQL Server logs for errors.

BOSH System Metrics

All BOSH-deployed components generate the following system metrics; these system metrics also serve as KPIs for the MySQL for PCF service.

RAM

system.mem.percent
Description RAM being consumed by the MySQL cluster node.

Use: MySQL increases its memory usage as the data set increases. This is normal, as much of that RAM is used to buffer IO. As long as there is enough remaining RAM for other processes on the instance, the MySQL server should be OK.

Origin: JMX Bridge or BOSH HM
Type: percentage
Frequency: 60 s
Recommended measurement Average over the last 10 minutes
Recommended alert thresholds Yellow warning: > 95%
Red critical: > 99%
Recommended response Update the cluster to use VMs that offer more RAM.

CPU

system.cpu.percent
Description CPU time being consumed by the MySQL cluster.

Use: A node that experiences context switching or high CPU usage will become unresponsive. This also affects the ability of the node to report metrics.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 60 s
Recommended measurement Average over the last 10 minutes
Recommended alert thresholds Yellow warning: > 80%
Red critical: > 90%
Recommended response Determine what is using so much CPU. If it is from normal processes, update the service instance to use a VM with larger CPU capacity.

Persistent Disk

persistent.disk.percent
Description Persistent disk being consumed by the MySQL cluster nodes.

Use: If the persistent disk fills up, MySQL will be unable to process queries and recovery is difficult.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 60 s
Recommended measurement Average over the last 10 minutes
Recommended alert thresholds Yellow warning: > 75%
Red critical: > 90%
Recommended response Update the deployment with larger persistent disks. This process may take some time, as the data is copied from the original persistent disk to a new one.

MySQL-Specific Metrics

Data Source Description Metric Unit
/p-mysql/available Indicates if the local database server is available and responding. boolean
/p-mysql/innodb/buffer_pool_free The number of free pages in the InnoDB Buffer Pool. pages
/p-mysql/innodb/buffer_pool_total The total number of pages in the InnoDB Buffer Pool. pages
/p-mysql/innodb/buffer_pool_used The number of used pages in the InnoDB Buffer Pool. pages
/p-mysql/innodb/buffer_pool_utilization The utilization of the InnoDB Buffer Pool. fraction
/p-mysql/innodb/current_row_locks The number of current row locks. locks
/p-mysql/innodb/data_reads The rate of data reads. reads/second
/p-mysql/innodb/data_writes The rate of data writes. writes/second
/p-mysql/innodb/mutex_os_waits The rate of mutex OS waits. events/second
/p-mysql/innodb/mutex_spin_rounds The rate of mutex spin rounds. events/second
/p-mysql/innodb/mutex_spin_waits The rate of mutex spin waits. events/second
/p-mysql/innodb/os_log_fsyncs The rate of fsync writes to the log file. writes/second
/p-mysql/innodb/row_lock_time Time spent in acquiring row locks. milliseconds
/p-mysql/innodb/row_lock_waits The number of times per second a row lock had to be waited for. events/second
/p-mysql/net/connections The rate of connections to the server. connection/second
/p-mysql/net/max_connections The maximum number of connections that have been in use simultaneously since the server started. connections
/p-mysql/performance/com_delete The rate of delete statements. queries/second
/p-mysql/performance/com_delete_multi The rate of delete-multi statements. queries/second
/p-mysql/performance/com_insert The rate of insert statements. query/second
/p-mysql/performance/com_insert_select The rate of insert-select statements. queries/second
/p-mysql/performance/com_replace_select The rate of replace-select statements. queries/second
/p-mysql/performance/com_select The rate of select statements. queries/second
/p-mysql/performance/com_update The rate of update statements. queries/second
/p-mysql/performance/com_update_multi The rate of update-multi. queries/second
/p-mysql/performance/created_tmp_disk_tables The rate of internal on-disk temporary tables created by second by the server while executing statements. table/second
/p-mysql/performance/created_tmp_files The rate of temporary files created by second. files/second
/p-mysql/performance/created_tmp_tables The rate of internal temporary tables created by second by the server while executing statements. tables/second
/p-mysql/performance/kernel_time Percentage of CPU time spent in kernel space by MySQL. percent
/p-mysql/performance/key_cache_utilization The key cache utilization ratio. fraction
/p-mysql/performance/open_files The number of open files. files
/p-mysql/performance/open_tables The number of of tables that are open. tables
/p-mysql/performance/qcache_hits The rate of query cache hits. hits/second
/p-mysql/performance/questions The rate of statements executed by the server. queries/second
/p-mysql/performance/slow_queries The rate of slow queries. queries/second
/p-mysql/performance/table_locks_waited The total number of times that a request for a table lock could not be granted immediately and a wait was needed. number
/p-mysql/performance/threads_connected The number of currently open connections. connections
/p-mysql/performance/threads_running The number of threads that are not sleeping. threads
/p-mysql/performance/busy_time Percentage of CPU time spent in user space by MySQL. percent
/p-mysql/performance/max_connections The maximum permitted number of simultaneous client connections. integer
/p-mysql/performance/open_files_limit The number of files that the operating system permits mysqld to open. integer
/p-mysql/performance/open_tables The number of tables that are open. integer
/p-mysql/performance/opened_tables The number of tables that have been opened. integer
/p-mysql/performance/opened_table_definitions The number of .frm files that have been cached. integer

Galera-Specific Metrics

Data Source Description Metric Unit
/p-mysql/galera/wsrep_ready Shows whether the node can accept queries. boolean
/p-mysql/galera/wsrep_cluster_size The current number of nodes in the Galera cluster. node
/p-mysql/galera/wsrep_cluster_status Shows the primary status of the cluster component that the node is in. State ID.
Values are Primary = 1, Non-primary = 0, Disconnected = -1 (See: https://mariadb.com/kb/en/mariadb/galera-cluster-status-variables/)
/p-mysql/galera/wsrep_local_recv_queue_avg Shows the average size of the local received queue since the last status query. float
/p-mysql/galera/wsrep_local_send_queue_avg Shows the average size of the local sent queue since the last status query. float
/p-mysql/galera/wsrep_local_index This node index in the cluster (base 0). int
/p-mysql/galera/wsrep_local_state This is the node’s local state int

For more information on monitoring PCF, see Monitoring Pivotal Cloud Foundry.

Replication Canary

MySQL for Pivotal Cloud Foundry (PCF) is a clustered solution that uses replication to provide benefits such as quick failover and rolling upgrades. This is more complex than a single node system with no replication. MySQL for PCF includes a Replication Canary to help with the increased complexity. The Replication Canary is a long-running monitor that validates that replication is working within the MySQL cluster.

How it Works

The Replication Canary writes to a private dataset in the cluster, and attempts to read that data from each node. It pauses between writing and reading to ensure that the writesets have been committed across each node of the cluster. The private dataset does not use a significant amount of disk capacity.

When replication fails to work properly, the Canary detects that it cannot read the data from all nodes, and immediately takes two actions:

  • Emails a pre-configured address with a message that replication has failed. See the sample below.
  • Disables client access to the cluster.

Note: Malfunctioning replication exposes the cluster to the possibility of data loss. Because of this, both behaviors are enabled by default. It is critical that you contact Pivotal support immediately in the case of replication failure. Support will work with you to determine the nature of the cluster failure and provide guidance regarding a solution.

Sample Notification Email

If the Canary detects a replication failure, it immediately sends an email through the Elastic Runtime notification service. See the following example:

Subject: CF Notification: p-mysql Replication Canary, alert 417

This message was sent directly to your email address.

{alert-code 417}
This is an email to notify you that the MySQL service's replication canary has detected an unsafe cluster condition in which replication is not performing as expected across all nodes.

Cluster Access

Each time the Canary detects cluster replication failure, it instructs all proxies to disable connections to the database cluster. If the replication issue resolves, the Canary detects this and automatically restores client access to the cluster.

If you must restore access to the cluster regardless of the Replication Canary, contact Support.

Determine Proxy State

You can determine if the Canary disabled cluster access by using the Proxy API. See the following example:

ubuntu@ip-10-0-0-38:~$ curl -ku admin:PASSWORD_FROM_OPSMGR -X GET https://proxy-api-p-mysql.SYSTEM-DOMAIN/v0/cluster ; echo
{"currentBackendIndex":0,"trafficEnabled":false,"message":"Disabling cluster traffic","lastUpdated":"2016-07-27T05:16:29.197754077Z"}

Enable the Replication Canary

To enable the Replication Canary, follow the instructions below to configure both the Elastic Runtime tile and the MySQL for PCF tile.

Configure the Elastic Runtime Tile

Note: In a typical PCF deployment, these settings are already configured.

  1. In the SMTP Config section, enter a From Email that the Replication Canary can use to send notifications, along with the SMTP server configuration.
  2. In the Errands section, select the Notifications errand.

Configure the MySQL for PCF Tile

  1. In the Advanced Options section, select Enable replication canary. Enable Replication
  2. If you want to the Replication Canary to send email but not disable access at the proxy, select Notify only.

    Note: Pivotal recommends leaving this checkbox unselected due to the possibility of data loss from replication failure.

  3. You can override the Replication canary time period. The Replication canary time period sets how frequently the canary checks for replication failure, in seconds. This adds a small amount of load to the databases, but the canary reacts more quickly to replication failure. The default is 30 seconds.

    Canary time

  4. You can override the Replication canary read delay. The Replication canary read delay sets how long the canary waits to verify data is replicating across each MySQL node, in seconds. Clusters under heavy load experience some small replication lag as writesets are committed across the nodes. The Default is 20 seconds.

  5. Enter an Email address to receive monitoring notifications. Use a closely monitored email address account. The purpose of the Canary is to escalate replication failure as quickly as possible.

  6. In the Resource Config section, ensure the Monitoring job has one instance. Resource config

Disable the Replication Canary

If you do not need the Replication Canary, for instance if you use a single MySQL node, follow this procedure to disable both the job and the resource configuration.

  1. In the Advanced Options section of the MySQL for PCF tile, select Disable Replication Canary. Disable protection

  2. In the Resource Config pane, set the Monitoring job to zero instances. Monitoring zero

Create a pull request or raise an issue on the source for this page in GitHub