Monitoring the MySQL Service
Warning: MySQL for PCF v1.10 is no longer supported because it has reached the End of General Support (EOGS) phase as defined by the Support Lifecycle Policy. To stay up to date with the latest software and security updates, upgrade to a supported version.
This document describes the metrics that are produced by MySQL for PCF, and everything you need to know about the Replication Canary. For information about the Interruptor, see Using the Interruptor.
Metrics
MySQL emits metrics that can be used to monitor the health and performance of the MySQL deployment. The Loggregator Firehose exposes these metrics.
The metrics polling interval defaults to 30 seconds. You can change this interval by navigating to the Advanced Options configuration pane and entering a new value in Metrics polling interval (min: 10).
Third-party monitoring tools can consume MySQL for PCF’s metrics using an nozzle to monitor MySQL performance and health. For more information, see this section on nozzles. For an example Datadog configuration that displays some of the significant metrics outlined below, see the CF Redis and MySQL example dashboards. Pivotal does not endorse or provide support for any third party solution.
Key Performance Indicators
Key Performance Indicators (KPIs) for MySQL for PCF are metrics that operators find most useful for monitoring their MySQL service to ensure smooth operation. KPIs are high-signal-value metrics that can indicate emerging issues. KPIs can be raw component metrics or derived metrics generated by applying formulas to raw metrics.
Pivotal provides the following KPIs as general alerting and response guidance for typical MySQL for PCF installations. Pivotal recommends that operators continue to fine-tune the alert measures to their installation by observing historical trends. Pivotal also recommends that operators expand beyond this guidance and create new, installation-specific monitoring metrics, thresholds, and alerts based on learning from their own installations.
MySQL for PCF KPIs
This section lists the KPIs that are specific for MySQL for PCF.
For a list of general KPIs that apply to all instances, and not specifically to MySQL for PCF, see BOSH Health Metrics.
For a list of all MySQL for PCF component metrics, see All MySQL Metrics.
Server Availability
/p-mysql/available | |
---|---|
Description | The MySQL Server is currently responding to requests, which indicates that the server is running. Use: This metric is especially useful in single-node mode, where cluster metrics are not relevant. If the server does not emit heartbeats, it is offline. Origin: Doppler/Firehose Type: boolean Frequency: 30 s (default) |
Recommended measurement | Average over the last 5 minutes |
Recommended alert thresholds | Yellow warning: N/A Red critical: < 1 |
Recommended response | Run mysql-diag and check the MySQL Server logs for errors. |
Galera WSREP Ready
/p-mysql/galera/wsrep_ready | |
---|---|
Description | Shows whether each cluster node can accept queries. Returns only 0 or 1. When this metric is 0, almost all queries to that node fail with the error:ERROR 1047 (08501) Unknown Command Use: Discover when nodes of a cluster have been unable to communicate and, thus, unable to accept transactions. Origin: Doppler/Firehose Type: boolean Frequency: 30 s (default) |
Recommended measurement | Average of values of each cluster node, over the last 5 minutes |
Recommended alert thresholds | Yellow warning: < 1.0 Red critical: 0 (cluster is down) |
Recommended response | - Run mysql-diag and check the MySQL Server logs for errors.- Make sure there has been no infrastructure event that affects intra-cluster communication. - Ensure that wsrep_ready has not been set to off by using the query:SHOW STATUS LIKE 'wsrep_ready'; |
Galera WSREP Cluster Size
/p-mysql/galera/wsrep_cluster_size | |
---|---|
Description | The number of cluster nodes with which each node is communicating normally. Use: When running in a multi-node configuration, this metric indicates if each member of the cluster is communicating normally with all other nodes. Origin: Doppler/Firehose Type: count Frequency: 30 s (default) |
Recommended measurement | (Average of the values of each node / cluster size), over the last 5 minutes |
Recommended alert thresholds | Yellow warning: < 3.0 (availability compromised) Red critical: < 1.0 (cluster unavailable) |
Recommended response | Run mysql-diag and check the MySQL Server logs for errors. |
Galera WSREP Cluster Status
/p-mysql/galera/wsrep_cluster_status | |
---|---|
Description | Shows the primary status of the cluster component that the node is in. Values are: - Primary = 1 - Non-primary = 0 - Disconnected = -1 See Galera Cluster Status Variables in the MariaDB documentation. Use: Any value other than “Primary” indicates that the node is part of a nonoperational component. This occurs in cases of multiple membership changes that result in a loss of quorum. Origin: Doppler/Firehose Type: integer (see above) Frequency: 30 s (default) |
Recommended measurement | Sum of each of the nodes, over the last 5 minutes |
Recommended alert thresholds | Yellow warning: < 3 Red critical: < 1 |
Recommended response | - Check node status to ensure that they are all in working order and able to receive write-sets. - Run mysql-diag and check the MySQL Server logs for errors. |
Persistent and Ephemeral Disk Used
/p-mysql/system/persistent_disk_used_percent /p-mysql/system/ephemeral_disk_used_percent |
|
---|---|
Description | The percentage of disk used on the persistent and ephemeral file systems. Use: MySQL cannot function correctly if there isn’t sufficient free space on the file systems. Use these metrics to ensure that you have disks large enough for your user base. Origin: Doppler/Firehose Type: Percent Frequency: 30 s (default) |
Recommended measurement | Max of persistent disk used of all of nodes and Maximum ephemeral disk used of all nodes |
Recommended alert thresholds | Yellow warning: > 80% Red critical: > 90% |
Recommended response | - Audit plan allocation by your users, recommend deletion of unused service instances. - Redeploy with larger disks. |
Service Plans Allocated
/p-mysql/broker/disk_allocated_service_plans | |
---|---|
Description | The number of megabytes allocated by the broker for all service plans, current and allocated. Use: The service broker will not allow new service instances to be created if there isn’t sufficient unreserved space remaining on the persistent disk. Origin: Doppler/Firehose Type: Megabytes Frequency: 30 s (default) |
Recommended measurement | disk allocated / persistent disk size |
Recommended alert thresholds | Yellow warning: < 80% Red critical: < 90% |
Recommended response | - Audit plan allocation by your users, recommend deletion of unused service instances. - Redeploy with larger persistent disks |
Connections
/p-mysql/net/connections | |
---|---|
Description | Connections per second made to the server. Use: If the number of connections drastically changes or if apps are unable to connect, there might be a network or app issue. Origin: Doppler/Firehose Type: count Frequency: 30 s (default) |
Recommended measurement | The maximum number of connections of any node / max connections , over last 1 minute |
Recommended alert thresholds | Yellow warning: > 80% Red critical: > 90% |
Recommended response | When connections per second approaches max connections , apps might experience times when they cannot connect to the database. The number of connections per second for the cluster varies based on application instances and app utilization. If this threshold is met or exceeded for an extended period of time, monitor app usage to ensure everything is behaving as expected. |
Questions
/p-mysql/performance/questions | |
---|---|
Description | The number of statements executed by the server, excluding statements executed within stored programs. Use: The cluster should always be processing some queries, if just as part of the internal automation. Origin: Firehose Envelope Type: Gauge Unit: count Frequency: 30 s (default) |
Recommended measurement | Change in number between the current and previous polling period |
Recommended alert thresholds | Yellow warning: 0 for 90 s Red critical: 0 for 120 s |
Recommended response | If the rate is ever zero for an extended time, run mysql-diag and investigate the MySQL server logs to understand why query rate changed and determine appropriate action. |
BOSH Health Metrics
All BOSH-deployed components generate the following component metrics; these component metrics also serve as KPIs for the MySQL for PCF service.
RAM
system.mem.percent | |
---|---|
Description | RAM being consumed by the MySQL cluster node. Use: MySQL increases its memory usage as the data set increases. This is normal, as much of that RAM is used to buffer IO. As long as there is enough remaining RAM for other processes on the instance, the MySQL server should be OK. Origin: JMX Bridge or BOSH HM Type: percentage Frequency: 60 s |
Recommended measurement | Average over the last 10 minutes |
Recommended alert thresholds | Yellow warning: > 95% Red critical: > 99% |
Recommended response | Update the cluster to use VMs that offer more RAM. |
CPU
system.cpu.percent | |
---|---|
Description | CPU time being consumed by the MySQL cluster. Use: A node that experiences context switching or high CPU usage will become unresponsive. This also affects the ability of the node to report metrics. Origin: JMX Bridge or BOSH HM Type: percent Frequency: 60 s |
Recommended measurement | Average over the last 10 minutes |
Recommended alert thresholds | Yellow warning: > 80% Red critical: > 90% |
Recommended response | Determine what is using so much CPU. If it is from normal processes, update the service instance to use a VM with larger CPU capacity. |
Persistent Disk
persistent.disk.percent | |
---|---|
Description | Persistent disk being consumed by the MySQL cluster nodes. Use: If the persistent disk fills up, MySQL will be unable to process queries and recovery is difficult. Origin: JMX Bridge or BOSH HM Type: percent Frequency: 60 s |
Recommended measurement | Average over the last 10 minutes |
Recommended alert thresholds | Yellow warning: > 75% Red critical: > 90% |
Recommended response | Update the deployment with larger persistent disks. This process might take some time, because the data is copied from the original persistent disk to a new one. For more information, see Redeploying with Larger Persistent Disks. |
MySQL-Specific Metrics
Data Source | Description | Metric Unit |
---|---|---|
/p-mysql/available |
Indicates if the local database server is available and responding. | boolean |
/p-mysql/system/persistent_disk_used |
The number of KB used on the persistent disk. | KB |
/p-mysql/system/persistent_disk_used_percent |
The percentage of persistent disk used by both the system and user applications. | percent |
/p-mysql/system/persistent_disk_free |
The number of KB available on the persistent disk. | KB |
/p-mysql/system/persistent_disk_inodes_used |
The number of inodes used on the persistent disk. | count |
/p-mysql/system/persistent_disk_inodes_used_percent |
The percentage of persistent disk inodes used by both the system and user applications. | percent |
/p-mysql/system/persistent_disk_inodes_free |
The number of inodes available on the persistent disk. | count |
/p-mysql/system/ephemeral_disk_used |
The number of KB used on the ephemeral disk. | KB |
/p-mysql/system/ephemeral_disk_used_percent |
The percentage of ephemeral disk used by both the system and user applications. | percent |
/p-mysql/system/ephemeral_disk_free |
The number of KB available on the ephemeral disk. | KB |
/p-mysql/system/ephemeral_disk_inodes_used |
The number of inodes used on the ephemeral disk. | count |
/p-mysql/system/ephemeral_disk_inodes_used_percent |
The percentage of ephemeral disk inodes used by both the system and user applications. | percent |
/p-mysql/system/ephemeral_disk_inodes_free |
The number of inodes available on the ephemeral disk. | count |
/p-mysql/broker/disk_allocated_service_plans |
The number of MB allocated by the broker for all service plans, current and allocated. | MB |
/p-mysql/innodb/buffer_pool_free |
The number of free pages in the InnoDB Buffer Pool. | pages |
/p-mysql/innodb/buffer_pool_total |
The total number of pages in the InnoDB Buffer Pool. | pages |
/p-mysql/innodb/buffer_pool_pages_data |
The number of pages in the InnoDB Buffer pool containing data. The number includes both dirty and clean pages. | pages |
/p-mysql/innodb/buffer_pool_used |
The number of used pages in the InnoDB Buffer Pool. | pages |
/p-mysql/innodb/buffer_pool_utilization |
The utilization of the InnoDB Buffer Pool. | fraction |
/p-mysql/innodb/data_reads |
The number of data reads. | reads/second |
/p-mysql/innodb/data_writes |
The number of data writes. | writes/second |
/p-mysql/innodb/mutex_os_waits |
The number of mutex OS waits. | events/second |
/p-mysql/innodb/mutex_spin_rounds |
The number of mutex spin rounds. | events/second |
/p-mysql/innodb/mutex_spin_waits |
The number of mutex spin waits. | events/second |
/p-mysql/innodb/os_log_fsyncs |
The number of fsync writes to the log file. | writes/second |
/p-mysql/innodb/row_lock_time |
Time spent in acquiring row locks. | milliseconds |
/p-mysql/innodb/row_lock_waits |
The number of times per second a row lock had to be waited for. | events/second |
/p-mysql/net/connections |
The number of connections to the server. | connection/second |
/p-mysql/net/max_connections |
The maximum number of connections that have been in use simultaneously since the server started. | connections |
/p-mysql/performance/com_delete |
The number of delete statements. | queries/second |
/p-mysql/performance/com_delete_multi |
The number of delete-multi statements. | queries/second |
/p-mysql/performance/com_insert |
The number of insert statements. | query/second |
/p-mysql/performance/com_insert_select |
The number of insert-select statements. | queries/second |
/p-mysql/performance/com_replace_select |
The number of replace-select statements. | queries/second |
/p-mysql/performance/com_select |
The number of select statements. | queries/second |
/p-mysql/performance/com_update |
The number of update statements. | queries/second |
/p-mysql/performance/com_update_multi |
The number of update-multi statements. | queries/second |
/p-mysql/performance/cpu_time |
Total CPU time used. Introduced with MariaDB 5.3; emitted in MySQL for Pivotal Cloud Foundry v1 but not in MySQL for Pivotal Platform v2. | numeric |
/p-mysql/performance/cpu_utilization_percent |
The percent of the CPU in use by all processes on the MySQL node. | percent utilization, from 0-100 |
/p-mysql/performance/created_tmp_disk_tables |
The number of internal on-disk temporary tables created each second by the server while executing statements. | table/second |
/p-mysql/performance/created_tmp_files |
The number of temporary files created each second. | files/second |
/p-mysql/performance/created_tmp_tables |
The number of internal temporary tables created each second by the server while executing statements. | tables/second |
/p-mysql/performance/open_files |
The number of open files. | files |
/p-mysql/performance/open_tables |
The number of tables that are open. | tables |
/p-mysql/performance/open_table_definitions |
The number of currently cached table definitions (FRM files). | count |
/p-mysql/performance/qcache_hits |
The number of query cache hits. | hits/second |
/p-mysql/performance/questions |
The number of statements executed by the server. | queries/second |
/p-mysql/performance/queries |
The number of statements executed by the server, excluding COM_PING and COM_STATISTICS . Differs from Questions in that it also counts statements executed within stored programs. |
queries/second |
/p-mysql/performance/slow_queries |
The number of slow queries. | queries/second |
/p-mysql/performance/table_locks_waited |
The total number of times that a request for a table lock could not be granted immediately and a wait was needed. | number |
/p-mysql/performance/threads_connected |
The number of currently open connections. | connections |
/p-mysql/performance/threads_running |
The number of threads that are not sleeping. | threads |
/p-mysql/performance/max_connections |
The maximum permitted number of simultaneous client connections. | integer |
/p-mysql/performance/open_files_limit |
The number of files that the operating system permits mysqld to open. | integer |
/p-mysql/performance/open_tables |
The number of tables that are open. | integer |
/p-mysql/performance/open_tables_definitions |
The number of currently cached table definitions (FRM files). | count |
/p-mysql/performance/opened_tables |
The number of tables that have been opened. | integer |
/p-mysql/performance/opened_table_definitions |
The number of FRM files that have been cached. | integer |
/p-mysql/performance/queries |
The number of statements executed by the service, which resets to zero when the MySQL process is restarted. | integer |
/p-mysql/performance/queries_delta |
The change in the /performance/queries metric since the last time it was emitted. |
integer greater than zero |
/p-mysql/performance/queries_delta |
The change in the /performance/queries metric since the last time it was emitted. |
integer greater than zero |
/p-mysql/variables/max_connections |
The maximum permitted number of simultaneous client connections. | connections |
/p-mysql/variables/open_files_limit |
The number of files that the operating system permits mysqld to open. | files |
/p-mysql/variables/read_only |
Whether the server is in read-only mode. | boolean |
Galera-Specific Metrics
Data Source | Description | Metric Unit |
---|---|---|
/p-mysql/galera/wsrep_ready |
Shows whether the node can accept queries. | boolean |
/p-mysql/galera/wsrep_cluster_size |
The current number of nodes in the Galera cluster. | nodes |
/p-mysql/galera/wsrep_cluster_status |
Shows the primary status of the cluster component that the node is in.
Possible values are:
|
State ID |
/p-mysql/galera/wsrep_flow_control_paused |
Fraction of time that replication was paused due to flow control since the
server started or last FLUSH STATUS .
This metric is a measure of how much replication lag is slowing down the cluster. |
float |
/p-mysql/galera/wsrep_flow_control_sent |
Number of FC_PAUSE or flow control pause events sent by this node. Unlike many status variables,
the counter for this metric does not reset every time you run the query. |
count |
/p-mysql/galera/wsrep_local_recv_queue_avg |
Shows the average size of the local received queue since the last status query. | float |
/p-mysql/galera/wsrep_local_send_queue_avg |
Shows the average size of the local sent queue since the last status query. | float |
/p-mysql/galera/wsrep_local_index |
This node index in the cluster (base 0). | int |
/p-mysql/galera/wsrep_local_state |
The local state of the node. Possible states include:
|
int |
For more information on monitoring PCF, see Monitoring Pivotal Cloud Foundry.
Replication Canary
MySQL for Pivotal Cloud Foundry (PCF) is a clustered solution that uses replication to provide benefits such as quick failover and rolling upgrades. This is more complex than a single node system with no replication. MySQL for PCF includes a Replication Canary to help with the increased complexity. The Replication Canary is a long-running monitor that validates that replication is working within the MySQL cluster.
How it Works
The Replication Canary writes to a private dataset in the cluster, and attempts to read that data from each node. It pauses between writing and reading to ensure that the writesets have been committed across each node of the cluster. The private dataset does not use a significant amount of disk capacity.
When replication fails to work properly, the Canary detects that it cannot read the data from all nodes, and immediately takes two actions:
- Emails a pre-configured address with a message that replication has failed. See the sample below.
- Disables client access to the cluster.
Note: Malfunctioning replication exposes the cluster to the possibility of data loss. Because of this, both behaviors are enabled by default. It is critical that you contact Pivotal support immediately in the case of replication failure. Support will work with you to determine the nature of the cluster failure and provide guidance regarding a solution.
Sample Notification Email
If the Canary detects a replication failure, it immediately sends an email through the Pivotal Application Service (PAS) or Elastic Runtime notification service. See the following example:
Subject: CF Notification: p-mysql Replication Canary, alert 417
This message was sent directly to your email address.
{alert-code 417}
This is an email to notify you that the MySQL service's replication canary has detected an unsafe cluster condition in which replication is not performing as expected across all nodes.
Cluster Access
Each time the Canary detects cluster replication failure, it instructs all proxies to disable connections to the database cluster. If the replication issue resolves, the Canary detects this and automatically restores client access to the cluster.
If you must restore access to the cluster regardless of the Replication Canary, contact Support.
Determine Proxy State
You can determine if the Canary disabled cluster access by using the Proxy API. See the following example:
ubuntu@ip-10-0-0-38:~$ curl -ku admin:PASSWORD_FROM_OPSMGR -X GET https://proxy-api-p-mysql.SYSTEM-DOMAIN/v0/cluster ; echo
{"currentBackendIndex":0,"trafficEnabled":false,"message":"Disabling cluster traffic","lastUpdated":"2016-07-27T05:16:29.197754077Z"}
Enable the Replication Canary
To enable the Replication Canary, follow the instructions below to configure both the PAS or Elastic Runtime tile and the MySQL for PCF tile.
Configure the PAS or Elastic Runtime Tile
Note: In a typical PCF deployment, these settings are already configured.
- In the SMTP Config section, enter a From Email that the Replication Canary can use to send notifications, along with the SMTP server configuration.
- In the Errands section, select the Notifications errand.
Configure the MySQL for PCF Tile
- In the Advanced Options section, select Enable replication canary.
- If you want to the Replication Canary to send email but not disable access at the proxy, select Notify only.
Note: Pivotal recommends leaving this checkbox unselected due to the possibility of data loss from replication failure.
You can override the Replication canary time period. The Replication canary time period sets how frequently the canary checks for replication failure, in seconds. This adds a small amount of load to the databases, but the canary reacts more quickly to replication failure. The default is 30 seconds.
You can override the Replication canary read delay. The Replication canary read delay sets how long the canary waits to verify data is replicating across each MySQL node, in seconds. Clusters under heavy load experience some small replication lag as writesets are committed across the nodes. The Default is 20 seconds.
Enter an Email address to receive monitoring notifications. Use a closely monitored email address account. The purpose of the Canary is to escalate replication failure as quickly as possible.
In the Resource Config section, ensure the Monitoring job has one instance.
Disable the Replication Canary
If you do not need the Replication Canary, for instance if you use a single MySQL node, follow this procedure to disable both the job and the resource configuration.
In the Advanced Options section of the MySQL for PCF tile, select Disable Replication Canary.
In the Resource Config pane, set the Monitoring job to zero instances.