LATEST VERSION: 2.5 - RELEASE NOTES

Monitoring and KPIs for On-Demand MySQL for PCF

Page last updated:

This topic explains how to monitor the health of the MySQL for Pivotal Cloud Foundry (PCF) service using the logs, metrics, and Key Performance Indicators (KPIs) generated by MySQL for PCF component VMs.

For general information about logging and metrics in PCF, see Logging and Metrics.

About Metrics

Metrics are regularly-generated log messages that report measured component states. The metrics polling interval is 30 seconds for MySQL instances and 60 seconds for the service broker.

Metrics are long, single lines of text that follow the format:

origin:"p.mysql" eventType:ValueMetric timestamp:1496776477930669688 deployment:"service-instance_2b5a001f-2bf3-460c-aee6-fd2253f9fb0c" job:"mysql" index:"b09df494-b731-4d06-a4b0-c2985ceedf4c" ip:"10.0.8.4" valueMetric:<name:"/p.mysql/performance/open_files" value:24 unit:"file" >

Key Performance Indicators

Key Performance Indicators (KPIs) for MySQL for PCF are metrics that operators find most useful for monitoring their MySQL service to ensure smooth operation. KPIs are high-signal-value metrics that can indicate emerging issues. KPIs can be raw component metrics or derived metrics generated by applying formulas to raw metrics.

Pivotal provides the following KPIs as general alerting and response guidance for typical MySQL for PCF installations. Pivotal recommends that operators continue to fine-tune the alert measures to their installation by observing historical trends. Pivotal also recommends that operators expand beyond this guidance and create new, installation-specific monitoring metrics, thresholds, and alerts based on learning from their own installations.

KPIs for MySQL Service Instances

This section lists the KPIs that are specific for MySQL for PCF instances.

For a list of general KPIs that apply to all instances, and not specifically to MySQL for PCF instances, see BOSH System Health Metrics.

For a list of all MySQL for PCF component metrics, see All MySQL Metrics.

Server Availability


/p.mysql/available

Description MySQL Server is currently responding to requests, which indicates if the component is available.

Use: If the server does not emit heartbeats, it is offline.

Origin: Doppler/Firehose
Type: boolean
Frequency: 30 s
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Check the MySQL Server logs for errors. You can find the instance by targeting your MySQL deployment with BOSH and inspecting logs for the instance. For more information, see Failing Jobs and Unhealthy Instances.

Persistent and Ephemeral Disk Used


/p.mysql/system/persistent_disk_used_percent and /p.mysql/system/ephemeral_disk_used_percent

Description The percentage of disk used on the persistent and ephemeral file systems.

Use: MySQL cannot function correctly if there isn’t sufficient free space on the file systems. Use these metrics to ensure that you have disks large enough for your user base.

Origin: Doppler/Firehose
Type: Percent
Frequency: 30 s (default)
Recommended measurement Max of persistent disk used of all of nodes
and
Maximum ephemeral disk used of all nodes
Recommended alert thresholds Yellow warning: > 40%

Red critical: > 45%
Recommended response Upgrade the service instance to a plan with larger disk capacity.

Connections


/p.mysql/net/connections

Description The rate of connections to the server, shown as connections per second.

Use: If the number of connections drastically changes or if apps are unable to connect, there might be a network or app issue.

Origin: Doppler/Firehose
Type: count
Frequency: 30 s
Recommended measurement (Number of connections / Max connections) over last 1 minute
Recommended alert thresholds Yellow warning: > 80
Red critical: > 90
Recommended response When approaching 100% of max connections, apps may be experiencing times when they cannot connect to the database. The connections/second for a service instance vary based on application instances and app utilization. If this metric is met or exceeded for an extended period of time, monitor app usage to ensure everything is behaving as expected.

Questions


/p.mysql/performance/questions

Description The rate of statements executed by the server, shown as queries per second.

Use: The server should always be processing some queries, if just as part of the internal automation.

Origin: Doppler/Firehose
Type: count
Frequency: 30 s
Recommended measurement Average over last 2 minutes
Recommended alert thresholds Yellow warning: 0 for 90 s
Red critical: 0 for 120 s
Recommended response Investigate the MySQL server logs, such as the audit log, to understand why query rate changed and determine appropriate action.

BOSH System Health Metrics

The BOSH layer that underlies PCF generates healthmonitor metrics for all VMs in the deployment. However, these metrics are not included in the Loggregator Firehose by default. To send BOSH HM metrics through the Firehose, install the open-source HM Forwarder.

All BOSH-deployed components generate the following system health metrics; these metrics also serve as KPIs for the MySQL for PCF service.

Persistent Disk


persistent.disk.percent

Description Persistent disk being consumed by the MySQL service instance.

Use: If the persistent disk fills up, MySQL will be unable to process queries and recovery is difficult.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 60 s (default)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 75
Red critical: > 90
Recommended response Update the service instance to use a plan with a larger persistent disk. This process may take some time, as the data is copied from the original persistent disk to a new one.

RAM


system.mem.percent

Description RAM being consumed by the MySQL service instance.

Use: MySQL increases its memory usage as the data set increases. This is normal, as much of that RAM is used to buffer IO. As long as there is enough remaining RAM for other processes on the instance, the MySQL server should be OK.

Origin: JMX Bridge or BOSH HM
Type: percentage
Frequency: 60 s (default)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 95
Red critical: > 99
Recommended response Update the service instance to a plan with more RAM.

CPU


system.cpu.percent

Description CPU time being consumed by the MySQL service.

Use: A node that experiences context switching or high CPU usage will become unresponsive. This also affects the ability of the node to report metrics.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 60 s (default)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 80
Red critical: > 90
Recommended response Determine what is using so much CPU. If it is from normal processes, update the service instance to use a plan with larger CPU capacity.

All MySQL Metrics

In addition to the above KPIs, the MySQL service emits the followings metrics that can be used for monitoring and alerting.

Data Source Description Metric Unit
/p.mysql/available Indicates if the local database server is available and responding. boolean
/p.mysql/follower/seconds_behind_master The number of seconds the follower VM is behind in applying writes from the leader VM to its own database. seconds
/p.mysql/follower/seconds_since_leader_heartbeat The number of seconds that elapses between the leader heartbeat and the replication of the heartbeat in the follower database. seconds
/p.mysql/innodb/buffer_pool_pages_free The amount of free space, measured in pages, in the InnoDB Buffer Pool. pages
/p.mysql/innodb/buffer_pool_pages_total The total amount of free space, measured in pages, in the InnoDB Buffer Pool containing data. pages
/p.mysql/innodb/buffer_pool_pages_data The number of pages in the InnoDB Buffer Pool containing data. The number includes both dirty and clean pages. pages
/p.mysql/innodb/buffer_pool_utilization The utilization of the InnoDB Buffer Pool. fraction
/p.mysql/innodb/data_read The amount of data read since the server started. bytes
/p.mysql/innodb/data_written The amount of data written since the server started. bytes
/p.mysql/innodb/os_log_fsyncs The number of fsync() writes done to the InnoDB redo log files. count
/p.mysql/innodb/row_lock_time Total time spent in acquiring row locks. milliseconds
/p.mysql/innodb/row_lock_waits The number of times per second a row lock had to be waited for since the server started. count
/p.mysql/innodb/row_lock_current_waits The number of row locks currently being waited for by operations on InnoDB tables. count
/p.mysql/net/connections The number of connection attempts to the server, both successful and unsuccessful, to the MySQL server. connections
/p.mysql/net/max_used_connections The maximum number of connections that have been in use simultaneously since the server started. connections
/p.mysql/performance/com_delete The number of delete statements since the server started or the last FLUSH STATUS. queries
/p.mysql/performance/com_delete_multi The number of delete-multi statements since the server started or the last FLUSH STATUS. Applies to DELETE statements that use multiple-table syntax. queries
/p.mysql/performance/com_insert The number of insert statements since the server started or the last FLUSH STATUS. queries
/p.mysql/performance/com_insert_select The number of insert-select statements since the server started or the last FLUSH STATUS. queries
/p.mysql/performance/com_replace_select The number of replace-select statements since the server started or the last FLUSH STATUS. queries
/p.mysql/performance/com_select The number of select statements since the server started or the last FLUSH STATUS. If a query result is returned from query cache, the server increments the qcache_hits status variable, not com_select. queries
/p.mysql/performance/com_update The number of update statements since the server started or the last FLUSH STATUS. queries
/p.mysql/performance/com_update_multi The number of update-multi statements since the server started or the last FLUSH STATUS. Applies to UPDATE statements that use multiple-table syntax. queries
/p.mysql/performance/cpu_utilization_percent The percent of the CPU in use by all processes on the MySQL node. percent utilization
/p.mysql/performance/created_tmp_disk_tables The number of internal on-disk temporary tables created each second by the server while executing statements. tables
/p.mysql/performance/created_tmp_files The number of temporary files created by mysqld. files
/p.mysql/performance/created_tmp_tables The number of internal temporary tables created by the server while executing statements. tables
/p.mysql/performance/kernel_time Percentage of CPU time spent in kernel space by MySQL. percent
/p.mysql/performance/key_cache_utilization The key cache utilization ratio. fraction
/p.mysql/performance/open_files The number of regular files currently open, which were opened by the server. files
/p.mysql/performance/open_tables The number of tables that are currently open. tables
/p.mysql/performance/open_table_definitions The number of currently cached table definitions or .frm files. integer
/p.mysql/performance/opened_tables The number of tables that have been opened. tables
/p.mysql/performance/opened_table_definitions The number of .frm files that have been cached. integer
/p.mysql/performance/qcache_hits The number of query cache hits. The query cache and qcache_hits metric is deprecated as of MySQL 5.7.20. hits
/p.mysql/performance/questions The number of statements executed by the server since the server started or the last FLUSH STATUS. This includes only statements sent to the server by clients and not statements executed within stored programs, unlike the queries variable. count
/p.mysql/performance/queries The number of statements executed by the server, excluding COM_PING and COM_STATISTICS. Differs from questions in that it also counts statements executed within stored programs. Not affected by FLUSH STATUS. count
/p.mysql/performance/queries_delta The change in the /performance/queries metric since the last time it was emitted. integer greater than or equal to zero
/p.mysql/performance/slow_queries The number of slow queries that have taken more than long_query_time seconds. queries
/p.mysql/performance/table_locks_waited The total number of times that a request for a table lock could not be granted immediately and a wait was needed. count
/p.mysql/performance/threads_connected The number of currently open connections. connections
/p.mysql/performance/threads_running The number of threads that are not sleeping. threads
/p.mysql/performance/user_time Percentage of CPU time spent in user space by MySQL. percent
/p.mysql/rpl_semi_sync_master_no_tx The number of transactions committed without follower acknowledgement. commits
/p.mysql/rpl_semi_sync_master_tx_avg_wait_time The average time the leader has waited for the follower to accept transactions. microseconds
/p.mysql/rpl_semi_sync_master_wait_sessions The current number of connections waiting for a sync commit. For more information about sync replication, see Understanding Synchronous Replication. sessions
/p.mysql/variables/max_connections The maximum permitted number of simultaneous client connections. connections
/p.mysql/variables/open_files_limit The number of files that the operating system permits mysqld to open. files
/p.mysql/variables/read_only Whether the server is in read-only mode boolean
Create a pull request or raise an issue on the source for this page in GitHub