Monitoring and KPIs for On-Demand MySQL for PCF

Page last updated:

This topic explains how to monitor the health of the MySQL for Pivotal Cloud Foundry (PCF) service using the logs, metrics, and Key Performance Indicators (KPIs) generated by MySQL for PCF component VMs.

For general information about logging and metrics in PCF, see Logging and Metrics.

About Metrics

Metrics are regularly-generated log messages that report measured component states. The default metrics polling interval is 30 seconds for MySQL instances and 60 seconds for the service broker.

You can configure the interval for MySQL instance in the Configure Monitoring pane in the MySQL for PCF tile. For more information, see Configure Monitoring.

Metrics are long, single lines of text that follow the format:

origin:"p.mysql" eventType:ValueMetric timestamp:1496776477930669688 deployment:"service-instance_2b5a001f-2bf3-460c-aee6-fd2253f9fb0c" job:"mysql" index:"b09df494-b731-4d06-a4b0-c2985ceedf4c" ip:"10.0.8.4" valueMetric:<name:"/p.mysql/performance/open_files" value:24 unit:"file" >

Access MySQL Metrics

To access metrics for MySQL for PCF service instances, you can use Loggregator’s Log Cache feature with the Log Cache CLI plugin. Log Cache is enabled by default in Pivotal Application Service (PAS) v2.2 and later.

To access metrics for on-demand service instances, do the following:

  1. To install the cf CLI plugin, run the following command:

    cf install-plugin -r CF-Community "log-cache"
    
  2. To access metrics for a service instance, run the following command:

    cf tail SERVICE-INSTANCE-NAME
    

    Two useful flags to append to this command are:

    • -f, --follow: Append metrics logs to stdout as they are generated, rather than returning a fixed number of metrics or metrics over a fixed interval.
    • --json: Output metrics logs as envelopes in JSON format.
      For a complete list of cf tail options, see the Log Cache CLI repository.

    Where SERVICE-INSTANCE-NAME is the name of your service instance.

    For example:

     $ cf tail -f my-instance | egrep 'connected|available|persistent_disk_used_percent'
     2019-05-17T11:25:59.48-0700 [my-instance] GAUGE /p.mysql/available:1.000000 boolean
     2019-05-17T11:26:29.49-0700 [my-instance] GAUGE /p.mysql/system/persistent_disk_used_percent:17.000000 percentage
     2019-05-17T11:26:29.49-0700 [my-instance] GAUGE /p.mysql/performance/threads_connected:**6**.000000 connection
     2019-05-17T11:26:59.50-0700 [my-instance] GAUGE /p.mysql/available:1.000000 boolean
     2019-05-17T11:27:29.50-0700 [my-instance] GAUGE /p.mysql/system/persistent_disk_used_percent:17.000000 percentage
     2019-05-17T11:27:29.50-0700 [my-instance] GAUGE /p.mysql/performance/threads_connected:**7**.000000 connection
    

    For more information about the metrics output, see the Key Performance Indicators and Component Metrics sections below.

For more information about how to enable Log Cache and about the cf tail command, see Enable Log Cache.

KPIs for MySQL Service Instances

Key Performance Indicators (KPIs) for MySQL for PCF are metrics that developers find most useful for monitoring their MySQL service instances to ensure smooth operation. KPIs are high-signal-value metrics that can indicate emerging issues. KPIs can be raw component metrics or derived metrics generated by applying formulas to raw metrics.

Pivotal provides the following KPIs as general alerting and response guidance for typical MySQL for PCF installations. Pivotal recommends that you continue to fine-tune the alert measures by observing historical trends. Pivotal also recommends that you expand beyond this guidance and create new, installation-specific monitoring metrics, thresholds, and alerts based on learning from your installations.

For a list of all the MySQL for PCF component metrics, see Component Metrics.

Server Availability


/p.mysql/available

Description MySQL Server is currently responding to requests, which indicates if the component is available.

Use: If the server does not emit heartbeats, it is offline.

Origin: Doppler/Firehose
Type: boolean
Frequency: 30 s
Recommended measurement Average over last 5 minutes
Recommended alert thresholds Yellow warning: N/A
Red critical: < 1
Recommended response Check the MySQL Server logs for errors. You can find the instance by targeting your MySQL deployment with BOSH and inspecting logs for the instance. For more information, see Failing Jobs and Unhealthy Instances.

If your service plan is an highly available (HA) cluster, you can also run mysql-diag to check logs for errors.

Persistent Disk Used


/p.mysql/system/persistent_disk_used_percent

Description The percentage of disk used on the persistent file system.

Use: MySQL cannot function correctly if there is not sufficient free space on the file systems. Use these metrics to ensure that you have disks large enough for your user base.
Origin: Doppler/Firehose
Type: Percent
Frequency: 30 s (default)
Recommended measurement Maximum of persistent disk used of all of nodes
Recommended alert thresholds Single Node and Leader Follower:
  • Yellow warning: > 25%
  • Red critical: > 30%
Highly Available Cluster:
  • Yellow warning: > 80%
  • Red critical: > 90%
Recommended response Upgrade the service instance to a plan with larger disk capacity.

Ephemeral Disk Used


/p.mysql/system/ephemeral_disk_used_percent

Description The percentage of disk used on the ephemeral file system.

Use: MySQL cannot function correctly if there is not sufficient free space on the file systems. Use these metrics to ensure that you have disks large enough for your user base.
Origin: Doppler/Firehose
Type: Percent
Frequency: 30 s (default)
Recommended measurement Maximum ephemeral disk used of all nodes
Recommended alert thresholds Yellow warning: > 80%
Red critical: > 95%
Recommended response Upgrade the service instance to a plan with larger disk capacity.

CPU Utilization Percent


/p.mysql/performance/cpu_utilization_percent

Description CPU time being consumed by the MySQL service.

Use: A node that experiences context switching or high CPU usage will become unresponsive. This also affects the ability of the node to report metrics.
Origin: Doppler/Firehose
Type: Percent
Frequency: 30 s (default)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: > 80 Red critical: > 90
Recommended response Determine what is using so much CPU. If it is from normal processes, update the service instance to use a plan with larger CPU capacity.

Connections


/p.mysql/variables/max_connections

/p.mysql/net/max_used_connections

Description The maximum number of connections used over the maximum permitted number of simultaneous client connections.

Use: If the number of connections drastically changes or if apps are unable to connect, there might be a network or app issue.
Origin: Doppler/Firehose
Type: count
Frequency: 30 s
Recommended measurement max_used_connections / max_connections
Recommended alert thresholds Yellow warning: > 80 %
Red critical: > 90 %
Recommended response If this measurement mets or exceeds 80% with linear growth, increase the value of max_connections.

If this measurement mets or exceeds 80% with exponential growth, monitor app usage to ensure everything is behaving as expected.

When approaching 100% of max connections, apps may be experiencing times when they cannot connect to the database. The connections/second for a service instance vary based on app instances and app utilization.

Queries Delta


/p.mysql/performance/queries_delta

Description The number of statements executed by the server over the last 30 seconds.

Use: The server should always be processing some queries. If the server does not process any queries, the server is non-functional.

Origin: Doppler/Firehose
Type: count
Frequency: 30 s
Recommended measurement Average over last 2 minutes
Recommended alert thresholds Red critical: 0
Recommended response Investigate the MySQL server logs, such as the audit log, to understand why query rate changed and determine appropriate action.

Highly Available Cluster WSREP Ready


/p.mysql/galera/wsrep_ready

Description Shows whether each cluster node can accept queries. Returns only 0 or 1. When this metric is 0, almost all queries to that node fail with the error:
ERROR 1047 (08501) Unknown Command

Use: Discover when nodes of a cluster have been unable to communicate and accept transactions.

Origin: Doppler/Firehose
Type: boolean
Frequency: 30 s (default)
Recommended measurement Average of values of each cluster node, over the last 5 minutes
Recommended alert thresholds Yellow warning: < 1
Red critical: 0 (cluster is down)
Recommended response
  • Run mysql-diag and check the MySQL Server logs for errors.
  • Make sure there has been no infrastructure event that affects intra-cluster communication.
  • Ensure that wsrep_ready has not been set to off by using the query:
    SHOW STATUS LIKE 'wsrep_ready';.

Highly Available Cluster WSREP Cluster Size


/p.mysql/galera/wsrep_cluster_size

Description The number of cluster nodes with which each node is communicating normally.

Use: When running in a multi-node configuration, this metric indicates if each member of the cluster is communicating normally with all other nodes.

Origin: Doppler/Firehose
Type: count
Frequency: 30 s (default)
Recommended measurement (Average of the values of each node / cluster size), over the last 5 minutes
Recommended alert thresholds Yellow warning: < 3 (availability compromised)
Red critical: < 1 (cluster unavailable)
Recommended response Run mysql-diag and check the MySQL Server logs for errors.

Highly Available Cluster WSREP Cluster Status


/p.mysql/galera/wsrep_cluster_status

Description Shows the primary status of the cluster component that the node is in.
Values are:
  • Primary: Node has a quorum.
  • Non-primary: Node has lost a quorum.
  • Disconnected: Node is unable to connect to other nodes.
Use: Any value other than Primary indicates that the node is part of a non-operational component. This occurs in cases of multiple membership changes that result in a loss of quorum.

Origin: Doppler/Firehose
Type: integer (see above)
Frequency: 30 s (default)
Recommended measurement Sum of each of the nodes, over the last 5 minutes
Recommended alert thresholds Yellow warning: < 3
Red critical: < 1
Recommended response
  • Check node status to ensure that they are all in working order and able to receive write-sets
  • Run mysql-diag and check the MySQL Server logs for errors

Component Metrics

In addition to the above KPIs, the MySQL service emits the followings metrics that can be used for monitoring and alerting:

MySQL Metrics

All MySQL for PCF service instances emit the following metrics:

/p.mysql/available
Description Indicates if the local database server is available and responding.
Unit boolean

/p.mysql/variables/max_connections
Description The maximum permitted number of simultaneous client connections.
Unit count

/p.mysql/variables/open_files_limit
Description The number of files that the operating system permits mysqld to open.
Unit files

/p.mysql/variables/read_only
Description Whether the server is in read-only mode
Unit boolean

/p.mysql/performance/questions
Description The number of statements executed by the server since the server started or the last FLUSH STATUS. This includes only statements sent to the server by clients and not statements executed within stored programs, unlike the queries variable.
Unit count

/p.mysql/performance/queries
Description The number of statements executed by the server, excluding COM_PING and COM_STATISTICS. Differs from questions in that it also counts statements executed within stored programs. Not affected by FLUSH STATUS.
Unit count

/p.mysql/performance/queries_delta
Description The change in the /performance/queries metric since the last time it was emitted.
Unit integer greater than or equal to zero

/p.mysql/innodb/buffer_pool_pages_free
Description The amount of free space, measured in pages, in the InnoDB Buffer Pool.
Unit count

/p.mysql/innodb/buffer_pool_pages_total
Description The total amount of free space, measured in pages, in the InnoDB Buffer Pool containing data.
Unit count

/p.mysql/innodb/buffer_pool_pages_data
Description The number of pages in the InnoDB Buffer Pool containing data. The number includes both dirty and clean pages.
Unit count

/p.mysql/innodb/row_lock_current_waits
Description The number of row locks currently being waited for by operations on InnoDB tables.
Unit count

/p.mysql/innodb/data_read
Description The amount of data read since the server started.
Unit bytes

/p.mysql/innodb/data_written
Description The amount of data written since the server started.
Unit bytes

/p.mysql/innodb/mutex_os_waits
Description The number of mutex OS waits.
Unit events/second

/p.mysql/innodb/mutex_spin_rounds
Description The number of mutex spin rounds.
Unit events/second

/p.mysql/innodb/mutex_spin_waits
Description The number of mutex spin waits.
Unit events/second

/p.mysql/innodb/os_log_fsyncs
Description The number of fsync() writes done to the InnoDB redo log files.
Unit count

/p.mysql/innodb/row_lock_time
Description Total time spent in acquiring row locks.
Unit milliseconds

/p.mysql/innodb/row_lock_waits
Description The number of times a row lock had to be waited for since the server started.
Unit count

/p.mysql/net/connections
Description The number of connection attempts to the server, both successful and unsuccessful, to the MySQL server.
Unit count

/p.mysql/net/max_used_connections
Description The maximum number of connections that have been in use simultaneously since the server started.
Unit count

/p.mysql/performance/com_delete
Description The number of delete commands since the server started or the last FLUSH STATUS.
Unit count

/p.mysql/performance/com_delete_multi
Description The number of delete-multi commands since the server started or the last FLUSH STATUS. Applies to DELETE statements that use multiple-table syntax.
Unit count

/p.mysql/performance/com_insert
Description The number of insert commands since the server started or the last FLUSH STATUS.
Unit count

/p.mysql/performance/com_insert_select
Description The number of insert-select commands since the server started or the last FLUSH STATUS.
Unit count

/p.mysql/performance/com_replace_select
Description The number of replace-select commands since the server started or the last FLUSH STATUS.
Unit count

/p.mysql/performance/com_select
Description The number of select commands since the server started or the last FLUSH STATUS. If a query result is returned from query cache, the server increments the qcache_hits status variable, not com_select.
Unit count

/p.mysql/performance/com_update
Description The number of update commands since the server started or the last FLUSH STATUS.
Unit count

/p.mysql/performance/com_update_multi
Description The number of update-multi commands since the server started or the last FLUSH STATUS. Applies to UPDATE statements that use multiple-table syntax.
Unit count

/p.mysql/performance/created_tmp_disk_tables
Description The number of internal on-disk temporary tables created by the server while executing statements.
Unit count

/p.mysql/performance/created_tmp_files
Description The number of temporary files created by mysqld.
Unit count

/p.mysql/performance/created_tmp_tables
Description The number of internal temporary tables created by the server while executing statements.
Unit count

/p.mysql/performance/cpu_utilization_percent
Description The percent of the CPU in use by all processes on the MySQL node.
Unit percent

/p.mysql/performance/open_files
Description The number of regular files currently open, which were opened by the server.
Unit count

/p.mysql/performance/open_tables
Description The number of tables that are currently open.
Unit count

/p.mysql/performance/opened_tables
Description The number of tables that have been opened.
Unit count

/p.mysql/performance/open_table_definitions
Description The number of currently cached table definitions or .frm files.
Unit count

/p.mysql/performance/opened_table_definitions
Description The number of .frm files that have been cached.
Unit count

/p.mysql/performance/qcache_hits
Description The number of query cache hits. The query cache and qcache_hits metric is deprecated as of MySQL 5.7.20.
Unit count

/p.mysql/performance/slow_queries
Description The number of queries that have taken more than long_query_time seconds.
Unit count

/p.mysql/performance/table_locks_waited
Description The total number of times that a request for a table lock could not be granted immediately and a wait was needed.
Unit count

/p.mysql/performance/threads_connected
Description The number of currently open connections.
Unit count

/p.mysql/performance/threads_running
Description The number of threads that are not sleeping.
Unit count

/p.mysql/rpl_semi_sync_master_tx_avg_wait_time
Description The average time the leader has waited for the follower to accept transactions.
Unit microseconds

/p.mysql/rpl_semi_sync_master_no_tx
Description The number of transactions committed without follower acknowledgement.
Unit count

/p.mysql/rpl_semi_sync_master_wait_sessions
Description The current number of connections waiting for a sync commit. For more information about sync replication, see About Synchronous Replication.
Unit count

Disk Metrics

All MySQL for PCF services emit the following disk usage metrics:

/p.mysql/system/persistent_disk_used_percent
Description The percentage of disk used on the persistent file system.
Unit percent

/p.mysql/system/persistent_disk_used
Description The amount of memory used on the persistent disk.
Unit KB

/p.mysql/system/persistent_disk_free
Description The amount of memory available on the persistent disk.
Unit KB

/p.mysql/system/persistent_disk_inodes_used_percent
Description The percentage of persistent disk inodes used by both the system and user applications.
Unit percent

/p.mysql/system/persistent_disk_inodes_used
Description The number of inodes used on the persistent disk.
Unit count

/p.mysql/system/persistent_disk_inodes_free
Description The number of inodes available on the persistent disk.
Unit count

/p.mysql/system/ephemeral_disk_used_percent
Description The percentage of disk used on the ephemeral file system.
Unit percent

/p.mysql/system/ephemeral_disk_used
Description The amount of memory used on the ephemeral disk.
Unit KB

/p.mysql/system/ephemeral_disk_free
Description The amount of memory available on the ephemeral disk.
Unit KB

/p.mysql/system/ephemeral_disk_inodes_used_percent
Description The percentage of ephemeral disk inodes used by both the system and user applications.
Unit percent

/p.mysql/system/ephemeral_disk_inodes_used
Description The number of inodes used on the ephemeral disk.
Unit count

/p.mysql/system/ephemeral_disk_inodes_free
Description The number of inodes available on the ephemeral disk.
Unit count

Leader-Follower Metrics

Leader-follower VMs emit the followings metrics:

/p.mysql/follower/is_follower
Description Shows whether a node is the follower VM.
Unit boolean

/p.mysql/follower/seconds_behind_master
Description The number of seconds the follower VM is behind in applying writes from the leader VM to its own database.
Unit seconds

/p.mysql/follower/seconds_since_leader_heartbeat
Description The number of seconds that elapses between the leader heartbeat and the replication of the heartbeat in the follower database.
Unit seconds

/p.mysql/follower/relay_log_space
Description The total size of all existing relay log files.
Unit bytes

/p.mysql/follower/slave_io_running
Description Shows whether the I/O thread has started and has connected to the leader VM.
Unit boolean

/p.mysql/follower/slave_sql_running
Description Shows whether the SQL thread has started.
Unit boolean

Highly Available Cluster Metrics

HA clusters emit the followings metrics:

/p.mysql/galera/wsrep_cluster_size
Description The current number of nodes in the HA cluster.
Unit count

/p.mysql/galera/wsrep_local_recv_queue
Description The current length of the local receive queue, in messages.
Unit count

/p.mysql/galera/wsrep_local_send_queue
Description The current length of the local send queue, in messages.
Unit count

/p.mysql/galera/wsrep_local_index
Description This node index in the cluster (base 0).
Unit count

/p.mysql/galera/wsrep_local_state
Description The local state of the node. Possible states include:
  • 1 = JOINING
  • 2 = DONOR/DESYNCED
  • 3 = JOINED
  • 4 = SYNCED
Unit integer

/p.mysql/galera/wsrep_ready
Description Shows whether the node can accept queries.
Unit boolean

/p.mysql/galera/wsrep_cluster_status
Description Shows the primary status of the cluster component that the node is in. Values are:
  • Primary: Node has a quorum.
  • Non-primary: Node has lost a quorum.
  • Disconnected: Node is unable to connect to other nodes.
Unit Status code

/p.mysql/galera/wsrep_flow_control_paused
Description Proportion of time, as a unit interval (0 to 1), that replication was paused due to flow control since the server started or last FLUSH STATUS. This metric is a measure of how much replication lag is slowing down the cluster.
Unit float

/p.mysql/galera/wsrep_flow_control_sent
Description Number of FC_PAUSE or flow control pause events sent by this node. Unlike many status variables, the counter for this metric does not reset every time you run the query.
Unit count

/p.mysql/galera/wsrep_flow_control_recv
Description Number of FC_PAUSE or flow control pause events received by this node. Unlike many status variables, the counter for this metric does not reset every time you run the query.
Unit count