LATEST VERSION: 1.1 - CHANGELOG
Redis for PCF v1.10

Monitoring Redis for PCF

Page last updated:

The Loggregator Firehose exposes Redis metrics. You can use third-party monitoring tools to consume Redis metrics to monitor Redis performance and health.

As an example of how to display KPIs and metrics, see the CF Redis example dashboard, which uses Datadog. Pivotal does not endorse or provide support for any third-party solution.

Metrics Polling Interval

The metrics polling interval defaults to 30 seconds. This can be changed by navigating to the Metrics configuration page and entering a new value in Metrics polling interval (min: 10).

Metrics Polling Interval

Metrics are emitted in the following format:

origin:"p-redis" eventType:ValueMetric timestamp:1480084323333475533 deployment:"cf-redis" job:"cf-redis-broker" index:"{redacted}" ip:"10.0.1.49" valueMetric:<name:"/p-redis/service-broker/dedicated_vm_plan/available_instances" value:4 unit:"" >

Key Performance Indicators

Key Performance Indicators (KPIs) for Redis for PCF are metrics that operators find most useful for monitoring their Redis service to ensure smooth operation. KPIs are high-signal-value metrics that can indicate emerging issues. KPIs can be raw component metrics or derived metrics generated by applying formulas to raw metrics.

Pivotal provides the following KPIs as general alerting and response guidance for typical Redis for PCF installations. Pivotal recommends that operators continue to fine-tune the alert measures to their installation by observing historical trends. Pivotal also recommends that operators expand beyond this guidance and create new, installation-specific monitoring metrics, thresholds, and alerts based on learning from their own installations.

For a list of all other Redis metrics, see Other Redis Metrics.

Redis for PCF Service KPIs

Total Instances For On-Demand Service


total_instances

Description Total instances provisioned by application developers across all On-Demand Services and for a specific On-Demand plan

Use: Track instance use by app developers.

Origin: Doppler/Firehose
Type: count
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Daily
Recommended alert thresholds Yellow warning: N/A
Red critical: N/A
Recommended response N/A

Quota Remaining For On-Demand Service


total_instances

Description Number of available instances across all On-Demand Services and for a specific On-Demand plan.

Use: Track remaining resources available for app developers.

Origin: Doppler/Firehose
Type: count
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Daily
Recommended alert thresholds Yellow warning: 3
Red critical: 0
Recommended response Increase quota allowed for the specific plan or across all on-demand services.

Quota Remaining For Shared-VM and Dedicated-VM Service


/p-redis/service-broker/dedicated_vm_plan/available_instances
/p-redis/service-broker/shared_vm_plan/available_instances

Description Number of available instances for the Dedicated-VM serving.

Use: Track remaining resources available for app developers.

Origin: Doppler/Firehose
Type: count
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Daily
Recommended alert thresholds Yellow warning: 2
Red critical: 0
Recommended response Increase VMs available for the Dedicated-VM service.

Redis KPIs

Percent of Persistent Disk Used


persistent.disk.percent

Description Percentage of persistent disk being used on a VM. The persistent disk is specified as an IaaS-specific disk type with a size. For example, pd-standard on GCP, or st1 on AWS, with disk size 5GB. This is a metric relevant to the health of the VM. A percentage of disk usage approaching 100 will cause the VM disk to become unusable as no more files will be allowed to be written.

Use: Redis is an in-memory data store that uses a persistent disk to backup and restore the dataset in case of upgrades and VM restarts.

Origin: JMX Bridge or BOSH HM
Type: percent
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: >75
Red critical: >90
Recommended response Ensure that the disk is at least 3.5x VM memory. If it is, then contact GSS. If it is not, then then increase disk space.

Used Memory Percent


info.memory.used_memory / info.memory.maxmemory

Description The ratio of these two metrics returns the percentage of available memory used:
  • info.memory.used_memory is a metric of the total number of bytes allocated by Redis using its allocator (either standard libc, jemalloc, or an alternative allocator such as tcmalloc).
  • maxmemory is a configuration option for the total memory made available to the Redis instance.
Use: This is a performance metric that is most critical for Redis instances with a maxmemory-policy of allkeys-lru

Origin: Doppler/Firehose
Type: percentage
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Application-specific based on velocity of data flow. Some options are:

  1. Individual data points—Use if key eviction is in place, for example, in cache use cases.
  2. Average over last 10 minutes—Use if this gives you enough detail.
  3. Maximum of last 10 minutes
If key eviction is not in place, options 1 or 3 give more useful information to ensure that high usage triggers an alert.
Recommended alert thresholds Yellow warning: 80% Not applicable for cache usage. When used as a cache, Redis will typically use up to maxmemory and then evict keys to make space for new entries.

A different threshold might be appropriate for specific use cases of no key eviction, to allow for reaction time. Factors to consider:

  1. Traffic load on application—Higher traffic means that Redis memory will fill up faster.
  2. Average size of data added/ transaction—The more data added to Redis on a single transaction, the faster Redis will fill up its memory.
Red critical: 90%. See warning-specific threshold information.
Recommended response No action assuming the maxmemory policy set meets your applications needs. If the maxmemory policy does not persist data as you wish, either coordinate a backup cadence or update your maxmemory policy if using the on-demand Redis service.

Connected Clients


info.clients.connected_clients

Description Number of clients currently connected to the Redis instance.

Use: Redis does not close client connections. They remain open until closed explicitly by the client or another script. Once the connected_clients reaches maxclients, Redis stops accepting new connections and begins producing ERR max number of clients reached errors.

Origin: Doppler/Firehose
Type: number
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: Application-specific. When connected clients reaches max clients, no more clients can connect. This alert should be at the level where it can tell you that your application has scaled to a certain level and may require action.
Red critical: Application-specific. When connected clients reaches max clients, no more clients can connect. This alert should be at the level where it can tell you that your application has scaled to a certain level and may require action.
Recommended response Increase max clients for your instance if using the on-demand service, or reduce the number of connected clients.

Blocked Clients


info.clients.blocked_clients

Description The number of clients currently blocked waiting for a blocking request they have made to the Redis server. Redis provides two types of primitive commands to retrieve items from lists: standard and blocking. This metric concerns the blocking commands.

Standard Commands

The standard commands (LPOP, RPOP, RPOPLPUSH) immediately return an item from a list. If there are no items available the standard pop commands return nil.

Blocking Commands

The blocking commands (BLPOP, BRPOP, BRPOPLPUSH) wait for an empty list to become non-empty. The client connection is blocked until an item is added to the lists it is watching. Only the client that made the blocking request is blocked, and the Redis server continues to serve other clients.

The blocking commands each take a timeout argument that is the time in seconds the server waits for a list before returning nil. A blocking command with timeout 0 waits forever. Multiple clients may be blocked waiting for the same list. For details of the blocking commands, see: https://redis.io/commands/blpop.

Use: Blocking commands can be useful to avoid clients regularly polling the server for new data. This metric tells you how many clients are currently blocked due to a blocking command.

Origin: Doppler/Firehose
Type: number
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Application-specific. Change from baseline may be more significant than actual value.
Recommended alert thresholds Yellow warning: The expected range of the blocked_clients metric depends on what Redis is being used for:
  • Many uses will have no need for blocking commands and should expect blocked_clients to always be zero.
  • If blocking commands are being used to force a recipient client to wait for a required input, a raised blocked_clients might suggest a problem with the source clients.
  • blocked_clients might be expected to be high in situations where Redis is being used for infrequent messaging.
If blocked_clients is expected to be non-zero, warnings could be based on change from baseline. A sudden rise in blocked_clients could be caused by source clients failing to provide data required by blocked clients.

Red critical: There is no blocked_clients threshold critical to the function of Redis. However a problem that is causing blocked_clients to rise might often cause a rise in connected_clients. connected_clients does have a hard upper limit and should be used to trigger alerts.
Recommended response Analysis could include:

  • Checking the connected_clients metric. blocked_clients would often rise in concert with connected_clients.
  • Establishing whether the rise in blocked_clients is accompanied by an overall increase in applications connecting to Redis, or by an asymmetry in clients providing and receiving data with blocking commands
  • Considering whether a change in blocked_clients is most likely caused by oversupply of blocking requests or undersupply of data
  • Considering whether a change in network latency is delaying the data from source clients
In general, a rise or change in blocked_clients is more likely to suggest a problem in the network or infrastructure, or in the function of client applications, rather than a problem with the Redis service.

Memory Fragmentation Ratio


info.memory.mem_fragmentation_ratio

Description Ratio of the amount of memory allocated to Redis by the OS to the amount of memory that Redis is using

Use: A memory fragmentation less than one shows that the memory used by Redis is higher than the OS available memory. In other packagings of redis, large values reflect memory fragmentation. For Redis for PCF, the instances only run Redis meaning that no other processes will be affected by a high fragmentation ratio (e.g., 10 or 11).

Origin: Doppler/Firehose
Type: ratio
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: < 1. Less than 1 indicates that the memory used by Redis is higher than the OS available memory which can lead to performance degradations.
Red critical: Same as warning threshold.
Recommended response Restart the Redis server to normalize fragmentation ratio.

Instantaneous Operations Per Second


info.stats.instantaneous_ops_per_sec

Description The number of commands processed per second by the Redis server. The instantaneous_ops_per_sec is calculated as the mean of the recent samples taken by the server. The number of recent samples is hardcoded as 16 in the implementation of Redis.

Use: The higher the commands processed per second, the better the performance of Redis. This is because Redis is single threaded and the commands are processed in sequence. A higher throughput would thus mean faster response per request which is a direct indicator of higher performance. A drop in the number of commands processed per second as compared to historical norms could be a sign of either low command volume or slow commands blocking the system. Low command volume could be normal, or it could be indicative of problems upstream.

Origin: Doppler/Firehose
Type: count
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Every 30 seconds
Recommended alert thresholds Yellow warning: A drop in the count compared to historical norms could be a sign of either low command volume or slow commands blocking the system. Low command volume could be normal, or it could be indicative of problems upstream. Slow commands could be due to a latency issue, a large number of clients being connected to the same instance, memory being swapped out, etc. Thus, the count is possibly a symptom of compromised Redis performance. However, this is not the case when low command volume is expected.

Red critical: A very low count or a large drop from previous counts may indicate a downturn in performance that should result in an investigation. That is unless the low traffic is expected behavior.
Recommended response A drop in the count may be a symptom of compromised Redis performance. The following are possible responses:

  1. Identify slow commands using the slowlog:
    Redis logs all the commands that take more than a specified amount of time in slowlog. By default, this time is set to 20ms and the slowlog is allowed a maximum of 120 commands. For the purposes of slowlog, execution time is the time taken by Redis alone and does not account for time spent in I/O. So it would not log slow commands solely due to network latency.

    Given that typical commands, including network latency, take about 200ms, a 20ms Redis execution time is 100 times slower. This could be indicative of memory management issues wherein Redis pages have been swapped to disk.

    To see all the commands with slow Redis execution times, type slowlog get in the redis-cli.
  2. Monitor client connections:
    Because Redis is single threaded, one process services requests from all clients. As the number of clients grows, the percentage of resource time given to each client decreases and each client spends an increasing time waiting for their share of Redis server time.

    Monitoring the number of clients may be important because there may be applications creating connections that you did not expect or your application may not be efficiently closing unused connections.

    The connected clients metrics can be used to monitor this. This can also be viewed from the redis-cli using the command “info clients”.
  3. Limit client connections:
    This currently defaults to 1000 but depending on the application, you may wish to limit this further. This can be done be running the command “config set maxclients ” in the redis-cli. Connections that exceed the limit will be rejected and closed immediately.

    Setting maxclients is important to limit the number of unintended client connections and should be set to around 110% to 150% of your expected peak number of connections. In addition, because an error message is returned for failed connection attempts, the maxclient limit warns you that a significant number of unexpected connections are occurring. This helps maintain optimal Redis performance.
  4. Improve memory management:
    Poor memory can cause increased latency in Redis. If your Redis instance is using more memory than is available, the operating system will swap parts of the redis process from out of phyical memory and onto disk. Swapping will significantly reduce Redis performance since reads from disk are about 5 orders or magnitude slower than reads from physical memory.

Keyspace Hits / Keyspace Misses


info.stats.keyspace_hits / info.stats.keyspace_misses

Description Hit ratio to determine share of keyspace hits that are successful.

Use: A memory fragmentation less than one shows that the memory used by Redis is higher than the OS available memory. In other packagings of redis, large values reflect memory fragmentation. For Redis for PCF, the instances only run Redis, therefore, no other processes are affected by a high fragmentation ratio (e.g., 10 or 11).

Origin: Doppler/Firehose
Type: ratio
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Application-specific
Recommended alert thresholds Yellow warning: Application-specific. In general depending how an application is using the cache, an expected hit ratio value can vary between 60% to 99% . Also, the same hit ratio values can mean different things for different applications. Every time an application gets a cache miss, it will probably go to and fetch the data from a slower resource. This cache miss cost can be different per application. The application developers might be able to provide a threshold that is meaningful for the app and its performance

Red critical: Application-specific. See the warning threshold above.
Recommended response Application-specific. See the warning threshold above. Work with application developers to understand the performance and cache configuration required for their applications.

BOSH Health Monitor Metrics

The BOSH layer that underlies PCF generates healthmonitor metrics for all VMs in the deployment. However, these metrics are not included in the Loggregator Firehose by default. To get these metrics, do either of the following:

  • To send BOSH HM metrics through the Firehose, install the open-source HM Forwarder.
  • To retrieve BOSH health metrics outside of the Firehose, install the JMX Bridge for PCF tile.

Other Redis Metrics

Redis also exposes the following metrics. for more information, see the Redis documentation.

  • arch_bits
  • uptime_in_seconds
  • uptime_in_days
  • hz
  • lru_clock
  • client_longest_output_list
  • client_biggest_input_buf
  • used_memory_rss
  • used_memory_peak
  • used_memory_lua
  • loading
  • rdb_bgsave_in_progress
  • rdb_last_save_time
  • rdb_last_bgsave_time_sec
  • rdb_current_bgsave_time_sec
  • aof_rewrite_in_progress
  • aof_rewrite_scheduled
  • aof_last_rewrite_time_sec
  • aof_current_rewrite_time_sec
  • total_connections_received
  • total_commands_processed
  • instantaneous_ops_per_sec
  • total_net_input_bytes
  • total_net_output_bytes
  • instantaneous_input_kbps
  • instantaneous_output_kbps
  • rejected_connections
  • sync_full
  • sync_partial_ok
  • sync_partial_err
  • expired_keys
  • evicted_keys
  • keyspace_hits
  • keyspace_misses
  • pubsub_channels
  • pubsub_patterns
  • latest_fork_usec
  • migrate_cached_sockets
  • repl_backlog_active
  • repl_backlog_size
  • repl_backlog_first_byte_offset
  • repl_backlog_histlen
  • used_cpu_sys
  • used_cpu_user
  • used_cpu_sys_children
  • used_cpu_user_children
  • rdb_last_bgsave_status
  • aof_last_bgrewrite_status
  • aof_last_write_status
Create a pull request or raise an issue on the source for this page in GitHub