Monitoring Pivotal Cloud Cache Service Instances

PCC clusters and brokers emit service metrics. You can use any tool that has a corresponding Cloud Foundry nozzle to read and monitor these metrics in real time.

In the descriptions of the metrics, KPI stands for Key Performance Indicator.

Service Instance Metrics

Member Count


serviceinstance.MemberCount

Description Returns the number of members in the distributed system.
Metric Type number
Suggested measurement Every second
Measurement Type count
Warning Threshold less than the manifest member count
Suggested Actions This depends on the expected member count, which is available in the BOSH manifest. If the number expected is different from the number emitted, this is a critical situation that may lead to data loss, and the reasons for node failure should be investigated by examining the service logs.
Why a KPI? Member loss due to any reason can potentially cause data loss.

Total Available Heap Size


serviceinstance.TotalHeapSize

Description Returns the total available heap, in megabytes, across all instance members.
Metric Type number
Suggested measurement Every second
Measurement Type pulse
Why a KPI? If the total heap size and used heap size are too close, the system might see thrashing due to GC activity. This increases latency.

Total Used Heap Size


serviceinstance.UsedHeapSize

Description Returns the total heap used across all instance members, in megabytes.
Metric Type number
Suggested measurement Every second
Measurement Type pulse
Why a KPI? If the total heap size and used heap size are too close, the system might see thrashing due to GC activity. This increases latency.

Total Available Heap Size as a Percentage


serviceinstance.UnusedHeapSizePercentage

Description Returns the proportion of total available heap across all instance members, expressed as a percentage.
Metric Type percent
Suggested measurement Every second
Measurement Type compound metric
Warning Threshold 40%
Critical Threshold 10%
Suggested Actions If this is a spike due to eviction catching up with insert frequency, then customers need to keep a close watch that it should not hit the RED marker. If there is no eviction, then horizontal scaling is suggested.
Why a KPI? If the total heap size and used heap size are too close, the system might see thrashing due to GC activity. This increases latency.

Per Member Metrics

Memory Used as a Percentage


member.UsedMemoryPercentage

Description RAM being consumed.
Metric Type percent
Suggested measurement Average over last 10 minutes
Measurement Type average
Warning Threshold 75%
Critical Threshold 85%

Count of Java Garbage Collections


member.GarbageCollectionCount

Description The number of times that garbage has been collected.
Metric Type number
Suggested measurement Sum over last 10 minutes
Measurement Type count
Warning Threshold Dependent on the IaaS and app use case.
Critical Threshold Dependent on the IaaS and app use case.
Suggested Actions Check the number of queries run against the system, which increases the deserialization of objects and increases garbage.
Why a KPI? If the frequency of garbage collection is high, the system might see high CPU usage, which causes delays in the cluster.

CPU Utilization Percentage


member.HostCpuUsage

Description This member’s process CPU utilization, expressed as a percentage.
Metric Type percent
Suggested measurement Average over last 10 minutes
Measurement Type average
Warning Threshold 85%
Critical Threshold 95%
Suggested Actions If this is not happening with high GC activity, the system is reaching its limits. Horizontal scaling might help.
Why a KPI? High CPU usage causes delayed responses and can also make the member non-responsive. This can cause the member to be kicked out of the cluster, potentially leading to data loss.

Average Latency of Get Operations


member.GetsAvgLatency

Description The average latency of cache get operations, in nanoseconds.
Metric Type number
Suggested measurement Average over last 10 minutes
Measurement Type average
Warning Threshold Dependent on the IaaS and app use case.
Critical Threshold Dependent on the IaaS and app use case.
Suggested Actions If this is not happening with high GC activity, the system is reaching its limit. Horizontal scaling might help.
Why a KPI? It is a good indicator of the overall responsiveness of the system. If this number is high, the service administrator should diagnose the root cause.

Average Latency of Put Operations


member.PutsAvgLatency

Description The average latency of cache put operations, in nanoseconds.
Metric Type number
Suggested measurement Average over last 10 minutes
Measurement Type average
Warning Threshold Dependent on the IaaS and app use case.
Critical Threshold Dependent on the IaaS and app use case.
Suggested Actions If this is not happening with high GC activity, the system is reaching its limit. Horizontal scaling might help.
Why a KPI? It is a good indicator of the overall responsiveness of the system. If this number is high, the service administrator should diagnose the root cause.

JVM pauses


member.JVMPauses

Description The quantity of JVM pauses.
Metric Type number
Suggested measurement Sum over 2 seconds
Measurement Type count
Warning Threshold Dependent on the IaaS and app use case.
Critical Threshold Dependent on the IaaS and app use case.
Suggested Actions Check the cached object size; if it is greater than 1 MB, you may be hitting the limitation on JVM to garbage collect this object. Otherwise, you may be hitting the utilization limit on the cluster, and will need to scale up to add more memory to the cluster.
Why a KPI? Due to a JVM pause, the member stops responding to “are-you-alive” messages, which may cause this member to be kicked out of the cluster.

File Descriptor Limit


member.FileDescriptorLimit

Description The maximum number of open file descriptors allowed for the member’s host operating system.
Metric Type number
Suggested measurement Every second
Measurement Type pulse
Why a KPI? If the number of open file descriptors exceeds number available, it causes the member to stop responding and crash.

Open File Descriptors


member.TotalFileDescriptorOpen

Description The current number of open file descriptors.
Metric Type number
Suggested measurement Every second
Measurement Type pulse
Why a KPI? If the number of open file descriptors exceeds number available, it causes the member to stop responding and crash.

Quantity of Remaining File Descriptors


member.FileDescriptorRemaining

Description The number of available file descriptors.
Metric Type number
Suggested measurement Every second
Measurement Type compound metric
Warning Threshold 1000
Critical Threshold 100
Suggested Actions Scale horizontally to increase capacity.
Why a KPI? If the number of open file descriptors exceeds number available, it causes the member to stop responding and crash.

Threads Waiting for a Reply


member.ReplyWaitsInProgress

Description The quantity of threads currently waiting for a reply.
Metric Type number
Suggested measurement Average over the past 10 seconds
Measurement Type pulse
Warning Threshold 1
Critical Threshold 10
Suggested Actions If the value does not average to zero over the sample interval, then the member is waiting for responses from other members. There are two possible explanations: either another member is unhealthy, or the network is dropping packets. Check other members’ health, and check for network issues.
Why a KPI? Unhealthy members are excluded from the cluster, possibly leading to data loss.

Gateway Sender and Gateway Receiver Metrics

These are metrics emitted through the CF Nozzle for gateway senders and gateway receivers.

Queue Size for the Gateway Sender

gatewaySender.<sender-id>.EventQueueSize
Description The current size of the gateway sender queue.
Metric Type number
Measurement Type count

Events Received at the Gateway Sender

gatewaySender.<sender-id>.EventsReceivedRate
Description A count of the events coming from the region to which the gateway sender is attached. It is the count since the last time the metric was checked. The first time it is checked, the count is of the number of events since the gateway sender was created.
Metric Type number
Measurement Type count

Events Queued by the Gateway Sender

gatewaySender.<sender-id>.EventsQueuedRate
Description A count of the events queued on the gateway sender from the region. This quantity of events might be lower than the quantity of events received, as not all received events are queued. It is a count since the last time the metric was checked. The first time it is checked, the count is of the number of events since the gateway sender was created.
Metric Type number
Measurement Type count

Events Received by the Gateway Receiver

gatewayReceiver.EventsReceivedRate
Description A count of the events received from the gateway sender which will be applied to the region on the gateway receiver’s site. It is the count since the last time the metric was checked. The first time it is checked, the count is of the number of events since the gateway receiver was created.
Metric Type number
Measurement Type count

Disk Metrics

These are metrics emitted through the CF Nozzle for disks.

Average Latency of Disk Writes

diskstore.DiskWritesAvgLatency
Description The average latency of disk writes in nanoseconds.
Metric Type number
Measurement Type time in nanoseconds

Quantity of Bytes on Disk

diskstore.TotalSpace
Description The total number of bytes on the attached disk.
Metric Type number
Measurement Type count

Quantity of Available Bytes on Disk

diskstore.UseableSpace
Description The total number of bytes of available space on the attached disk.
Metric Type number
Measurement Type count

Experimental Metrics

These metrics are experimental. Any of these metrics may be removed without advance notice, and the current name of any of these metrics may change. Expect changes to these metrics.

These experimental metrics are specific to a region (REGION-NAME):

  • region.REGION-NAME.BucketCount
  • region.REGION-NAME.CreatesRate
  • region.REGION-NAME.DestroyRate
  • region.REGION-NAME.EntrySize
  • region.REGION-NAME.FullPath
  • region.REGION-NAME.GetsRate
  • region.REGION-NAME.NumRunningFunctions
  • region.REGION-NAME.PrimaryBucketCount
  • region.REGION-NAME.PutAllRate
  • region.REGION-NAME.PutLocalRate
  • region.REGION-NAME.PutRemoteRate
  • region.REGION-NAME.PutsRate
  • region.REGION-NAME.RegionType
  • region.REGION-NAME.SystemRegionEntryCount
  • region.REGION-NAME.TotalBucketSize
  • region.REGION-NAME.TotalRegionCount
  • region.REGION-NAME.TotalRegionEntryCount

These experimental metrics are specific to a member:

  • member.AverageReads
  • member.AverageWrites
  • member.BytesReceivedRate
  • member.BytesSentRate
  • member.CacheServer
  • member.ClientConnectionCount
  • member.CreatesRate
  • member.CurrentHeapSize
  • member.DeserializationRate
  • member.DestroysRate
  • member.FunctionExecutionRate
  • member.GetRequestRate
  • member.GetsRate
  • member.MaxMemory
  • member.MemberUpTime
  • member.NumThreads
  • member.PDXDeserializationRate
  • member.PutAllRate
  • member.PutRequestRate
  • member.PutsRate
  • member.TotalBucketCount
  • member.TotalHitCount
  • member.TotalMissCount
  • member.TotalPrimaryBucketCount

Total Memory Consumption

The BOSH mem-check errand calculates and outputs the quantity of memory used across all PCC service instances. This errand helps PCF operators monitor resource costs, which are based on memory usage.

From the director, run a BOSH command of the form:

bosh -d <service broker name> run-errand mem-check

With this command:

bosh -d cloudcache-service-broker run-errand mem-check

Here is an anonymized portion of example output from the mem-check errand for a two cluster deployment:

           Analyzing deployment xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx1...
           JVM heap usage for service instance xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx1
           Used Total = 1204 MB
           Max Total = 3201 MB

           Analyzing deployment xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx2...
           JVM heap usage for service instance xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx2
           Used Total = 986 MB
           Max Total = 3201 MB

           JVM heap usage for all clusters everywhere:
           Used Global Total = 2390 MB
           Max Global Total = 6402 MB

Monitoring PCC Service Instances with Prometheus

Prometheus is one of various tools you can use to monitor services instances. It is a monitoring and alerting toolkit that allows for metric scraping. You can use the Firehose exporter to export all the metrics from the Firehose, which you can then graph with Grafana to monitor your PCC cluster.

Follow the instructions here to deploy Prometheus alongside your PCF cluster.

Prometheus can be deployed on any IaaS. You need to verify that the Firehose exporter job can talk to your UAA VM. This might involve opening up firewall rules or enabling your VM to allow outgoing traffic.

Grafana Example

You can run queries on, and build a custom dashboard of, specific metrics that are important to you.