Monitoring Pivotal Cloud Cache Service Instances
Warning: Pivotal Cloud Cache
v1.7 is no longer supported
because it has reached the End of General Support (EOGS) phase as defined by the
Support Lifecycle Policy.
To stay up to date with the latest software and security updates,
upgrade to a supported version.
PCC clusters and brokers emit service metrics.
You can use any tool that has a corresponding Cloud Foundry nozzle to read and monitor these metrics in real time.
As an app developer, when you opt to use a data service, you should be prepared to:
- monitor the state of that service
- triage issues that occur with that service
- be notified of any concerns
If you believe an issue relates to the underlying infrastructure (network, CPU, memory, or disk),
you will need to capture evidence and notify your platform team. The metrics described in this
section can help in characterizing the performance and resource consumption of your service
instance.
Service Instance Metrics
In the descriptions of the metrics, KPI stands for Key Performance Indicator.
Member Count
serviceinstance.MemberCount
|
Description |
Returns the number of members in the distributed system. |
Metric Type |
number |
Suggested measurement |
Every second |
Measurement Type |
count |
Warning Threshold |
less than the manifest member count |
Suggested Actions |
This depends on the expected member count, which is available in the BOSH manifest. If the number expected is different from the number emitted, this is a critical situation that may lead to data loss, and the reasons for node failure should be investigated by examining the service logs. |
Why a KPI? |
Member loss due to any reason can potentially cause data loss. |
Total Available Heap Size
serviceinstance.TotalHeapSize
|
Description |
Returns the total available heap, in megabytes, across all instance members. |
Metric Type |
number |
Suggested measurement |
Every second |
Measurement Type |
pulse |
Why a KPI? |
If the total heap size and used heap size are too close, the system might see thrashing due to GC activity. This increases latency. |
Total Used Heap Size
serviceinstance.UsedHeapSize
|
Description |
Returns the total heap used across all instance members, in megabytes. |
Metric Type |
number |
Suggested measurement |
Every second |
Measurement Type |
pulse |
Why a KPI? |
If the total heap size and used heap size are too close, the system might see thrashing due to GC activity. This increases latency. |
Total Available Heap Size as a Percentage
serviceinstance.UnusedHeapSizePercentage
|
Description |
Returns the proportion of total available heap across all instance members, expressed as a percentage. |
Metric Type |
percent |
Suggested measurement |
Every second |
Measurement Type |
compound metric |
Warning Threshold |
40% |
Critical Threshold |
10% |
Suggested Actions |
If this is a spike due to eviction catching up with insert frequency, then customers need to keep a close watch that it should not hit the RED marker. If there is no eviction, then horizontal scaling is suggested. |
Why a KPI? |
If the total heap size and used heap size are too close, the system might see thrashing due to GC activity. This increases latency. |
Per Member Metrics
Memory Used as a Percentage
member.UsedMemoryPercentage
|
Description |
RAM being consumed. |
Metric Type |
percent |
Suggested measurement |
Average over last 10 minutes |
Measurement Type |
average |
Warning Threshold |
75% |
Critical Threshold |
85% |
Count of Java Garbage Collections
member.GarbageCollectionCount
|
Description |
The number of times that garbage has been collected. |
Metric Type |
number |
Suggested measurement |
Sum over last 10 minutes |
Measurement Type |
count |
Warning Threshold |
Dependent on the IaaS and app use case. |
Critical Threshold |
Dependent on the IaaS and app use case. |
Suggested Actions |
Check the number of queries run against the system, which increases the deserialization of objects and increases garbage. |
Why a KPI? |
If the frequency of garbage collection is high, the system might see high CPU usage, which causes delays in the cluster. |
CPU Utilization Percentage
member.HostCpuUsage
|
Description |
This member’s process CPU utilization, expressed as a percentage. |
Metric Type |
percent |
Suggested measurement |
Average over last 10 minutes |
Measurement Type |
average |
Warning Threshold |
85% |
Critical Threshold |
95% |
Suggested Actions |
If this is not happening with high GC activity, the system is reaching its limits. Horizontal scaling might help. |
Why a KPI? |
High CPU usage causes delayed responses and can also make the member non-responsive. This can cause the member to be kicked out of the cluster, potentially leading to data loss. |
Average Latency of Get Operations
member.GetsAvgLatency
|
Description |
The average latency of cache get operations, in nanoseconds. |
Metric Type |
number |
Suggested measurement |
Average over last 10 minutes |
Measurement Type |
average |
Warning Threshold |
Dependent on the IaaS and app use case. |
Critical Threshold |
Dependent on the IaaS and app use case. |
Suggested Actions |
If this is not happening with high GC activity, the system is reaching its limit. Horizontal scaling might help. |
Why a KPI? |
It is a good indicator of the overall responsiveness of the system. If this number is high, the service administrator should diagnose the root cause. |
Average Latency of Put Operations
member.PutsAvgLatency
|
Description |
The average latency of cache put operations, in nanoseconds. |
Metric Type |
number |
Suggested measurement |
Average over last 10 minutes |
Measurement Type |
average |
Warning Threshold |
Dependent on the IaaS and app use case. |
Critical Threshold |
Dependent on the IaaS and app use case. |
Suggested Actions |
If this is not happening with high GC activity, the system is reaching its limit. Horizontal scaling might help. |
Why a KPI? |
It is a good indicator of the overall responsiveness of the system. If this number is high, the service administrator should diagnose the root cause. |
JVM pauses
member.JVMPauses
|
Description |
The quantity of JVM pauses. |
Metric Type |
number |
Suggested measurement |
Sum over 2 seconds |
Measurement Type |
count |
Warning Threshold |
Dependent on the IaaS and app use case. |
Critical Threshold |
Dependent on the IaaS and app use case. |
Suggested Actions |
Check the cached object size; if it is greater than 1 MB, you may be hitting the limitation on JVM to garbage collect this object. Otherwise, you may be hitting the utilization limit on the cluster, and will need to scale up to add more memory to the cluster. |
Why a KPI? |
Due to a JVM pause, the member stops responding to “are-you-alive” messages, which may cause this member to be kicked out of the cluster. |
File Descriptor Limit
member.FileDescriptorLimit
|
Description |
The maximum number of open file descriptors allowed for the member’s host operating system. |
Metric Type |
number |
Suggested measurement |
Every second |
Measurement Type |
pulse |
Why a KPI? |
If the number of open file descriptors exceeds number available, it causes the member to stop responding and crash. |
Open File Descriptors
member.TotalFileDescriptorOpen
|
Description |
The current number of open file descriptors. |
Metric Type |
number |
Suggested measurement |
Every second |
Measurement Type |
pulse |
Why a KPI? |
If the number of open file descriptors exceeds number available, it causes the member to stop responding and crash. |
Quantity of Remaining File Descriptors
member.FileDescriptorRemaining
|
Description |
The number of available file descriptors. |
Metric Type |
number |
Suggested measurement |
Every second |
Measurement Type |
compound metric |
Warning Threshold |
1000 |
Critical Threshold |
100 |
Suggested Actions |
Scale horizontally to increase capacity. |
Why a KPI? |
If the number of open file descriptors exceeds number available, it causes the member to stop responding and crash. |
Threads Waiting for a Reply
member.ReplyWaitsInProgress
|
Description |
The quantity of threads currently waiting for a reply. |
Metric Type |
number |
Suggested measurement |
Average over the past 10 seconds |
Measurement Type |
pulse |
Warning Threshold |
1 |
Critical Threshold |
10 |
Suggested Actions |
If the value does not average to zero over the sample interval, then the member is waiting for responses from other members. There are two possible explanations: either another member is unhealthy, or the network is dropping packets. Check other members’ health, and check for network issues.
|
Why a KPI? |
Unhealthy members are excluded from the cluster, possibly leading to data loss. |
Gateway Sender and Gateway Receiver Metrics
These are metrics emitted through the CF Nozzle for gateway senders and gateway receivers.
Queue Size for the Gateway Sender
gatewaySender.<sender-id>.EventQueueSize |
Description |
The current size of the gateway sender queue. |
Metric Type |
number |
Measurement Type |
count |
Events Received at the Gateway Sender
gatewaySender.<sender-id>.EventsReceivedRate |
Description |
A count of the events coming from the region to which the gateway sender is attached. It is the count since the last time the metric was checked. The first time it is checked, the count is of the number of events since the gateway sender was created. |
Metric Type |
number |
Measurement Type |
count |
Events Queued by the Gateway Sender
gatewaySender.<sender-id>.EventsQueuedRate |
Description |
A count of the events queued on the gateway sender from the region. This quantity of events might be lower than the quantity of events received, as not all received events are queued. It is a count since the last time the metric was checked. The first time it is checked, the count is of the number of events since the gateway sender was created. |
Metric Type |
number |
Measurement Type |
count |
Events Received by the Gateway Receiver
gatewayReceiver.EventsReceivedRate |
Description |
A count of the events received from the gateway sender which will be applied to the region on the gateway receiver’s site. It is the count since the last time the metric was checked. The first time it is checked, the count is of the number of events since the gateway receiver was created. |
Metric Type |
number |
Measurement Type |
count |
Disk Metrics
These are metrics emitted through the CF Nozzle for disks.
Average Latency of Disk Writes
diskstore.DiskWritesAvgLatency |
Description |
The average latency of disk writes in nanoseconds. |
Metric Type |
number |
Measurement Type |
time in nanoseconds |
Quantity of Bytes on Disk
diskstore.TotalSpace |
Description |
The total number of bytes on the attached disk. |
Metric Type |
number |
Measurement Type |
count |
Quantity of Available Bytes on Disk
diskstore.UseableSpace |
Description |
The total number of bytes of available space on the attached disk. |
Metric Type |
number |
Measurement Type |
count |
Experimental Metrics
These metrics are experimental.
Any of these metrics may be removed without advance notice,
and the current name of any of these metrics may change.
Expect changes to these metrics.
These experimental metrics are specific to a region (REGION-NAME
):
region.REGION-NAME.BucketCount
region.REGION-NAME.CreatesRate
region.REGION-NAME.DestroyRate
region.REGION-NAME.EntrySize
region.REGION-NAME.FullPath
region.REGION-NAME.GetsRate
region.REGION-NAME.NumRunningFunctions
region.REGION-NAME.PrimaryBucketCount
region.REGION-NAME.PutAllRate
region.REGION-NAME.PutLocalRate
region.REGION-NAME.PutRemoteRate
region.REGION-NAME.PutsRate
region.REGION-NAME.RegionType
region.REGION-NAME.SystemRegionEntryCount
region.REGION-NAME.TotalBucketSize
region.REGION-NAME.TotalRegionCount
region.REGION-NAME.TotalRegionEntryCount
These experimental metrics are specific to a member:
member.AverageReads
member.AverageWrites
member.BytesReceivedRate
member.BytesSentRate
member.CacheServer
member.ClientConnectionCount
member.CreatesRate
member.CurrentHeapSize
member.DeserializationRate
member.DestroysRate
member.FunctionExecutionRate
member.GetRequestRate
member.GetsRate
member.MaxMemory
member.MemberUpTime
member.NumThreads
member.PDXDeserializationRate
member.PutAllRate
member.PutRequestRate
member.PutsRate
member.TotalBucketCount
member.TotalHitCount
member.TotalMissCount
member.TotalPrimaryBucketCount
Total Memory Consumption
The BOSH mem-check
errand calculates and outputs
the quantity of memory used across all PCC service instances.
This errand helps PCF operators monitor resource costs,
which are based on memory usage.
From the director, run a BOSH command of the form:
bosh -d <service broker name> run-errand mem-check
With this command:
bosh -d cloudcache-service-broker run-errand mem-check
Here is an anonymized portion of example output from the mem-check
errand
for a two cluster deployment:
Analyzing deployment xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx1...
JVM heap usage for service instance xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx1
Used Total = 1204 MB
Max Total = 3201 MB
Analyzing deployment xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx2...
JVM heap usage for service instance xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx2
Used Total = 986 MB
Max Total = 3201 MB
JVM heap usage for all clusters everywhere:
Used Global Total = 2390 MB
Max Global Total = 6402 MB