Troubleshooting
Here are problems and fixes related to using PCC.
Acquire Artifacts for Troubleshooting
Gather GemFire Logs
GemFire statistics and log files may be obtained by using the
gfsh export logs
command.
See Export gfsh Logs for details.
View Statistics Files and Logs
You can visualize the performance of your cluster by downloading the statistics files from your servers. These files are located in the persistent store on each VM. To copy these files to your workstation, run the command:
bosh2 -e BOSH-ENVIRONMENT -d DEPLOYMENT-NAME scp server/0:/var/vcap/store/gemfire-server/statistics.gfs /tmp
Alternatively, use bosh ssh
to access PCC service instance GemFire server VMs
and directly obtain the GemFire logs that reside within the directory
/var/vcap/sys/log/gemfire-server
.
See the Pivotal GemFire Installing and Running VSD topic for information about loading the statistics files into Pivotal GemFire VSD.
Acquire Thread Dumps
Thread dumps may be useful for debugging. Take at least three thread dumps on each VM, separating them by about one second.
To list your VMs, run:
bosh -e ENV -d DEPLOYMENT vms
Acquire the Deployment Name instructs how to acquire the string to substitute for
DEPLOYMENT
.Use the VM in a
bosh ssh
command to ssh in to the PCC VM where you want to produce the thread dumps. PCC VMs can be referenced using a 0-based index, for example server/0, or locator/2:bosh -e ENV -d DEPLOYMENT ssh server/0
Get into the Bosh Process Manager (bpm) shell by running
sudo /var/vcap/packages/bpm/bin/bpm shell JOB-NAME
where JOB-NAME is either
gemfire-server
orgemfire-locator
, depending on which PCC VM you are on.Find the process ID (PID) that is running the GemFire Java process by running
ps -aux | grep java
Typically, the PID is 1.
As you take multiple thread dumps, redirect the output of each to a uniquely named file. This example uses the file name
threaddump1.txt
:/var/vcap/packages/jdk8/bin/jcmd 1 Thread.print > /tmp/threaddump1.txt
Files in
/tmp
will be accessible on the VM in directory/var/vcap/data/gemfire-server/tmp
or/var/vcap/data/gemfire-locator/tmp
.Move the files to the
/tmp
directory on the VM by runningmv /var/vcap/data/gemfire-server/tmp/threaddump1.txt /tmp/
or
mv /var/vcap/data/gemfire-locator/tmp/threaddump1.txt /tmp/
Files can be copied to your local machine using
bosh scp
command. From your local machine, run:bosh -d DEPLOYMENT scp VM:/tmp/threaddump1.txt .
For example:
$ bosh -d service-instance_1fd2850e-b754-4c5e-aa5c-ddb54ee301e6 scp server/0:/tmp/threaddump1.txt .
Acquire the Deployment Name
The DEPLOYMENT
name is needed in several troubleshooting procedures.
To acquire the DEPLOYMENT
name:
Use the Pivotal Cloud Foundry CLI. Target the space where the service instance runs.
Discover the globally unique identifier (GUID) for the service instance:
cf service INSTANCE-NAME --guid
The output is the GUID. For example:
$ cf service dev-instance --guid 1fd2850e-b754-4c5e-aa5c-ddb54ee301e6
Prefix the GUID with the string
service-instance_
to obtain theDEPLOYMENT
name. For the example GUID, theDEPLOYMENT
name isservice-instance_1fd2850e-b754-4c5e-aa5c-ddb54ee301e6
.
Troubleshooting for Operators
Smoke Test Failures
Error message: “Creating p-cloudcache SERVICE-NAME failed”
Cause of the Problem: The smoke tests could not create an instance of GemFire.
Action: To troubleshoot why the deployment failed, use the CF CLI to create a new service instance using the same plan and download the logs of the service deployment from BOSH.
Error message: “Deleting SERVICE-NAME failed”
Cause of the Problem: The smoke test attempted to clean up a service instance it created and failed to delete the service using the
cf delete-service
command.Action: Run BOSH
logs
to view the logs on the broker or the service instance to see why the deletion may have failed.Error message: “Cannot connect to the cluster SERVICE-NAME”
Cause of the Problem: The smoke test was unable to connect to the cluster.
Action: Review the logs of your load balancer, and review the logs of your CF Router to ensure the route to your PCC cluster is properly registered.
You also can create a service instance and try to connect to it using the gfsh CLI. This requires creating a service key.
Error message: “Could not perform create/put on Cloud Cache cluster”
Cause of the Problem: The smoke test was unable to write data to the cluster. The user may not have permissions to create a region or write data.
Error message: “Could not retrieve value from Cloud Cache cluster”
Cause of the Problem: The smoke test was unable to read back the data it wrote. Data loss can happen if a cluster member improperly stops and starts again or if the member machine crashes and is resurrected by BOSH.
Action: Run BOSH
logs
to view the logs on the broker to see if there were any interruptions to the cluster by a service update.
General Connectivity
Problem: Client-to-server communication
Cause of the Problem: PCC Clients communicate to PCC servers on port 40404 and with locators on port 55221. Both of these ports must be reachable from the PAS network to service the network.
Problem: Membership port range
Cause of the Problem: PCC servers and locators communicate with each other using UDP and TCP. The current port range for this communication is
49152-65535
.Solution: If you have a firewall between VMs, ensure this port range is open.
Problem: Port range usage across a WAN
Cause of the Problem: Gateway receivers and gateway senders communicate across WAN-separated service instances. Each PCC service instance uses GemFire defaults for the gateway receiver ports. The default is the inclusive range of port numbers 5000 to 5499.
Solution: Ensure this port range is open when WAN-separated service instances will communicate.
Troubleshooting for Developers
Problem: An error occurs when creating a service instance or when running a smoke test. The service creation issues an error message that starts with
Instance provisioning failed: There was a problem completing your request.
GemFire server logs at
/var/vcap/sys/log/gemfire-server/gemfire/server-<N>.log
will contain a disk-access error with the stringA DiskAccessException has occurred
and a stack trace similar to this one that begins with
org.apache.geode.cache.persistence.ConflictingPersistentDataException at org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.checkMyStateOnMembers(PersistenceAdvisorImpl.java:743) at org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:819) at org.apache.geode.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:52) at org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1178) at org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1059) at org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3089)
Cause of the Problem: The PCC VMs are underprovisioned; the quantity of disk space is too small.
Solution: Use Ops Manager to provision VMs of at least the minimum size. See Configure Service Plans for minimum-size details.