Troubleshooting Problems in PCF

Page last updated:

This guide provides help with resolving issues encountered during a Pivotal Cloud Foundry (PCF) installation.

Retrying the Deployment

Although an install or update can fail for many reasons, the system is self-healing, and can often automatically correct or work around hardware or network faults.

Click Install or Apply Changes again, and the system may resolve a problem on its own.

Some failures only produce generic errors like Exited with 1. In cases like this, where a failure is not accompanied by useful information, click Install or Apply Changes to retry.

When the system does provide informative evidence, review the Common Problems section at the end of this guide to see if your problem is covered there.

Common Issues

Compare evidence that you have gathered to the descriptions below. If your issue is covered, try the recommended remediation procedures.

BOSH Does Not Reinstall

You might want to reinstall BOSH for troubleshooting purposes. However, if PCF does not detect any changes, BOSH does not reinstall. To force a reinstall of BOSH, select BOSH Director > Resource Sizes and change a resource value. For example, you could increase the amount of RAM by 4 MB.

Creating Bound Missing VMs Times Out

This task happens immediately following package compilation, but before job assignment to agents. For example:

cloud_controller/0: Timed out pinging to f690db09-876c-475e-865f-2cece06aba79 after 600 seconds (00:10:24)

This is most likely a NATS issue with the VM in question. To identify a NATS issue, inspect the agent log for the VM. Since the BOSH director is unable to reach the BOSH agent, you must access the VM using another method. You will likely also be unable to access the VM using TCP. In this case, access the VM using your virtualization console.

To diagnose:

  1. Access the VM using your virtualization console and log in.

  2. Navigate to the Credentials tab of the Pivotal Application Service (PAS) tile and locate the VM in question to find the VM credentials.

  3. Become root.

  4. Run cd /var/vcap/bosh/log.

  5. Open the file current.

  6. First, determine whether the BOSH agent and director have successfully completed a handshake, represented in the logs as a “ping-pong”:

    2013-10-03_14:35:48.58456 #[608] INFO: Message: {"method"=>"ping", "arguments"=>[],
    "reply_to"=>"director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab"}
    
    2013-10-03_14:35:48.60182 #[608] INFO: reply_to:   director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab:
    payload: {:value=>"pong"}
    

    This handshake must complete for the agent to receive instructions from the director.

  7. If you do not see the handshake, look for another line near the beginning of the file, prefixed INFO: loaded new infrastructure settings. For example:

    2013-10-03_14:35:21.83222 #[608] INFO: loaded new infrastructure settings:
    {"vm"=>{"name"=>"vm-4d80ede4-b0a5-4992-aea6a0386e18e", "id"=>"vm-360"},
    "agent_id"=>"56aea4ef-6aa9-4c39-8019-7024ccfdde4",
    "networks"=>{"default"=>{"ip"=>"192.0.2.19",
    "netmask"=>"255.255.255.0", "cloud_properties"=>{"name"=>"VMNetwork"},
    "default"=>["dns", "gateway"],
    "dns"=>["192.0.2.2", "192.0.2.17"], "gateway"=>"192.0.2.2",
    "dns_record_name"=>"0.nats.default.cf-d729343071061.microbosh",
    "mac"=>"00:50:56:9b:71:67"}}, "disks"=>{"system"=>0, "ephemeral"=>1,
    "persistent"=>{}}, "ntp"=>[], "blobstore"=>{"provider"=>"dav",
    "options"=>{"endpoint"=>"http://192.0.2.17:25250",
    "user"=>"agent", "password"=>"agent"}},
    "mbus"=>"nats://nats:nats@192.0.2.17:4222",
    "env"=>{"bosh"=>{"password"=>"$6$40ftQ9K4rvvC/8ADZHW0"}}}
     

This is a JSON blob of key/value pairs representing the expected infrastructure for the BOSH agent. For this issue, the following section is the most important:

"mbus"=>"nats://nats:nats@192.0.2.17:4222"

This key/value pair represents where the agent expects the NATS server to be. One diagnostic tactic is to try pinging this NATS IP address from the VM to determine whether you are experiencing routing issues.

Install Exits With a Creates/Updates/Deletes App Failure or With a 403 Error

Scenario 1: Your PCF install exits with the following 403 error when you attempt to log in to the Apps Manager:

{"type": "step_finished", "id": "apps-manager.deploy"}

/home/tempest-web/tempest/web/vendor/bundle/ruby/1.9.1/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:306:in
`fetch': 403 => Net::HTTPForbidden for https://login.api.example.net/oauth/authorizeresponse_type=code&client_id=portal&redirect_uri=https%3...
-- unhandled response (Mechanize::ResponseCodeError)

Scenario 2: Your PCF install exits with a creates/updates/deletes an app (FAILED - 1) error message with the following stack trace:

1) App CRUD creates/updates/deletes an app
   Failure/Error: Unable to find matching line from backtrace
   CFoundry::TargetRefused:
     Connection refused - connect(2)

In either of the above scenarios, ensure that you have correctly entered your domains in wildcard format:

  1. Browse to the Operations Manager fully qualified domain name (FQDN).

  2. Click the PAS tile.

  3. Select HAProxy and click Generate Self-Signed RSA Certificate.

  4. Enter your system and app domains in wildcard format, as well as optionally any custom domains, and click Save. Refer to PAS Cloud Controller for explanations of these domain values.

Rsa cert

Install Fails When Gateway Instances Exceed Zero

If you configure the number of Gateway instances to be greater than zero for a given product, you create a dependency on PAS for that product installation. If you attempt to install a product tile with an PAS dependency before installing PAS, the install fails.

To change the number of Gateway instances, click the product tile, then select Settings > Resource sizes > INSTANCES and change the value next to the product Gateway job.

To remove the PAS dependency, change the value of this field to 0.

Out of Disk Space Error

PCF displays an Out of Disk Space error if log files expand to fill all available disk space. If this happens, rebooting the PCF installation VM clears the tmp directory of these log files and resolves the error.

If users receive Out of Disk Space errors when trying to push apps, this can mean that Diego cells may be running out of disk space capacity.

To perform a detailed analysis of disk usage by containers and host VMs in your PAS deployment, see Examining GrootFS Disk Usage.

Installing BOSH Director Fails

If the DNS information for the PCF VM is incorrectly specified when deploying the PCF .ova file, installing BOSH Director fails at the “Installing Micro BOSH” step.

To resolve this issue, correct the DNS settings in the PCF Virtual Machine properties.

Deleting Ops Manager Fails

Ops Manager displays an error message when it cannot delete your installation. This scenario might happen if the BOSH Director cannot access the VMs or is experiencing other issues. To manually delete your installation and all VMs, you must do the following:

  1. Use your IaaS dashboard to manually delete the VMs for all installed products, with the exception of the Ops Manager VM.
  2. SSH into your Ops Manager VM and remove the installation.yml file from /var/tempest/workspaces/default/.

    Note: Deleting the installation.yml file does not prevent you from reinstalling Ops Manager. For future deploys, Ops Manager regenerates this file when you click Save on any page in the BOSH Director.

Your installation is now deleted.

Installing PAS Fails

If the DNS information for the PCF VM becomes incorrect after BOSH Director has been installed, installing PAS with Pivotal Operations Manager fails at the “Verifying app push” step.

To resolve this issue, correct the DNS settings in the PCF Virtual Machine properties.

Ops Manager Hangs During MicroBOSH Install or HAProxy States “IP Address Already Taken”

During an Ops Manager installation, you might receive the following errors:

  • The Ops Manager GUI shows that the installation stops at the “Setting MicroBOSH deployment manifest” task.
  • When you set the IP address for the HAProxy, the “IP Address Already Taken” message appears.

When you install Ops Manager, you assign it an IP address. Ops Manager then takes the next two consecutive IP addresses, assigns the first to MicroBOSH, and reserves the second. For example:

203.0.113.1 - Ops Manager (User assigned)
203.0.113.2 - MicroBOSH (Ops Manager assigned)
203.0.113.3 - Reserved (Ops Manager reserved)

To resolve this issue, ensure that the next two subsequent IP addresses from the manually assigned address are unassigned.

Poor PCF Performance

If you notice poor network performance by your PCF deployment and your deployment uses a Network Address Translation (NAT) gateway, your NAT gateway may be under-resourced.

Troubleshoot

To troubleshoot the issue, set a custom firewall rule in your IaaS console to route traffic originating from your private network directly to an S3-compatible object store. If you see decreased average latency and improved network performance, perform the solution below to scale up your NAT gateway.

Scale Up Your NAT Gateway

Perform the following steps to scale up your NAT gateway:

  1. Navigate to your IaaS console.
  2. Spin up a new NAT gateway of a larger VM size than your previous NAT gateway.
  3. Change the routes to direct traffic through the new NAT gateway.
  4. Spin down the old NAT gateway.

The specific procedures will vary depending on your IaaS. Consult your IaaS documentation for more information.

Common Issues Caused by Firewalls

This section describes various issues you might encounter when installing PAS in an environment that uses a strong firewall.

DNS Resolution Fails

When you install PCF in an environment that uses a strong firewall, the firewall might block DNS resolution. To resolve this issue, refer to the Troubleshooting DNS Resolution Issues section of the Preparing Your Firewall for Deploying PCF topic.

Create a pull request or raise an issue on the source for this page in GitHub