LATEST VERSION: 1.10 - CHANGELOG
Pivotal Cloud Foundry v1.7

Pivotal Cloud Foundry Troubleshooting Guide

Page last updated:

Note: Pivotal Cloud Foundry (PCF) for vCloud Air and vCloud Director is deprecated and availability is restricted to existing customers. Contact Support for more information.

This guide provides help with diagnosing and resolving issues encountered during a Pivotal Cloud Foundry (PCF) installation. For help troubleshooting issues that are specific to PCF deployments on VMware vSphere, refer to the topic on Troubleshooting Ops Manager for VMware vSphere.

An install or update can fail for many reasons. Fortunately, the system tends to heal or work around hardware or network faults. By the time you click the Install or Apply Changes button again, the problem may be resolved.

Some failures produce only generic errors like Exited with 1. In cases like this, where a failure is not accompanied by useful information, retry clicking Install or Apply Changes.

When the system does provide informative evidence, review the Common Problems section at the end of this guide to see if your problem is covered there.

Besides whether products install successfully or not, an important area to consider when troubleshooting is communication between VMs deployed by Pivotal Cloud Foundry. Depending on what products you install, communication takes the form of messaging, routing, or both. If they go wrong, an installation can fail. For example, in an Elastic Runtime installation the PCF VM tries to push a test application to the cloud during post-installation testing. The installation fails if the resulting traffic cannot be routed to the HA Proxy load balancer.

Viewing the Debug Endpoint

The debug endpoint is a web page that provides information useful in troubleshooting. If you have superuser privileges and can view the Ops Manager Installation Dashboard, you can access the debug endpoint.

  • In a browser, open the URL:

    https://OPS-MANAGER-FQDN/debug

The debug endpoint offers three links:

  • Files allows you to view the YAML files that Ops Manager uses to configure products that you install. The most important YAML file, installation.yml, provides networking settings and describes microbosh. In this case, microbosh is the VM whose BOSH Director component is used by Ops Manager to perform installations and updates of Elastic Runtime and other products.
  • Components describes the components in detail.
  • Rails log shows errors thrown by the VM where the Ops Manager web application (a Rails application) is running, as recorded in the production.log file. See the next section to learn how to explore other logs.

Logging Tips

Identifying Where to Start

This section contains general tips for locating where a particular problem is called out in the log files. Refer to the later sections for tips regarding specific logs (such as those for Elastic Runtime Components).

  • Start with the largest and most recently updated files in the job log
  • Identify logs that contain ‘err’ in the name
  • Scan the file contents for a “failed” or “error” string

Viewing Logs for Elastic Runtime Components

To troubleshoot specific Elastic Runtime components by viewing their log files, browse to the Ops Manager interface and follow the procedure below.

  1. In Ops Manager, browse to the Pivotal Elastic Runtime > Status tab. In the Job column, locate the component of interest.
  2. In the Logs column for the component, click the download icon.

    Status

  3. Browse to the Pivotal Elastic Runtime > Logs tab.

    Logs

  4. Once the zip file corresponding to the component of interest moves to the Downloaded list, click the linked file path to download the zip file.

  5. Once the download completes, unzip the file.

The contents of the log directory vary depending on which component you view. For example, the Diego cell log directory contains subdirectories for the metron_agent rep, monit, and garden processes. To view the standard error stream for garden, download the Diego cell logs and open diego.0.job > garden > garden.stderr.log.

Viewing Web Application and BOSH Failure Logs in a Terminal Window

You can obtain diagnostic information from the Operations Manager by logging in to the VM where it is running. To log in to the Operations Manager VM, you need the following information:

  • The IP address of the PCF VM shown in the Settings tab of the Ops Manager Director tile.
  • Your import credentials. Import credentials are the username and password used to import the PCF .ova or .ovf file into your virtualization system.

Complete the following steps to log in to the Operations Manager VM:

  1. Open a terminal window.
  2. Run ssh IMPORT-USERNAME@PCF-VM-IP-ADDRESS to connect to the PCF installation VM.
  3. Enter your import password when prompted.
  4. Change directories to the home directory of the web application:

    cd /home/tempest-web/tempest/web/

  5. You are now in a position to explore whether things are as they should be within the web application.

    You can also verify that the microbosh component is successfully installed. A successful MicroBOSH installation is required to install Elastic Runtime and any products like databases and messaging services.

  6. Change directories to the BOSH installation log home:

    cd /var/tempest/workspaces/default/deployments/micro

  7. You may want to begin by running a tail command on the current log:

    cd /var/tempest/workspaces/default/deployments/micro

    If you are unable to resolve an issue by viewing configurations, exploring logs, or reviewing common problems, you can troubleshoot further by running BOSH diagnostic commands with the BOSH Command Line Interface (CLI).

Note: Do not manually modify the deployment manifest. Operations Manager will overwrite manual changes to this manifest. In addition, manually changing the manifest may cause future deployments to fail.

Viewing the VMs in Your Deployment

To view the VMs in your PCF deployment, perform the following steps specific to your IaaS.

Amazon Web Services (AWS)

  1. Log in to the AWS Console.
  2. Navigate to the EC2 Dashboard.
  3. Click Running Instances.
  4. Click the gear icon in the upper right.
  5. Select the following: job, deployment, director, index.
  6. Click Close.

OpenStack

  1. Install the novaclient.
  2. Point novaclient to your OpenStack installation and tenant by exporting the following environment variables:
    $ export OS_AUTH_URL= YOUR_KEYSTONE_AUTH_ENDPOINT
    $ export OS_TENANT_NAME = TENANT_NAME
    $ export OS_USERNAME = USERNAME
    $ export OS_PASSWORD = PASSWORD
    
  3. List your VMs by running the following command:
    $ nova list --fields metadata
    

vSphere

  1. Log into vCenter.
  2. Select Hosts and Clusters.
  3. Select the top level object that contains your PCF deployment. For example, select Cluster, Datastore or Resource Pool.
  4. In the top tab, click Related Objects.
  5. Select Virtual Machines.
  6. Right click on the Table heading and select Show/Hide Columns.
  7. Select the following boxes: job, deployment, director, index.

Viewing Apps Manager Logs in a Terminal Window

The Apps Manager provides a graphical user interface to help manage organizations, users, applications, and spaces.

When troubleshooting Apps Manager performance, you might want to view the Apps Manager application logs. To view the Apps Manager application logs, follow these steps:

  1. Run cf login -a api.MY-SYSTEM-DOMAIN -u admin from a command line to log in to PCF using the UAA Administrator credentials. In Pivotal Ops Manager, refer to Pivotal Elastic Runtime > Credentials for these credentials.

    $ cf login -a api.example.com -u admin
    API endpoint: api.example.com
    
    Password>******
    Authenticating...
    OK
    
  2. Run cf target -o system -s apps-manager to target the system org and the apps-manager space.

    $ cf target -o system -s apps-manager
    
  3. Run cf logs apps-manager to tail the Apps Manager logs.

    $ cf logs apps-manager
    Connected, tailing logs for app apps-manager in org system / space apps-manager as
    admin...
    

Changing Logging Levels for the Apps Manager

The Apps Manager recognizes the LOG_LEVEL environment variable. The LOG_LEVEL environment variable allows you to filter the messages reported in the Apps Manager log files by severity level. The Apps Manager defines severity levels using the Ruby standard library Logger class.

By default, the Apps Manager LOG_LEVEL is set to info. The logs show more verbose messaging when you set the LOG_LEVEL to debug.

To change the Apps Manager LOG_LEVEL, run cf set-env apps-manager LOG_LEVEL with the desired severity level.

$ cf set-env apps-manager LOG_LEVEL debug

You can set LOG_LEVEL to one of the six severity levels defined by the Ruby Logger class:

  • Level 5: unknown – An unknown message that should always be logged
  • Level 4: fatal – An unhandleable error that results in a program crash
  • Level 3: error – A handleable error condition
  • Level 2: warn – A warning
  • Level 1: info – General information about system operation
  • Level 0: debug – Low-level information for developers

Once set, the Apps Manager log files only include messages at the set severity level and above. For example, if you set LOG_LEVEL to fatal, the log includes fatal and unknown level messages only.

Common Issues

Compare evidence that you have gathered to the descriptions below. If your issue is covered, try the recommended remediation procedures.

BOSH Does Not Reinstall

You might want to reinstall BOSH for troubleshooting purposes. However, if PCF does not detect any changes, BOSH does not reinstall. To force a reinstall of BOSH, select Ops Manager Director > Resource Sizes and change a resource value. For example, you could increase the amount of RAM by 4 MB.

Creating Bound Missing VMs Times Out

This task happens immediately following package compilation, but before job assignment to agents. For example:

cloud_controller/0: Timed out pinging to f690db09-876c-475e-865f-2cece06aba79 after 600 seconds (00:10:24)

This is most likely a NATS issue with the VM in question. To identify a NATS issue, inspect the agent log for the VM. Since the BOSH director is unable to reach the BOSH agent, you must access the VM using another method. You will likely also be unable to access the VM using TCP. In this case, access the VM using your virtualization console.

To diagnose:

  1. Access the VM using your virtualization console and log in.

  2. Navigate to the Credentials tab of the Elastic Runtime tile and locate the VM in question to find the VM credentials.

  3. Become root.

  4. Run cd /var/vcap/bosh/log.

  5. Open the file current.

  6. First, determine whether the BOSH agent and director have successfully completed a handshake, represented in the logs as a “ping-pong”:

    2013-10-03\_14:35:48.58456 #[608] INFO: Message: {"method"=>"ping", "arguments"=>[],
    "reply\_to"=>"director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab"}
    
    2013-10-03\_14:35:48.60182 #[608] INFO: reply\_to:   director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab:
    payload: {:value=>"pong"}
    

    This handshake must complete for the agent to receive instructions from the director.

  7. If you do not see the handshake, look for another line near the beginning of the file, prefixed INFO: loaded new infrastructure settings. For example:

    2013-10-03\_14:35:21.83222 #[608] INFO: loaded new infrastructure settings:
    {"vm"=>{"name"=>"vm-4d80ede4-b0a5-4992-aea6a0386e18e", "id"=>"vm-360"},
    "agent\_id"=>"56aea4ef-6aa9-4c39-8019-7024ccfdde4",
    "networks"=>{"default"=>{"ip"=>"192.0.2.19",
    "netmask"=>"255.255.255.0", "cloud\_properties"=>{"name"=>"VMNetwork"},
    "default"=>["dns", "gateway"],
    "dns"=>["192.0.2.2", "192.0.2.17"], "gateway"=>"192.0.2.2",
    "dns\_record\_name"=>"0.nats.default.cf-d729343071061.microbosh",
    "mac"=>"00:50:56:9b:71:67"}}, "disks"=>{"system"=>0, "ephemeral"=>1,
    "persistent"=>{}}, "ntp"=>[], "blobstore"=>{"provider"=>"dav",
    "options"=>{"endpoint"=>"http://192.0.2.17:25250",
    "user"=>"agent", "password"=>"agent"}},
    "mbus"=>"nats://nats:nats@192.0.2.17:4222",
    "env"=>{"bosh"=>{"password"=>"$6$40ftQ9K4rvvC/8ADZHW0"}}}
     

This is a JSON blob of key/value pairs representing the expected infrastructure for the BOSH agent. For this issue, the following section is the most important:

"mbus"=>"nats://nats:nats@192.0.2.17:4222"

This key/value pair represents where the agent expects the NATS server to be. One diagnostic tactic is to try pinging this NATS IP address from the VM to determine whether you are experiencing routing issues.

Install Exits With a Creates/Updates/Deletes App Failure or With a 403 Error

Scenario 1: Your PCF install exits with the following 403 error when you attempt to log in to the Apps Manager:

{"type": "step_finished", "id": "apps-manager.deploy"}

/home/tempest-web/tempest/web/vendor/bundle/ruby/1.9.1/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:306:in
`fetch': 403 => Net::HTTPForbidden for https://login.api.example.net/oauth/authorizeresponse_type=code&client_id=portal&redirect_uri=https%3...
-- unhandled response (Mechanize::ResponseCodeError)

Scenario 2: Your PCF install exits with a creates/updates/deletes an app (FAILED - 1) error message with the following stack trace:

1) App CRUD creates/updates/deletes an app
   Failure/Error: Unable to find matching line from backtrace
   CFoundry::TargetRefused:
     Connection refused - connect(2)

In either of the above scenarios, ensure that you have correctly entered your domains in wildcard format:

  1. Browse to the Operations Manager fully qualified domain name (FQDN).

  2. Click the Elastic Runtime tile.

  3. Select HAProxy and click Generate Self-Signed RSA Certificate.

  4. Enter your system and app domains in wildcard format, as well as optionally any custom domains, and click Save. Refer to Elastic Runtime > Cloud Controller for explanations of these domain values.

Rsa cert

Install Fails When Gateway Instances Exceed Zero

If you configure the number of Gateway instances to be greater than zero for a given product, you create a dependency on Elastic Runtime for that product installation. If you attempt to install a product tile with an Elastic Runtime dependency before installing Elastic Runtime, the install fails.

To change the number of Gateway instances, click the product tile, then select Settings > Resource sizes > INSTANCES and change the value next to the product Gateway job.

To remove the Elastic Runtime dependency, change the value of this field to 0.

Out of Disk Space Error

PCF displays an Out of Disk Space error if log files expand to fill all available disk space. If this happens, rebooting the PCF installation VM clears the tmp directory of these log files and resolves the error.

Installing Ops Manager Director Fails

If the DNS information for the PCF VM is incorrectly specified when deploying the PCF .ova file, installing Ops Manager Director fails at the “Installing Micro BOSH” step.

To resolve this issue, correct the DNS settings in the PCF Virtual Machine properties.

Deleting Ops Manager Fails

Ops Manager displays an error message when it cannot delete your installation. This scenario might happen if the Ops Manager Director cannot access the VMs or is experiencing other issues. To manually delete your installation and all VMs, you must do the following:

  1. Use your IaaS dashboard to manually delete the VMs for all installed products, with the exception of the Ops Manager VM.
  2. SSH into your Ops Manager VM and remove the installation.yml file from /var/tempest/workspaces/default/.

    Note: Deleting the installation.yml file does not prevent you from reinstalling Ops Manager. For future deploys, Ops Manager regenerates this file when you click Save on any page in the Ops Manager Director.

Your installation is now deleted.

Installing Elastic Runtime Fails

If the DNS information for the PCF VM becomes incorrect after Ops Manager Director has been installed, installing Elastic Runtime with Pivotal Operations Manager fails at the “Verifying app push” step.

To resolve this issue, correct the DNS settings in the PCF Virtual Machine properties.

Cannot Attach Disk During MicroBOSH Deploy to vCloud

When attempting to attach a disk to a MicroBOSH VM, you might receive the following error: The requested operation cannot be performed because disk XXXXXXXXX was not created properly.

Possible causes and recommendations:

  • If the account used during deployment lacks permission to access the default storage profile, attaching the disk might fail.

  • vCloud Director can incorrectly report a successful disk creation even if the operation fails, resulting in subsequent error messages. To resolve this issue, redeploy MicroBOSH.

Ops Manager Hangs During MicroBOSH Install or HAProxy States “IP Address Already Taken”

During an Ops Manager installation, you might receive the following errors:

  • The Ops Manager GUI shows that the installation stops at the “Setting MicroBOSH deployment manifest” task.
  • When you set the IP address for the HAProxy, the “IP Address Already Taken” message appears.

When you install Ops Manager, you assign it an IP address. Ops Manager then takes the next two consecutive IP addresses, assigns the first to MicroBOSH, and reserves the second. For example:

203.0.113.1 - Ops Manager (User assigned)
203.0.113.2 - MicroBOSH (Ops Manager assigned)
203.0.113.3 - Reserved (Ops Manager reserved)

To resolve this issue, ensure that the next two subsequent IP addresses from the manually assigned address are unassigned.

Poor PCF Performance

If you notice poor network performance by your PCF deployment and your deployment uses a Network Address Translation (NAT) gateway, your NAT gateway may be under-resourced.

Troubleshoot

To troubleshoot the issue, set a custom firewall rule in your IaaS console to route traffic originating from your private network directly to an S3-compatible object store. If you see decreased average latency and improved network performance, perform the solution below to scale up your NAT gateway.

Scale Up Your NAT Gateway

Perform the following steps to scale up your NAT gateway:

  1. Navigate to your IaaS console.
  2. Spin up a new NAT gateway of a larger VM size than your previous NAT gateway.
  3. Change the routes to direct traffic through the new NAT gateway.
  4. Spin down the old NAT gateway.

The specific procedures will vary depending on your IaaS. Consult your IaaS documentation for more information.

Common Issues Caused by Firewalls

This section describes various issues you might encounter when installing Elastic Runtime in an environment that uses a strong firewall.

DNS Resolution Fails

When you install PCF in an environment that uses a strong firewall, the firewall might block DNS resolution. To resolve this issue, refer to the Troubleshooting DNS Resolution Issues section of the Preparing Your Firewall for Deploying PCF topic.

Create a pull request or raise an issue on the source for this page in GitHub