Troubleshooting

Page last updated:

PKS API is Slow or Times Out

Symptom

When you run PKS CLI commands, the PKS API times out or is slow to respond.

Explanation

The PKS API control plane VM requires more resources.

Solution

  1. Navigate to https://YOUR-OPS-MANAGER-FQDN/ in a browser to log in to the Ops Manager Installation Dashboard.

  2. Select the Pivotal Container Service tile.

  3. Select the Resource Config page.

  4. For the Pivotal Container Service job, select a VM Type with greater CPU and memory resources.

  5. Click Save.

  6. Click the Installation Dashboard link to return to the Installation Dashboard.

  7. Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.

  8. Click Apply Changes.

All Cluster Operations Fail

Symptom

All PKS CLI cluster operations fail including attempts to create or delete clusters with pks create-cluster and pks delete-cluster.

The output of pks cluster CLUSTER-NAME contains Last Action State: error, and the output of bosh -e ENV-ALIAS -d SERVICE-INSTANCE vms indicates that the Process State of at least one deployed node is failing.

Explanation

If any deployed master or worker nodes run out of disk space in /var/vcap/store , all cluster operations such as the creation or deletion of clusters will fail.

Diagnostics

To confirm that there is a disk space issue, check recent BOSH activity for any disk space error messages.

  1. Log in to the BOSH Director and run bosh tasks. The output from bosh tasks provides details about the tasks that the BOSH Director has run. See Managing PKS Deployments with BOSH for more information about logging in to the BOSH Director.

  2. In the BOSH command output, locate a task that attempted to perform a cluster operation, such as cluster creation or deletion.

  3. To retrieve more information about the task, run the following command:

    bosh -e MY-ENVIRONMENT task TASK-NUMBER
    

    Where:

    • MY-ENVIRONMENT is the name of your BOSH environment.
    • TASK-NUMBER is the number of the task that attempted to create the cluster.
      For example:
      $ bosh -e pks task 23
  4. In the output, look for the following text string:

    no space left on device
    
  5. Check the health of your deployed Kubernetes clusters by following the procedure in Verifying Deployment Health.

  6. In the output of bosh -e ENV-ALIAS -d SERVICE-INSTANCE vms, look for any nodes that display failing as their Process State. For example:

    Instance                                     Process State  AZ       IPs         VM CID                                   VM Type  Active
    master/3a3adc92-14ce-4cd4-a12c-6b5eb03e33d6  failing        az-1     10.0.11.10  vm-09027f0e-dac5-498e-474e-b47f2cda614d  small    true
    
  7. Make a note of the plan assigned to the failing node.

Solution

  1. In the PKS tile, locate the plan assigned to the failing node.

  2. In the plan configuration, select a larger VM type for the plan’s master or worker nodes or both.

    For more information about scaling existing clusters by changing the VM types, see Scale Vertically by Changing Cluster Node VM Sizes in the PKS Tile.

Cluster Creation Fails

Symptom

When creating a cluster, you run pks cluster CLUSTER-NAME to monitor the cluster creation status. In the command output, the value for Last Action State is error.

Explanation

There was an error creating the cluster.

Diagnostics

  1. Log in to the BOSH Director and run bosh tasks. The output from bosh tasks provides details about the tasks that the BOSH Director has run. See Managing PKS Deployments with BOSH for more information about logging in to the BOSH Director.

  2. In the BOSH command output, locate the task that attempted to create the cluster.

  3. To retrieve more information about the task, run the following command:

    bosh -e MY-ENVIRONMENT task TASK-NUMBER
    

    Where:

    • MY-ENVIRONMENT is the name of your BOSH environment.
    • TASK-NUMBER is the number of the task that attempted to create the cluster.
      For example:
      $ bosh -e pks task 23

BOSH logs are used for error diagnostics but if the issue you see in the BOSH logs is related to using or managing Kubernetes, you should consult the Kubernetes Documentation for troubleshooting that issue.

For troubleshooting failed BOSH tasks, see the BOSH documentation.

Cluster Deletion Fails

Symptom

When attempting to delete a cluster using pks delete-cluster CLUSTER-NAME, the cluster is not deleted.

Explanation

There was an error deleting the cluster.

Solution

Log in to the BOSH Director and delete the BOSH deployment manually, then retry the pks delete-cluster operation.

  1. Log in to the BOSH Director and obtain the deployment name for cluster you want to delete. For instructions, see Managing PKS Deployments with BOSH.

  2. Run the following BOSH command:

    bosh -e MY-ENVIRONMENT delete-deployment -d DEPLOYMENT-NAME
    

    Where:

    • MY-ENVIRONMENT is the name of your BOSH environment.
    • DEPLOYMENT-NAME is the name of your BOSH deployment.

      Note: If necessary, you can append the --force flag to delete the deployment.

  3. Run the following PKS command:

    pks delete-cluster CLUSTER-NAME
    

    Where CLUSTER-NAME is the name of your PKS cluster.

Cannot Re-Create a Cluster that Failed to Deploy

Symptom

After cluster creation fails, you cannot re-run pks create-cluster to attempt creating the cluster again.

Explanation

PKS does not automatically clean up the failed BOSH deployment. Running pks create-cluster using the same cluster name creates a name clash error in BOSH.

Solution

Log in to the BOSH Director and delete the BOSH deployment manually, then retry the pks delete-cluster operation. After cluster deletion succeeds, re-create the cluster.

  1. Log in to the BOSH Director and obtain the deployment name for cluster you want to delete. For instructions, see Managing PKS Deployments with BOSH.

  2. Run the following BOSH command:

    bosh -e MY-ENVIRONMENT delete-deployment -d DEPLOYMENT-NAME
    

    Where:

    • MY-ENVIRONMENT is the name of your BOSH environment.
    • DEPLOYMENT-NAME is the name of your BOSH deployment.

      Note: If necessary, you can append the --force flag to delete the deployment.

  3. Run the following PKS command:

    pks delete-cluster CLUSTER-NAME
    

    Where CLUSTER-NAME is the name of your PKS cluster.

  4. To re-create the cluster, run the following PKS command:

    pks create-cluster CLUSTER-NAME
    

    Where CLUSTER-NAME is the name of your PKS cluster.

Cannot Access Add-On Features or Functions

Symptom

You cannot access a feature or function provided by a Kubernetes add-on.

Examples include the following:

  • You cannot access the Kubernetes Web UI (Dashboard) in a browser or using the kubectl command-line tool.
  • Pods cannot resolve DNS names, and error messages report the service kube-dns is invalid. If kube-dns is not deployed, the cluster typically fails to start.
  • Heapster does not start.

Explanation

The Kubernetes features and functions listed above are provided by the following PKS add-ons:

  • Kubernetes Dashboard kubernetes-dashboard
  • DNS Resolution: kube-dns
  • Heapster: heapster

    Note: Heapster is deprecated in PKS v1.3, and Kubernetes has retired Heapster. For more information, see the kubernetes-retired/heapster repository in GitHub.

To enable these add-ons, Ops Manager must run scripts after deploying PKS. You must configure Ops Manager to automatically run these post-deploy scripts.

Solution

Perform the following steps to configure Ops Manager to run post-deploy scripts to deploy the missing add-ons to your cluster.

  1. Navigate to https://YOUR-OPS-MANAGER-FQDN/ in a browser to log in to the Ops Manager Installation Dashboard.

  2. Click the Ops Manager tile.

  3. Select Director Config.

  4. Select Enable Post Deploy Scripts.

    Note: This setting enables post-deploy scripts for all tiles in your Ops Manager installation.

  5. Click Save.

  6. Click the Installation Dashboard link to return to the Installation Dashboard.

  7. Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.

  8. Click Apply Changes.

  9. After Ops Manager finishes applying changes, enter pks delete-cluster on the command line to delete the cluster. For more information, see Deleting Clusters.

  10. On the command line, enter pks create-cluster to recreate the cluster. For more information, see Creating Clusters.

Resurrecting VMs Causes Incorrect Permissions in vSphere HA

Symptoms

Output resulting from the bosh vms command alternates between showing that the VMs are failing and showing that the VMs are running. The operator must run the bosh vms command multiple times to see this cycle.

Explanation

The VMs’ permissions are altered during the restarting of the VM so operators have to reset permissions every time the VM reboots or is redeployed.

VMs cannot be successfully resurrected if the resurrection state of your VM is set to off or if the the vSphere HA restarts the VM before BOSH is aware that the VM is down. For more information about VM resurrection, see Resurrection in the Cloud Foundry BOSH documentation.

Solution

Run the following command on all of your master and worker VMs:

bosh -environment BOSH-DIRECTOR-NAME -deployment DEPLOYMENT-NAME ssh INSTANCE-GROUP-NAME -c "sudo /var/vcap/jobs/kube-controller-manager/bin/pre-start; sudo /var/vcap/jobs/kube-apiserver/bin/post-start"

Where:

  • BOSH-DIRECTOR-NAME is your BOSH Director name.
  • DEPLOYMENT-NAME is the name of your BOSH deployment.
  • INSTANCE-GROUP-NAME is the name of the BOSH instance group you are referencing.

The above command, when applied to each VM, gives your VMs the correct permissions.

Worker Node Hangs Indefinitely

Symptoms

After making your selection in the Upgrade all clusters errand section, the worker node might hang indefinitely. For more information on monitoring the Upgrade all clusters errand using the BOSH CLI, see Upgrade the PKS Tile in Upgrading PKS.

Explanation

During the PKS tile upgrade process, worker nodes are cordoned and drained. This drain is dependent on Kubernetes being able to unschedule all pods. If Kubernetes is unable to unschedule a pod, then the drain hangs indefinitely. One reason why Kubernetes may be unable to unschedule the node is if the PodDisruptionBudget object has been configured in a way that allows 0 disruptions and only a single instance of the pod has been scheduled.

In your spec file, the .spec.replicas configuration sets the total amount of replicas that are available in your application. PodDisruptionBudget objects can specifies the amount of replicas, proportional to that total, that must be available in your application, regardless of downtime. Operators can configure PodDisruptionBudget objects for each application using their spec file.

Some apps deployed using Helm-Charts may have a default PodDisruptionBudget set. For more information on configuring PodDisruptionBudget objects using a spec file, see Specifying a PodDisruptionBudget in the Kubernetes documentation.

Solution

Configure .spec.replicas to be greater than the PodDisruptionBudget object.

When the number of replicas configured in .spec.replicas is greater than the number of replicas set in the PodDisruptionBudget object, disruptions can occur.

For more information, see How Disruption Budgets Work in the Kubernetes documentation. For more information about workload capacity and uptime requirements in PKS, see Prepare to Upgrade in Upgrading PKS.

Cannot Authenticate to an OpenID Connect-Enabled Cluster

Symptom

When you authenticate to an OpenID Connect-enabled cluster using an existing kubeconfig file, you see an authentication or authorization error.

Explanation

users.user.auth-provider.config.id-token and users.user.auth-provider.config.refresh-token contained in the kubeconfig file for the cluster may have expired.

Solution

  1. Upgrade the PKS CLI to v1.2.0 or later. To download the PKS CLI, navigate to Pivotal Network. For more information, see Installing the PKS CLI.

  2. Obtain a kubeconfig file that contains the new tokens by running the following command:

    pks get-credentials CLUSTER-NAME
    

    Where CLUSTER-NAME is the name of your cluster.

  3. Connect to the cluster using kubectl.

If you continue to see an authentication or authorization error, verify that you have sufficient access permissions for the cluster.

Error: Failed Jobs

Symptom

In stdout or log files, you see an error message referencing post-start scripts failed or Failed Jobs.

Explanation

After deploying PKS, Ops Manager runs scripts to start a number of jobs. You must configure Ops Manager to automatically run these post-deploy scripts.

Solution

Perform the following steps to configure Ops Manager to run post-deploy scripts.

  1. Navigate to https://YOUR-OPS-MANAGER-FQDN/ in a browser to log in to the Ops Manager Installation Dashboard.

  2. Click the BOSH Director tile.

  3. Select Director Config.

  4. Select Enable Post Deploy Scripts.

    Note: This setting enables post-deploy scripts for all tiles in your Ops Manager installation.

  5. Click Save.

  6. Click the Installation Dashboard link to return to the Installation Dashboard.

  7. Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.

  8. Click Apply Changes.

  9. (Optional) If it is a new deployment of PKS, follow the steps below:

    1. On the command line, enter pks delete-cluster to delete the cluster. For more information, see Deleting Clusters.
    2. Enter pks create-cluster to recreate the cluster. For more information, see Creating Clusters.

Error: No Such Host

Symptom

In stdout or log files, you see an error message that includes lookup vm-WORKER-NODE-GUID on IP-ADDRESS: no such host.

Explanation

This error occurs on GCP when the Ops Manager Director tile uses 8.8.8.8 as the DNS server. When this IP range is in use, the master node cannot locate the route to the worker nodes.

Solution

Use the Google internal DNS range, 169.254.169.254, as the DNS server.

Error: FailedMount

Symptom

In Kubernetes log files, you see a Warning event from kubelet with FailedMount as the reason.

Explanation

A persistent volume fails to connect to the Kubernetes cluster worker VM.

Diagnostics

  • In your cloud provider console, verify that volumes are being created and attached to nodes.
  • From the Kubernetes cluster master node, check the controller manager logs for errors attaching persistent volumes.
  • From the Kubernetes cluster worker node, check kubelet for errors attaching persistent volumes.

Error: Login Failed

Symptom

PKS login command failed with an error “Credentials were rejected, please try again.”

Explanation

You may experience this issue when a large number pods are running continuously in your PKS deployment.

As a result, binary logs that track pod information accumulate over time and fill up the persistent disk of the Pivotal Container Service VM.

Note: Binary logs on the PKS control plane are configured to be purged after a certain number of days.

Solution

  1. Check the total number of pods in your PKS deployments.
  2. If there are a large number of pods such as over 1,000 pods, then check the amount of available persistent disk space on the Pivotal Container Service VM.
  3. If available disk space is low, increase the amount of persistent disk storage on the Pivotal Container Service VM depending on the number of pods in your PKS deployment. Refer to the table in the following section.

Storage Requirements for Large Numbers of Pods

If you expect the cluster workload to run a large number of pods continuously, then increase the size of persistent disk storage allocated to the the Pivotal Container Service VM as follows:

Number of Pods Storage (Persistent Disk) Requirement ^*
1,000 pods 20 GB
5,000 pods 100 GB
10,000 pods 200 GB
50,000 pods 1,000 GB

Please send any feedback you have to pks-feedback@pivotal.io.