LATEST VERSION: v1.2 - RELEASE NOTES
Pivotal Container Service v1.2

Troubleshooting

Page last updated:

PKS API is Slow or Times Out

Symptom

When you run PKS CLI commands, the PKS API times out or is slow to respond.

Explanation

The PKS API control plane VM requires more resources.

Solution

  1. Navigate to https://YOUR-OPS-MANAGER-FQDN/ in a browser to log in to the Ops Manager Installation Dashboard.

  2. Select the Pivotal Container Service tile.

  3. Select the Resource Config page.

  4. For the Pivotal Container Service job, select a VM Type with greater CPU and memory resources.

  5. Click Save.

  6. Click the Installation Dashboard link to return to the Installation Dashboard.

  7. Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.

    Note: In Ops Manager v2.2, the Review Pending Changes page is a Beta feature. If you deploy PKS to Ops Manager v2.2, you can skip this step.

  8. Click Apply Changes.

Cluster Creation Fails

Symptom

When creating a cluster, you run pks cluster CLUSTER-NAME to monitor the cluster creation status. In the command output, the value for Last Action State is error.

Explanation

There was an error creating the cluster.

Diagnostics

  1. Log in to the BOSH Director and run bosh tasks. The output from bosh tasks provides details about the tasks that the BOSH Director has run. See Managing PKS Deployments with BOSH for more information about logging in to the BOSH Director.

  2. In the BOSH command output, locate the task that attempted to create the cluster.

  3. To retrieve more information about the task, run the following command:

    bosh -e MY-ENVIRONMENT task TASK-NUMBER
    

    Where:

    • MY-ENVIRONMENT is the name of your BOSH environment.
    • TASK-NUMBER is the number of the task that attempted to create the cluster.
      For example:
      $ bosh -e pks task 23

See the BOSH documentation for more information about troubleshooting failed BOSH tasks.

Cluster Creation Fails with NSX-T

Symptom

You fail to create a cluster on PKS deployments using NSX-T, and cannot re-run pks create-cluster.

Explanation

You cannot re-run pks create-cluster because some NSX-T objects have not been cleaned up.

Solution

If you are using PKS v1.1.5 or later, perform the following steps to clean up the NSX-T objects:

  1. Run the following command:

    bosh -e MY-ENVIRONMENT delete-deployment -d DEPLOYMENT-NAME
    

    Where:

    • MY-ENVIRONMENT is the name of your BOSH environment.
    • DEPLOYMENT-NAME is the name of your BOSH deployment.

      Note: If necessary you can append the --force flag to delete the deployment.

  2. Run the following command:

    pks delete-cluster CLUSTER-NAME
    

    Where CLUSTER-NAME is the name of your PKS cluster.

If you are using PKS v1.1.4 or earlier, contact VMware Customer Support to obtain a cleanup script.

Cannot Access Add-On Features or Functions

Symptom

You cannot access a feature or function provided by a Kubernetes add-on.

Examples include the following:

  • You cannot access the Kubernetes Web UI (Dashboard) in a browser or using the kubectl command-line tool.
  • Heapster does not start.
  • Pods cannot resolve DNS names, and error messages report the service kube-dns is invalid. If kube-dns is not deployed, the cluster typically fails to start.

Explanation

The Kubernetes features and functions listed above are provided by the following PKS add-ons:

  • Kubernetes Dashboard kubernetes-dashboard
  • Heapster: heapster
  • DNS Resolution: kube-dns

To enable these add-ons, Ops Manager must run scripts after deploying PKS. You must configure Ops Manager to automatically run these post-deploy scripts.

Solution

Perform the following steps to configure Ops Manager to run post-deploy scripts to deploy the missing add-ons to your cluster.

  1. Navigate to https://YOUR-OPS-MANAGER-FQDN/ in a browser to log in to the Ops Manager Installation Dashboard.

  2. Click the Ops Manager v2.1 tile.

  3. Select Director Config.

  4. Select Enable Post Deploy Scripts.

    Note: This setting enables post-deploy scripts for all tiles in your Ops Manager installation.

  5. Click Save.

  6. Click the Installation Dashboard link to return to the Installation Dashboard.

  7. Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.

    Note: In Ops Manager v2.2, the Review Pending Changes page is a Beta feature. If you deploy PKS to Ops Manager v2.2, you can skip this step.

  8. Click Apply Changes.

  9. After Ops Manager finishes applying changes, enter pks delete-cluster on the command line to delete the cluster. For more information, see Deleting Clusters.

  10. On the command line, enter pks create-cluster to recreate the cluster. For more information, see Creating Clusters.

Resurrecting VMs Causes Incorrect Permissions in vSphere HA

Symptoms

Output resulting from the bosh vms command alternates between showing that the VMs are failing and showing that the VMs are running. The operator must run the bosh vms command multiple times to see this cycle.

Explanation

The VMs’ permissions are altered during the restarting of the VM so operators have to reset permissions every time the VM reboots or is redeployed.

VMs cannot be successfully resurrected if the resurrection state of your VM is set to off or if the the vSphere HA restarts the VM before BOSH is aware that the VM is down. For more information about VM resurrection, see Resurrection in the Cloud Foundry BOSH documentation.

Solution

Run the following command on all of your master and worker VMs:

bosh -environment BOSH-DIRECTOR-NAME -deployment DEPLOYMENT-NAME ssh INSTANCE-GROUP-NAME -c "sudo /var/vcap/jobs/kube-controller-manager/bin/pre-start; sudo /var/vcap/jobs/kube-apiserver/bin/post-start"

Where:

  • BOSH-DIRECTOR-NAME is your BOSH Director name.
  • DEPLOYMENT-NAME is the name of your BOSH deployment.
  • INSTANCE-GROUP-NAME is the name of the BOSH instance group you are referencing.

The above command, when applied to each VM, gives your VMs the correct permissions.

Worker Node Hangs Indefinitely

Symptoms

After making your selection in the Upgrade all clusters errand section, the worker node might hang indefinitely. For more information on monitoring the Upgrade all clusters errand using the BOSH CLI, see Upgrade the PKS Tile in Upgrading PKS.

Explanation

During the PKS tile upgrade process, worker nodes are cordoned and drained. This drain is dependent on Kubernetes being able to unschedule all pods. If Kubernetes is unable to unschedule a pod, then the drain hangs indefinitely. One reason why Kubernetes may be unable to unschedule the node is if the PodDisruptionBudget object has been configured in a way that allows 0 disruptions and only a single instance of the pod has been scheduled.

In your spec file, the .spec.replicas configuration sets the total amount of replicas that are available in your application. PodDisruptionBudget objects can specifies the amount of replicas, proportional to that total, that must be available in your application, regardless of downtime. Operators can configure PodDisruptionBudget objects for each application using their spec file.

Some apps deployed using Helm-Charts may have a default PodDisruptionBudget set. For more information on configuring PodDisruptionBudget objects using a spec file, see Specifying a PodDisruptionBudget in the Kubernetes documentation.

Solution

Configure .spec.replicas to be greater than the PodDisruptionBudget object.

When the number of replicas configured in .spec.replicas is greater than the number of replicas set in the PodDisruptionBudget object, disruptions can occur.

For more information, see How Disruption Budgets Work in the Kubernetes documentation. For more information about workload capacity and uptime requirements in PKS, see Prepare to Upgrade in Upgrading PKS.

Cannot Authenticate to an OpenID Connect-Enabled Cluster

Symptom

When you authenticate to an OpenID Connect-enabled cluster using an existing kubeconfig file, you see an authentication or authorization error.

Explanation

users.user.auth-provider.config.id-token and users.user.auth-provider.config.refresh-token contained in the kubeconfig file for the cluster may have expired.

Solution

  1. Upgrade the PKS CLI to v1.2.0 or later. To download the PKS CLI, navigate to Pivotal Network. For more information, see Installing the PKS CLI.

  2. Obtain a kubeconfig file that contains the new tokens by running the following command:

    pks get-credentials CLUSTER-NAME
    

    Where CLUSTER-NAME is the name of your cluster.

  3. Connect to the cluster using kubectl.

If you continue to see an authentication or authorization error, verify that you have sufficient access permissions for the cluster.

Error: Failed Jobs

Symptom

In stdout or log files, you see an error message referencing post-start scripts failed or Failed Jobs.

Explanation

After deploying PKS, Ops Manager runs scripts to start a number of jobs. You must configure Ops Manager to automatically run these post-deploy scripts.

Solution

Perform the following steps to configure Ops Manager to run post-deploy scripts.

  1. Navigate to https://YOUR-OPS-MANAGER-FQDN/ in a browser to log in to the Ops Manager Installation Dashboard.

  2. Click the BOSH Director tile.

  3. Select Director Config.

  4. Select Enable Post Deploy Scripts.

    Note: This setting enables post-deploy scripts for all tiles in your Ops Manager installation.

  5. Click Save.

  6. Click the Installation Dashboard link to return to the Installation Dashboard.

  7. Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.

    Note: In Ops Manager v2.2, the Review Pending Changes page is a Beta feature. If you deploy PKS to Ops Manager v2.2, you can skip this step.

  8. Click Apply Changes.

  9. (Optional) If it is a new deployment of PKS, follow the steps below:

    1. On the command line, enter pks delete-cluster to delete the cluster. For more information, see Deleting Clusters.
    2. Enter pks create-cluster to recreate the cluster. For more information, see Creating Clusters.

Error: No Such Host

Symptom

In stdout or log files, you see an error message that includes lookup vm-WORKER-NODE-GUID on IP-ADDRESS: no such host.

Explanation

This error occurs on GCP when the Ops Manager Director tile uses 8.8.8.8 as the DNS server. When this IP range is in use, the master node cannot locate the route to the worker nodes.

Solution

Use the Google internal DNS range, 169.254.169.254, as the DNS server.

Error: FailedMount

Symptom

In Kubernetes log files, you see a Warning event from kubelet with FailedMount as the reason.

Explanation

A persistent volume fails to connect to the Kubernetes cluster worker VM.

Diagnostics

  • In your cloud provider console, verify that volumes are being created and attached to nodes.
  • From the Kubernetes cluster master node, check the controller manager logs for errors attaching persistent volumes.
  • From the Kubernetes cluster worker node, check kubelet for errors attaching persistent volumes.

Error: Duplicate Variable Name

Symptom

In PKS Broker log files, you see an error message that includes Duplicate variable name '/dns_api_tls_ca'.

Explanation

This error may occur if you use Ops Manager v2.1.7 and later with PKS v1.1.0.

Solution

PKS v1.1.0 does not support Ops Manager v2.1.7 and later. You must use Ops Manager v2.1.0-2.1.6 with PKS v1.1.0.


Please send any feedback you have to pks-feedback@pivotal.io.

Create a pull request or raise an issue on the source for this page in GitHub