General Troubleshooting
- PKS API is Slow or Times Out
- All Cluster Operations Fail
- Cluster Creation Fails
- Cannot Re-Create a Cluster that Failed to Deploy
- Cannot Access Add-On Features or Functions
- Resurrecting VMs Causes Incorrect Permissions in vSphere HA
- Worker Node Hangs Indefinitely
- Cannot Authenticate to an OpenID Connect-Enabled Cluster
- Error: Failed Jobs
- Error: No Such Host
- Error: FailedMount
- Error: Plan Not Found
Page last updated:
PKS API is Slow or Times Out
Symptom
When you run PKS CLI commands, the PKS API times out or is slow to respond.
Explanation
The PKS API VM requires more resources.
Solution
Navigate to
https://YOUR-OPS-MANAGER-FQDN/
in a browser to log in to the Ops Manager Installation Dashboard.Select the Enterprise PKS tile.
Select the Resource Config page.
For the PKS API job, select a VM Type with greater CPU and memory resources.
Click Save.
Click the Installation Dashboard link to return to the Installation Dashboard.
Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.
Click Apply Changes.
All Cluster Operations Fail
Symptom
All PKS CLI cluster operations fail including attempts to create or delete clusters
with pks create-cluster
and pks delete-cluster
.
The output of pks cluster CLUSTER-NAME
contains Last Action State: error
, and
the output of bosh -e ENV-ALIAS -d SERVICE-INSTANCE vms
indicates that the Process State
of at least one deployed node is failing
.
Explanation
If any deployed master or worker nodes run out of disk space in /var/vcap/store
,
all cluster operations such as the creation or deletion of clusters will fail.
Diagnostics
To confirm that there is a disk space issue, check recent BOSH activity for any disk space error messages.
Log in to the BOSH Director and run
bosh tasks
. The output frombosh tasks
provides details about the tasks that the BOSH Director has run. See Using BOSH Diagnostic Commands in Enterprise PKS for more information about logging in to the BOSH Director.In the BOSH command output, locate a task that attempted to perform a cluster operation, such as cluster creation or deletion.
To retrieve more information about the task, run the following command:
bosh -e MY-ENVIRONMENT task TASK-NUMBER
Where:
MY-ENVIRONMENT
is the name of your BOSH environment.TASK-NUMBER
is the number of the task that attempted to create the cluster.
For example:$ bosh -e pks task 23
In the output, look for the following text string:
no space left on device
Check the health of your deployed Kubernetes clusters by following the procedure in Verifying Deployment Health.
In the output of
bosh -e ENV-ALIAS -d SERVICE-INSTANCE vms
, look for any nodes that displayfailing
as theirProcess State
. For example:Instance Process State AZ IPs VM CID VM Type Active master/3a3adc92-14ce-4cd4-a12c-6b5eb03e33d6 failing az-1 10.0.11.10 vm-09027f0e-dac5-498e-474e-b47f2cda614d small true
Make a note of the plan assigned to the failing node.
Solution
In the Enterprise PKS tile, locate the plan assigned to the failing node.
In the plan configuration, select a larger VM type for the plan’s master or worker nodes or both.
For more information about scaling existing clusters by changing the VM types, see Scale Vertically by Changing Cluster Node VM Sizes in the PKS Tile.
Cluster Creation Fails
Symptom
When creating a cluster, you run pks cluster CLUSTER-NAME
to monitor the cluster creation status.
In the command output, the value for Last Action State is error
.
Explanation
There was an error creating the cluster.
Diagnostics
Log in to the BOSH Director and run
bosh tasks
. The output frombosh tasks
provides details about the tasks that the BOSH Director has run. See Using BOSH Diagnostic Commands in Enterprise PKS for more information about logging in to the BOSH Director.In the BOSH command output, locate the task that attempted to create the cluster.
To retrieve more information about the task, run the following command:
bosh -e MY-ENVIRONMENT task TASK-NUMBER
Where:
MY-ENVIRONMENT
is the name of your BOSH environment.TASK-NUMBER
is the number of the task that attempted to create the cluster.
For example:$ bosh -e pks task 23
BOSH logs are used for error diagnostics but if the issue you see in the BOSH logs is related to using or managing Kubernetes, you should consult the Kubernetes Documentation for troubleshooting that issue.
For troubleshooting failed BOSH tasks, see the BOSH documentation.
Cannot Re-Create a Cluster that Failed to Deploy
Symptom
After cluster creation fails, you cannot re-run pks create-cluster
to attempt
creating the cluster again.
Explanation
Enterprise PKS does not automatically clean up the failed BOSH deployment. Running pks
create-cluster
using the same cluster name creates a name clash error in BOSH.
Solution
Log in to the BOSH Director and delete the BOSH deployment manually, then retry the pks delete-cluster
operation. After cluster deletion succeeds, re-create the cluster.
Log in to the BOSH Director and obtain the deployment name for cluster you want to delete. For instructions, see Using BOSH Diagnostic Commands in Enterprise PKS.
Run the following BOSH command:
bosh -e MY-ENVIRONMENT delete-deployment -d DEPLOYMENT-NAME
Where:
MY-ENVIRONMENT
is the name of your BOSH environment.DEPLOYMENT-NAME
is the name of your BOSH deployment.Note: If necessary, you can append the
--force
flag to delete the deployment.
Run the following PKS command:
pks delete-cluster CLUSTER-NAME
Where
CLUSTER-NAME
is the name of your Enterprise PKS cluster.To re-create the cluster, run the following PKS command:
pks create-cluster CLUSTER-NAME
Where
CLUSTER-NAME
is the name of your Enterprise PKS cluster.
Cannot Access Add-On Features or Functions
Symptom
You cannot access a feature or function provided by a Kubernetes add-on.
For example, pods cannot resolve DNS names, and error messages report the service CoreDNS
is invalid. If CoreDNS
is not deployed, the cluster typically fails to start.
Explanation
Kubernetes features and functions are provided by Enterprise PKS add-ons.
DNS resolution, for example, is provided by the CoreDNS
service.
To enable these add-ons, Ops Manager must run scripts after deploying Enterprise PKS. You must configure Ops Manager to automatically run these post-deploy scripts.
Solution
Perform the following steps to configure Ops Manager to run post-deploy scripts to deploy the missing add-ons to your cluster.
Navigate to
https://YOUR-OPS-MANAGER-FQDN/
in a browser to log in to the Ops Manager Installation Dashboard.Click the BOSH Director tile.
Select Director Config.
Select Enable Post Deploy Scripts.
Note: This setting enables post-deploy scripts for all tiles in your Ops Manager installation.
Click Save.
Click the Installation Dashboard link to return to the Installation Dashboard.
Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.
Click Apply Changes.
After Ops Manager finishes applying changes, enter
pks delete-cluster
on the command line to delete the cluster. For more information, see Deleting Clusters.On the command line, enter
pks create-cluster
to recreate the cluster. For more information, see Creating Clusters.
Resurrecting VMs Causes Incorrect Permissions in vSphere HA
Symptoms
Output resulting from the bosh vms
command alternates between showing that the VMs are failing
and showing that the VMs are running
. The operator must run the bosh vms
command multiple times to see this cycle.
Explanation
The VMs’ permissions are altered during the restarting of the VM so operators have to reset permissions every time the VM reboots or is redeployed.
VMs cannot be successfully resurrected if the resurrection state of your VM is set to off
or if the vSphere HA restarts the VM before BOSH is aware that the VM is down.
For more information about VM resurrection, see Resurrection in the BOSH documentation.
Solution
Run the following command on all of your master and worker VMs:
bosh -environment BOSH-DIRECTOR-NAME -deployment DEPLOYMENT-NAME ssh INSTANCE-GROUP-NAME -c "sudo /var/vcap/jobs/kube-controller-manager/bin/pre-start; sudo /var/vcap/jobs/kube-apiserver/bin/post-start"
Where:
BOSH-DIRECTOR-NAME
is your BOSH Director name.DEPLOYMENT-NAME
is the name of your BOSH deployment.INSTANCE-GROUP-NAME
is the name of the BOSH instance group you are referencing.
The above command, when applied to each VM, gives your VMs the correct permissions.
Worker Node Hangs Indefinitely
Symptoms
After making your selection in the Upgrade all clusters errand section, the worker node might hang indefinitely. For more information about monitoring the Upgrade all clusters errand using the BOSH CLI, see Upgrade the PKS Tile in Upgrading Enterprise PKS (Flannel Networking).
Explanation
During the Enterprise PKS tile upgrade process, worker nodes are cordoned and drained. This drain is dependent on Kubernetes being able to unschedule all pods. If Kubernetes is unable to unschedule a pod, then the drain hangs indefinitely.
Kubernetes may be unable to unschedule the node if the PodDisruptionBudget
object has been configured to permit zero disruptions and only a single instance of the pod has been scheduled.
In your spec file, the .spec.replicas
configuration sets the total amount of replicas that are available in your app.
PodDisruptionBudget
objects specify the amount of replicas, proportional to the total, that must be available in your app, regardless of downtime. Operators can configure PodDisruptionBudget
objects for each app using their spec file.
Some apps deployed using Helm charts might have a default PodDisruptionBudget
set.
For more information on configuring PodDisruptionBudget
objects using a spec file, see Specifying a PodDisruptionBudget in the Kubernetes documentation.
If .spec.replicas
is configured correctly, you can also configure the default node drain behavior to prevent cluster upgrades from hanging or failing.
Solution
To resolve this issue, do one of the following:
Configure
.spec.replicas
to be greater than thePodDisruptionBudget
object.
When the number of replicas configured in.spec.replicas
is greater than the number of replicas set in thePodDisruptionBudget
object, disruptions can occur.
For more information, see How Disruption Budgets Work in the Kubernetes documentation.
For more information about workload capacity and uptime requirements in Enterprise PKS, see Prepare to Upgrade in Upgrading Enterprise PKS (Flannel Networking).Configure the default node drain behavior by doing the following:
- Navigate to Ops Manager Installation > Enterprise PKS > Plans.
Set the default node drain behavior by configuring the following fields:
Field Instructions Node Drain Timeout Enter a timeout in minutes for the node to drain pods. You must enter a valid integer between 0
and1440
. If you set this value to0
, the node drain does not terminate.Pod Shutdown Grace Enter a timeout in seconds for the node to wait before it forces the pod to terminate. You must enter a valid integer between -1
and86400
. If you set this value to-1
, the timeout is set to the default timeout specified by the pod.Force node to drain even if it has running pods not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet. If you enable this configuration, the node still drains when pods are not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet. Force node to drain even if it has running DaemonSet-managed pods. If you enable this configuration, the node still drains when pods are managed by a DeamonSet. Force node to drain even if it has running running pods using emptyDir. If you enable this configuration, the node still drains when pods are using a emptyDir volume. Force node to drain even if pods are still running after timeout. If you enable this configuration and then during the timeout pods fail to drain on the worker node, the node forces running pods to terminate and the upgrade or scale continues. Warning: If you select Force node to drain even if pods are still running after timeout, the node kills all running workloads on pods. Before enabling this configuration, set Node Drain Timeout to greater than
0
.Warning: If you deselect Force node to drain even if it has running DaemonSet-managed pods with Enable Metric Sink Resources, Enable Log Sink Resources, or Enable Node Exporter selected, the upgrade will fail as all options deploy a DaemonSet in the
pks-system
namespace.Navigate to Ops Manager Installation Dashboard > Review Pending Changes, select Upgrade all clusters errand, and Apply Changes. The new behavior takes effect during the next upgrade, not immediately after applying your changes.
Note: You can also use the PKS CLI to configure node drain behavior. To configure the default node drain behavior with the PKS CLI, run
pks update-cluster
with an action flag. You can view the current node drain behavior withpks cluster --details
. For more information, see Configure Node Drain Behavior in Upgrade Preparation Checklist for Enterprise PKS v1.7.
Cannot Authenticate to an OpenID Connect-Enabled Cluster
Symptom
When you authenticate to an OpenID Connect-enabled cluster using an existing kubeconfig file, you see an authentication or authorization error.
Explanation
users.user.auth-provider.config.id-token
and users.user.auth-provider.config.refresh-token
contained in the kubeconfig file for the cluster may have expired.
Solution
Upgrade the PKS CLI to v1.2.0 or later. To download the PKS CLI, navigate to VMware Tanzu Network. For more information, see Installing the PKS CLI.
Obtain a kubeconfig file that contains the new tokens by running the following command:
pks get-credentials CLUSTER-NAME
Where
CLUSTER-NAME
is the name of your cluster. For example:$ pks get-credentials pks-example-cluster Fetching credentials for cluster pks-example-cluster. Context set for cluster pks-example-cluster. You can now switch between clusters by using: $kubectl config use-context <cluster-name>
Note: If your operator has configured Enterprise PKS to use a SAML identity provider, you must include an additional SSO flag to use the above command. For information about the SSO flags, see the section for the above command in PKS CLI. For information about configuring SAML, see Connecting Enterprise PKS to a SAML Identity Provider
Connect to the cluster using kubectl.
If you continue to see an authentication or authorization error, verify that you have sufficient access permissions for the cluster.
Login Failed Error: Credentials were rejected
Symptom
PKS login command fails with an error “Credentials were rejected, please try again.”
Explanation
You may experience this issue when a large number of pods are running continuously in your Enterprise PKS deployment. As a result, the persistent disk on the PKS Database VM runs out of space.
Solution
- Check the total number of pods in your Enterprise PKS deployments.
- If there are a large number of pods such as over 1,000 pods, then check the amount of available persistent disk space on the PKS Database VM.
- If available disk space is low, increase the amount of persistent disk storage on the PKS Database VM depending on the number of pods in your Enterprise PKS deployment. Refer to the table in the following section.
Storage Requirements for Large Numbers of Pods
If you expect the cluster workload to run a large number of pods continuously, then increase the size of persistent disk storage allocated to the PKS Database VM as follows:
Number of Pods | Persistent Disk Requirements (GB) |
---|---|
1,000 pods | 20 |
5,000 pods | 100 |
10,000 pods | 200 |
50,000 pods | 1,000 |
Login Failed Errors Due to Server State
Symptom
You encounter an error similar to one of the following when running a kubectl
or cluster
command:
- “Error: You must be logged in to the server (Unauthorized)”
- “Error: You are not currently authenticated. Please log in to continue”
Explanation
You may experience this issue when your authentication server or a host has the incorrect time.
Workaround
To refresh your credentials, run the following:
pks get-credentials
Solution
- To resolve the problem permanently, correct the time on the server with the incorrect time.
Error: Failed Jobs
Symptom
In stdout or log files, you see an error message referencing post-start scripts failed
or Failed Jobs
.
Explanation
After deploying Enterprise PKS, Ops Manager runs scripts to start a number of jobs. You must configure Ops Manager to automatically run these post-deploy scripts.
Solution
Perform the following steps to configure Ops Manager to run post-deploy scripts.
Navigate to
https://YOUR-OPS-MANAGER-FQDN/
in a browser to log in to the Ops Manager Installation Dashboard.Click the BOSH Director tile.
Select Director Config.
Select Enable Post Deploy Scripts.
Note: This setting enables post-deploy scripts for all tiles in your Ops Manager installation.
Click Save.
Click the Installation Dashboard link to return to the Installation Dashboard.
Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.
Click Apply Changes.
(Optional) If it is a new deployment of Enterprise PKS, follow the steps below:
- On the command line, enter
pks delete-cluster
to delete the cluster. For more information, see Deleting Clusters. - Enter
pks create-cluster
to recreate the cluster. For more information, see Creating Clusters.
- On the command line, enter
Error: No Such Host
Symptom
In stdout or log files, you see an error message that includes lookup vm-WORKER-NODE-GUID on IP-ADDRESS: no such host
.
Explanation
This error occurs on GCP when the Ops Manager Director tile uses 8.8.8.8 as the DNS server. When this IP range is in use, the master node cannot locate the route to the worker nodes.
Solution
Use the Google internal DNS range, 169.254.169.254, as the DNS server.
Error: FailedMount
Symptom
In Kubernetes log files, you see a Warning
event from kubelet with FailedMount
as the reason.
Explanation
A persistent volume fails to connect to the Kubernetes cluster worker VM.
Diagnostics
- In your cloud provider console, verify that volumes are being created and attached to nodes.
- From the Kubernetes cluster master node, check the controller manager logs for errors attaching persistent volumes.
- From the Kubernetes cluster worker node, check kubelet for errors attaching persistent volumes.
Error: Plan Not Found
Symptom
Plan not found error when an active plan is deactivated.
Explanation
You may receive the error “plan UUID not found” if, after creating a cluster using a plan (such as Plan 1), you then deactivate the plan (Plan 1) from the PKS Tile in Ops Manager and then Save and Apply Changes with the Upgrade all clusters errand selected.
Ops Manager does not have capability to check clusters that are using a particular plan. Only when user saves the plan, the deployment process will check whether a plan can be deactivated. The error message “plan is displayed in the Ops Manager logs.
Solution
- Do not disable or deactivate a plan that is in use by or more clusters.
- Run the command
pks cluster my-cluster --details
to view what plan the cluster is using.
Please send any feedback you have to pks-feedback@pivotal.io.