Troubleshooting Healthwatch

This topic describes how to troubleshoot problems and known issues that may arise when deploying or operating Healthwatch, Healthwatch Exporter for VMware Tanzu Application Service for VMs (TAS for VMs), and Healthwatch Exporter for Tanzu Kubernetes Grid Integrated Edition (TKGI).

Accessing VM UIs for Troubleshooting

The sections below describe how to access the user interfaces (UIs) of the Prometheus and Alertmanager VMs for troubleshooting.

Access the Prometheus UI

The Prometheus UI allows you to view various processes on the VMs in the Prometheus instance that the Healthwatch tile deploys, including alerts that are currently running and the health status of scrape targets. Because the Prometheus UI is not secure, the Healthwatch tile does not include it. However, you can access the Prometheus UI to troubleshoot the Prometheus instance.

To access the Prometheus UI:

  1. Run:

    bosh deployments
    

    This command returns a list of all BOSH deployments that are currently running.

  2. Record the name of your Healthwatch deployment.

  3. Run:

    bosh -d DEPLOYMENT-NAME ssh tsdb/0 --opts='-L 9090:localhost:9090'
    

    Where DEPLOYMENT-NAME is the name of your Healthwatch deployment that you recorded in the previous step.

  4. Navigate to the Ops Manager Installation Dashboard.

  5. Click the Healthwatch tile.

  6. Select the Credentials tab.

  7. In the Tsdb Client Mtls row, click Link to Credential.

  8. Record the certificate and private key for Tsdb Client Mtls.

  9. Add the certificate and private key for Tsdb Client Mtls that you recorded in the previous step to the keystore for your operating system.

  10. In a web browser, navigate to localhost:9090. If your browser prompts you to specify which certificate to use for mTLS, select the certificate you added to the keystore for your operating system in the previous step. The Prometheus UI appears.

Access the Alertmanager UI

The Alertmanager UI allows you to view which alerts are currently running. Because the Alertmanager UI is not secure, the Healthwatch tile does not include it. However, you can access the Alertmanager UI to troubleshoot or silence alerts.

To access the Alertmanager UI:

  1. Run:

    bosh deployments
    

    This command returns a list of all BOSH deployments that are currently running.

  2. Record the name of your Healthwatch deployment.

  3. Run:

    bosh -d DEPLOYMENT-NAME ssh tsdb/0 --opts='-L 8080:localhost:10401'
    

    Where DEPLOYMENT-NAME is the name of your Healthwatch deployment that you recorded in the previous step.

  4. In a web browser, navigate to localhost:8080. The Alertmanager UI appears.

Troubleshooting Known Issues

The sections below describe how to troubleshoot known issues in Healthwatch and Healthwatch Exporter for TKGI.

“Unable to Render Templates” Error When Installing or Upgrading

When installing or upgrading to Healthwatch v2.1, you see the following error:

- Unable to render templates for job 'opsman-cert-expiration-exporter'. Errors are:
  - Error filling in template 'bpm.yml.erb' (line 9: Can't find property '["opsman_access_credentials.uaa_client_secret"]')

This error occurs if you upgraded from Ops Manager v2.3 or earlier to Ops Manager v2.4 through v2.7. To resolve this issue:

  1. SSH into the Ops Manager VM by following the procedure in Log In to the Ops Manager VM with SSH in Advanced Troubleshooting with the BOSH CLI in the Ops Manager documentation.

  2. Change the user to root.

  3. Open the Rails console by running:

    > cd /home/tempest-web/tempest/web; RAILS_ENV='production' TEMPEST_INFRASTRUCTURE='DEPLOYMENT-IAAS' TEMPEST_WEB_DIR='/home/tempest-web' SECRET_KEY_BASE='1234' DATA_ROOT='/var/tempest' LOG_DIR='/var/log/opsmanager' su tempest-web --command 'bundle exec rails console'
    

    Where DEPLOYMENT-IAAS is either google, aws, azure, vsphere, or openstack, depending on the IaaS of your Ops Manager deployment.

  4. Set the decryption passphrase by running:

    irb(main):001:0> EncryptionKey.instance.passphrase = 'DECRYPTION-PASSPHRASE'
    

    Where DECRYPTION-PASSPHRASE is the decryption passphrase you want to set.

  5. Update the UAA restricted view access client secret by running:

    irb(main):001:0> Uaa::UaaConfig.instance.update_attributes(restricted_view_api_access_client_secret: SecureRandom.hex)
    
  6. Exit the Rails console and restart the tempest-web service by running:

    irb(main):001:0> exit
    > service tempest-web restart
    

This issue is fixed in Ops Manager v2.8 and later.

Smoke Tests Errand Fails When Deploying Healthwatch

When you deploy Healthwatch, the Smoke Tests errand fails with the following error message:

querying for grafana up should be greater than 0

The Smoke Tests errand fails because the Prometheus instance fails to scrape metrics from the Grafana instance. Potential causes of this failure include:

  • There is a network issue between the Prometheus instance and Grafana instance.

  • The Grafana instance uses a certificate that does not match the certificate authority (CA) you configured in the Grafana Configuration pane in the Healthwatch tile. This could occur because the CA you configured in the Grafana Configuration pane is either a self-signed certificate or a different CA from the one that generated the certificate. As a result, the Prometheus instance does not trust the certificate that the Grafana instance uses. For more information about configuring a CA for the Grafana instance, see Grafana Configuration in Configuring Healthwatch.

To find out why the Prometheus instance fails to scrape metrics from the Grafana instance:

  1. Log in to one of the VMs in the Prometheus instance by following the procedure in BOSH SSH in Advanced Troubleshooting with the BOSH CLI in the Ops Manager documentation.

  2. View information about the Grafana instance scrape target by running:

    curl -vk https://localhost:9090/api/v1/targets --cacert /var/vcap/jobs/prometheus/config/certs/prometheus_ca.pem --cert /var/vcap/jobs/prometheus/config/certs/prometheus_certificate.pem --key /var/vcap/jobs/prometheus/config/certs/prometheus_certificate.key | /var/vcap/packages/prometheus_backup_jq/bin/jq '.data.activeTargets[] | select(.scrapePool == "grafana")'
    

    The lastError field in the command output describes the reason for the Prometheus instance failing to scrape the Grafana instance.

TKGI Metric Exporter VM Fails to Connect to the BOSH Director

When the TKGI metric exporter VM attempts to connect to the BOSH Director, you see the following error:

ERROR [context.UaaContext [ForkJoinPool-1-worker-3]] javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors
ERROR [ingress.TokenCallCredentials [ForkJoinPool-1-worker-3]] Caught error retrieving UAA token: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors
INFO  [ingress.EventStreamObserver [ForkJoinPool-1-worker-3]] io.grpc.StatusRuntimeException: UNAUTHENTICATED

This error appears when the TKGI metric exporter VM cannot verify that the certificate chain of the UAA server for the BOSH Director is valid. To enable the TKGI metric exporter VM to connect to the BOSH Director, you must correct any certificate chain errors.

To check for certificate chain errors in the UAA server for the BOSH Director:

  1. Log in to the TKGI metric exporter VM by following the procedure in BOSH SSH in Advanced Troubleshooting with the BOSH CLI in the Ops Manager documentation.

  2. View the certificate that the UAA server uses by running:

    openssl s_client -connect 10.0.0.5:8443
    
  3. Save the certificate as a cert.pem file.

  4. Run:

    openssl verify cert.pem
    

    If the command returns an OK message, the certificate is trusted and has a valid certificate chain. If the command returns any other message, see the OpenSSL documentation to troubleshoot.

BOSH Health Metrics Cause Errors When Two Healthwatch Exporter Tiles Are Installed

When you install both Healthwatch Exporter for TAS for VMs and Healthwatch Exporter for TKGI on the same foundation, the BOSH Director Status panel in the BOSH Director Health dashboard in the Grafana UI shows “Not Running”, and your BOSH Director deployment returns the following error:

Director responded with non-successful status code '401' response '{"code":600000,"description":"Require one of the scopes: bosh.admin, bosh.750587e9-eae5-494f-99c4-5ca429b13959.admin, bosh.teams.p-healthwatch2-pas-exporter-b3a337d7ec4cca94f166.admin"}'

This occurs because both Healthwatch Exporter tiles deploy a BOSH health metric exporter VM, and both BOSH health metric exporter VMs are named bosh-health-exporter. This causes the two sets of metrics to conflict with each other.

To address this, you must scale the BOSH health metric exporter VM down to zero instances in one of the Healthwatch Exporter tiles.

To scale the BOSH health metric exporter VM down to zero instances in one of the Healthwatch Exporter tiles:

  1. Navigate to the Ops Manager Installation Dashboard.

  2. Click the Healthwatch Exporter for Tanzu Kubernetes Grid - Integrated tile or Healthwatch Exporter for Tanzu Application Service tile.

  3. Select Resource Config.

  4. In the Bosh Health Exporter row, select 0 from the Instances dropdown.

  5. Click Save.

  6. Return to the Ops Manager Installation Dashboard.

  7. Click Review Pending Changes.

  8. Click Apply Changes.

Troubleshooting Missing TKGI Cluster Metrics

The sections below describe how to troubleshoot missing TKGI cluster metrics in the Grafana UI.

To find out why the Prometheus instance fails to scrape metrics from your TKGI clusters, see Diagnose Prometheus Scrape Job Failure below.

Potential causes of this failure include:

Diagnose Prometheus Scrape Job Failure

When the Kubernetes Nodes dashboard in the Grafana UI does not show metrics data, the Prometheus instance in the Healthwatch tile has failed to scrape metrics from on-demand Kubernetes clusters created through the TKGI API.

To find out why the Prometheus instance fails to scrape metrics from your TKGI clusters:

  1. Log in to one of the VMs in the Prometheus instance by following the procedure in BOSH SSH in Advanced Troubleshooting with the BOSH CLI in the Ops Manager documentation.

  2. View information about your Prometheus instance scrape targets by running:

    curl -vk https://localhost:9090/api/v1/targets --cacert /var/vcap/jobs/prometheus/config/certs/prometheus_ca.pem --cert /var/vcap/jobs/prometheus/config/certs/prometheus_certificate.pem --key /var/vcap/jobs/prometheus/config/certs/prometheus_certificate.key | /var/vcap/packages/prometheus_backup_jq/bin/jq .
    
  3. Find the scrape jobs for your TKGI clusters. The lastError field describes the reason for the Prometheus instance failing to scrape your TKGI clusters.

No Data on TKGI Kubernetes Nodes Dashboard

If you are using TKGI v1.10.0 or v1.10.1, the Kubernetes Nodes dashboard in the Grafana UI might not show data for individual pods. This is due to a known issue in Kubernetes v1.19.6 and earlier and Kubernetes v1.20.1 and earlier.

To fix this issue, upgrade to TKGI v1.10.2.

Configure DNS for Your TKGI Clusters

When TKGI cluster discovery fails, you see the following error:

2020-05-20 19:24:02 ERROR k8s.K8sClient [parallel-1] Failed to make request
java.net.UnknownHostException: CLUSTER-NAME.ENVIRONMENT-DOMAIN

Where:

  • CLUSTER-NAME is the name of your TKGI cluster.
  • ENVIRONMENT-DOMAIN is the domain of your TKGI foundation.

This occurs because the TKGI API cannot access your TKGI clusters from the Internet. To resolve this issue, you must configure a DNS entry for the control plane of each of your TKGI clusters in the console for your IaaS.

To configure DNS entries for the control planes of your TKGI clusters:

  1. Find the IP addresses and hostnames of the control plane of each of your TKGI clusters. For more information, see Viewing Cluster Details in the TKGI documentation.

  2. Record the Kubernetes Master IP(s) and Kubernetes Master Host from the output you viewed in the previous step. For more information, see Viewing Cluster Details in the TKGI documentation.

  3. In a web browser, log in to the user console for your IaaS.

  4. For each TKGI cluster, find the public IP address of the VM that has an internal IP address matching the Kubernetes Master IP(s) you recorded in a previous step. For more information, see the documentation for your IaaS:

  5. For each TKGI cluster, create an A record in your DNS server that points to the public IP address of the control plane of the TKGI cluster that you recorded in the previous step. For more information, see the documentation for your IaaS:

    • AWS: For more information about configuring a DNS entry in the Amazon VPC console, see the AWS documentation.
    • Azure: For more information about configuring an A record in Azure DNS, see the Azure documentation.
    • GCP: For more information about adding an A record to Cloud DNS, see the GCP documentation.
    • OpenStack: For more information about configuring a DNS entry in the OpenStack internal DNS, see the OpenStack documentation.
    • vSphere: For more information about configuring a DNS entry in the vCenter Server Appliance, see the vSphere documentation.
  6. Wait for your DNS server to update.

Troubleshooting Healthwatch Exporter Tiles Using Grafana UI Dashboards

By default, the Grafana UI includes dashboards for Healthwatch Exporter tiles under the Healthwatch folder.

Viewing Healthwatch Exporter Tile Metrics

The Healthwatch - SLOs dashboard in the Grafana UI displays a row for each metric exporter VM you select from the corresponding metric exporter instance dropdown at the top of the page. Each row contains four panels:

  • Up: The current health of the Prometheus endpoint on the metric exporter VM. A value of 1 indicates that the Prometheus endpoint is healthy. A value of 0 or missing data indicates that either the Prometheus endpoint is unresponsive or the Prometheus instance failed to scrape the Prometheus endpoint. For more information, see Jobs and Instances in the Prometheus documentation.

  • Exporter SLO: The percentage of time that the Healthwatch Exporter tile was up and running over the selected time period.

  • Error Budget Remaining: How many minutes are left in the error budget before exceeding the selected Uptime SLO Target over the selected time period.

  • Minutes of Downtime: How many minutes the Healthwatch Exporter tiles were down during the selected time period.

Troubleshooting Healthwatch Exporter for TAS for VMs

The Healthwatch - Exporter Troubleshooting dashboard in the Grafana UI displays metrics that allow you to monitor the performance of each Healthwatch Exporter for TAS for VMs tile installed on your foundations. You can use these metrics to troubleshoot when you see inconsistent graphs for a particular metric type, or if a Healthwatch Exporter for TAS for VMs tile is not behaving as expected.

These dashboards contain the following panels:

  • Exporter Info: A listing of the healthwatch_pasExporter_status metric, showing runtime information for Healthwatch Exporter for TAS for VMs.

  • Exporter JVM Memory: A graph of the jvm_memory_bytes_used, jvm_memory_bytes_commited, and jvm_memory_bytes_init metrics, showing the number of used, committed, and initial bytes in a given Java virtual machine (JVM) memory area over the selected time period. You can use this graph to check for memory leaks.

  • Ephemeral Disk Usage: A gauge of the system_disk_ephemeral_percent metric, showing the percentage of the ephemeral disk used. You can use this gauge to determine whether the disk is reaching capacity.

  • Rate of Garbage Collection: A graph of the jvm_gc_collection_seconds_sum metric, showing the rate of JVM garbage collection over the selected time period. You can use this graph to determine whether the JVM garbage collection is functional.

  • Rate of Envelope Ingress: A graph of the healthwatch_pasExporter_ingress_envelopes metric, showing the rate of Loggregator envelope ingress over the selected time period. You can use this graph to check for spikes in the number of Loggregator envelopes that the metric exporter VMs receive.

  • CPU Usage: A graph of the cpu_usage_user metric, showing the percentage of CPU used over the selected time period. You can use this graph to determine whether the amount of CPU used by Healthwatch Exporter for TAS for VMs is reaching capacity.

  • Exporter VM Threads: A graph of the jvm_threads_current and jvm_threads_peak metrics, showing the current and peak thread counts of a given JVM over the selected time period. You can use this graph to check whether Healthwatch Exporter for TAS for VMs is leaking threads.