Troubleshooting Pivotal Healthwatch

Page last updated:

This topic describes how to resolve common issues with Pivotal Healthwatch.

Insufficient Memory Resources

Insufficient capacity of Diego cells can cause issues when you install or upgrade Pivotal Healthwatch.

Error

The push-apps errand fails with a message similar to the following:

$ /var/vcap/packages/cf-cli/bin/cf start healthwatch-blue
Starting app healthwatch-blue in org system / space healthwatch as admin...
FAILED
InsufficientResources

Cause

Diego cells do not have enough resources available to run the Pivotal Healthwatch apps.

Solution

To resolve this issue, navigate to the Resource Config pane of the Tanzu Application Service for VMs (TAS for VMs) tile and increase the number of Diego Cell instances. Alternatively, if you do not require high availability, scale down the number of instances in the Component Config section of the Pivotal Healthwatch tile.

Memory Limit Errors

Insufficient memory allocation can cause issues when you install or upgrade Pivotal Healthwatch.

Error

If a Ops Manager environment exceeds the total memory limit set for the healthwatch space in the system org, the Pivotal Healthwatch push-apps errand can fail. When this occurs, the error message looks similar to the following:

$ /var/vcap/packages/cf-cli/bin/cf start cf-health-check
Starting app cf-health-check in org system / space healthwatch as admin...
FAILED
Server error, status code: 400, error code: 100005, message: You have exceeded your organization's memory limit: app requested more memory than available

Cause

Your Ops Manager environment has an insufficient total memory quota set for the healthwatch space in the system org.

The issue should not occur if the Apps Manager errand has run in your environment. Because service tiles use the system org to execute smoke tests, the Apps Manager errand sets the default system org quota to runaway. If the Apps Manager errand has not run or failed, the default system quota may not be reset properly.

Solution

To resolve this issue, you can set the default memory quota for the healthwatch space in the system org to at least 24 GB and re-run the push-apps errand manually.

Ops Manager Health Check Errors

The Ops Manager Health Check needs the ability to reach Ops Manager on the underlying network.

Error

This error appears as constantly failing Ops Manager Health Checks on the Dashboard and Ops Manager Health Check History page, even though Ops Manager is running.

Cause

The opsmanager-health-check app attempts to connect to Ops Manager in order to verify it is running. This app needs the correct network settings in order to reach the Ops Manager VM. If there are firewall rules in place that prevent the network access, this check continually fails.

Solution

To determine if the ops-manager-health-check app can reach the Ops Manager VM:

  1. SSH into the running instance. Run:

    cf ssh opsmanager-health-check
    
  2. Check the network access by running:

    curl -k -v OPS-MANAGER-URL
    

    Where OPS-MANAGER-URL is the URL of your Ops Manager deployment.

If the steps above are not successful and you cannot modify network access to allow the opsmanager-health-check app to reach Ops Manager, the test cannot execute successfully. In that case, follow the procedure in Disable Ops Manager Continuous Validation Testing in Installing Pivotal Healthwatch.

CLI Command Health Check Errors

Below are suggestions for troubleshooting errors with the CLI Command Health Check.

Error

The CLI Command Health Check panel on the Healthwatch dashboard shows failures.

Solution

To troubleshoot these failures, examine the logs from the cf-health-check app in the healthwatch space under the system org. Look for JSON log entries where the status field does not equal "SUCCESS". These log entries are the output of the cf-cli. Use this information to begin troubleshooting.

Note: The CLI Command Health Check may fail during certain events such as TAS for VMs upgrades or BBR backups. For more information, see cf push Availability During TAS for VMs Upgrades in the TAS for VMs documentation.

BOSH Health Check Failing After Upgrade

Error

ERROR: Bosh health check failed to delete deployment "bosh-health-check": Deployment not found

Cause

In Pivotal Healthwatch v1.2.2 and later, Pivotal Healthwatch uses service broker UAA credentials for the BOSH Health Check. This causes a permission issue if the BOSH Health Check deployment already exists on the Director.

Solution

Manually delete the existing BOSH Health Check deployment.

BOSH System Metrics Not Being Ingested

Error

Either the Smoke Test errand fails with the following error:

[Fail] Bosh metric ingestion [It] Ingests metrics from the director into mysql /var/vcap/packages/healthwatch-data/src/github.com/pivotal-cf/healthwatch-data/data-ingestion/smoketests/bosh_metrics_test.go:50

Or the healthwatch.ingestor.ingested.boshSystemMetrics metric has a value of 0.

Note: A symptom of this error is a red Job Health panel with no failing jobs noted on the Pivotal Healthwatch dashboard.

Cause

The Healthwatch Ingestor is not receiving BOSH system metrics. There are two likely causes to this:

  1. The Healthwatch Ingestor is not receiving any metrics from the Loggregator RLP, including BOSH system metrics. This could be an issue with the Ingestor itself or the Loggregator RLP. To determine if the Ingestor isn’t receiving any metrics look at the Pivotal Healthwatch dashboard. If you do not have any data in the Router Panel graphs in addition to having 0% Job Health then you are not getting data from the RLP.

  2. The Loggregator RLP does not contain BOSH Metrics due to a failure in the bosh-system-metrics-forwarder component.

  3. A bug causes the BOSH System Metrics Forwarder process to disconnect from the metrics stream emitted by the BOSH Director. This bug is present in Ops Manager versions earlier than v2.2.4 and v2.3.0 and Pivotal Application Service (PAS) versions earlier than v2.2.5 and v2.3.0.

Solution

If the Ingestor is not receiving any metrics from Loggregator:

  • Check the logs from the Healthwatch Ingestor to see any error messages by running:

    cf logs healthwatch-ingestor --recent
    
  • Restart the Healthwatch Ingestor by running:

    cf restart healthwatch-ingestor
    

If the BOSH System Metrics Forwarder is failing:

  • Upgrade Ops Manager and TAS for VMs to versions higher than those listed in the “Cause” section above. You can then validate that there are BOSH system metrics in the Loggregator RLP by running:

    cf log-stream bosh-system-metrics-forwarder
    

    This displays metrics such as system.healthy and system.cpu.user about every 30 seconds.

  • As a temporary fix, you can recreate the loggregator_trafficcontroller VMs.

    • Log in to the BOSH Director VM by following one of the procedures in Create a BOSH Alias in Advanced Troubleshooting with the BOSH CLI in the Ops Manager documentation.
    • Recreate the Loggregator Traffic Controller VMs by running:

      bosh -e ENVIRONMENT-URL-OR-IP -d cf-GUID recreate loggregator_trafficcontroller
      

      Where:

      • ENVIRONMENT-URL-OR-IP is the URL or IP address of the BOSH Director.
      • GUID is the GUID of the BOSH Director.

        For more information about recreating VMs, see Recreate in Commands in the BOSH documentation.

If the BOSH System Metrics Forwarder and ingestors are running as expected:

  • Check for clock drift. If the clock on the Healthwatch Forwarder is in the future, the errand is looking for MySQL metric entries with a timestamp that the metrics sources cannot reach before errand timeout.
    To check for clock drift:
    1. SSH onto the Healthwatch Forwarder VM.
    2. Run date and see if the date is in the future compared to other machines in the environment.
    3. If the date out of sync, investigate and sync your Network Time Protocol (NTP) servers.

No Metrics Being Ingested

Error

The healthwatch.ingestor.ingested metric has a value of 0.

Cause

The Healthwatch Ingestor is not receiving any metrics. This could be due to the Loggregator system being in a bad state or the Ingestor not reconnecting to the Loggregator RLP.

Solution

A few possible solutions are:

  • Restart the Healthwatch Ingestor by running:

    cf restart healthwatch-ingestor
    
  • Recreate the Loggregator Traffic Controller VMs where the Loggregator RLP processes reside.

    • Log in to the BOSH Director VM by following one of the procedures in Create a BOSH Alias in Advanced Troubleshooting with the BOSH CLI in the Ops Manager documentation.
    • Recreate the Loggregator Traffic Controller VMs by running:

      bosh -e ENVIRONMENT-URL-OR-IP -d cf-GUID recreate loggregator_trafficcontroller
      

      Where:

      • ENVIRONMENT-URL-OR-IP is the URL or IP address of the BOSH Director.
      • GUID is the GUID of the BOSH Director.
  • Check for clock drift. If the clock on the Healthwatch Forwarder is in the future, the errand is looking for MySQL metric entries with a timestamp that the metrics sources cannot reach before errand timeout.
    To check for clock drift:

    1. SSH onto the Healthwatch Forwarder VM.
    2. Run date and see if the date is in the future compared to other machines in the environment.
    3. If the date out of sync, investigate and sync your Network Time Protocol (NTP) servers.