Troubleshooting PCF Healthwatch
Page last updated:
This topic describes how to resolve common issues with Pivotal Cloud Foundry (PCF) Healthwatch.
Insufficient capacity of the Diego cells can cause issues when you install or upgrade PCF Healthwatch.
push-apps errand can fail if Diego cells do not have sufficient free memory to place the PCF Healthwatch applications. If this occurs, you will see an error message like the following:
$ /var/vcap/packages/cf-cli/bin/cf start healthwatch-blue Starting app healthwatch-blue in org system / space healthwatch as admin... FAILED InsufficientResources
Diego cells do not have enough available resources to place the PCF Healthwatch applications.
To resolve this issue, navigate to the Resource Config pane of the PAS or SRT tile and increase the number of Diego Cell instances. Or, if you do not need high-availability, scale down the number of instances in Healthwatch Component Config in the PCF Healthwatch Tile.
Insufficient memory allocation can cause issues when you install or upgrade PCF Healthwatch.
If a PCF environment exceeds the total memory limit set for the
healthwatch space in the
system org, the PCF Healthwatch
push-apps errand can fail. When this occurs, the error message looks similar to the following:
$ /var/vcap/packages/cf-cli/bin/cf start cf-health-check Starting app cf-health-check in org system / space healthwatch as admin... FAILED Server error, status code: 400, error code: 100005, message: You have exceeded your organization's memory limit: app requested more memory than available
Your PCF environment has an insufficient total memory quota set for the
healthwatch space in the
The issue should not occur if the Apps Manager errand has run in your environment. Because service tiles use the
system org to execute smoke tests, the Apps Manager errand sets the default
system org quota to
runaway. If the Apps Manager errand has not run or failed, the default
system quota may not be reset properly.
To resolve this issue, you can set the default memory quota for the
healthwatch space in the
system org to at least 24 GB and re-run the
push-apps errand manually.
The Ops Manager Health Check needs the ability to reach Ops Manager on the underlying network.
This error will appear as constantly failing Ops Manager Health Checks on the Dashboard and
Ops Manager Health Check History page even though Ops Manager is running.
opsmanager-health-check application attempts to connect to Ops Manager in order to verify it is running. This application needs the correct network settings in order to be able to reach the Ops Manager VM. If there are firewall rules in place that prevent the network access, then this check will continually fail.
To resolve this issue, confirm that the
opsmanager-health-check application is attempting to reach the Ops Manager VM on a URL that is accessible from that instance. To verify this, run
cf ssh opsmanager-health-check to SSH into the running instance. Then run
curl -v $OPSMANAGER_URL to check the network access.
If you cannot modify network access to allow the
opsmanager-health-check application to reach Ops Manager, this test cannot be executed properly and you should disable it. See Disable Ops Manager Continuous Validation Testing.
Below are suggestions for troubleshooting errors with the CLI Command Health Check.
The CLI Command Health Check panel on the Healthwatch dashboard shows failures.
To troubleshoot these failures, start by examining the logs from the
cf-health-check app in the
healthwatch space under the
system org. Look for JSON log entries where the
status field does not equal
"SUCCESS". These log entries are the output of the
cf-cli. Use this information to begin troubleshooting.
Note: The CLI Command Health Check may fail during certain events on the Foundation such as PAS Upgrades or BBR Backups. For more information, please see [cf push Availability During Pivotal Application Service Upgrades](https://docs.pivotal.io/pivotalcf/2-0/customizing/cfpush-availability-during-upgrade.html).
Note: The CLI Command Health Check may fail during certain events on the such as PAS Upgrades or BBR Backups. For more information, see cf push Availability During Pivotal Application Service Upgrades in the PCF Documentation.
See the following error messages:
ERROR: Bosh health check failed to delete deployment “bosh-health-check”: Deployment not found
bosh-health-checkdeployment does exist.
For PCF Healthwatch v1.1.8 and later, BOSH Health Check is using the service broker UAA credentials. This can cause a permissions issue if the BOSH Health Check deployment already exists on the BOSH Director.
To resolve this issue, manually delete the existing BOSH Health Check deployment.
Either the Smoke Test errand fails with:
[Fail] Bosh metric ingestion [It] Ingests metrics from the director into mysql /var/vcap/packages/healthwatch-data/src/github.com/pivotal-cf/healthwatch-data/data-ingestion/smoketests/bosh_metrics_test.go:50
or there is a complete lack of data in the Job Health and Job Vitals panels on the PCF Healthwatch dashboard.
Note: A symptom of this error is a red Job Health panel with no failing jobs noted on the PCF Healthwatch dashboard.
The Healthwatch Ingestor is not receiving BOSH system metrics. There are two likely causes to this:
The Healthwatch Ingestor is not receiving any metrics from the Firehose, including BOSH system metrics. This could be an issue with the Ingestor itself or the Loggregator Traffic Controller. To determine if the Ingestor isn’t receiving any metrics look at the PCF Healthwatch dashboard. If you don’t have any data in the Router Panel graphs in addition to having 0% Job Health then you are not getting data from the Firehose.
The Firehose does not contain Bosh Metrics due to a failure in the
bosh-system-metrics-forwardercomponent. A bug in earlier versions of Ops Manager causes the BOSH System Metrics Forwarder process to disconnect from the metrics stream emitted by the Bosh Director. This bug is present in Ops Manager versions earlier than v2.0.13 and v2.1.4.
If the Ingestor is not receiving any metrics from the Firehose:
- Check the logs from the Healthwatch Ingestor to see any error messages.
cf logs healthwatch-ingestor --recent
- Restart the Healthwatch Ingestor:
cf restart healthwatch-ingestor
If the BOSH System Metrics Forwarder is failing:
- Upgrade Ops Manager to v2.0.13 or v2.1.4 or later. You can then validate that there are BOSH system metrics in the Firehose by running
cf nozzle -n | grep system. This displays metrics such as
system.cpu.userabout every 30 seconds.
- As a temporary fix, you can recreate the
loggregator_trafficcontrollerVMs. After logging in to the BOSH Director, recreate the Loggregator Traffic Controller VMs:
bosh -e <MY_ENV> -d cf-<guid> recreate loggregator_trafficcontroller