Pivotal Healthwatch v1.7 Release Notes

v1.7.2

Release Date: March 11, 2020

Features

New features and changes in this release:

  • [Bug Fix] Fix the issue where BOSH health check failed after reinstallation.
  • [Bug Fix] Fix the issue where super metrics were lost when ingestor instances were overloaded by router latency metrics.

Known Issues

This release has the following known issues.

Healthwatch v1.7 and PAS v2.8 Compatibility Limitations

When using Pivotal Healthwatch v1.7 with Pivotal Application Service (PAS) v2.8, the following chart will be blank: Log Cache Cache Duration. This is because the underlying metric name has changed from log_cache.cache_period to log_cache.log_cache_cache_period. You must also leave the following errand disabled: Metrics Troubleshooting. Note that this errand is disabled by default, so likely no action is necessary. We recommend using Pivotal Healthwatch v1.8 with PAS v2.8.

Disk Slowly Fills When Using vSAN with Healthwatch Leads

The vSAN object count increases on vSphere versions earlier than v6.5 update 2.

Healthwatch deploys the app bosh-health-check, which deploys and deletes a VM every 10 minutes. vSphere versions earlier than v6.5 update 2 leave a namespace or folder and subfolders when the VM is deleted. The orphaned folders cause the vSAN object count to increase. This is a known issue for vSAN. For more information about the vSAN known issue, see Deleted VMs leave components behind in GitHub.

To address the issue, update vSphere to v6.5 update 2 or later. Or, you can stop the bosh-health-check to slow down the increase in vSAN object count.

Indicator Protocol Beta Dashboard Displays Error Due to Log Cache

Occasionally, the Indicator Protocol Beta Dashboard charts will fail to load with the error: "Error fetching graph data.".

These charts are populated using Log Cache, which is part of Loggregator and will fail periodically due to Log Cache timing out while attempting to process the data.

No corrective action is required and it will self-resolve if possible.

Healthwatch Reports False Capacity Metrics for Isolation Segments Without Placement Tags

In PAS v2.8, a new feature in the Isolation Segment Tile allows you to deploy compute isolation segments without placement tags. This allows you to deploy a separate group of Diego Cells without isolating the Cell capacity from other apps. For more information about this feature, see Compute and Networking Isolation in Pivotal Isolation Segment v2.8 Release Notes.

If you deploy compute isolation segments without placement tags, Healthwatch cannot accurately measure and report on capacity. Capacity charts, calculated capacity metrics such as Free Chunks, and capacity alerts may incorrectly report a lower capacity than is available for apps.

v1.7.1

Release Date: December 18, 2019

Features

New features and changes in this release:

  • Remove alert associated to the depreciated metric Reverse Log Proxy Loss Rate.
  • Add Metrics Troubleshooting errand. This is for troubleshooting purpose only, off by default.
  • Remove scientific notation when retrieving alert configurations from the Healthwatch API.
  • [Bug Fix] Fix infinite redirect at login.
  • [Bug Fix] Fix issue with multiple healthwatch_space_developer users on Healthwatch re-install.
  • [Bug Fix] Remove unnecessary migration error message during installation.
  • [Bug Fix] Fix documentation links from alerts that pointed at older versions of Healthwatch.
  • [Bug Fix] Fix issue with duplicate entries in free_chunk_configuration table.

  • Maintenance update of the following dependencies:

  • Spring Boot now 2.1.9

  • Indicator Protocol now 0.7.17

  • syslog-release now 11.x

Known Issues

This release has the following known issues.

BOSH Health Check Fails After Reinstallation

If Healthwatch is uninstalled and re-installed while the BOSH Health Check is running, then the BOSH Health Check fails to deploy, and reports an error in the Healthwatch UI.

To address this issue, manually delete the bosh-health-check deployment and restart the bosh-health-check app.

Healthwatch v1.7 and PAS v2.8 Compatibility Limitations

When using Pivotal Healthwatch v1.7 with Pivotal Application Service (PAS) v2.8, the following chart will be blank: Log Cache Cache Duration. This is because the underlying metric name has changed from log_cache.cache_period to log_cache.log_cache_cache_period. You must also leave the following errand disabled: Metrics Troubleshooting. Note that this errand is disabled by default, so likely no action is necessary. We recommend using Pivotal Healthwatch v1.8 with PAS v2.8.

Disk Slowly Fills When Using vSAN with Healthwatch Leads

The vSAN object count increases on vSphere versions earlier than v6.5 update 2.

Healthwatch deploys the app bosh-health-check, which deploys and deletes a VM every 10 minutes. vSphere versions earlier than v6.5 update 2 leave a namespace or folder and subfolders when the VM is deleted. The orphaned folders cause the vSAN object count to increase. This is a known issue for vSAN. For more information about the vSAN known issue, see Deleted VMs leave components behind in GitHub.

To address the issue, update vSphere to v6.5 update 2 or later. Or, you can stop the bosh-health-check to slow down the increase in vSAN object count.

Indicator Protocol Beta Dashboard Displays Error Due to Log Cache

Occasionally, the Indicator Protocol Beta Dashboard charts will fail to load with the error: "Error fetching graph data.".

These charts are populated using Log Cache, which is part of Loggregator and will fail periodically due to Log Cache timing out while attempting to process the data.

No corrective action is required and it will self-resolve if possible.

Healthwatch Reports False Capacity Metrics for Isolation Segments Without Placement Tags

In PAS v2.8, a new feature in the Isolation Segment Tile allows you to deploy compute isolation segments without placement tags. This allows you to deploy a separate group of Diego Cells without isolating the Cell capacity from other apps. For more information about this feature, see Compute and Networking Isolation in Pivotal Isolation Segment v2.8 Release Notes.

If you deploy compute isolation segments without placement tags, Healthwatch cannot accurately measure and report on capacity. Capacity charts, calculated capacity metrics such as Free Chunks, and capacity alerts may incorrectly report a lower capacity than is available for apps.

v1.7.0

Release Date: September 20, 2019

Features

New features and changes in this release:

  • Remove the default critical and warning threshold for alerts we have learned are highly dependent upon customer environments.

    • For customers doing a fresh install:
    • They will not receive alerts for metrics with highly variable thresholds, designated by the Environment Specific Alert table. Customer who wants to receive alerts for the metrics with dynamic thresholds need to configure the alert threshold through HAPI explicitly.
    • For customers upgrading:
    • If they have custom alert thresholds configured through HAPI for the affected metrics, the alert behavior will not be affected by this change. If customers choose to forego their custom thresholds and no longer monitor these metrics, instructions are provided here.
    • If they do not have custom alert thresholds configured, they will no longer receive alerts for the affected metrics. Current in-flight red/yellow alerts will be cleared by green alerts regardless the current metric value.
  • Remove metric, graph, and alert associated to Route Registration Messages Delta. This metric was removed in PAS 2.4 so related graphs and alerts should not display. The current associated alert will be resolved automatically.

  • [Bug Fix] Correctly handle rotation of root Certificate Authorities.

  • [Bug Fix] Reduce noisiness of system.healthy alerts when a BOSH VM is created or deleted.

  • [Bug Fix] If healthwatch-ingestor fails to receive data after 15 seconds, it will automatically reset its Spring Application Context to re-establish a Firehose connection. After 20 resets of the Spring Application Context, the app instance will purposely crash and let Diego re-schedule it, providing a fresh container and JVM instance.

  • [Bug Fix] Fix healthwatch-ingestor crash in cases where GoRouter receives an HTTP request with non-standard HTTP method, resulting in a HttpStartStop metric with a null HTTP method value.

  • [Bug Fix] Setting Redis Worker Count in the Healthwatch Component Config page of Ops Manager successfully changes instance number. Previously, changes to this field were not reflected in the Healthwatch deployment.

  • [Bug Fix] Delete orphaned cf-health-check smoke-test-app instances regularly. Previously, cf-health-check would occasionally fail to delete a smoke test and never cleaned it up.

  • [Bug Fix] Fix occasional inaccurate spikes in Log Transport Throughput graph.

  • Maintenance update of the following dependencies:

    • Spring Boot now 2.1.8

Known Issues

This release has the following known issues.

Disk Slowly Fills When Using vSAN with Healthwatch Leads

The vSAN object count increases on vSphere versions earlier than v6.5 update 2.

Healthwatch deploys the app bosh-health-check, which deploys and deletes a VM every 10 minutes. vSphere versions earlier than v6.5 update 2 leave a namespace or folder and subfolders when the VM is deleted. The orphaned folders cause the vSAN object count to increase. This is a known issue for vSAN. For more information about the vSAN known issue, see Deleted VMs leave components behind in GitHub.

To address the issue, update vSphere to v6.5 update 2 or later. Or, you can stop the bosh-health-check to slow down the increase in vSAN object count.

Indicator Protocol Beta Dashboard Displays Error Due to Log Cache

Occasionally, the Indicator Protocol Beta Dashboard charts will fail to load with the error: "Error fetching graph data.".

These charts are populated using Log Cache, which is part of Loggregator and will fail periodically due to Log Cache timing out while attempting to process the data.

No corrective action is required and it will self-resolve if possible.

Multiple healthwatch_space_developer CF on Healthwatch re-install

When the PCF Healthwatch tile is re-installed, the push-apps errand creates a duplicate healthwatch_space_developer user because the pre-existing user is not deleted during the previous tile’s deletion.

This causes the cf-health-check to fail due to an invalid password for the healthwatch_space_developer user.

This issue is fixed in Healthwatch v1.7.1.

Reverse Log Proxy Loss Rate Alert

Occasionally, the Reverse Log Roxy Loss Rate alert fires, although the metric is removed from Healthwatch v1.5 and later.

Infinite Login Redirect When Using Private Domain Suffixes

In Healthwatch v1.7.0, certain private domain suffixes, such as .local or .a, result in an infinite redirect loop when you try to access the Healthwatch UI.

A workaround is to set the SKIP_CERT_VERIFY environment variable to true on the Healthwatch app. This is resolved in Healthwatch v1.7.1.

For the canonical list of public suffixes, see the Public Suffix List.

Healthwatch Reports False Capacity Metrics for Isolation Segments Without Placement Tags

In PAS v2.8, a new feature in the Isolation Segment Tile allows you to deploy compute isolation segments without placement tags. This allows you to deploy a separate group of Diego Cells without isolating the Cell capacity from other apps. For more information about this feature, see Compute and Networking Isolation in Pivotal Isolation Segment v2.8 Release Notes.

If you deploy compute isolation segments without placement tags, Healthwatch cannot accurately measure and report on capacity. Capacity charts, calculated capacity metrics such as Free Chunks, and capacity alerts may incorrectly report a lower capacity than is available for apps.