PCF Healthwatch v1.6 Release Notes

v1.6.4

Release Date: February 28, 2020

Features

New features and changes in this release:

  • Remove alert associated with the removed metric Reverse Log Proxy Loss Rate. See Reverse Log Proxy Loss Rate.
  • [Bug Fix] Fix infinite redirect at login.
  • [Bug Fix] Remove the healthwatch_space_developer user when the PCF Healthwatch tile is uninstalled.
  • [Bug Fix] Fix the issue where Super Value Metrics were lost when large volumes of router latency metrics overloaded the ingestor instances.
  • [Bug Fix] Fix the installation failure due to duplicated entry in free_chunks_configuration which would cause flyway migration failure during Apply Change.

  • Maintenance update of the following dependencies:

    • Spring Boot now 2.1.9
    • Indicator Protocol now 0.7.17
    • Syslog release now 11.6.0

Known Issues

This release has the following known issues.

Disk Slowly Fills When Using vSAN with Healthwatch Leads

The vSAN object count increases on vSphere versions earlier than v6.5 update 2.

Healthwatch deploys the app bosh-health-check, which deploys and deletes a VM every 10 minutes. vSphere versions earlier than v6.5 update 2 leave a namespace or folder and subfolders when a VM is deleted. The orphaned folders cause the vSAN object count to increase. This is a known issue for vSAN. For more information about the vSAN known issue, see Deleted VMs leave components behind in GitHub.

To address the issue, update vSphere to v6.5 update 2 or later. Or, stop the bosh-health-check to slow down the increase in vSAN object count.

Indicator Protocol Beta Dashboard Displays Error Due to Log Cache

Occasionally, the Indicator Protocol Beta Dashboard charts fail to load with the following error: "Error fetching graph data.".

The Indicator Protocol Beta Dashboard charts are populated using data from Log Cache, which is a component of Loggregator. The charts may fail to load if Log Cache times out while processing the data.

No corrective action is required. The issue will self-resolve if possible.

Infinite Login Redirect When Using Private Domain Suffixes

Certain private domain suffixes, such as .local or .a, result an infinite redirect loop when trying to access the Healthwatch UI.

A workaround is to set the SKIP_CERT_VERIFY environment variable to true on the Healthwatch app.

For the canonical list of public suffixes, see the Public Suffix List.

v1.6.3

Release Date: September 17, 2019

Features

New features and changes in this release:

  • Remove metric, graph, and alert associated to Route Registration Messages Delta. This metric was removed in PAS 2.4 so related graphs and alerts should not display. The current associated alert will be resolved automatically.
  • [Bug Fix] Correctly handle rotation of root Certificate Authorities.
  • [Bug Fix] Correct the threshold for Syslog Adapter Capacity.
  • [Bug Fix] Reduce noisiness of system.healthy alerts when a BOSH VM is created or deleted.
  • [Bug Fix] If healthwatch-ingestor fails to receive data after 15 seconds, it will automatically reset its Spring Application Context to re-establish a Firehose connection. After 20 resets of the Spring Application Context, the app instance will purposely crash and let Diego re-schedule it, providing a fresh container and JVM instance.
  • [Bug Fix] Fix healthwatch-ingestor crash in cases where GoRouter receives an HTTP request with non-standard HTTP method, resulting in a HttpStartStop metric with a null HTTP method value.
  • [Bug Fix] Setting Redis Worker Count in the Healthwatch Component Config page of Ops Manager successfully changes instance number. Previously, changes to this field were not reflected in the Healthwatch deployment.
  • [Bug Fix] Delete orphaned cf-health-check smoke-test-app instances regularly. Previously, cf-health-check would occasionally fail to delete a smoke test and never cleaned it up.
  • [Bug Fix] Fix occasional inaccurate spikes in Log Transport Throughput graph.

  • Maintenance update of the following dependencies:

    • Golang now 1.12.9
    • Java now 1.8.0_222-b10
    • Indicator Protocol now 0.7.16
    • Spring Boot now 2.1.7

Known Issues

This release has the following known issues.

Disk Slowly Fills When Using vSAN with Healthwatch Leads

The vSAN object count increases on vSphere versions earlier than v6.5 update 2.

Healthwatch deploys the app bosh-health-check, which deploys and deletes a VM every 10 minutes. vSphere versions earlier than v6.5 update 2 leave a namespace or folder and subfolders when a VM is deleted. The orphaned folders cause the vSAN object count to increase. This is a known issue for vSAN. For more information about the vSAN known issue, see Deleted VMs leave components behind in GitHub.

To address the issue, update vSphere to v6.5 update 2 or later. Or, stop the bosh-health-check to slow down the increase in vSAN object count.

Indicator Protocol Beta Dashboard Displays Error Due to Log Cache

Occasionally, the Indicator Protocol Beta Dashboard charts will fail to load with the following error: "Error fetching graph data.".

The Indicator Protocol Beta Dashboard charts are populated using data from Log Cache, which is a component of Loggregator. The charts may fail to load if Log Cache times out while processing the data.

No corrective action is required. The issue will self-resolve if possible.

Multiple healthwatch_space_developer Users Created During Healthwatch Re-installation

When the PCF Healthwatch tile is re-installed, the push-apps errand creates a duplicate healthwatch_space_developer user because the pre-existing user is not deleted during the previous tile’s deletion.

This causes the cf-health-check to fail due to an invalid password for the healthwatch_space_developer user.

Infinite Login Redirect When Using Private Domain Suffixes

Certain private domain suffixes, such as .local or .a, result an infinite redirect loop when trying to access the Healthwatch UI.

A workaround is to set the SKIP_CERT_VERIFY environment variable to true on the Healthwatch app.

For the canonical list of public suffixes, see the Public Suffix List.

v1.6.2 – Withdrawn

This release has been removed from Pivotal Network.

Release Date: September 11, 2019

Features

See release note for v1.6.3

Known Issues

Flyway migration fails during upgrade

PCF Healthwatch v1.6.2 contains a bad flyway migration. This causes issues during upgrades from PCF Healthwatch v1.5. Due to this issue, PCF Healthwatch v1.6.2 is no longer available on Pivotal Network.

Install or upgrade to PCF Healthwatch v1.6.3 instead.

Reverse Log Proxy Loss Rate Alert Fires

Occasionally, the alert would fire, although the metric has been remove from Healthwatch v1.5 and above.

Infinite login redirect when using private domain suffixes

Certain private domain suffixes (eg, .local or .a) result an infinite redirect loop when trying to access the Healthwatch UI.

A workaround is to set the SKIP_CERT_VERIFY environment variable to true on the Healthwatch app. A bug fix is included in 1.6.4.

For the canonical list of public suffixes, see https://publicsuffix.org/list/public_suffix_list.dat.

v1.6.1

Release Date: July 8, 2019

Features

New features and changes in this release:

Known Issues

This release has the following known issues.

Incorrect Upgrade Requirements

Tile metadata in PCF Healthwatch states that a user can upgrade directly from Healthwatch v1.3 to v1.6, but the statement is incorrect. To upgrade successfully to v1.6, you need PCF Healthwatch v1.5 or later.

The metadata statement is corrected in Healthwatch v1.6.2 and later.

Cell Health Check graph is not showing correctly

On the Compute Performance page, the Cell Health Check graph shows no data. Upgrading to Healthwatch 1.6.3+ fixes this issue. Alerting on the underlying metric, rep.UnhealthyCell, was unaffected.

Occasional inaccurate spikes in Log Transport Throughput graph

The graph might spike inaccurately when doing a deployment upgrade or when the loggregator system is overloaded.

Ineffectual “Redis Worker Count” property in tile configuration

Setting Redis Worker Count in the PCF Healthwatch tile does not change the number of instances of the healthwatch-worker app.

This issue is fixed in Healthwatch v1.6.3 and later.

Disk Slowly Fills When Using vSAN with Healthwatch Leads

The vSAN object count increases on vSphere versions earlier than v6.5 update 2.

Healthwatch deploys the app bosh-health-check, which deploys and deletes a VM every 10 minutes. vSphere versions earlier than v6.5 update 2 leave a namespace or folder and subfolders when a VM is deleted. The orphaned folders cause the vSAN object count to increase. This is a known issue for vSAN. For more information about the vSAN known issue, see Deleted VMs leave components behind in GitHub.

To address the issue, update vSphere to v6.5 update 2 or later. Or, stop the bosh-health-check to slow down the increase in vSAN object count.

Indicator Protocol Beta Dashboard Displays Error Due to Log Cache

Occasionally, the Indicator Protocol Beta Dashboard charts will fail to load with the following error: "Error fetching graph data.".

The Indicator Protocol Beta Dashboard charts are populated using data from Log Cache, which is a component of Loggregator. The charts may fail to load if Log Cache times out while processing the data.

No corrective action is required. The issue will self-resolve if possible.

Multiple healthwatch_space_developer CF on Healthwatch re-install

A user is created during the push-apps errand (which runs during tile installation) to be used during the cf-health-check test. This user was not being deleted on tile deletion, so if the tile is re-installed the cf-health-check fail because it’s using an invalid password for the pre-existing healthwatch_space_developer user.

Reverse Log Proxy Loss Rate Alert Fires

Occasionally, the alert would fire, although the metric has been remove from Healthwatch v1.5 and above.

v1.6.0

Release Date: June 19, 2019

Features

New features and changes in this release:

  • Updates Healthwatch Charts with new features:
    • Drag a selection to zoom in.
    • Double click to zoom out.
    • More visibility around missing data.
    • Legend with filters.
  • Renames Log Transport Dropped Messages chart to Log Transport Dropped Ingress Messages. The Log Transport Dropped Ingress Messages chart graphs the doppler.dropped, direction: ingress metric.
  • Adds Log Transport Dropped Egress Messages chart on the Logging Performance page. This graphs the doppler.dropped, direction: egress metric. For more information about the Doppler Egress Dropped Messages KPI, see Doppler Egress Dropped Messages.
  • PCF Healthwatch apps connect to the internal MySQL database using TLS.
  • Increases logging around failed Cloud Foundry Command Line Interface (cf CLI) tests in the cf-health-check app.
  • Binary log retention for internal MySQL database changed from 7 days to 2 days. This reduces the amount of persistent storage used by the VM.
  • [Bug Fix] Fixes Log Transport Loss Rate alert markers rendering on multiple charts.
  • [Bug Fix] Fixes Healthwatch Has Missing or Incorrect Data by more robustly determining the deployment tag.
  • [Bug Fix] Diego Cell Capacity page graphs do not show false drops in capacity due to occasional late metric. Previously, if Diego emits a metric outside the standard one minute window, Diego Cell Capacity graphs show a false drop.
  • [Bug Fix] Fixes Healthwatch Cannot Start if ‘0’ aliased certificate is present in indicator keystore.
  • [Bug Fix] Fixes regression where opsman-health-check doesn’t work for self-signed Ops Manager certificate.
  • [Bug Fix] BOSH Director stoplight correctly turns red when bosh-health-check fails.
  • [Bug Fix] Correctly account for half-hour timezones in the Healthwatch UI.
  • [Bug Fix] If healthwatch-ingestor fails to receive data after 15 seconds, it will automatically reset its Spring Application Context to re-establish a Firehose connection.

  • Maintenance update of the following dependencies:

    • pxc-release now v0.15.0
    • Golang now v1.12.5
    • Indicator Protocol now v0.7.14
    • Spring Boot now 2.1.5
    • Flyway Command-line and Library now v5.2.4
    • Redis now v3.2.13
    • CF CLI now v6.45.0
    • Libraries Updated:
      • com.google.protobuf:protobuf-java now v3.7.1
      • io.projectreactor.ipc:reactor-netty now v0.7.15.RELEASE
      • react-markdown now v4.0.8

Known Issues

This release has the following known issues.

Incorrect Upgrade Requirements

Tile metadata in PCF Healthwatch states that a user can upgrade directly from Healthwatch v1.3 to v1.6, but the statement is incorrect. To upgrade successfully to v1.6, you need PCF Healthwatch v1.5 or later.

The metadata statement is corrected in Healthwatch v1.6.2 and later.

Occasional inaccurate spikes in Log Transport Throughput graph

The graph might spike inaccurately when doing a deployment upgrade or when the loggregator system is overloaded.

Ineffectual Redis Worker Count Property in Tile Configuration

Setting Redis Worker Count in the PCF Healthwatch tile does not change the number of instances of the healthwatch-worker app.

This issue is fixed in Healthwatch v1.6.3.

Log Transport Dropped Egress Messages Graph Not Displaying

If all values for loggregator.doppler.dropped.egress are 0, the Log Transport Dropped Egress Messages graph will not display.

This issue is fixed in Healthwatch v1.6.1.

Reverse Log Proxy Egress Dropped Messages Graph Not Displaying

If there are no cf-syslog-drain metrics emitted, the Reverse Log Proxy Egress Dropped Messages graph will not display.

Disk Slowly Fills When Using vSAN with Healthwatch Leads

The vSAN object count increases on vSphere versions earlier than v6.5 update 2.

Healthwatch deploys the app bosh-health-check, which deploys and deletes a VM every 10 minutes. vSphere versions earlier than v6.5 update 2 leave a namespace or folder and subfolders when the VM is deleted. The orphaned folders cause the vSAN object count to increase. This is a known issue for vSAN. For more information about the vSAN known issue, see Deleted VMs leave components behind in GitHub.

To address the issue, update vSphere to v6.5 update 2 or later. Or, you can stop the bosh-health-check to slow down the increase in vSAN object count.

Indicator Protocol Beta Dashboard Displays Error Due to Log Cache

Occasionally, the Indicator Protocol Beta Dashboard charts will fail to load with the error: "Error fetching graph data.".

These charts are populated using Log Cache, which is part of Loggregator and will fail periodically due to Log Cache timing out while attempting to process the data.

No corrective action is required and it will self-resolve if possible.

Multiple healthwatch_space_developer CF on Healthwatch re-install

A user is created during the push-apps errand (which runs during tile installation) to be used during the cf-health-check test. This user was not being deleted on tile deletion, so if the tile is re-installed the cf-health-check fail because it’s using an invalid password for the pre-existing healthwatch_space_developer user.

Reverse Log Proxy Loss Rate Alert Fires

Occasionally, the alert would fire, although the metric has been remove from Healthwatch v1.5 and above.