PCF Healthwatch v1.5 Release Notes

v1.5.5

Release Date: September 23, 2019

Features

New features and changes in this release:

  • [Feature Improvement] Remove metric, graph, and alert associated to Route Registration Messages Delta when running PAS 2.4+. This metric was removed in PAS 2.4 so related graphs and alerts should not display. The current associated alert will be resolved automatically.
  • [Bug Fix] Correct the threshold for Syslog Adapter Capacity.
  • [Bug Fix] Reduce noisiness of system.healthy alerts when a BOSH VM is created or deleted.
  • [Bug Fix] If healthwatch-ingestor fails to receive data after 15 seconds, it will automatically reset its Spring Application Context to re-establish a Firehose connection. After 20 resets of the Spring Application Context, the app instance will purposely crash and let Diego re-schedule it, providing a fresh container and JVM instance.
  • [Bug Fix] Fix healthwatch-ingestor crash in cases where GoRouter receives an HTTP request with non-standard HTTP method, resulting in a HttpStartStop metric with a null HTTP method value.
  • [Bug Fix] Setting Redis Worker Count in the Healthwatch Component Config page of Ops Manager successfully changes instance number. Previously, changes to this field were not reflected in the Healthwatch deployment.
  • [Bug Fix] Delete orphaned cf-health-check smoke-test-app instances regularly. Previously, cf-health-check would occasionally fail to delete a smoke test and never cleaned it up.
  • [Bug Fix] Fix occasional inaccurate spikes in Log Transport Throughput graph.

Known Issues

This release has the following known issues:

Reverse Log Proxy Egress Dropped Messages Graph Not Displaying

If there are no cf-syslog-drain metrics are emitted, the Reverse Log Proxy Egress Dropped Messages graph will not display.

v1.5.4

Release Date: July 3, 2019

Features

New features and changes in this release:

  • [Feature Improvement] Increases logging of failed Cloud Foundry Command Line Interface (cf CLI) tests in the cf-health-check app.
  • [Feature Improvement] Reduces binary log retention for internal MySQL database from 7 days to 2 days. This reduces the amount of persistent storage used by the VM.
  • [Bug Fix] Fixes rendering of Log Transport Loss Rate alert markers on multiple charts. For more information about this alert, see Log Transport Loss Rate.
  • [Bug Fix] Fixes Healthwatch Has Missing or Incorrect Data by more accurately determining the deployment tag.
  • [Bug Fix] The Diego Cell Capacity graphs do not show false drops in capacity due to occasionally late metrics. Previously, if Diego emitted a metric outside the standard one-minute window, Diego Cell Capacity graphs showed a false drop in capacity.
  • [Bug Fix] Fixes Healthwatch Cannot Start if ‘0’ aliased certificate is present in indicator keystore.
  • [Bug Fix] Fixes regression where opsman-health-check does not work for self-signed Ops Manager certificates.
  • [Bug Fix] BOSH Director stoplight correctly turns red when bosh-health-check fails.
  • [Bug Fix] Correctly account for half-hour timezones in the PCF Healthwatch UI.
  • [Bug Fix] If healthwatch-ingestor fails to receive data after 15 seconds, it will automatically reset its Spring Application Context to re-establish a Firehose connection.

  • Maintenance update of the following dependencies:

    • pxc-release now v0.15.0
    • Golang now v1.12.5
    • Indicator Protocol now v0.7.14
    • Spring Boot now 2.1.5
    • Flyway Command-line and Library now v5.2.4
    • Redis now v3.2.13
    • CF CLI now v6.44.1
    • Libraries Updated:
      • com.google.protobuf:protobuf-java now v3.7.1
      • io.projectreactor.ipc:reactor-netty now v0.7.15.RELEASE
      • react-markdown now v4.0.8

Known Issues

This release has the following known issues:

Ineffectual Redis Worker Count Property in Tile Configuration

Setting Redis Worker Count in the PCF Healthwatch tile does not change the number of instances of the healthwatch-worker app.

This issue has been fixed in Healthwatch v1.5.5.

Reverse Log Proxy Egress Dropped Messages Graph Not Displaying

If there are no cf-syslog-drain metrics are emitted, the Reverse Log Proxy Egress Dropped Messages graph will not display.

Occasional inaccurate spikes in Log Transport Throughput graph

The graph might spike inaccurately when doing a deployment upgrade or when the loggregator system is overloaded.

Log Transport Loss Rate Graph y-axis Breaks With Negative Values

In the case that some metrics are missing from a period of time in a calculation, a negative number can occur. The y-axis does not handle the negative value and breaks as long as the negative value is in scope. The negative value causes the y-axis markers to be skewed and much larger than the actual value. Values in the line, tooltip, and alerts are all still valid.

This issue resolves itself when a negative value does not display on graph.

v1.5.2

Release Date: March 29, 2019

Features

New features and changes in this release:

  • Adds Slow Consumers chart on the Logging Performance page that graphs the doppler_proxy.slow_consumer metric. For more information about the Slow Consumer Drops KPI, see Slow Consumer Drops.
  • Replaces Reverse Log Proxy Loss Rate chart on the Logging Performance page with a Reverse Log Proxy Egress Dropped Messages chart that graphs the rlp.dropped, direction: egress metric. For more information about the Reverse Log Proxy Egress Dropped Messages KPI, see Reverse Log Proxy Egress Dropped Messages.
  • Adds Log Cache Cache Duration chart on the Logging Performance page that graphs the log_cache.cache_period metric. For more information about the Log Cache Cache Duration KPI, see Key Capacity Scaling Indicators.
  • Updates Syslog Drain Binding Capacity chart on the Logging Performance page to calculate using the cf-syslog-drain.adapter.drain_bindings metric instead of the cf-syslog-drain.scheduler.drains metric. For more information about the CF Syslog Drain Bindings Count KSI, see CF Syslog Drain Bindings Count.
  • Delete alert configurations with HAPI. For more information, see Configuring PCF Healthwatch Alerts.
  • Displays banner on Healthwatch dashboard when Redis queue size reaches a Critical state and Healthwatch may not be able to evaluate the health of the foundation.
  • Improves log output when bosh-health-check deployment creation or deletion fails.
  • Updates default threshold of locket.ActiveLocks alert configuration from 4 to 5.
  • Updates Bosh CLI to v5.4.0.
  • Updates Loggregator Agent to v3.0.

Known Issues

This release has the following known issues.

Ineffectual Redis Worker Count Property in Tile Configuration

Setting Redis Worker Count in the PCF Healthwatch tile does not change the number of instances of the healthwatch-worker app.

This issue has been fixed in Healthwatch v1.5.5.

Occasional inaccurate spikes in Log Transport Throughput graph

The graph might spike inaccurately when doing a deployment upgrade or when the loggregator system is overloaded.

Healthwatch Has Missing or Incorrect Data

There are multiple symptoms of this error:

  • Graphs show missing data on multiple pages.
  • Diego & Cloud Controller Synced Check Graph is Red but bbs.Domain.cf-apps metric is 1 in Firehose.

In either case, this is caused by the Healthwatch push-apps errand incorrectly determining the deployment tag that metrics are emitted with.

This has been fixed in Healthwatch 1.5.3+.

Log Transport Loss Rate Alert Markers Render on the Wrong Charts

The Log Transport Loss Rate alert markers render on charts other than the Log Transport Loss Rate chart.

Active Locks Alert Threshold Changes When PAS Feature Disabled

Disable Zero Downtime App Deployments is an optional configuration in PAS v2.4 that changes the recommended alert threshold of the Active Locks KPI from 5 to 4. For more information about the Active Locks KPI, see Active Locks.

This configuration is in the Advanced Features pane of the PAS tile. The corresponding manifest property is advanced_features.properties.cloud_controller_temporary_disable_deployments.

If you select Disable Zero Downtime App Deployments in the PAS tile or if you use PCF Healthwatch 1.5 with PAS v2.3, use the PCF Healthwatch API to update the alert threshold for active locks to a value of 4. To do this, send the following API call:

"{\"query\":\"origin == 'locket' and name == 'ActiveLocks'\",\"threshold\":{\"critical\":4,\"type\":\"EQUALITY\"}}"

For more information, see Update Alert Configurations.

Action Required when Changing Metrics Deployment Name in PAS Tile

PAS v2.4 introduces the ability to uniquely identify metrics by tile. It uses cf-GUID as the value for deployment, which matches the BOSH deployment name. In PAS v2.3 and earlier, metrics have a deployment value of cf. The Advanced Features pane of the PAS tile includes a Use “cf” as deployment name in emitted metrics instead of unique name checkbox to override this new feature and revert to previous behavior.

If you change the value of the Use “cf” as deployment name in emitted metrics instead of unique name checkbox in the PAS v2.4 tile, you must run the Healthwatch Push Monitoring Components errand. Healthwatch does not detect the change in PAS configuration unless you run this errand.

For 24 hours after the configuration change, Healthwatch handles metric data in the following ways:

  • Data emitted with the previous cf tag is treated as an Isolation Segment.
  • Data emitted with the new cf-GUID tag is treated as the default CF deployment.

During this window, use the Isolation Segment dropdown on the Capacity and Routing detail pages to view the data from before the configuration change. You can toggle between cf and cf-GUID.

For more information, see the PAS 2.4 release notes.

Disk Slowly Fills When Using vSAN with Healthwatch Leads

The vSAN object count increases on vSphere versions earlier than v6.5 update 2.

Healthwatch deploys the application bosh-health-check, which deploys and deletes a VM every 10 minutes. vSphere versions earlier than v6.5 update 2, which is in lock with vSAN, leave behind a namespace or folder and subfolders when the VM is deleted. The orphaned folders cause the vSAN object count to increase. This is a known issue for vSAN. For more information about the vSAN known issue, see Deleted VMs leave components behind in GitHub.

To address the issue, you can update vSphere to v6.5 update 2 and later. If updating vSphere is not an option, stop the bosh-health-check to slow down the increase in vSAN object count.

Healthwatch Periodically Registers a False Drop In Diego Cell Capacity

Healthwatch ingests metrics from Diego once per minute. Occasionally, Diego emits metrics to Healthwatch outside of the minute window. This causes Healthwatch to register a false drop in the Diego Cell Capacity metric.

If the drop in Diego Cell Capacity is not longer than one minute, it does not represent a true drop in Diego Cell Capacity and can be disregarded.

Tiles without BPM cannot be installed with Indicator Protocol Enabled

When customers check the radio button to enable indicator protocol and attempt to redeploy a tile that does not include the BPM job on all of its VMs, the deployment will fail when updating runtime configs during the next apply changes. Here is an example of the error failing for mysql: “` Task 5612 | 17:42:32 | Preparing package compilation: Finding packages to compile (00:00:00) Task 5612 | 17:42:32 | Updating instance dedicated-mysql-broker: dedicated-mysql-broker/7bd716ce-c007-4f3f-ae3b-dac4a7b5a321 (0) (canary) (00:05:33) L Error: 'dedicated-mysql-broker/7bd716ce-c007-4f3f-ae3b-dac4a7b5a321 (0)’ is not running after update. Review logs for failed jobs: indicator-registration-agent Task 5612 | 17:48:05 | Error: 'dedicated-mysql-broker/7bd716ce-c007-4f3f-ae3b-dac4a7b5a321 (0)’ is not running after update. Review logs for failed jobs: indicator-registration-agent

Task 5612 Started Mon May 6 17:42:24 UTC 2019 Task 5612 Finished Mon May 6 17:48:05 UTC 2019 Task 5612 Duration 00:05:41 Task 5612 error

Updating deployment: Expected task '5612’ to succeed but state is 'error’ Exit code 1 ===== 2019-05-06 17:48:05 UTC Finished ”/usr/local/bin/bosh –no-color –non-interactive –tty –environment=10.0.0.5 –deployment=pivotal-mysql-528905099ffd0ea3a034 deploy /var/tempest/workspaces/default/deployments/pivotal-mysql-528905099ffd0ea3a034.yml"; Duration: 342s; Exit Status: 1 Exited with 1. “`

(It suggests reviewing the logs for indicator-registration-agent, but there are no logs in the /var/vcap/sys/log/indicator-registration-agent directory of the failing Bosh VM.)

The radio button in question can be found here:

Healthwatch opsman config

This is caused because the BPM release and job are not included in the runtime config used to enable Indicator Protocol.

The registration agent depends on BPM to schedule the scraping of the file system for indicator documents.

The only known workaround is not to enable Indicator Protocol, or to disable it.

Healthwatch Cannot Start if Multiple Root Certificate Authorities are Present in Ops Manager

Healthwatch does not start if there are multiple root certificate authorities (CAs) in Ops Manager. This issue is caused because Indicator Protocol cannot load a file containing multiple certificates.

This issue occurs whether or not you select the radio button to enable Indicator Protocol.

This issue may occur after certificate rotation if you generate a new CA and do not delete the old CA.

The known workaround is to delete old CAs in Ops Manager. For more information, see Rotating Certificates.

Indicator Protocol Beta Dashboard displays Error Due to Log Cache

Occasionally, the Indicator Protocol Beta Dashboard charts fail to load and display the error: "Error fetching graph data.".

These charts are populated using Log Cache, which is part of Loggregator and will fail periodically due to Log Cache timing out while attempting to process the data.

No corrective action is required and it will self-resolve if possible.