LATEST VERSION: 1.4 - RELEASE NOTES
PCF Healthwatch v1.1

PCF Healthwatch v1.1 Release Notes

Releases

v1.1.9

Release Date: July 17, 2018

  • Given that the UAA Request Latency metric is now emitted in PCF, the Request Latency by UAA chart will now have data on the UAA details page if on PAS v2.0.11 or later.
  • [Bug Fix] In prior versions of PCF Healthwatch v1.1.x, metrics emitted by PCF Healthwatch were not tagged with the Foundation name. This is now resolved.
  • [Bug Fix] Updated how the stemcell number is referenced in the tile manifest. Prior to this fix BOSH would misinterpret a stemcell number ending in 0 and incorrectly drop the 0.
  • PCF Healthwatch now leverages pre-compiled releases in order to reduce the deployment time necessary during the Apply Changes flow.
  • The default threshold values for Syslog Adapter Capacity shown in the UI have been modified: 250 critical and 200 warning.
  • Stemcell for v1.1.9 is now v3541
  • Maintenance update of the following product dependencies:
    • Golang now v1.9.7
    • Java GRPC now v1.13.1
    • Spring now Brussels-SR11
    • Flyway Command-line now v5.1.3
    • CF CLI now v6.37.0
    • OpenJDK now v1.8.0_172-b11
    • Loggregator now v99.1
    • Consul now v196
    • CF-MySQL now 36.14.0

Known Issues

  • If Healthwatch is upgraded while the BOSH Health Check runs, the BOSH Director SLI will show an error. Details on troubleshooting can be found here.
  • Does not cover monitoring of Isolation Segments.
  • If using PCF Healthwatch v1.1 on PCF v2.1, the Number of Route Registration Messages Sent and Received comparison graph may incorrectly reflect an inflated gap.
  • Hides three of the PAS MySQL KPI charts. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used

v1.1.8

Release Date: May 14, 2018

  • [Bug Fix] Fixes a bug identified in prior versions where two-factor authentication being enabled for the BOSH Director caused the BOSH Health Check continuous validation test and the BOSH Task Check to fail.
  • Maintenance update of the following product dependencies:
    • PushApps now v0.0.53

Known Issues

  • [Bug Identified] Metrics emitted by PCF Healthwatch are not tagged with the Foundation name while being emitted back into the Firehose. This is resolved in v1.1.9.
  • If Healthwatch is upgraded while the BOSH Health Check runs, the BOSH Director SLI will show an error. Details on troubleshooting can be found here.
  • Due to a known metric emission issue with PCF, the Request Latency by UAA chart will be blank on the UAA details page if not on PCF v2.0.11 or later.
  • Does not cover monitoring of Isolation Segments.
  • If using PCF Healthwatch v1.1 on PCF v2.1, the Number of Route Registration Messages Sent and Received comparison graph may incorrectly reflect an inflated gap.
  • Hides three of the PAS MySQL KPI charts. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used

v1.1.7

Release Date: April 26, 2018

  • [Feature] For result metrics published as part of PCF Healthwatch continuous testing, Healthwatch now also publishes the configured test frequency along with the resulting metric. The frequency value is passed into the Firehose as a key-value tag.
  • Maintenance update of the following product dependencies:
    • Golang now v1.8.7
    • OpenJDK now v1.8.0.162
    • Spring now Brussels-SR9
    • CF CLI now v6.36.1
    • Flyway Command-line now v5.0.7
    • cloudfoundry-incubator/push-apps now v0.0.51

Known Issues

  • [Bug Identified] Metrics emitted by PCF Healthwatch are not tagged with the Foundation name while being emitted back into the Firehose. This is resolved in v1.1.9.
  • [Bug Identified] Where two-factor authentication is enabled for the BOSH Director, the BOSH Health Check continuous validation test and the BOSH Task Check may not work correctly. This manifests as the BOSH Health panel showing as red even through the Director is healthy.
  • Due to a known metric emission issue with PCF, the Request Latency by UAA chart will be blank on the UAA details page.
  • Does not cover monitoring of Isolation Segments.
  • If using PCF Healthwatch v1.1 on PCF v2.1, the Number of Route Registration Messages Sent and Received comparison graph may incorrectly reflect an inflated gap.
  • Hides three of the PAS MySQL KPI charts. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used

v1.1.6

Release Date: March 20, 2018

  • To support PCF Healthwatch v1.1 running on either PCF v2.0 or PCF v2.1, the Firehose Loss Rate calculation has been updated to adjust for the underlying source metric differences between PCF v2.0 and PCF v2.1.
  • [Bug Fix] The newest version of the Chrome browser (v65.0.3325.162) will cause several charts on Healthwatch to display alert banding even when no threshold has been breached. This is due to a change in how Chrome handles SVG. v1.1.6 resolves this potential UI issue on newer Chrome browser versions.
  • The UAA details page is now also compatible with the PCF Small Footprint PAS tile.

Known Issues

  • [Bug Identified] Metrics emitted by PCF Healthwatch are not tagged with the Foundation name while being emitted back into the Firehose. This is resolved in v1.1.9.
  • [Bug Identified] Where two-factor authentication is enabled for the BOSH Director, the BOSH Health Check continuous validation test and the BOSH Task Check may not work correctly. This manifests as the BOSH Health panel showing as red even through the Director is healthy.
  • Due to a known metric emission issue with PCF, the Request Latency by UAA chart will be blank on the UAA details page.
  • Does not cover monitoring of Isolation Segments.
  • If using PCF Healthwatch v1.1 on PCF v2.1, the Number of Route Registration Messages Sent and Received comparison graph may incorrectly reflect an inflated gap.
  • Hides three of the PAS MySQL KPI charts. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used

v1.1.5

Release Date: March 5, 2018

  • [Bug Fix] Fixes a bug identified in v1.1.4 where the new java-based Push Monitoring Components for PCF Healthwatch errand could fail in some environments leveraging self-signed certs.
  • [Bug Fix] Improves the performance of the Ingestor component (firehose nozzle). On some environments the Ingestors would experience high CPU and disconnect repeatedly from the firehose, risking loss of firehose-based data in Healthwatch.
    • As a result of this effort, the Ingestor and Loader have been combined and are no longer separate components. As of v1.1.5, the Ingestor now feeds directly to the datastore. These combined components still report on themselves in the same manner for product monitoring.
  • [Bug Fix] Product created metrics that result from assessments of firehose-based metric data were not properly accounting for a null scenario resulting from a complete loss of firehose data. These metrics would still attempt to calculate and publish, resulting in a produced 0 value. A 0 value would indicate an issue for most produced metrics. However in the case of the logging Loss Rate metrics where a 0 value is healthy, this issue could have hidden an underlying metric feed problem. As of v1.1.5, given that the values needed to make a necessary metric assessment are not present, then the related Healthwatch produced metric will not be calculated or published, until the platform data flow is restored. For external metric consumers, a stop in production of these metrics therefore indicates a potential issue with Healthwatch itself.
    • Current Healthwatch metrics resulting from firehose data: Firehose Loss Rate, Adapter Loss Rate, Reverse Log Proxy Loss Rate, Syslog Drain Binding Capacity, Number of Available Free Chunks of Memory, Percentage of Memory Available, Total Memory Available, Percentage of Disk Available, Total Disk Available, Percentage of Cell Container Capacity Available, Total Cell Container Capacity Available
  • [Feature] A new page is now available that visualizes the recently recommended UAA KPI recommendations for PCF 2.0. It is available at healthwatch.<system>/uaa/details

Known Issues

  • [Bug Identified] Metrics emitted by PCF Healthwatch are not tagged with the Foundation name while being emitted back into the Firehose. This is resolved in v1.1.9.
  • [Bug Identified] Where two-factor authentication is enabled for the BOSH Director, the BOSH Health Check continuous validation test and the BOSH Task Check may not work correctly. This manifests as the BOSH Health panel showing as red even through the Director is healthy.
  • UAA details page is not compatible with the PCF Small Footprint PAS tile.
  • Due to a known metric emission issue with PCF, the Request Latency by UAA chart will be blank on the UAA details page.
  • Does not cover monitoring of Isolation Segments.
  • If using PCF Healthwatch v1.1 on PCF v2.1, the Number of Route Registration Messages Sent and Received comparison graph may incorrectly reflect an inflated gap.
  • Hides three of the PAS MySQL KPI charts. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used

v1.1.4

Release Date: February 8, 2018

  • [Bug Fix] Switches the servlet container from Tomcat to Jetty. This resolves reported issues with Healthwatch installation failing on a non-RFC 1918 network.
  • [Bug Fix] Switches the underlying push-apps errand script from bash script to kotlin. This is expected to resolve issues on some Azure installations where the CLI was timing out before the Healthwatch push-apps errand could complete successfully.
  • [Bug Fix] Fixes an issue identified in v1.1.3 where the Logging Throughput and Loss Rate calculations were potentially underreporting.
  • [Bug Fix] Fixes an issue identified in v1.1.3 and earlier where the CF CLI credentials were visible in the push_apps script logs.
  • [Bug Fix] The CLI Command Health Check app was not declaring the amount of memory available to it, therefore relying on the system default. In some environments this could result in too low of memory available to successfully start the app. Other packaged test apps did already declare memory needed. To resolve, the CLI Command Health Check app now explicitly declares a 1GB memory allocation on push.
  • [Feature] The existing manifest capability to Disable Ops Manager Continuous Validation Testing has been exposed as a configuration property within the Ops Manager UI for the Healthwatch tile configuration. This enable/disable choice is available on the Health Check configuration screen within the Healthwatch tile settings screen. The default value is Enable.
    • Note: If you had previously turned this test off within the manifest prior to v1.1.4, please validate your setting is Disable before applying changes to upgrade to this release version.
  • [Feature] Healthwatch now creates and publishes 3 additional metrics regarding Capacity Available. These are useful for downstream consumers wanting to monitor against a given available capacity value, instead of, or in compliment to, the percentage-based available capacity metrics already published.
  • [Feature] The main dashboard will now display the Foundation name. This displayed name will be the name configured, or the system domain if this default value was not updated.
  • [Security Feature] As Operators are allowed to define the name of the foundation, which is then published into the firehose as a tag on the Healthwatch emitted metrics, a sanitization method has been added to Metron Forwarder so that disallowed characters that could be problematic for other downstream firehose consumers cannot be published. Any disallowed characters are stripped from the passed foundation label value.
  • [Feature] UI Improvements:
    • The copy to clipboard user interaction was improved throughout the UI. Now Copied will display briefly when the copy icon is clicked.
    • Minor design update made to the layout of the Jobs detail page in order to improve overall readability of the information presented.
    • On the Job Instances detail page, the y-axis is now fixed 0-100% for the bosh metric line graphs. This makes this page consistent with behavior on other product pages, and better emphasis low vs high percentages when scanning across multiple charts.
    • On the test result detail pages for CLI Command Health, Canary App Health, BOSH Director Health, and Ops Manager Health, the end time for a particular test run is now displayed in the details table in UTC. By displaying this information-only timestamp in UTC, it is easier to leverage the information when searching through relevant logs. The primary UI interactions on these pages remains as-is, displaying in the user’s local time.
    • On the test result detail pages for CLI Command Health, Canary App Health, BOSH Director Health, and Ops Manager Health, the detailed test result table has been visually adjusted so that the information no longer needs to be truncated, and is easier to read. Test Results are now represented by a Pass/Fail/Didn’t Run/No Data icon, with a hover interaction available to confirm icon meaning.
    • Improvement made to queries updating the Capacity panel on the main dashboard. This panel could sometimes show an unexpected line drop in the final minute, although the details page already had the most recent value correctly displayed. This update reduces the likelihood of that behavior.
    • We have removed the previously stated limitation that the Google Chrome browser must be used for accessing the UI. The latest Mozilla Firefox browser also works well. Microsoft Edge 16 has one issue with broken tab switch navigation on the Capacity and Diego detail pages that looks to be resolved in the upcoming Edge 17.

Known Issues

  • [Bug Identified] Metrics emitted by PCF Healthwatch are not tagged with the Foundation name while being emitted back into the Firehose. This is resolved in v1.1.9.
  • [Bug Identified] Where two-factor authentication is enabled for the BOSH Director, the BOSH Health Check continuous validation test and the BOSH Task Check may not work correctly. This manifests as the BOSH Health panel showing as red even through the Director is healthy.
  • [Bug Identified] v1.1.4 introduced a new java-based Push Monitoring Components for PCF Healthwatch errand. This errand will not execute in an environment leveraging self-signed certs where the CA for these certs was added to the BOSH Director Trusted Certs via Ops Man in order to facilitate SSL validation. The resolution is to update the JVM’s trust store with these certs. This is resolved in v1.1.5.
  • Does not cover monitoring of Isolation Segments.
  • Does not include the recently published UAA KPI recommendations for PCF 2.0
  • If using PCF Healthwatch v1.1 on PCF v2.1, the Number of Route Registration Messages Sent and Received comparison graph may incorrectly reflect an inflated gap.
  • Hides three of the PAS MySQL KPI charts. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used

v1.1.3

Release Date: January 11, 2018

  • [Feature] PCF Healthwatch is now also compatible with the PCF Small Footprint PAS tile
  • [Feature] Operators can now choose to change the default Foundation name that PCF Healthwatch passes into the firehose as part of the publication of the PCF Healthwatch Metrics.
    • Operators can optionally configure this name within the tile. Doing so will replace the default foundation name value of system domain. This updated foundation name is passed into the Firehose as a key-value tag. For example:
      origin:"healthwatch" eventType:ValueMetric timestamp:1515598485276671703 deployment:"cf" job:"healthwatch-forwarder" index:"07a5b686-ef82-4dd0-6413-466b" ip:"10.0.16.6" tags:<key:"foundation" value:"production-1" > valueMetric:<name:"SyslogDrain.Adapter.LossRate.1M" value:0 unit:"m" >
      origin:"healthwatch" eventType:ValueMetric timestamp:1515598485279815606 deployment:"cf" job:"healthwatch-forwarder" index:"07a5b686-ef82-4dd0-6413-466b" ip:"10.0.16.6" tags:<key:"foundation" value:"production-1" > valueMetric:<name:"SyslogDrain.RLP.LossRate.1M" value:0 unit:"m" >
      
  • [Feature] The error page is now more descriptive when the login error is the result of an invalid scope.
    • Users that receive the error message Error: User missing required scopes. when attempting to access the PCF Healthwatch UI will need to have the correct healthwatch.read scope added to their UAA user account.
  • [Feature] When using the copy+paste interaction on an unhealthy job, the job name will also now be copied to clipboard. Having job/vm-id in the clipboard provides more useful pasting into the bosh2 cli.
  • [Bug Fix] Fixes an issue identified in v1.1.1 where the full root ca certificate was visible in the log output
  • Product stemcell was updated to v3468

Known Issues

  • [Bug Identified] Metrics emitted by PCF Healthwatch are not tagged with the Foundation name while being emitted back into the Firehose. This is resolved in v1.1.9.
  • [Bug Identified] Where two-factor authentication is enabled for the BOSH Director, the BOSH Health Check continuous validation test and the BOSH Task Check may not work correctly. This manifests as the BOSH Health panel showing as red even through the Director is healthy.
  • [Bug Identified] The CF CLI credentials are visible in the push_apps script logs. This is fixed in v1.1.4.
  • [Bug Identified] An update to origin-based queries introduced in v1.1.3 caused a potential calculation issue to occur for the Logging Throughput and Logging Loss Rate calculations, as these can compute from multiple metrics with differing origins. This is resolved in v1.1.4.
  • Does not cover monitoring of Isolation Segments.
  • Supports only the Google Chrome browser when accessing the PCF Healthwatch UI.
  • If using PCF Healthwatch v1.1 on PCF v2.1, the Number of Route Registration Messages Sent and Received comparison graph may incorrectly reflect an inflated gap.
  • Hides three of the PAS MySQL KPI charts. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used

v1.1.1

Release Date: December 22, 2017

  • [Feature] To support monitoring of Pivotal Cloud Foundry (PCF) v2.0, the following functionality has been added to PCF Healthwatch:
  • [Feature] Metrics published by PCF Healthwatch on a given PCF foundation are now identifiable to that foundation.
    • If PCF Healthwatch is installed on multiple foundations, the metrics PCF Healthwatch publishes are identifiable to their source PCF foundation. This enables operators who are aggregating data streams from multiple foundations to more easily recognize which foundation the PCF Healthwatch metrics of concern originated from.
    • By default, the value provided for a PCF foundation is the system domain of that foundation. The foundation value is passed into the Firehose as a key-value tag. For example:
      origin:"healthwatch" eventType:ValueMetric timestamp:1511211281010702574 deployment:"cf" job:"healthwatch-forwarder" index:"dbf89280-1b6b-46c7-4255-aaad" ip:"10.0.16.29" tags:<key:"foundation" value:"pcf.downey.cfapps.com" > valueMetric:<name:"health.check.OpsMan.probe.count" value:1 unit:"count" >
      origin:"healthwatch" eventType:ValueMetric timestamp:1511211286171879726 deployment:"cf" job:"healthwatch-forwarder" index:"05a557a0-0e38-4298-6adb-278d" ip:"10.0.16.29" tags:<key:"foundation" value:"pcf.downey.cfapps.com" > valueMetric:<name:"health.check.bosh.director.probe.available" value:1 unit:"Metric" >
      
  • [Feature] PCF Healthwatch now publishes operational metrics about itself so that its functionality and performance can also be monitored.

    For more information, see Monitoring PCF Healthwatch.

  • [Feature] Operators installing or upgrading PCF Healthwatch can now configure the desired number of Health Checkers in the Healthwatch Component Config section of the PCF Healthwatch tile.

  • [Feature] Operators who do not use Ops Manager for deployments can now turn off the default Ops Manager test suite. For more information, see Installing and Configuring PCF Healthwatch.

  • [Feature] UI Improvements:

    • The PCF Healthwatch dashboard has a new six-column default layout. If the width of your display is 1835 pixels or fewer, the dashboard shows three columns; you can resize them manually in the browser.
    • When an unhealthy job is flagged and becomes visible on the PCF Healthwatch dashboard, you can now click on that job name to go directly to the Job Instances Detail page for that specific job.
    • Tooltip interactions and handling of long deployment names was improved.
    • Breadcrumb navigation was added.
    • Panel titles now link to detail view pages.

Known Issues

  • [Bug Identified] Metrics emitted by PCF Healthwatch are not tagged with the Foundation name while being emitted back into the Firehose. This is resolved in v1.1.9.
  • [Bug Identified] Where two-factor authentication is enabled for the BOSH Director, the BOSH Health Check continuous validation test and the BOSH Task Check may not work correctly. This manifests as the BOSH Health panel showing as red even through the Director is healthy.
  • [Bug Identified] The CF CLI credentials are visible in the push_apps script logs. This is fixed in v1.1.4.
  • [Bug Identified] The full root ca certificate is visible in the log output
  • Is not compatible with the PCF Small Footprint PAS tile.
  • Does not cover monitoring of Isolation Segments.
  • Supports only the Google Chrome browser when accessing the PCF Healthwatch UI.
  • If using PCF Healthwatch v1.1 on PCF v2.1, the Number of Route Registration Messages Sent and Received comparison graph may incorrectly reflect an inflated gap.
  • Hides three of the PAS MySQL KPI charts at launch. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used

New Features in v1.1

PCF Healthwatch v1.0 was available as a limited, closed-BETA release. The section below summarizes key differences between PCF Healthwatch v1.0 and v1.1. For more information about new features in v1.1, see v1.1.0 release notes.

  • [Feature] Manual plugin configurations are no longer required to ingest BOSH metrics into PCF Healthwatch. Use of the prior plugins should be eliminated upon switch to v1.1.
    • Smoke tests now fail on lack of BOSH metrics.
  • Naming convention changes:
    • The healthwatch.health.check.AppsMan.available metric is now healthwatch.health.check.CanaryApp.available.
    • The healthwatch.health.check.AppsMan.responseTime metric is now healthwatch.health.check.CanaryApp.responseTime.
    • The data loader app deployed at installation was renamed from mysql-logqueue to loader.
  • Data convention change: In PCF Healthwatch v1.0, the default deployment value for all Healthwatch-created metrics was p-healthwatch. In PCF Healthwatch v1.1, the deployment value is the actual deployment value the metrics were created from. This is a necessary data structure change to prepare for the future capability of monitoring isolation segments. The origin of all Healthwatch-created metrics remains healthwatch.
  • Default installation configuration change:
    • Ingestor instance count now defaults to 4
    • MySQL Loader instance count now defaults to 4
  • [Feature] New Syslog Drain Binding Capacity metric represents the average number of drain bindings across Adapter instances. The Logging Performance page now displays this capacity ratio as an indicator for scaling Syslog Drain Adapters. This chart replaces the informational Count of Bindings chart used in v1.0.
  • [Feature] The following Router graphs are now multi-line so that the performance of the individual instances can be better represented:
    • Router Throughput
    • 502 Bad Gateways
    • All 5XX Errors
    • Number of Routes Registered
  • [Feature] New count of the Available Free Chunks metric is now available within the PCF Healthwatch datastore and is being forwarded into the Firehose for external consumption.
  • [Feature] PCF Healthwatch v1.1 uses the new Ops Manager feature for supporting colocated errands.
  • [Feature] PCF Healthwatch v1.1 is updated to reflect the Key Performance Indicators changes for PCF v2.0.
  • [Bug Fix] In v1.0, the Running App Instances stoplight would continue to show a data value during a complete disconnection from the firehose data stream, if there had previously been valid data received. This has been corrected. In an scenario where the product suffers a complete loss of new data for more than 5 minutes, the stoplight will now display 0.
  • Product stemcell was updated to v3445.
  • MySQL version was updated to v36.10.0.
  • In PCF 2.0, Elastic Runtime (ERT) was renamed to Pivotal Application Services (PAS). All help text references to ERT have been updated to PAS.
Create a pull request or raise an issue on the source for this page in GitHub