PCF Healthwatch v1.0 Release Notes

IMPORTANT: The Pivotal Cloud Foundry (PCF) Healthwatch tile is currently in beta and is intended for evaluation and test purposes only. Do not use this product in a PCF production environment.

Releases

v1.0.8 (Beta)

Release Date: October 26, 2017

  • Note: The healthwatch-bosh-plugin.json has been updated. The new version is 1.0.0-beta.3. Updating to the new healthwatch-bosh-plugin.json file ensures the default Log Level is set to error. This had previously been set to debug which could cause excessive log output.
  • Updates the stemcell required to v3421
  • [Feature] The CLI Command Health Smoke Test suite will now distinguish cf login success or failure as a distinct test event
    • A new Healthwatch metric is available, Can Login healthwatch.health.check.cliCommand.login (1 = pass, 0 = fail)
    • The Healthwatch UI is updated to reflect the success or failure of the Login test portion of the larger test suite run
  • [Feature] PCF Healthwatch now ingests BOSH VM metrics outside of the core CF deployment
    • Previously Healthwatch was limited to the core cf deployment for BOSH health metrics. This meant that if you had a tile installed, for example PCF JMX Bridge, that tile’s VM could be unhealthy, but that was not surfacing to Healthwatch. While not yet explicitly monitoring other service tiles operational KPIs, surfacing the BOSH VM health metrics for these other deployments gives Operators better insight into the health of VMs that could potentially impact the larger foundation when unhealthy.
  • [Bug Fix] The BOSH Deployment Occurring metric was not properly distinguishing between active, queued and cancelling tasks. This is resolved to now only measure active deployments.
  • [Feature] Eliminates usage of Ops Manager IP accessors in PCF Healthwatch. Now consuming necessary manifest values via BOSH links
  • [Feature] Now better distinguish log output from cf-health-check by what cf command was run. This allows for easier debugging per Healthwatch cli command health test via cf logs cf-health-check --recent
  • [Feature] It is no longer required to keep the MySQL Proxy job scaled to 2 instances. An Operator installing Healthwatch can now set the MySQL Proxy instances to 1 or multiple.
  • [Stability Improvement] The default memory limit of the product apps Healthwatch deploys has been lowered to better reflect the actual memory necessary for these apps in order to better conserve quota space. These apps now deploy with the following default memory attribute:
    • healthwatch-ingestor 128M
    • canary-health-check 1G
    • healthwatch-mysql-logqueue 64M
    • opsmanager-health-check 64M
    • healthwatch-aggregator 64M
    • bosh-task-check 64M
    • bosh-health-check 64M
    • cf-health-check 500M
    • healthwatch 4G
  • [Feature] UI Improvement: The tooltip styling and info presented has been improved based on user feedback. The line selected will highlight itself. The tooltip available on single and multi-line graphs now also conveys the deployment associated. There is also now a tooltip available on box graphs, conveying the date and time of the assessments.
  • [Bug Fix] In v1.0.7 engaging the tooltip on a line chart could cause the chart to try to update the x-axis. This is now resolved.
  • [Bug Fix] The Canary App Response time is meant as a measure of receipt of a success response. In prior versions a response time for a failed response was still being recorded. This is now resolved.
  • [Stability Improvement] The Push Monitoring Components errand has been updated to Default (On) in the tile config, meaning it will always run on Apply Changes. This was done to reduce the likelihood of ingestor disconnection after an upgrade.
  • The prior known issue The Adapter Loss Rate and Reverse Log Proxy Loss Rate metrics will be incorrect (always 0) until an issue with how these metrics are currently emitted is resolved. is now resolved, however you will need to be on Elastic Runtime Tile v1.11.14 or later to leverage the patch fix.
  • Note: Super metrics emitted by PCF Healthwatch are now using a consistent tagging format of:
    • Deployment = p-healthwatch
    • Job = specific to app generating
    • Origin = healthwatch
  • Note: Metric Name Change The following Scalable Syslog super-metrics output by Healthwatch have an updated naming convention
    • healthwatch.ScalableSyslog.Adapter.LossRate.1M is now healthwatch.SyslogDrain.Adapter.LossRate.1M
    • healthwatch.ScalableSyslog.RLP.LossRate.1M is now healthwatch.SyslogDrain.RLP.LossRate.1M

Known Issues

  • Note: You must repeat the Getting BOSH Health Metrics steps every time the BOSH Director is recreated. The BOSH Director is recreated when you upgrade Ops Manager and install, upgrade, and delete data service tiles (for example, MySQL, RabbitMQ, or Redis).
  • The Metron Forwarder is not currently in a highly available configuration.
  • Router latency measurement (the gorouter.latency metric) has a known issue present in PCF v1.11 that may impact PCF Healthwatch when run on PCF v1.11. This router issue is resolved in PCF v1.12.

v1.0.7 (Beta)

Release Date: October 10, 2017

  • Updates made to PCF Healthwatch to support PCF v1.12:
    • Logging Throughput, Dropped Messages, and Firehose Loss Rate have been updated to reflect correct metric math for PCF v1.12. The charts work appropriately for PCF v1.11.
    • Router File Descriptors and Router Exhausted Connections have been added as recommended Router metrics specific to PCF v1.12.
    • Router, Logging, and Diego Performance contextual help has been updated to show where calculations differ between PCF v1.11 and PCF v1.12.
  • Improvements to the UI:
    • On the main dashboard and all detail pages, any charts loading data now display a spinner until the requested chart data is successfully loaded.
    • If BOSH metrics have not yet been connected to PCF Healthwatch, the Job Health stoplight now shows 0% Healthy as Red.
    • Based on updated Router recommendations to monitor gorouter.ms_since_last_registry_update per router instance, the Time Since Last Route Register Received graph is now a per-instance multi-line graph.
  • The push apps errand now tries to run three times before failing.
  • On a failed BOSH Health Checker, PCF Healthwatch now logs out the BOSH CLI output to ease debugging.
  • Property requires_product_versions: >= 1.11
  • Golang version: v1.8.4
  • MySQL version: v36.6
  • MySQL release ensures the superuser is admin. The root user does not have admin access.
  • Bug Fixed: The VM_credentials property was previously being duplicated. This is now resolved.
  • Bug Fixed: Corrected naming mismatch in the run.sh file. The corrected file is now associated with current and former releases on Pivotal Network

Known Issues

  • Note: You must repeat the Getting BOSH Health Metrics steps every time the BOSH Director is recreated. The BOSH Director is recreated when you upgrade Ops Manager and install, upgrade, and delete data service tiles (for example, MySQL, RabbitMQ, or Redis).
  • The Adapter Loss Rate and Reverse Log Proxy Loss Rate metrics will be incorrect (always 0) until an issue with how these metrics are currently emitted is resolved.
  • The Metron Forwarder is not currently in a highly available configuration.
  • Router latency measurement (the gorouter.latency metric) has a known issue present in PCF v1.11 that may impact PCF Healthwatch when run on PCF v1.11. This router issue is resolved in PCF v1.12.

v1.0.6 (Beta)

Release Date: September 26, 2017

  • The PCF Healthwatch tile now includes smoke tests, which are executed using an errand upon installation and upgrade. These smoke tests validate the following:
    • UI is running.
    • Platform metrics are being ingested.
    • BOSH VM metrics are being ingested. Note: Because ingestion of BOSH metrics in v1.0 requires additional configuration, this portion of the tests does not result in a hard failure if the metrics are not being ingested. Instead, a soft failure allows the smoke tests to pass, but it writes out a message directing the operator to documentation on how to configure these metrics.
  • A buildpack dependency that could cause an installation issue in an air-gapped environment has been corrected.
  • Improvements to the UI:
    • On the Job Health and Vitals screen, if jobs do not report a Persistent Disk measure, the summary graph displays N/A and the related line chart does not appear. If a given job reports Persistent Disk, both the summary graph and the related line chart reflect the reported value.
    • Contextual Help interaction and content was added to the CLI Command Health, BOSH Director Health, Ops Manager Health, and Canary App Health secondary screens.

Known Issues

  • Router latency measurement (the gorouter.latency metric) has a known issue present in PCF v1.11 that may impact PCF Healthwatch. This router issue is expected to be resolved in PCF v1.12.
  • The Adapter Loss Rate and Reverse Log Proxy Loss Rate metrics will be incorrect (always 0) until an issue with how these metrics are currently emitted is resolved.
  • The Metron Forwarder is not currently in a highly available configuration.

v1.0.5 (Beta)

Release Date: September 14, 2017

  • The healthwatch-bosh-plugin plugin has been updated. If this is your first installation of PCF Healthwatch, follow the standard instructions in Installing PCF Healthwatch. If you have installed an earlier version of PCF Healthwatch, see Upgrading PCF Healthwatch to v1.0.5.
  • PCF Healthwatch now reflects the occurrence of BOSH deployments in the visualizations of platform metrics and the continuous validation test results. The time periods reflected for a BOSH deployment are based on the BOSH deployment occurrence metric.
  • Chart Info content additions:
    • As a user hovers over a given chart within a panel, a help ? icon appears. Clicking on the help icon provides information about the metric(s) driving the chart.
  • Contextual Help panel content additions:
    • As a user hovers over a panel, an information icon appears on that given panel. Clicking on this icon opens a new tabbed screen with content relevant to the metrics displayed on the panel.
  • Improvements to the UI.
  • Prior known issues fixed:
    • On the Job Health screen, the Only Show Errors toggle feature now works as expected.

Known Issues

  • If an alpha version of PCF Healthwatch was installed, you must delete it before proceeding with the installation of this beta tile. The alpha tile was not upgradable.
  • Router latency measurement (the gorouter.latency metric) has a known issue present in PCF v1.11 that may impact PCF Healthwatch. This router issue is expected to be resolved in PCF v1.12.
  • The Adapter Loss Rate and Reverse Log Proxy Loss Rate metrics will be incorrect (always 0) until an issue with how these metrics are currently emitted is resolved.
  • The Metron Forwarder is not currently in a highly available configuration.

v1.0.4-alpha.7

Release Date: September 7, 2017

  • The healthwatch.read scope has been added to the UAA Admin user by default in ERT v1.11.9 and later. As a result of this change, the minimum compatible Elastic Runtime version has been raised to v1.11.9 or later.
  • Updates for adjusted PCF v1.11 monitoring recommendations:
    • The route_emitter.RouteEmitterSyncDuration metric should be monitored per job instead of globally as in PCF v1.10 and prior. PCF Healthwatch is updated to render the chart as a multi-line graph to highlight any problematic job(s).
    • The locket.ActiveLocks metric threshold should be ≠ 4 instead of > 4. The visual alert banding was updated to bring attention to any measurement less than or greater than 4.
  • The Job Health and Vitals dashboard is complete and now offers up to 24 hours of the recommended KPI metrics for monitoring the underlying VM health.
    • The Job Health screen now defaults to displaying jobs with errors. This filter feature can be toggled off to display the red and green health results of all jobs reporting in the deployment.
    • Each job group displayed on the Job Health screen now includes a link to a deeper health overview of that group. This overview includes the following:
      • A Metrics Summary, which is a scrollable and filterable data representation of all jobs in a group. For each job over the chosen time interval, the summary provides information about Count of Failed Health Checks, CPU, System Disk, Persistent Disk, and Ephemeral Disk.
      • A Health section with a representation of the health check at the individual job level. This display is driven by the filtered Metrics Summary.
      • A Job Vitals section displaying the recommended Job Vitals metrics for each filtered job. This display is driven by the filtered Metrics Summary.
  • Contextual Help content has been added to all main dashboard panels and some deep dive screens.
    • When a user hovers over any main dashboard panel, an information icon appears on that given panel. Clicking on this icon opens a new tabbed screen with content relevant to the metrics displayed on the panel.
  • Prior known issues fixed:
    • The bug identified in Alpha 6, a malformed URL in the BOSH Health Check test, has been resolved.

Known Issues

  • Bug identified: On the Job Health screen, the Only Show Errors toggle feature does not work as expected. This means that you cannot currently change the page to reflect all healthy and unhealthy jobs reporting. The bug will be fixed in the next release.
  • If an earlier alpha version of PCF Healthwatch was installed, you must delete it before proceeding with the installation of the newer tile. The tile is not upgradable at this time.
  • Router latency measurement (the gorouter.latency metric) has a known issue present in PCF v1.11 that may impact PCF Healthwatch. This router issue is expected to be resolved in PCF v1.12.
  • The Adapter Loss Rate and Reverse Log Proxy Loss Rate metrics will be incorrect (always 0) until an issue with how these metrics are currently emitted is resolved.
  • BOSH deployments are not yet reflected on any deep-dive screens.
  • Contextual help content additions are still in progress throughout the product.

v1.0.0-alpha.6

Release Date: August 24, 2017

  • The Ops Manager Health Check dashboard is complete and now offers up to 24 hours of test results for the Ops Manager continuous validation test.
  • The App Canary Health Check dashboard is complete and now offers up to 24 hours of test results for the App Canary continuous validation test.
  • The Diego Capacity dashboard is complete and now offers up to 24 hours of the recommended KPI and KSI metrics for monitoring PCF capacity. This dashboard is available through the Capacity panel on the main PCF Healthwatch dashboard.
  • The Diego App Instances dashboard is complete and now offers up to 24 hours of the recommended KPI and KSI metrics for monitoring PCF app instances. This dashboard is available through the App Instances panel on the main PCF Healthwatch dashboard.
  • The Diego Performance dashboard is complete and now offers up to 24 hours of the recommended KPI and KSI metrics for monitoring Diego health and performance. This dashboard is available through a tab selection from the Diego Capacity and Diego App Instances dashboards.
  • The Job Health and Vitals dashboard has been started.
  • The CLI Command Health Smoke Test feature is improved. When the cf-health-check app fails, a -1 (did not run) value is recorded for the cf push test. This helps operators to determine when the test itself fails.
  • Canary App Response Time values are now stored in milliseconds (ms) instead of nanoseconds (ns).
  • Improvements to the UI:
    • On deep-dive pages, the Last Update timestamp shown in local time now also includes a UTC reference.
    • The detailed results shown on the CLI Smoke Test and BOSH Director deep-dive pages are updated to display the results sorted in descending order.
  • Prior known issues fixed:
    • The Scalable Syslog Performance panel on the Logging Performance deep-dive page is now dynamic and appears only if the Scalable Syslog feature is being used in the foundation.
      • If Drain Binding Count is ≥ 1 within the 24-hour retention period, the panel is displayed.
      • If Drain Binding Count is ≡ 0 within the 24-hour retention period, the panel is not rendered.

Known Issues

  • Bug identified: The BOSH Health Check test will continue to fail in this version due to a malformed URL. The bug will be fixed in the next release.
  • If an earlier alpha version of PCF Healthwatch was installed, it must be deleted before proceeding with the installation of the newer tile. The tile is not upgradable at this time.
  • Router latency measurement (the gorouter.latency metric) has a known issue present in PCF v1.11 that may impact PCF Healthwatch. This router issue is expected to be resolved in PCF v1.12.
  • The Adapter Loss Rate and Reverse Log Proxy Loss Rate metrics will be incorrect (always 0) until an issue with how these metrics are currently emitted is resolved.
  • The following dashboards are not yet completed:
    • Job Health and Job Vitals
  • BOSH deployments are not yet reflected on deep-dive screens.
  • Contextual help content is not yet present.

v1.0.0-alpha.5

Release Date: August 8, 2017

  • PCF Healthwatch now forwards its product-generated metrics into the Firehose, allowing other existing Firehose consumers to access these operationally useful data points.
  • The URL for the Dashboard UI was updated to healthwatch.YOUR-SYSTEM-DOMAIN.
  • The BOSH Director Health Check dashboard is complete and now offers up to 24 hours of test results for the BOSH Director continuous validation test.
  • The Logging Performance dashboard is complete and now offers up to 24 hours of the recommended KPI and KSI metrics for monitoring the Logging system.
  • The Router Performance dashboard is complete and now offers up to 24 hours of the recommended KPI and KSI metrics for monitoring the Gorouter.
  • PCF Healthwatch now stores and displays VM ID instead of index to make VM info more usable for BOSH2 interactions.
  • PCF Healthwatch now points healthcheck apps at MySQL Proxy instead of a MySQL node, which resolves the issue in earlier alphas where downtime would be experienced during a PCF Healthwatch stemcell update.

Known issues

  • If an earlier alpha version of PCF Healthwatch was installed, it must be deleted before proceeding with the installation of the newer tile. The tile is not upgradable at this time.
  • Router latency measurement (the gorouter.latency metric) has a known issue present in PCF v1.11 that may impact PCF Healthwatch. This router issue is expected to be resolved in PCF v1.12.
  • The Adapter Loss Rate and Reverse Log Proxy Loss Rate metrics will be incorrect (always 0) until an issue with how these metrics are currently emitted is resolved.
  • The Scalable Syslog Performance panel on the Logging Performance deep-dive screen is intended to be dynamic, appearing only if the Scalable Syslog feature is being used in the foundation. This panel will be dynamic in the next version (Alpha 6).
  • The healthwatch-forwarder resource is defaulted to 3 instances in the PCF Healthwatch tile configuration. This can be changed to 1 instance. The default will be corrected in the next version (Alpha 6).
  • Currently, 5 metrics have slight variations from the data points published in PCF Healthwatch Metrics. This will be corrected in the next version (Alpha 6):
    • healthwatch.scalablesyslog.rlp.lossRate.1M will be updated to healthwatch.ScalableSyslog.RLP.LossRate.1M.
    • healthwatch.scalablesyslog.adapter.LossRate.1M will be updated to healthwatch.ScalableSyslog.Adapter.LossRate.1M.
    • healthwatch.TotalPercentageAvailableDiskCapacity.5M will be updated to healthwatch.Diego.TotalPercentageAvailableDiskCapacity.5M.
    • healthwatch.TotalPercentageAvailableContainerCapacity.5M will be updated to healthwatch.Diego.TotalPercentageAvailableContainerCapacity.5M.
    • healthwatch.TotalPercentageAvailableMemoryCapacity.5M will be updated to healthwatch.Diego.TotalPercentageAvailableMemoryCapacity.5M.
  • The following dashboards are not yet completed as of this Alpha 5 version:
    • Diego: Capacity
    • Diego: App Instances
    • Diego: Diego Performance
    • Ops Manager Health Check
    • App Canary Health Check
    • Job Health and Job Vitals
  • BOSH deployments are not yet reflected on any deep-dive screens.
  • Contextual help content is not yet present.