PCF Healthwatch v1.2 Release Notes

Releases

v1.2.6

Release Date: November 30, 2018

  • [Bug Fix] The delete-space errand deletes the Application Security Group. This resolves an issue that prevented users from reinstalling the Healthwatch tile after deleting it.

  • [Bug Fix] The Healthwatch Ingestor reconnects to the Loggregator Firehose automatically after all Loggregator Traffic Controller VMs are stopped.

  • [Bug Fix] Healthwatch collects and alerts on the Galera Cluster Status Sum and Galera Total Percentage Healthy Nodes metrics only on clustered node MySQL Pivotal Application Service (PAS) configurations. This prevents erroneous alerts on the Galera Cluster Status Sum and Galera Total Percentage Healthy Nodes metrics for single node MySQL or external database PAS configurations.

  • The following are maintenance upgrades to product dependencies:

    • CF CLI now v6.40.1
    • Golang now v1.10.5
    • Flyway Command-line and Library now v5.2.1
    • OpenJDK now v1.8.0_192-b03
    • Spring now Brussels-SR14

v1.2.5

Release Date: August 7, 2018

  • [Bug Fix] Fixes display of bar charts within PCF Healthwatch UI so that data remains within the table at different browser sizes.
  • Maintenance update of the following product dependencies:
    • Golang now v1.10.3

Known Issues

  • Due to a known issue with Pivotal Application Service for Windows, the Windows-based Diego Cell instances are not emitting BOSH VM health metrics through the firehose if not on Ops Manager v2.1.4 or later. This could result in potentially unhealthy instances not surfacing to PCF Healthwatch for these types of cells.
  • Currently if Windows-based Diego Cell instances are created via Pivotal Application Service for Windows, these cells are emitting their platform metrics with a hard-coded deployment value of cf. This can result in the following impacts to capacity values shown by PCF Healthwatch, or other consumers of monitoring metrics:
    • If Isolation Segments are used in combination with Isolated Windows-based Diego Cells: Any windows-based cells that are isolated to a given isolation segment are reporting as part of the core cf system deployment. This means that the isolation segment(s) capacity values will be underreporting (i.e., only including linux-based cells and excluding windows-based cells), and the core cf system deployment will be over-reporting capacity (including windows-based cells from isolation segments as part of the core cf system capacity).
    • If Isolation Segments are not used: The core CF system deployment will correctly show total capacity, however both windows-based Cells and linux-based Cells will be grouped together in PCF Healthwatch capacity assessments such as capacity remaining and number of free chunks of memory.
  • Hides three of the PAS MySQL KPI charts. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used

v1.2.4

Release Date: July 17, 2018

  • [Bug Fix] Updates how the stemcell number is referenced in the tile manifest. Prior to this fix BOSH would misinterpret a stemcell number ending in 0 and incorrectly drop the 0.
  • Stemcell for v1.2.4 is now v3541

Known Issues

  • Due to a known issue with Pivotal Application Service for Windows, the Windows-based Diego Cell instances are not emitting BOSH VM health metrics through the firehose if not on Ops Manager v2.1.4 or later. This could result in potentially unhealthy instances not surfacing to PCF Healthwatch for these types of cells.
  • Currently if Windows-based Diego Cell instances are created via Pivotal Application Service for Windows, these cells are emitting their platform metrics with a hard-coded deployment value of cf. This can result in the following impacts to capacity values shown by PCF Healthwatch, or other consumers of monitoring metrics:
    • If Isolation Segments are used in combination with Isolated Windows-based Diego Cells: Any windows-based cells that are isolated to a given isolation segment are reporting as part of the core cf system deployment. This means that the isolation segment(s) capacity values will be underreporting (i.e., only including linux-based cells and excluding windows-based cells), and the core cf system deployment will be over-reporting capacity (including windows-based cells from isolation segments as part of the core cf system capacity).
    • If Isolation Segments are not used: The core CF system deployment will correctly show total capacity, however both windows-based Cells and linux-based Cells will be grouped together in PCF Healthwatch capacity assessments such as capacity remaining and number of free chunks of memory.
  • Hides three of the PAS MySQL KPI charts. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used

v1.2.3

Release Date: July 3, 2018

  • [Feature] The PCF Healthwatch dashboard UI now highlights Routing and Capacity for when Isolation Segments are in use.
  • [Feature] Given that the UAA Request Latency metric is now emitted in PCF, Healthwatch has added an additional out of the box alert for monitoring UAA Request Latency.
  • The default threshold values for Number of Available Free Chunks of Cell Memory Alert have been modified: 6 GB critical and 12 GB warning.
    • If these thresholds have been previously modified, this update will not override that user configuration.
  • The default threshold values for Syslog Adapter Capacity Alert have been modified: 250 critical and 200 warning.
    • If these thresholds have been previously modified, this update will not override that user configuration.
  • PCF Healthwatch now leverages pre-compiled releases in order to reduce the deployment time necessary during the Apply Changes flow.
  • [Bug Fix] In prior versions of PCF Healthwatch v1.2.x, metrics emitted by PCF Healthwatchcwere not tagged with the Foundation name. This is now resolved.
  • Maintenance update of the following product dependencies:
    • Golang now v1.9.7
    • Java GRPC now v1.13.1
    • OpenJDK now v1.8.0_172-b11
    • CF CLI now v6.37.0
    • Flyway Command-line now v5.1.3
    • Redis now v3.2.12
    • Spring now Brussels-SR11
    • Loggregator now v101.11
    • CF-MySQL now 36.14.0

Known Issues

  • Due to a known issue with Pivotal Application Service for Windows, the Windows-based Diego Cell instances are not emitting BOSH VM health metrics through the firehose if not on Ops Manager v2.1.4 or later. This could result in potentially unhealthy instances not surfacing to PCF Healthwatch for these types of cells.
  • Currently if Windows-based Diego Cell instances are created via Pivotal Application Service for Windows, these cells are emitting their platform metrics with a hard-coded deployment value of cf. This can result in the following impacts to capacity values shown by PCF Healthwatch, or other consumers of monitoring metrics:
    • If Isolation Segments are used in combination with Isolated Windows-based Diego Cells: Any windows-based cells that are isolated to a given isolation segment are reporting as part of the core cf system deployment. This means that the isolation segment(s) capacity values will be underreporting (i.e., only including linux-based cells and excluding windows-based cells), and the core cf system deployment will be over-reporting capacity (including windows-based cells from isolation segments as part of the core cf system capacity).
    • If Isolation Segments are not used: The core CF system deployment will correctly show total capacity, however both windows-based Cells and linux-based Cells will be grouped together in PCF Healthwatch capacity assessments such as capacity remaining and number of free chunks of memory.
  • Hides three of the PAS MySQL KPI charts. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used

v1.2.2

Release Date: May 25, 2018

  • [Bug Fix] Fixes a bug identified in prior versions where two-factor authentication being enabled for the BOSH Director caused the BOSH Health Check continuous validation test and the BOSH Task Check to fail.
  • [Feature] The Router & Capacity detail pages in the UI now support the visualization of platform metric data emitted from the firehose per Isolation Segment. When an Isolation Segment is present, graphs will indicate whether they are shared or isolated component metrics.

Known Issues

  • [Bug Identified] Metrics emitted by PCF Healthwatch are not tagged with the Foundation name while being emitted back into the Firehose. This is resolved in v1.2.3.
  • Due to a known issue with Pivotal Application Service for Windows, the Windows-based Diego Cell instances are not emitting BOSH VM health metrics through the firehose if not on Ops Manager v2.1.4 or later. This could result in potentially unhealthy instances not surfacing to PCF Healthwatch for these types of cells.
  • Currently if Windows-based Diego Cell instances are created via Pivotal Application Service for Windows, these cells are emitting their platform metrics with a hard-coded deployment value of cf. This can result in the following impacts to capacity values shown by PCF Healthwatch, or other consumers of monitoring metrics:
    • If Isolation Segments are used in combination with Isolated Windows-based Diego Cells: Any windows-based cells that are isolated to a given isolation segment are reporting as part of the core cf system deployment. This means that the isolation segment(s) capacity values will be underreporting (i.e., only including linux-based cells and excluding windows-based cells), and the core cf system deployment will be over-reporting capacity (including windows-based cells from isolation segments as part of the core cf system capacity).
    • If Isolation Segments are not used: The core CF system deployment will correctly show total capacity, however both windows-based Cells and linux-based Cells will be grouped together in PCF Healthwatch capacity assessments such as capacity remaining and number of free chunks of memory.
  • Hides three of the PAS MySQL KPI charts. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used

v1.2.1

Release Date: May 8, 2018

  • [Feature] PCF Healthwatch now comes with out of the box Alerting for the recommended UAA and Monitoring PCF Healthwatch Alerts, as well as alerts for Healthwatch continuous tests (Cloud Foundry CLI Health, Ops Manager Health, Canary App Health, BOSH Director Health).
  • [Feature] The Number of Available Free Chunks of Memory operational metric created by PCF Healthwatch now publishes with a new key-value tag to indicate configured value for which this metric is being calculated.
  • Alerts that are fired for BOSH Metrics on specific jobs, will now better link to the related Job Instances page in the Healthwatch UI.
  • Two new metrics for PCF Healthwatch itself have been added to indicate when the Redis workers need to be scaled: healthwatch.redis.valueMetricQueue.size healthwatch.redis.counterEventQueue.size.
  • The CLI Health Receiving Logs test was improved to check for the expected log line multiple times within the timeout window. This improves an issue in some customer environments where the test would indicate failure due to slow log delivery.

Known Issues

  • [Bug Identified] Metrics emitted by PCF Healthwatch are not tagged with the Foundation name while being emitted back into the Firehose. This is resolved in v1.2.3.
  • The PCF Healthwatch UI does not yet support visualization of platform metric data emitted from the firehose per Isolation Segment. This feature will be available in an upcoming patch.
  • [Bug Identified] Where two-factor authentication is enabled for the BOSH Director, the BOSH Health Check continuous validation test and the BOSH Task Check may not work correctly. This manifests as the BOSH Health panel showing as red even through the Director is healthy.
  • Due to a known metric emission issue with PCF, the Request Latency by UAA chart will be blank on the UAA details page if not on PCF v2.1.3 or later.
  • Due to a known issue with Pivotal Application Service for Windows, the Windows-based Diego Cell instances are not currently emitting BOSH VM health metrics through the firehose if not on Ops Manager v2.1.4 or later. This could result in potentially unhealthy instances not surfacing to PCF Healthwatch for these types of cells.
  • Currently if Windows-based Diego Cell instances are created via Pivotal Application Service for Windows, these cells are emitting their platform metrics with a hard-coded deployment value of cf. This can result in the following impacts to capacity values shown by PCF Healthwatch, or other consumers of monitoring metrics:
    • If Isolation Segments are used in combination with Isolated Windows-based Diego Cells: Any windows-based cells that are isolated to a given isolation segment are reporting as part of the core cf system deployment. This means that the isolation segment(s) capacity values will be underreporting (i.e., only including linux-based cells and excluding windows-based cells), and the core cf system deployment will be over-reporting capacity (including windows-based cells from isolation segments as part of the core cf system capacity).
    • If Isolation Segments are not used: The core CF system deployment will correctly show total capacity, however both windows-based Cells and linux-based Cells will be grouped together in PCF Healthwatch capacity assessments such as capacity remaining and number of free chunks of memory.
  • Hides three of the PAS MySQL KPI charts. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used

v1.2.0

Release Date: April 23, 2018

  • [Feature] To support monitoring of Pivotal Cloud Foundry (PCF) v2.1, the following functionality has been added to PCF Healthwatch:
    • The Doppler Message Rate Capacity operational metric created by PCF Healthwatch helps indicate that Doppler instances are nearing recommended maximum load and should be scaled up.
  • [Feature] PCF Healthwatch now comes with out of the box Alerting for the recommended PCF Key Performance Indicators and Key Scaling Indicators. Information on how to configure the alerting thresholds shipped out of the box can be found here.
    • Healthwatch publishes alerts to a new common publisher, PCF Event Alerts. Alerts can be sent by email to specific users and to distribution lists, and/or by webhook for integrations such as Slack. Out-of-the-box alerts that are not of interest can be unsubscribed from. Management of alert subscriptions and distribution of the alerts is managed via PCF Event Alerts.
    • Healthwatch ships out of the box alerts with preconfigured threshold values. For KPI suggested alerting thresholds that have been recommended as dynamic, or needing to be fine-tuned to a given foundation’s use cases, customers should further fine-tune the thresholds PCF Healthwatch has provided. In particular, consider fine-tuning these alerts to the environment’s use-cases: Number of Routes Registered, Router Handling Latency, Router Exhausted Connections, Router 502 Bad Gateways, and Router Server Errors. Baselines can be established for the environment by reviewing the respective graphs in the PCF Healthwatch UI.
  • [Feature] Customers can now configure the memory value that PCF Healthwatch will calculate the remaining Number of Free Chunks of Memory from. Information on how to configure this value can be found here.
    • PCF Healthwatch ships with an out of the box configuration of 4GB, which is the standard recommended value. If the standard size of apps pushed on a given deployment exceeds 4GB, this default value should be increased accordingly.
    • If Compute capacity has been isolated via PCF Isolation Segments, Healthwatch will default to 4GB per capacity deployment. This default value is configurable per deployment.
  • [Feature] To support current Alerting and Free Chunks configuration capability in the product, a new Admin User Scope healthwatch.admin has been added.
    • The UAA Admin User has this new scope included by default. Additional users can be provided with this scope. Only users that should be allowed to change settings in Healthwatch should be granted the healthwatch.admin scope. In v1.2, this scope allows a user to alter the default alerting threshold values for the entire product. In future versions of Healthwatch, this user scope may be allowed additional configuration capabilities. As such, this Admin scope should not be granted to any user you wish to remain read-only access.
  • [Feature] PCF Healthwatch now supports use of PCF Isolation Segments.
    • Platform metric data emitted from the firehose per Isolation Segment, such as Router, Diego Cell, and BOSH VM Health, is now stored identifiable to the isolated deployment(s).
    • If Compute capacity is isolated, PCF Healthwatch now creates and publishes relevant operational metrics per isolated deployment(s).
    • Alerting thresholds shipped out of the box, such as Router and Cell Capacity measures, can be fine-tuned per isolated deployment(s).
  • [Feature] Optional Data Migration Errand: a migration errand that will move data from PCF Healthwatch v1.1 to v1.2 has been made optional to allow the Operator to choose between a faster upgrade (no data migration) or retaining the past 24hrs of data (via running the migration). Information on how to run this optional errand can be found here.
    • If the migration errand is run, there are 2 metrics that will be impacted. In v1.1 gorouter.latency is pre-aggregated to maximum values; this will impact the accuracy of the new v1.2 aggregations only for v1.1 transferred data. In PCF 2.1, Route Emitter Messages sent is a new metric, therefore this older metric will not transfer and the relevant chart will be blank for the period before v1.2 installation.

Known Issues

  • The PCF Healthwatch UI does not yet support visualization of platform metric data emitted from the firehose per Isolation Segment. This feature will be available in an upcoming patch.
  • UAA and Monitoring PCF Healthwatch Alerts not yet available. These will be available in a future patch version.
  • [Bug Identified] Where two-factor authentication is enabled for the BOSH Director, the BOSH Health Check continuous validation test and the BOSH Task Check may not work correctly. This manifests as the BOSH Health panel showing as red even through the Director is healthy.
  • Due to a known metric emission issue with PCF, the Request Latency by UAA chart will be blank on the UAA details page.
  • Due to a known issue with Pivotal Application Service for Windows, the Windows-based Diego Cell instances are not currently emitting BOSH VM health metrics through the firehose if not on Ops Manager v2.1.4 or later. . This could result in potentially unhealthy instances not surfacing to PCF Healthwatch for these types of cells.
  • Currently if Windows-based Diego Cell instances are created via Pivotal Application Service for Windows, these cells are emitting their platform metrics with a hard-coded deployment value of cf. This can result in the following impacts to capacity values shown by PCF Healthwatch, or other consumers of monitoring metrics:
    • If Isolation Segments are used in combination with Isolated Windows-based Diego Cells: Any windows-based cells that are isolated to a given isolation segment are reporting as part of the core cf system deployment. This means that the isolation segment(s) capacity values will be underreporting (i.e., only including linux-based cells and excluding windows-based cells), and the core cf system deployment will be over-reporting capacity (including windows-based cells from isolation segments as part of the core cf system capacity).
    • If Isolation Segments are not used: The core CF system deployment will correctly show total capacity, however both windows-based Cells and linux-based Cells will be grouped together in PCF Healthwatch capacity assessments such as capacity remaining and number of free chunks of memory.
  • Hides three of the PAS MySQL KPI charts. These charts will be available in a future patch version:
    • Query Rate
    • MySQL CPU Busy Time
    • Percentage of Max Connections Used
  • [Bug Identified] Metrics emitted by PCF Healthwatch and PCF Healthwatch KPIs are not tagged with the Foundation name while being emitted back into the Firehose.

New Features in v1.2

The section below summarizes key differences between PCF Healthwatch v1.1 and v1.2. For more information about new features in v1.2, see v1.2.0 release notes.

  • Data aggregation improvements: In PCF Healthwatch v1.2 all firehose emitted platform metrics that Healthwatch ingests are aggregated per pre-defined rules before being written to the datastore. This helps avoid the cost of storing raw data, and in the case of gauge values, can help decorate the data with additional points of interest.
    • Counter metrics: Max counter value received for the 1 minute aggregation window, from which a minute-to-minute rate is later derived. Unique to the metric name. Further unique to the individual metric emitter (per instance as applicable).
    • Gauge metrics: Received values for the 1 minute aggregation window; aggregated and stored with 5 calculated values per metric: avg, min, max, med, 95p. Unique to the metric name. Further unique to the individual emitter (per instance as applicable).
  • Data store architecture changes:
    • PCF Healthwatch v1.2 migrates the data store architecture from a clustered MySQL with Galera to a single-node MySQL leveraging Redis as a queue mechanism. More information on the new data flow and availability handling in PCF Healthwatch v1.2 can be found on the Architecture page. See the v1.2.0 release notes for information on the optional data migration errand.
    • The Metrics Forwarder component is now high availability and can be scaled as needed.
  • PCF Healthwatch is now able to support use of PCF Isolation Segments.
  • UI Updates:
    • Firehose Loss Rate has been renamed Log Transport Loss Rate, and the Red/Green threshold values have been tightened in accordance with the KPI recommendation updates for PCF v2.1.
    • New Doppler Message Rate chart available on the Logging Performance page in accordance with the KPI recommendation updates for PCF v2.1.
    • GoRouter Latency and UAA Latency charts now use gorouter.latency.95p and gorouter.latency.uaa.95p