Configuring PCF Healthwatch Alerts

Warning: PCF Healthwatch v1.6 is no longer supported or available for download. PCF Healthwatch v1.6 has reached the End of General Support (EOGS) phase as defined by the Support Lifecycle Policy. To stay up to date with the latest software and security updates, upgrade to a supported version.

This topic describes how to use the Pivotal Cloud Foundry (PCF) Healthwatch API to retrieve and configure alert configurations. It also provides information about configuring PCF Event Alerts to receive push notifications when a PCF Healthwatch alert occurs.

Note: PCF Healthwatch stores all data points for 25 hours and then prunes them. Any active alerts that are pruned reissue an alert every 24 hours if the related metrics are not yet recovered to a normal state.

Currently, two main types of alert configurations are supported: Out-of-the-Box and Deployment-specific. Through this API, consumers can do the following:

  • View current alert configurations
  • Update Out-of-the-Box threshold values
  • Create or delete deployment-specific configurations based on existing Out-of-the-Box configurations

Prerequisites/Assumptions

The steps in this document assume that you can generate bearer tokens for a UAA client with the healthwatch.read (GET only) and healthwatch.admin (both GET and POST) scopes.

After creating a user that has healthwatch.read or healthwatch.admin scopes, follow these steps to authenticate against UAA:

uaac token client get <my_healthwatch_admin_client> -s <my_healthwatch_admin_secret>

At this point you are properly authenticated and ready to start using the Healthwatch Alerts API.

Healthwatch API Status

Test the availability of the Healthwatch API by hitting the /info endpoint with a GET request:

curl https://healthwatch-api.SYSTEM-DOMAIN/info

The expected response is a 200/OK with the message "HAPI is happy".

View All Alert Configurations

GET /v1/alert-configurations

To view a list of alert configurations, send a GET request to the /alert-configurations endpoint:

uaac curl https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations

This returns a JSON array of alert configurations:

[
    {
        "query": "origin == 'some_origin' and name == 'Some.Metric.Name'",
        "threshold": {
            "critical": 95,
            "warning": 85,
            "type": "UPPER"
        }
    },
    {
        "query": "origin == 'some_origin' and name == 'Another.Metric.Name'",
        "threshold": {
            "critical": 9,
            "warning": 28,
            "type": "LOWER"
        }
    },
    {
        "query": "origin == 'another_origin' and name == 'Some.Metric.Name'",
        "threshold": {
            "critical": 1,
            "type": "EQUALITY"
        }
    }
]

The query and threshold properties are covered in detail below.

View Specific Alert Configurations

GET /v1/alert-configurations?q=...

To narrow the results of a GET request, add a query to the URL in a parameter named q:

uaac curl "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations?q=origin == 'some_origin' and name == 'Some.Metric.Name'" 

This returns a JSON array of alert configurations, filtered against the provideded query:

[
    {
        "query": "origin == 'some_origin' and name == 'Some.Metric.Name'",
        "threshold": {
            "critical": 95,
            "warning": 85,
            "type": "UPPER"
        }
    }
]

Update Alert Configurations

POST /v1/alert-configurations

To update an existing alert configuration, make a POST request to the alert-configurations endpoint with the updated data:

uaac curl -X POST "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations"  \
      -H "Content-Type: application/json" \
      --data "{\"query\":\"origin == 'some_origin' and name == 'Some.Metric.Name'\",\"threshold\":{\"critical\":90,\"warning\":80,\"type\":\"UPPER\"}}"

See the following example output:

{
    "query": "origin == 'some_origin' and name == 'Some.Metric.Name'",
    "threshold": {
        "critical": 95,
        "warning": 85,
        "type": "UPPER"
    }
}

Warning: These alert configurations cannot be deleted. In order to revert your changes, update the alert back to its default values.

Create Alert Configurations for Isolation Segments

POST /v1/alert-configurations

Specific thresholds can be set for an Isolation Segments by extending existing alert configurations with a deployment specifier. For example, to create an isolation segment alert configuration for the above alert, run the following:

uaac curl -X POST "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations"  \
      -H "Content-Type: application/json" \
      --data "{\"query\":\"origin == 'some_origin' and name == 'Some.Metric.Name' and deployment == 'Some-Isolated-Deployment'\",\"threshold\":{\"critical\":55,\"warning\":45,\"type\":\"UPPER\"}}"

The created alert configuration is echoed back in the following response:

{
    "query": "origin == 'some_origin' and name == 'Some.Metric.Name' and deployment == 'Some-Isolated-Deployment'",
    "threshold": {
        "critical": 55,
        "warning": 45,
        "type": "UPPER"
    }
}

Note: You can delete Isolation Segment alert configurations only if you created them with the above method.

Delete Isolation Segment Alert Configurations

To delete a user-created alert configuration for an isolation segment, add a query to the URL in a parameter named q.

DELETE /v1/alert-configurations?q=...

See the following example:

uaac curl -X DELETE "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations?q=origin == 'some_origin' and name == 'Some.Metric.Name' and deployment == 'Some-Isolated-Deployment'"

This returns the number of deleted alert configurations. See the following example output.

1

Note: You can delete Isolation Segment alert configurations only if you created them through the Healthwatch API.

Disable Alerts on a Metric

PCF Healthwatch does not support disabling alerts on a specific metric. To ensure that PCF Healthwatch does not alert on a metric, update the threshold of the alert to a value that will never trigger an alert.

For more information about thresholds, see Thresholds.

For more information about updating alert configurations, see Update Alert Configurations.

Queries

The query field is used in two ways:

  • GET requests: The query filters the alert configurations being queried.
  • POST requests: The query specifies the alert configuration being updated.
  • DELETE requests: The query filters the alert configurations being deleted.

The query associated with an alert configuration denotes the trigger conditions for the alert.

A well-formed query is a valid Spring Expression Language (SpEL) expression that uses only the equality, "==", and conjunction, "and", operators. For example:

"origin == 'some_origin' and name == 'Some.Metric.Name'"  # valid
"origin  > 'some_origin' and name == 'Some.Metric.Name'"  # invalid (uses '>')
"origin == 'some_origin'  or name == 'Some.Metric.Name'"  # invalid (uses 'or')

The following fields can be queried:

  • origin   (required)
  • name       (required)
  • job
  • deployment

Thresholds

The threshold field contains a threshold type, as well as critical and warning threshold values. The type can be "UPPER", "LOWER", "EQUALITY", or "INEQUALITY".

Alert configurations whose thresholds are of the UPPER type trigger their alerts when the actual metric value is above the warning value, and again when above the critical value.

Thresholds with the LOWER type work the same way, except the alerts trigger when the metric falls below the thresholds.

The EQUALITY alerts trigger when the metric value is not exactly equal to the critical threshold. These alerts do not have warning thresholds.

The INEQUALITY alerts trigger when the metric value is exactly equal to the critical threshold. These alerts do not have warning thresholds.

Supported Alerts

This section describes the default Warning and Critical alerts for PCF Healthwatch. Each alert includes recommended thresholds for metrics monitored by Healthwatch.

Pivotal recommends customizing the default thresholds for alerts indicated with a 1 in the following tables based on your environment. You can determine the best threshold for your environment by monitoring the metrics over time and noting the metric values that indicate acceptable and unacceptable system performance and health. For more information about updating alert configurations in PCF Healthwatch, see Update Alert Configurations.

By default, PCF Healthwatch includes the configurable alerts in the tables below. You can learn more about the metrics PCF Healthwatch emits here: PCF Healthwatch Metrics.

Performance Alerts

Alert Metric Threshold
Active Locks Held Name: ActiveLocks
Origin: locket
Category: Compute Performance
Threshold Type: EQUALITY
Critical: 4
Unit: Number
Assessment Window (minutes): 5
Active Presences Held1 Name: ActivePresences
Origin: locket
Category: Compute Performance
Threshold Type: UPPER
Critical: 200
Warning: 150
Unit: Number
Assessment Window (minutes): 15
Auctioneer Time to Fetch Cell State Name: AuctioneerFetchStatesDuration
Origin: auctioneer
Category: Compute Performance
Threshold Type: UPPER
Critical: 5000000000
Warning: 2000000000
Unit: ns
Assessment Window (minutes): 5
App Instances Placement Failures Rate Name: AuctioneerLRPAuctionsFailed
Origin: auctioneer
Category: App Instances
Threshold Type: UPPER
Critical: 1
Warning: .5
Unit: Number
Assessment Window (minutes): 5
App Instance Starts Rate1 Name: AuctioneerLRPAuctionsStarted
Origin: auctioneer
Category: App Instances
Threshold Type: UPPER
Critical: 100
Warning: 50
Unit: Number
Assessment Window (minutes): 5
Router Exhausted Connections1 Name: backend_exhausted_conns
Origin: gorouter
Category: Routing
Threshold Type: UPPER
Critical: 10
Warning: 5
Unit: Number
Assessment Window (minutes): 5
Number of Router 502 Bad Gateways1 Name: bad_gateways
Origin: gorouter
Category: Routing
Threshold Type: UPPER
Critical: 40
Warning: 30
Unit: Number
Assessment Window (minutes): 5
Task Placement Failures Rate Name: AuctioneerTaskAuctionsFailed
Origin: auctioneer
Category: App Instances
Threshold Type: UPPER
Critical: 1
Warning: .5
Unit: Number
Assessment Window (minutes): 5
BBS Time to Run LRP Convergence Name: ConvergenceLRPDuration
Origin: bbs
Category: Compute Performance
Threshold Type: UPPER
Critical: 20000000000
Warning: 10000000000
Unit: ns
Assessment Window (minutes): 15
Number of Crashed App Instances1 Name: CrashedActualLRPs
Origin: bbs
Category: App Instances
Threshold Type: UPPER
Critical: 20
Warning: 10
Unit: Number
Assessment Window (minutes): 5
Cloud Controller and Diego in Sync Name: Diego.
AppsDomainSynced
Origin: bbs
Category: Compute Performance
Threshold Type: EQUALITY
Critical: 1
Unit: Number
Assessment Window (minutes): 5
Rate of Change in Running App Instances1 Name: Diego.
LRPsAdded.1H
Origin: healthwatch
Category: App Instances
Threshold Type: UPPER
Critical: 100
Warning: 50
Unit: Number
Assessment Window (minutes): 5
Router File Descriptors Name: file_descriptors
Origin: gorouter
Category: Routing
Threshold Type: UPPER
Critical: 60000
Warning: 50000
Unit: Number
Assessment Window (minutes): 5
PAS MySQL Galera Cluster Status Name: Galera.
ClusterStatusSum
Origin: healthwatch
Category: MySQL
Threshold Type: LOWER
Critical: 0.9999
Warning: 2.9999
Unit: Number
Assessment Window (minutes): 5
PAS MySQL Galera Cluster Size Name: Galera.
TotalPercentageHealthyNodes
Origin: healthwatch
Category: MySQL
Threshold Type: LOWER
Critical: 0.3332
Warning: 0.9999
Unit: Percent
Assessment Window (minutes): 5
Healthwatch BOSH Director Test Availability Name: health.check.
bosh.director.probe.available
Origin: healthwatch
Category: BOSH Director
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 10
Ops Manager Test Availability Name: health.check.
OpsMan.probe.available
Origin: healthwatch
Category: Ops Manager
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 5
Canary App Health Test Availability Name: health.check.
CanaryApp.probe.available
Origin: healthwatch
Category: Canary App
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 5
CLI Health Test Availability Name: health.check.
cliCommand.probe.available
Origin: healthwatch
Category: CLI
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 5
Healthwatch UI Availability Name: health.check.
ui.available
Origin: healthwatch
Category: Healthwatch
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 5
Healthwatch UI Availability Name: health.check.
ui.available
Origin: healthwatch
Category: Healthwatch
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 5
Healthwatch Nozzle Disconnects Name: ingestor.disconnects
Origin: healthwatch
Category: Healthwatch
Threshold Type: UPPER
Critical: 10
Warning: 5
Unit: Number
Assessment Window (minutes): 5
Healthwatch Ingestor Data Drops Name: ingestor.dropped
Origin: healthwatch
Category: Healthwatch
Threshold Type: UPPER
Critical: 20
Warning: 10
Unit: Number
Assessment Window (minutes): 5
Healthwatch Ingestor Metrics Ingested Name: ingestor.ingested
Origin: healthwatch
Category: Healthwatch
Threshold Type: LOWER
Critical: 0.01
Warning: 0.01
Unit: Number
Assessment Window (minutes): 30
Healthwatch Ingestor BOSH System Metrics Ingested Name: ingestor.ingested.
boshSystemMetrics
Origin: healthwatch
Category: Healthwatch
Threshold Type: LOWER
Critical: 0.01
Warning: 0.01
Unit: Number
Assessment Window (minutes): 30
Router Handling Latency1 Name: latency
Origin: gorouter
Category: Routing
Threshold Type: UPPER
Critical: 150
Warning: 100
Unit: ms
Assessment Window (minutes): 30
UAA Request Latency Name: latency.uaa
Origin: gorouter
Category: Routing
Threshold Type: UPPER
Critical: 150
Warning: 100
Unit: ms
Assessment Window (minutes): 5
Locks Held by BBS Name: LockHeld
Origin: bbs
Category: Healthwatch
Threshold Type: EQUALITY
Critical: 1
Unit: Number
Assessment Window (minutes): 5
Locks Held by Auctioneer Name: LockHeld
Origin: auctioneer
Category: Compute Performance
Threshold Type: EQUALITY
Critical: 1
Unit: Number
Assessment Window (minutes): 5
More App Instances Than Expected Name: LRPsExtra
Origin: bbs
Category: App Instances
Threshold Type: UPPER
Critical: 10
Warning: 5
Unit: Number
Assessment Window (minutes): 5
Fewer App Instances Than Expected Name: LRPsMissing
Origin: bbs
Category: App Instances
Threshold Type: UPPER
Critical: 10
Warning: 5
Unit: Number
Assessment Window (minutes): 5
Healthwatch Super Metrics Published Name: metrics.published
Origin: healthwatch
Category: Healthwatch
Threshold Type: LOWER
Critical: 0
Warning: 20
Unit: Number
Assessment Window (minutes): 5
Time Since Last Route Register Received Name: ms_since_last_registry_update
Origin: gorouter
Category: Routing
Threshold Type: UPPER
Critical: 30000
Warning: 30000
Unit: ms
Assessment Window (minutes): 5
PAS MySQL Server Availability Name: /mysql/available
Origin: mysql
Job: mysql, database
Category: MySQL
Threshold Type: EQUALITY
Critical: 1
Unit: Number
Assessment Window (minutes): 5
PAS MySQL Galera Cluster Node Readiness Name: /mysql/galera/wsrep_ready
Origin: mysql
Job: mysql, database
Category: MySQL
Threshold Type: LOWER
Critical: 1
Warning: 0.9999
Unit: Number
Assessment Window (minutes): 5
Cell Rep Time to Sync Name: RepBulkSyncDuration
Origin: rep
Category: Compute Performance
Threshold Type: UPPER
Critical: 10000000000
Warning: 5000000000
Unit: ns
Assessment Window (minutes): 15
Number of Router 5XX Server Errors1 Name: responses.5xx
Origin: gorouter
Category: Routing
Threshold Type: UPPER
Critical: 40
Warning: 30
Unit: Number
Assessment Window (minutes): 5
Route Emitter Time to Sync1 Name: RouteEmitterSyncDuration
Origin: route_emitter
Category: Compute Performance
Threshold Type: UPPER
Critical: 10000000000
Warning: 5000000000
Unit: ns
Assessment Window (minutes): 15
Number of Route Registration Messages Sent and Received Name: RouteRegistration.MessagesDelta
Origin: healthwatch
Category: Routing
Threshold Type: UPPER
Critical: 50
Warning: 30
Unit: Number
Assessment Window (minutes): 5
BBS Time to Handle Requests Name: RequestLatency
Origin: bbs
Category: Compute Performance
Threshold Type: UPPER
Critical: 10000000000
Warning: 5000000000
Unit: ns
Assessment Window (minutes): 15
UAA Requests In Flight Name: server.inflight.count
Origin: uaa
Category: UAA
Threshold Type: UPPER
Critical: 200
Warning: 150
Unit: Number
Assessment Window (minutes): 5
VM CPU Name: system.cpu.user
Origin: bosh-system-metrics-forwarder
Category: All Jobs
Threshold Type: UPPER
Critical: 95
Warning: 85
Unit: Percent
Assessment Window (minutes): 5
VM Ephemeral Disk Used Name: system.disk.ephemeral.percent
Origin: bosh-system-metrics-forwarder
Category: All Jobs
Threshold Type: UPPER
Critical: 90
Warning: 80
Unit: Percent
Assessment Window (minutes): 5
VM Persistent Disk Used Name: system.disk.
persistent.percent
Origin: bosh-system-metrics-forwarder
Category: All Jobs
Threshold Type: UPPER
Critical: 90
Warning: 80
Unit: Percent
Assessment Window (minutes): 5
VM Disk Used Name: system.disk.
system.percent
Origin: bosh-system-metrics-forwarder
Category: All Jobs
Threshold Type: UPPER
Critical: 90
Warning: 80
Unit: Percent
Assessment Window (minutes): 5
VM Health Check Recovery Name: system.healthy
Origin: bosh-system-metrics-forwarder
Category: All Jobs
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 5
Router Throughput1 Name: total_requests
Origin: gorouter
Category: Routing
Threshold Type: UPPER
Critical: 125000
Warning: 100000
Unit: Number
Assessment Window (minutes): 5
Number of Router Routes Registered1 Name: total_routes
Origin: gorouter
Category: Routing
Threshold Type: UPPER
Critical: 200
Warning: 100
Unit: Number
Assessment Window (minutes): 5
VM Memory Used Name: system.mem.percent
Origin: bosh-system-metrics-forwarder
Category: All Jobs
Threshold Type: UPPER
Critical: 95
Warning: 85
Unit: Percent
Assessment Window (minutes): 5
UAA Throughput Rate Name: uaa.throughput.rate
Origin: healthwatch
Category: Healthwatch
Threshold Type: UPPER
Critical: 15000
Warning: 12000
Unit: Number
Assessment Window (minutes): 5
Unhealthy Cells2 Name: UnhealthyCell
Origin: rep
Category: Compute Performance
Threshold Type: EQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 5

1 Pivotal recommends customizing the default thresholds for these alerts based on your environment. You can determine the best threshold for your environment by monitoring the metrics over time and noting the metric values that indicate acceptable and unacceptable system performance and health. For more information about updating alert configurations in PCF Healthwatch, see Update Alert Configurations.
2 We are alerting by cell for this metric. We will notify at a critical level when any Diego Cell has been unhealthy for 5 minutes.

Scaling Alerts

Alert Metric Threshold
Number of Available Free Chunks of Cell Memory Name: Diego.
AvailableFreeChunks
Origin: healthwatch
Category: Capacity
Threshold Type: LOWER
Critical: 1
Warning: 2
Unit: Number
Assessment Window (minutes): 5
Number of Available Free Chunks of Cell Disk Name: Diego.
AvailableFreeChunksDisk
Origin: healthwatch
Category: Capacity
Threshold Type: LOWER
Critical: 50
Warning: 100
Unit: Number
Assessment Window (minutes): 5
Remaining Cell Disk Available Name: Diego.
TotalAvailableDiskCapacity.
5M
Origin: healthwatch
Category: Capacity
Threshold Type: LOWER
Critical: 6144
Warning: 12288
Unit: MBs
Assessment Window (minutes): 5
Remaining Cell Memory Available1 Name: Diego.
TotalAvailableMemoryCapacity.
5M
Origin: healthwatch
Category: Capacity
Threshold Type: LOWER
Critical: 32768
Warning: 65536
Unit: MBs
Assessment Window (minutes): 5
Cell Container Capacity Available Name: Diego.
TotalPercentageAvailableContainerCapacity.
5M
Origin: healthwatch
Category: Capacity
Threshold Type: LOWER
Critical: 0.35
Warning: 0.35
Unit: Percent
Assessment Window (minutes): 30
Cell Disk Available Name: Diego.
TotalPercentageAvailableDiskCapacity.
5M
Origin: healthwatch
Category: Capacity
Threshold Type: LOWER
Critical: 0.35
Warning: 0.35
Unit: Percent
Assessment Window (minutes): 30
Cell Memory Available1 Name: Diego.
TotalPercentageAvailableMemoryCapacity.
5M
Origin: healthwatch
Category: Capacity
Threshold Type: LOWER
Critical: 0.35
Warning: 0.35
Unit: Percent
Assessment Window (minutes): 30
Doppler Message Rate Capacity Name: Doppler.
MessagesAverage.1M
Origin: healthwatch
Category: Logging
Threshold Type: UPPER
Critical: 1000000
Warning: 800000
Unit: Number
Assessment Window (minutes): 60
Log Transport Loss Rate Name: Firehose.
LossRate.1M
Origin: healthwatch
Category: Logging
Threshold Type: UPPER
Critical: 0.01
Warning: 0.005
Unit: Percent
Assessment Window (minutes): 5
Redis Counter Event Queue Size Name: redis.
counterEventQueue.size
Origin: healthwatch
Category: Healthwatch
Threshold Type: UPPER
Critical: 10000
Unit: Number
Assessment Window (minutes): 5
Redis Value Metric Queue Size Name: redis.
valueMetricQueue.size
Origin: healthwatch
Category: Healthwatch
Threshold Type: UPPER
Critical: 10000
Unit: Number
Assessment Window (minutes): 5
Syslog Adapter Capacity Name: SyslogDrain.
Adapter.BindingsAverage.5M
Origin: healthwatch
Category: Logging
Threshold Type: UPPER
Critical: 500
Warning: 450
Unit: Number
Assessment Window (minutes): 60
Syslog Adapter Loss Rate Name: SyslogDrain.
Adapter.LossRate.1M
Origin: healthwatch
Category: Logging
Threshold Type: UPPER
Critical: 0.1
Warning: 0.01
Unit: Percent
Assessment Window (minutes): 5
Router Instance CPU Name: system.cpu.user
Origin: bosh-system-metrics-forwarder
Job: router
Category: Routing
Threshold Type: UPPER
Critical: 70
Warning: 60
Unit: Percent
Assessment Window (minutes): 5
UAA Instance CPU Name: system.cpu.user
Origin: bosh-system-metrics-forwarder
Job: uaa,
control3
Category: UAA
Threshold Type: UPPER
Critical: 90
Warning: 80
Unit: Percent
Assessment Window (minutes): 5

1 Pivotal recommends customizing the default thresholds for these alerts based on your environment. You can determine the best threshold for your environment by monitoring the metrics over time and noting the metric values that indicate acceptable and unacceptable system performance and health. For more information about updating alert configurations in PCF Healthwatch, see Update Alert Configurations.
3 This handles two jobs to accommodate different job names between PAS and SRT*

Service Level Indicators

Alert Metric Threshold
BOSH Director Health Name: health.check.
bosh.director.success
Origin: healthwatch
Category: BOSH Director
Threshold Type: EQUALITY
Critical: 1
Unit: Number
Assessment Window (minutes): 10
Ops Manager Availability Name: health.check.
OpsMan.available
Origin: healthwatch
Category: Ops Manager
Threshold Type: EQUALITY
Critical: 1
Unit: Number
Assessment Window (minutes): 10
Canary App Availability Name: health.check.
CanaryApp.available
Origin: healthwatch
Category: Canary App
Threshold Type: LOWER
Critical: 0.5
Unit: Number
Assessment Window (minutes): 5
Canary App Response Time Name: health.check.
CanaryApp.responseTime
Origin: healthwatch
Category: Canary App
Threshold Type: UPPER
Critical: 30000
Warning: 15000
Unit: ms
Assessment Window (minutes): 5
CF Push Time Name: health.check.
cliCommand.pushTime
Origin: healthwatch
Category: CLI
Threshold Type: UPPER
Critical: 120000
Warning: 60000
Unit: ms
Assessment Window (minutes): 10
Can CF Login Name: health.check.
cliCommand.login
Origin: healthwatch
Category: CLI
Threshold Type: INEQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 10
Can CF Push Name: health.check.
cliCommand.push
Origin: healthwatch
Category: CLI
Threshold Type: INEQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 10
Can CF Start Name: health.check.
cliCommand.start
Origin: healthwatch
Category: CLI
Threshold Type: INEQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 10
Can CF Stop Name: health.check.
cliCommand.stop
Origin: healthwatch
Category: CLI
Threshold Type: INEQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 10
Can CF Delete Name: health.check.
cliCommand.delete
Origin: healthwatch
Category: CLI
Threshold Type: INEQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 10
Can Receive Logs Name: health.check.
cliCommand.logs
Origin: healthwatch
Category: CLI
Threshold Type: INEQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 10

Errors

This section lists common error messages and their causes.


Error Message: “Unsupported condition in query expression”

Possible Cause: Key in query string is something other than name, origin, job, or deployment


Error Message: “Unsupported query expression”

Possible Causes:

  • Query string does not include AT LEAST name and origin
  • Found operator other than and or ==
  • Expression in query string not in format of property == 'value' and ...

Error Message: “Invalid query expression”

Possible Cause: Invalid query string format


Error Message: “Invalid threshold type for metric ‘some_origin.some_name’”)

Possible Cause: Given threshold type does not match expected (see table above)


Error Message: “Must provide a warning threshold for an Upper/Lower Threshold”

Possible Cause: Threshold missing required values for type


Error Message: “name = 'some_name’ and origin = 'some_origin’ is not a supported alert configuration” Error Message: “name = 'some_name’, origin = 'some_origin’, and job = 'some_job’ is not a supported alert configuration”

Possible Cause: An alert configuration does not exist for the targeted metric

Walkthrough Example

A best practice deployment of Cloud Foundry includes at least three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

By default, Healthwatch sends an alert if the Diego.TotalPercentageAvailableMemoryCapacity.5M metric falls below 35%, or one in three.

However, if your environment has been scaled up to five AZs you may wish to adjust the alert configuration accordingly to 20%, or more in five.

uaac token client get <my_healthwatch_admin_client> -s <my_healthwatch_admin_secret>
export token=$(uaac context | grep access_token | awk '{print $2}')

uaac curl -X POST "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations"  \
      -H "Content-Type: application/json" \
      --data "{\"query\":\"origin == 'healthwatch' and name == 'Diego.TotalPercentageAvailableMemoryCapacity.5M'\",\"threshold\":{\"critical\":0.2,\"warning\":0.3,\"type\":\"LOWER\"}}"

The response body contains the updated alert configuration. You can then confirm the change:

uaac curl "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations?q=origin == 'healthwatch' and name == 'Diego.TotalPercentageAvailableMemoryCapacity.5M'"

Configure PCF Healthwatch Alert Notifications

You can configure PCF Event Alerts to receive push notifications when a PCF Healthwatch alert occurs. For example, if you configured a PCF Healthwatch alert for memory on a VM, you can use PCF Event Alerts to receive a message on Slack if memory on the VM exceeds the threshold defined in the PCF Healthwatch alert.

For more information about configuring PCF Event Alerts for PCF Healthwatch, see PCF Event Alerts.