Configuring Pivotal Healthwatch Alerts

This topic describes how to use the Pivotal Healthwatch API to retrieve and configure alert configurations. It also provides information about configuring Pivotal Event Alerts to receive push notifications when a Pivotal Healthwatch alert occurs.

Note: Pivotal Healthwatch stores all data points for 25 hours and then prunes them. Any active alerts that are pruned reissue an alert every 24 hours if the related metrics are not yet recovered to a normal state.

Currently, two main types of alert configurations are supported: Out-of-the-Box and Deployment-specific. Through this API, consumers can do the following:

  • View current alert configurations
  • Enable/disable current alert configurations
  • Update Out-of-the-Box threshold values
  • Create or delete deployment-specific configurations based on existing Out-of-the-Box configurations

Prerequisites and Assumptions

The steps in this document assume that you can generate bearer tokens for a UAA client with the healthwatch.read (GET only) and healthwatch.admin (both GET and POST) scopes.

After creating a user that has healthwatch.read or healthwatch.admin scopes, authenticate against UAA by running:

uaac token client get HEALTHWATCH-ADMIN-CLIENT -s HEALTHWATCH-ADMIN-CLIENT-SECRET

Where:

  • HEALTHWATCH-ADMIN-CLIENT is the UAA client with the healthwatch.read or healthwatch.admin scopes.

  • HEALTHWATCH-ADMIN-CLIENT-SECRET is the UAA client secret.

  • SYSTEM-DOMAIN is the system domain URL configured in the TAS for VMs tile. For example, sys.example.com.

At this point you are properly authenticated and ready to start using the Healthwatch Alerts API.

Healthwatch API Status

To test the availability of the Healthwatch API, send a GET request to /info endpoint by running:

curl https://healthwatch-api.SYSTEM-DOMAIN/info

Where SYSTEM-DOMAIN is the system domain URL configured in the TAS for VMs tile. For example, sys.example.com.

The expected response is a 200/OK with the message "HAPI is happy".

View All Alert Configurations

GET /v1/alert-configurations

To view a list of alert configurations, send a GET request to the /alert-configurations endpoint by running:

uaac curl https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations

Where SYSTEM-DOMAIN is the system domain URL configured in the TAS for VMs tile. For example, sys.example.com.

This returns a JSON array of alert configurations:

[
    {
        "query": "origin == 'some_origin' and name == 'Some.Metric.Name'",
        "enabled": true,
        "threshold": {
            "critical": 95,
            "warning": 85,
            "type": "UPPER"
        }
    },
    {
        "query": "origin == 'some_origin' and name == 'Another.Metric.Name'",
        "enabled": false,
        "threshold": {
            "critical": 0,
            "warning": 0,
            "type": "LOWER"
        }
    },
    {
        "query": "origin == 'another_origin' and name == 'Some.Metric.Name'",
        "enabled": true,
        "threshold": {
            "critical": 1,
            "type": "EQUALITY"
        }
    }
]

The query and threshold properties are covered in detail below.

View Specific Alert Configurations

GET /v1/alert-configurations?q=…

To narrow the results of a GET request, add a query to the URL in a parameter named q:

uaac curl "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations?q=origin == 'some_origin' and name == 'Some.Metric.Name'"

Where SYSTEM-DOMAIN is the system domain URL configured in the TAS for VMs tile. For example, sys.example.com.

This returns a JSON array of alert configurations, filtered against the provided query:

[
    {
        "query": "origin == 'some_origin' and name == 'Some.Metric.Name'",
        "enabled": true,
        "threshold": {
            "critical": 95,
            "warning": 85,
            "type": "UPPER"
        }
    }
]

Update Alert Configurations

POST /v1/alert-configurations

To update an existing alert configuration, make a POST request to the alert-configurations endpoint with the updated data:

uaac curl -X POST "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations"  \
      -H "Content-Type: application/json" \
      --data "{\"query\":\"origin == 'some_origin' and name == 'Some.Metric.Name'\",\"threshold\":{\"critical\":90,\"warning\":80},\"enabled\":true}"

Where SYSTEM-DOMAIN is the system domain URL configured in the TAS for VMs tile. For example, sys.example.com.

See the following example output:

{
    "query": "origin == 'some_origin' and name == 'Some.Metric.Name'",
    "enabled": true,
    "threshold": {
        "critical": 95,
        "warning": 85,
        "type": "UPPER"
    }
}

Warning: These alert configurations cannot be deleted. To revert your changes, update the alert back to its default values. For more information, see Supported Alerts below.

Create Alert Configurations for Isolation Segments

POST /v1/alert-configurations

Specific thresholds can be set for an isolation segment by extending existing alert configurations with a deployment specifier. For example, to create an isolation segment alert configuration for the above alert, run:

uaac curl -X POST "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations"  \
      -H "Content-Type: application/json" \
      --data "{\"query\":\"origin == 'some_origin' and name == 'Some.Metric.Name' and deployment == 'ISOLATION-SEGMENT-NAME'\",\"threshold\":{\"critical\":55,\"warning\":45}}"

Where:

  • SYSTEM-DOMAIN is the system domain URL configured in the TAS for VMs tile. For example, sys.example.com.

  • ISOLATION-SEGMENT-NAME is the name of the isolation segment for which you want to create an alert configuration.

A new alert configuration will be enabled by default. You can optionally include \"enabled\": false in the request body to override this behavior.

The threshold type will be inferred from the deployment-agnostic alert configuration and cannot be changed.

The created alert configuration is echoed back in the following response:

{
    "query": "origin == 'some_origin' and name == 'Some.Metric.Name' and deployment == 'ISOLATION-SEGMENT-NAME'",
    "enabled": true,
    "threshold": {
        "critical": 55,
        "warning": 45,
        "type": "UPPER"
    }
}

Note: You can delete isolation segment alert configurations only if you created them with the above method.

Delete Isolation Segment Alert Configurations

To delete a user-created alert configuration for an isolation segment, add a query to the URL in a parameter named q:

DELETE /v1/alert-configurations?q=...

See the following example:

uaac curl -X DELETE "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations?q=origin == 'some_origin' and name == 'Some.Metric.Name' and deployment == 'ISOLATION-SEGMENT-NAME'"

Where:

  • SYSTEM-DOMAIN is the system domain URL configured in the TAS for VMs tile. For example, sys.example.com.

  • ISOLATION-SEGMENT-NAME is the name of the isolation segment for which you want to create an alert configuration.

This returns the number of deleted alert configurations. See the following example output:

1

Note: You can delete isolation segment alert configurations only if you created them through the Healthwatch API.

Queries

The query field is used in two ways:

  • GET requests: The query filters the alert configurations being queried.
  • POST requests: The query specifies the alert configuration being updated.
  • DELETE requests: The query filters the alert configurations being deleted.

The query associated with an alert configuration denotes the trigger conditions for the alert.

A well-formed query is a valid Spring Expression Language (SpEL) expression that uses only the equality, "==", and conjunction, "and", operators. For example:

"origin == 'some_origin' and name == 'Some.Metric.Name'"  # valid
"origin  > 'some_origin' and name == 'Some.Metric.Name'"  # invalid (uses '>')
"origin == 'some_origin'  or name == 'Some.Metric.Name'"  # invalid (uses 'or')

The following fields can be queried:

  • (Required) origin   
  • (Required) name       
  • job
  • deployment

Thresholds

The threshold field contains a threshold type, as well as critical and warning threshold values. The type can be "UPPER", "LOWER", "EQUALITY", or "INEQUALITY".

Alert configurations whose thresholds are of the UPPER type trigger their alerts when the actual metric value is above the warning value, and again when above the critical value.

Thresholds with the LOWER type work the same way, except the alerts trigger when the metric falls below the thresholds.

The EQUALITY alerts trigger when the metric value is not exactly equal to the critical threshold. These alerts do not have warning thresholds.

The INEQUALITY alerts trigger when the metric value is exactly equal to the critical threshold. These alerts do not have warning thresholds.

Supported Alerts

By default, Pivotal Healthwatch includes the following configurable alerts: This section describes the default Warning and Critical alerts for Pivotal Healthwatch. Each alert includes recommended thresholds for metrics monitored by Healthwatch.

Performance Alerts

Alert Metric Threshold
Active Locks Held Name: ActiveLocks
Origin: locket
Category: Compute Performance
Threshold Type: EQUALITY
Critical: 4
Unit: Number
Assessment Window (minutes): 5
Auctioneer Time to Fetch Cell State Name: AuctioneerFetchStatesDuration
Origin: auctioneer
Category: Compute Performance
Threshold Type: UPPER
Critical: 5000000000
Warning: 2000000000
Unit: ns
Assessment Window (minutes): 5
App Instances Placement Failures Rate Name: AuctioneerLRPAuctionsFailed
Origin: auctioneer
Category: App Instances
Threshold Type: UPPER
Critical: 1
Warning: .5
Unit: Number
Assessment Window (minutes): 5
Task Placement Failures Rate Name: AuctioneerTaskAuctionsFailed
Origin: auctioneer
Category: App Instances
Threshold Type: UPPER
Critical: 1
Warning: .5
Unit: Number
Assessment Window (minutes): 5
BBS Time to Run LRP Convergence Name: ConvergenceLRPDuration
Origin: bbs
Category: Compute Performance
Threshold Type: UPPER
Critical: 20000000000
Warning: 10000000000
Unit: ns
Assessment Window (minutes): 15
Cloud Controller and Diego in Sync Name: Diego.
AppsDomainSynced
Origin: bbs
Category: Compute Performance
Threshold Type: EQUALITY
Critical: 1
Unit: Number
Assessment Window (minutes): 5
Router File Descriptors Name: file_descriptors
Origin: gorouter
Category: Routing
Threshold Type: UPPER
Critical: 60000
Warning: 50000
Unit: Number
Assessment Window (minutes): 5
PAS MySQL Galera Cluster Status Name: Galera.
ClusterStatusSum
Origin: healthwatch
Category: MySQL
Threshold Type: LOWER
Critical: 0.9999
Warning: 2.9999
Unit: Number
Assessment Window (minutes): 5
PAS MySQL Galera Cluster Size Name: Galera.
TotalPercentageHealthyNodes
Origin: healthwatch
Category: MySQL
Threshold Type: LOWER
Critical: 0.3332
Warning: 0.9999
Unit: Percent
Assessment Window (minutes): 5
Healthwatch BOSH Director Test Availability Name: health.check.
bosh.director.probe.available
Origin: healthwatch
Category: BOSH Director
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 10
Ops Manager Test Availability Name: health.check.
OpsMan.probe.available
Origin: healthwatch
Category: Ops Manager
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 5
Canary App Health Test Availability Name: health.check.
CanaryApp.probe.available
Origin: healthwatch
Category: Canary App
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 5
CLI Health Test Availability Name: health.check.
cliCommand.probe.available
Origin: healthwatch
Category: CLI
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 5
Healthwatch UI Availability Name: health.check.
ui.available
Origin: healthwatch
Category: Healthwatch
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 5
Healthwatch UI Availability Name: health.check.
ui.available
Origin: healthwatch
Category: Healthwatch
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 5
Healthwatch Nozzle Disconnects Name: ingestor.disconnects
Origin: healthwatch
Category: Healthwatch
Threshold Type: UPPER
Critical: 10
Warning: 5
Unit: Number
Assessment Window (minutes): 5
Healthwatch Ingestor Data Drops Name: ingestor.dropped
Origin: healthwatch
Category: Healthwatch
Threshold Type: UPPER
Critical: 20
Warning: 10
Unit: Number
Assessment Window (minutes): 5
Healthwatch Ingestor Metrics Ingested Name: ingestor.ingested
Origin: healthwatch
Category: Healthwatch
Threshold Type: LOWER
Critical: 0.01
Warning: 0.01
Unit: Number
Assessment Window (minutes): 30
Healthwatch Ingestor BOSH System Metrics Ingested Name: ingestor.ingested.
boshSystemMetrics
Origin: healthwatch
Category: Healthwatch
Threshold Type: LOWER
Critical: 0.01
Warning: 0.01
Unit: Number
Assessment Window (minutes): 30
UAA Request Latency Name: latency.uaa
Origin: gorouter
Category: Routing
Threshold Type: UPPER
Critical: 150
Warning: 100
Unit: ms
Assessment Window (minutes): 5
Locks Held by BBS Name: LockHeld
Origin: bbs
Category: Healthwatch
Threshold Type: EQUALITY
Critical: 1
Unit: Number
Assessment Window (minutes): 5
Locks Held by Auctioneer Name: LockHeld
Origin: auctioneer
Category: Compute Performance
Threshold Type: EQUALITY
Critical: 1
Unit: Number
Assessment Window (minutes): 5
More App Instances Than Expected Name: LRPsExtra
Origin: bbs
Category: App Instances
Threshold Type: UPPER
Critical: 10
Warning: 5
Unit: Number
Assessment Window (minutes): 5
Fewer App Instances Than Expected Name: LRPsMissing
Origin: bbs
Category: App Instances
Threshold Type: UPPER
Critical: 10
Warning: 5
Unit: Number
Assessment Window (minutes): 5
Healthwatch Super Metrics Published Name: metrics.published
Origin: healthwatch
Category: Healthwatch
Threshold Type: LOWER
Critical: 0
Warning: 20
Unit: Number
Assessment Window (minutes): 5
Time Since Last Route Register Received Name: ms_since_last_registry_update
Origin: gorouter
Category: Routing
Threshold Type: UPPER
Critical: 30000
Warning: 30000
Unit: ms
Assessment Window (minutes): 5
PAS MySQL Server Availability Name: /mysql/available
Origin: mysql
Job: mysql, database
Category: MySQL
Threshold Type: EQUALITY
Critical: 1
Unit: Number
Assessment Window (minutes): 5
PAS MySQL Galera Cluster Node Readiness Name: /mysql/galera/wsrep_ready
Origin: mysql
Job: mysql, database
Category: MySQL
Threshold Type: LOWER
Critical: 1
Warning: 0.9999
Unit: Number
Assessment Window (minutes): 5
Cell Rep Time to Sync Name: RepBulkSyncDuration
Origin: rep
Category: Compute Performance
Threshold Type: UPPER
Critical: 10000000000
Warning: 5000000000
Unit: ns
Assessment Window (minutes): 15
Number of Route Registration Messages Sent and Received Name: RouteRegistration.MessagesDelta
Origin: healthwatch
Category: Routing
Threshold Type: UPPER
Critical: 50
Warning: 30
Unit: Number
Assessment Window (minutes): 5
BBS Time to Handle Requests Name: RequestLatency
Origin: bbs
Category: Compute Performance
Threshold Type: UPPER
Critical: 10000000000
Warning: 5000000000
Unit: ns
Assessment Window (minutes): 15
UAA Requests In Flight Name: server.inflight.count
Origin: uaa
Category: UAA
Threshold Type: UPPER
Critical: 200
Warning: 150
Unit: Number
Assessment Window (minutes): 5
VM CPU Name: system.cpu.user
Origin: bosh-system-metrics-forwarder
Category: All Jobs
Threshold Type: UPPER
Critical: 95
Warning: 85
Unit: Percent
Assessment Window (minutes): 5
VM Ephemeral Disk Used Name: system.disk.ephemeral.percent
Origin: bosh-system-metrics-forwarder
Category: All Jobs
Threshold Type: UPPER
Critical: 90
Warning: 80
Unit: Percent
Assessment Window (minutes): 5
VM Persistent Disk Used Name: system.disk.
persistent.percent
Origin: bosh-system-metrics-forwarder
Category: All Jobs
Threshold Type: UPPER
Critical: 90
Warning: 80
Unit: Percent
Assessment Window (minutes): 5
VM Disk Used Name: system.disk.
system.percent
Origin: bosh-system-metrics-forwarder
Category: All Jobs
Threshold Type: UPPER
Critical: 90
Warning: 80
Unit: Percent
Assessment Window (minutes): 5
VM Health Check Recovery Name: system.healthy
Origin: bosh-system-metrics-forwarder
Category: All Jobs
Threshold Type: LOWER
Critical: 0.4
Warning: 0.6
Unit: Number
Assessment Window (minutes): 5
VM Memory Used Name: system.mem.percent
Origin: bosh-system-metrics-forwarder
Category: All Jobs
Threshold Type: UPPER
Critical: 95
Warning: 85
Unit: Percent
Assessment Window (minutes): 5
UAA Throughput Rate Name: uaa.throughput.rate
Origin: healthwatch
Category: Healthwatch
Threshold Type: UPPER
Critical: 15000
Warning: 12000
Unit: Number
Assessment Window (minutes): 5
Unhealthy Cells1 Name: UnhealthyCell
Origin: rep
Category: Compute Performance
Threshold Type: EQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 5

For more information about the metrics Pivotal Healthwatch emits, see Pivotal Healthwatch Metrics.

1 Healthwatch alerts by Diego Cell for this metric. Healthwatch notifies at a critical level when any Diego Cell has been unhealthy for 5 minutes.

Scaling Alerts

Alert Metric Threshold
Number of Available Free Chunks of Cell Memory Name: Diego.
AvailableFreeChunks
Origin: healthwatch
Category: Capacity
Threshold Type: LOWER
Critical: 1
Warning: 2
Unit: Number
Assessment Window (minutes): 5
Number of Available Free Chunks of Cell Disk Name: Diego.
AvailableFreeChunksDisk
Origin: healthwatch
Category: Capacity
Threshold Type: LOWER
Critical: 50
Warning: 100
Unit: Number
Assessment Window (minutes): 5
Remaining Cell Disk Available Name: Diego.
TotalAvailableDiskCapacity.
5M
Origin: healthwatch
Category: Capacity
Threshold Type: LOWER
Critical: 6144
Warning: 12288
Unit: MBs
Assessment Window (minutes): 5
Cell Container Capacity Available Name: Diego.
TotalPercentageAvailableContainerCapacity.
5M
Origin: healthwatch
Category: Capacity
Threshold Type: LOWER
Critical: 0.35
Warning: 0.35
Unit: Percent
Assessment Window (minutes): 30
Cell Disk Available Name: Diego.
TotalPercentageAvailableDiskCapacity.
5M
Origin: healthwatch
Category: Capacity
Threshold Type: LOWER
Critical: 0.35
Warning: 0.35
Unit: Percent
Assessment Window (minutes): 30
Doppler Message Rate Capacity Name: Doppler.
MessagesAverage.1M
Origin: healthwatch
Category: Logging
Threshold Type: UPPER
Critical: 1000000
Warning: 800000
Unit: Number
Assessment Window (minutes): 60
Log Transport Loss Rate Name: Firehose.
LossRate.1M
Origin: healthwatch
Category: Logging
Threshold Type: UPPER
Critical: 0.01
Warning: 0.005
Unit: Percent
Assessment Window (minutes): 5
Redis Counter Event Queue Size Name: redis.
counterEventQueue.size
Origin: healthwatch
Category: Healthwatch
Threshold Type: UPPER
Critical: 10000
Unit: Number
Assessment Window (minutes): 5
Redis Value Metric Queue Size Name: redis.
valueMetricQueue.size
Origin: healthwatch
Category: Healthwatch
Threshold Type: UPPER
Critical: 10000
Unit: Number
Assessment Window (minutes): 5
Syslog Adapter Capacity Name: SyslogDrain.
Adapter.BindingsAverage.5M
Origin: healthwatch
Category: Logging
Threshold Type: UPPER
Critical: 500
Warning: 450
Unit: Number
Assessment Window (minutes): 60
Syslog Adapter Loss Rate Name: SyslogDrain.
Adapter.LossRate.1M
Origin: healthwatch
Category: Logging
Threshold Type: UPPER
Critical: 0.1
Warning: 0.01
Unit: Percent
Assessment Window (minutes): 5
Router Instance CPU Name: system.cpu.user
Origin: bosh-system-metrics-forwarder
Job: router
Category: Routing
Threshold Type: UPPER
Critical: 70
Warning: 60
Unit: Percent
Assessment Window (minutes): 5
UAA Instance CPU Name: system.cpu.user
Origin: bosh-system-metrics-forwarder
Job: uaa,
control2
Category: UAA
Threshold Type: UPPER
Critical: 90
Warning: 80
Unit: Percent
Assessment Window (minutes): 5

2 This handles two jobs to accommodate different job names between PAS and SRT*

Service Level Indicators

Alert Metric Threshold
BOSH Director Health Name: health.check.
bosh.director.success
Origin: healthwatch
Category: BOSH Director
Threshold Type: EQUALITY
Critical: 1
Unit: Number
Assessment Window (minutes): 10
Ops Manager Availability Name: health.check.
OpsMan.available
Origin: healthwatch
Category: Ops Manager
Threshold Type: EQUALITY
Critical: 1
Unit: Number
Assessment Window (minutes): 10
Canary App Availability Name: health.check.
CanaryApp.available
Origin: healthwatch
Category: Canary App
Threshold Type: LOWER
Critical: 0.5
Unit: Number
Assessment Window (minutes): 5
Canary App Response Time Name: health.check.
CanaryApp.responseTime
Origin: healthwatch
Category: Canary App
Threshold Type: UPPER
Critical: 30000
Warning: 15000
Unit: ms
Assessment Window (minutes): 5
CF Push Time Name: health.check.
cliCommand.pushTime
Origin: healthwatch
Category: CLI
Threshold Type: UPPER
Critical: 120000
Warning: 60000
Unit: ms
Assessment Window (minutes): 10
Can CF Login Name: health.check.
cliCommand.login
Origin: healthwatch
Category: CLI
Threshold Type: INEQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 10
Can CF Push Name: health.check.
cliCommand.push
Origin: healthwatch
Category: CLI
Threshold Type: INEQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 10
Can CF Start Name: health.check.
cliCommand.start
Origin: healthwatch
Category: CLI
Threshold Type: INEQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 10
Can CF Stop Name: health.check.
cliCommand.stop
Origin: healthwatch
Category: CLI
Threshold Type: INEQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 10
Can CF Delete Name: health.check.
cliCommand.delete
Origin: healthwatch
Category: CLI
Threshold Type: INEQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 10
Can Receive Logs Name: health.check.
cliCommand.logs
Origin: healthwatch
Category: CLI
Threshold Type: INEQUALITY
Critical: 0
Unit: Number
Assessment Window (minutes): 10

Environment Specific Alerts

The alerts in the table below are highly variable depending on your environment. Pivotal recommends customizing their critical and warning thresholds based on your environment. These alerts are disabled by default (the enabled property on the alert configuration is false) and the default threshold values are 0.

You can determine the best threshold for your environment by monitoring the metrics over time and noting the metric values that indicate acceptable and unacceptable system performance and health. Once you have monitored your system and established baseline values, you can update the threshold values and enable these alerts.

For more information about updating alert configurations in Pivotal Healthwatch, see Update Alert Configurations.

Alert Metric Alert Type
Active Presences Held Name: ActivePresences
Origin: locket
Category: Compute Performance
Performance
App Instance Starts Rate Name: AuctioneerLRPAuctionsStarted
Origin: auctioneer
Category: App Instances
Performance
Router Exhausted Connections Name: backend_exhausted_conns
Origin: gorouter
Category: Routing
Performance
Number of Router 502 Bad Gateways Name: bad_gateways
Origin: gorouter
Category: Routing
Performance
Number of Crashed App Instances Name: CrashedActualLRPs
Origin: bbs
Category: App Instances
Performance
Rate of Change in Running App Instances Name: Diego.
LRPsAdded.1H
Origin: healthwatch
Category: App Instances
Performance
Remaining Cell Memory Available Name: Diego.
TotalAvailableMemoryCapacity.
5M
Origin: healthwatch
Category: Capacity
Scaling
Cell Memory Available Name: Diego.
TotalPercentageAvailableMemoryCapacity.
5M
Origin: healthwatch
Category: Capacity
Scaling
Router Handling Latency Name: latency
Origin: gorouter
Category: Routing
Performance
Number of Router 5XX Server Errors Name: responses.5xx
Origin: gorouter
Category: Routing
Performance
Route Emitter Time to Sync Name: RouteEmitterSyncDuration
Origin: route_emitter
Category: Compute Performance
Performance
Router Throughput Name: total_requests
Origin: gorouter
Category: Routing
Performance
Number of Router Routes Registered Name: total_routes
Origin: gorouter
Category: Routing
Performance

Warning: Since these alerts have default threshold values of 0, ensure the values are updated before they are enabled.

Errors

This section lists common error messages and their causes.


Error Message: “Unsupported condition in query expression”

Possible Cause: Key in query string is something other than name, origin, job, or deployment


Error Message: “Unsupported query expression”

Possible Causes:

  • Query string does not include AT LEAST name and origin
  • Found operator other than and or ==
  • Expression in query string not in format of property == 'value' and ...

Error Message: “Invalid query expression”

Possible Cause: Invalid query string format


Error Message: “Invalid threshold type for metric ‘some_origin.some_name’”)

Possible Cause: Given threshold type does not match expected (see table above)


Error Message: “Must provide a warning threshold for an Upper/Lower Threshold”

Possible Cause: Threshold missing required values for type


Error Message: “name = 'some_name’ and origin = 'some_origin’ is not a supported alert configuration” Error Message: “name = 'some_name’, origin = 'some_origin’, and job = 'some_job’ is not a supported alert configuration”

Possible Cause: An alert configuration does not exist for the targeted metric


Error Message: “A valid threshold is required when creating an alert configuration”

Possible Cause: An attempt to create a new alert configuration (ie, for an isolation segment) did not include a valid threshold/s.

Walkthrough Example

A best practice deployment of Cloud Foundry includes at least three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.

By default, Healthwatch sends an alert if the Diego.TotalPercentageAvailableMemoryCapacity.5M metric falls below 35%, or one in three.

However, if your environment has been scaled up to five AZs, you may want to adjust the alert configuration accordingly to 20%, or one in five, by running:

uaac token client get HEALTHWATCH-ADMIN-CLIENT -s HEALTHWATCH-ADMIN-CLIENT-SECRET

uaac curl -X POST "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations"  \
      -H "Content-Type: application/json" \
      --data "{\"query\":\"origin == 'healthwatch' and name == 'Diego.TotalPercentageAvailableMemoryCapacity.5M'\",\"threshold\":{\"critical\":0.2,\"warning\":0.3}}"

Where:

  • HEALTHWATCH-ADMIN-CLIENT is the UAA client with the healthwatch.read or healthwatch.admin scopes.

  • HEALTHWATCH-ADMIN-CLIENT-SECRET is the UAA client secret.

  • SYSTEM-DOMAIN is the system domain URL configured in the TAS for VMs tile. For example, sys.example.com.

To simply disable the alert, run:

uaac token client get HEALTHWATCH-ADMIN-CLIENT -s HEALTHWATCH-ADMIN-CLIENT-SECRET

uaac curl -X POST "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations"  \
      -H "Content-Type: application/json" \
      --data "{\"query\":\"origin == 'healthwatch' and name == 'Diego.TotalPercentageAvailableMemoryCapacity.5M'\",\"enabled\": false}"

Where:

  • HEALTHWATCH-ADMIN-CLIENT is the UAA client with the healthwatch.read or healthwatch.admin scopes.

  • HEALTHWATCH-ADMIN-CLIENT-SECRET is the UAA client secret.

  • SYSTEM-DOMAIN is the system domain URL configured in the TAS for VMs tile. For example, sys.example.com.

The response body contains the updated alert configuration:

uaac curl "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations?q=origin == 'healthwatch' and name == 'Diego.TotalPercentageAvailableMemoryCapacity.5M'"

Configure Pivotal Healthwatch Alert Notifications

You can configure Pivotal Event Alerts to receive push notifications when a Pivotal Healthwatch alert occurs. For example, if you configured a Pivotal Healthwatch alert for memory on a VM, you can use Pivotal Event Alerts to receive a message on Slack if memory on the VM exceeds the threshold defined in the Pivotal Healthwatch alert.

For more information about configuring Pivotal Event Alerts for Pivotal Healthwatch, see Pivotal Event Alerts.