Configuring PCF Healthwatch Alerts
- Prerequisites/Assumptions
- Healthwatch API Status
- View All Alert Configurations
- View Specific Alert Configurations
- Update Alert Configurations
- Create Alert Configurations for Isolation Segments
- Delete Isolation Segment Alert Configurations
- Disable Alerts on a Metric
- Queries
- Thresholds
- Supported Alerts
- Errors
- Walkthrough Example
- Configure PCF Healthwatch Alert Notifications
Warning: PCF Healthwatch v1.6 is no longer supported or available for download. PCF Healthwatch v1.6 has reached the End of General Support (EOGS) phase as defined by the Support Lifecycle Policy. To stay up to date with the latest software and security updates, upgrade to a supported version.
This topic describes how to use the Pivotal Cloud Foundry (PCF) Healthwatch API to retrieve and configure alert configurations. It also provides information about configuring PCF Event Alerts to receive push notifications when a PCF Healthwatch alert occurs.
Note: PCF Healthwatch stores all data points for 25 hours and then prunes them. Any active alerts that are pruned reissue an alert every 24 hours if the related metrics are not yet recovered to a normal state.
Currently, two main types of alert configurations are supported: Out-of-the-Box and Deployment-specific. Through this API, consumers can do the following:
- View current alert configurations
- Update Out-of-the-Box threshold values
- Create or delete deployment-specific configurations based on existing Out-of-the-Box configurations
Prerequisites/Assumptions
The steps in this document assume that you can generate bearer tokens for a UAA client with the healthwatch.read
(GET
only) and healthwatch.admin
(both GET
and POST
) scopes.
After creating a user that has healthwatch.read
or healthwatch.admin
scopes, follow these steps to authenticate against UAA:
uaac token client get <my_healthwatch_admin_client> -s <my_healthwatch_admin_secret>
At this point you are properly authenticated and ready to start using the Healthwatch Alerts API.
Healthwatch API Status
Test the availability of the Healthwatch API by hitting the /info
endpoint with a GET
request:
curl https://healthwatch-api.SYSTEM-DOMAIN/info
The expected response is a 200
/OK
with the message "HAPI is happy"
.
View All Alert Configurations
GET /v1/alert-configurations
To view a list of alert configurations, send a GET
request to the /alert-configurations
endpoint:
uaac curl https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations
This returns a JSON array of alert configurations:
[
{
"query": "origin == 'some_origin' and name == 'Some.Metric.Name'",
"threshold": {
"critical": 95,
"warning": 85,
"type": "UPPER"
}
},
{
"query": "origin == 'some_origin' and name == 'Another.Metric.Name'",
"threshold": {
"critical": 9,
"warning": 28,
"type": "LOWER"
}
},
{
"query": "origin == 'another_origin' and name == 'Some.Metric.Name'",
"threshold": {
"critical": 1,
"type": "EQUALITY"
}
}
]
The query
and threshold
properties are covered in detail below.
View Specific Alert Configurations
GET /v1/alert-configurations?q=...
To narrow the results of a GET
request, add a query
to the URL in a parameter named q
:
uaac curl "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations?q=origin == 'some_origin' and name == 'Some.Metric.Name'"
This returns a JSON array of alert configurations, filtered against the provideded query
:
[
{
"query": "origin == 'some_origin' and name == 'Some.Metric.Name'",
"threshold": {
"critical": 95,
"warning": 85,
"type": "UPPER"
}
}
]
Update Alert Configurations
POST /v1/alert-configurations
To update an existing alert configuration, make a POST
request to the alert-configurations
endpoint with the updated data:
uaac curl -X POST "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations" \
-H "Content-Type: application/json" \
--data "{\"query\":\"origin == 'some_origin' and name == 'Some.Metric.Name'\",\"threshold\":{\"critical\":90,\"warning\":80,\"type\":\"UPPER\"}}"
See the following example output:
{
"query": "origin == 'some_origin' and name == 'Some.Metric.Name'",
"threshold": {
"critical": 95,
"warning": 85,
"type": "UPPER"
}
}
Warning: These alert configurations cannot be deleted. In order to revert your changes, update the alert back to its default values.
Create Alert Configurations for Isolation Segments
POST /v1/alert-configurations
Specific thresholds can be set for an Isolation Segments by extending existing alert configurations with a deployment specifier. For example, to create an isolation segment alert configuration for the above alert, run the following:
uaac curl -X POST "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations" \
-H "Content-Type: application/json" \
--data "{\"query\":\"origin == 'some_origin' and name == 'Some.Metric.Name' and deployment == 'Some-Isolated-Deployment'\",\"threshold\":{\"critical\":55,\"warning\":45,\"type\":\"UPPER\"}}"
The created alert configuration is echoed back in the following response:
{
"query": "origin == 'some_origin' and name == 'Some.Metric.Name' and deployment == 'Some-Isolated-Deployment'",
"threshold": {
"critical": 55,
"warning": 45,
"type": "UPPER"
}
}
Note: You can delete Isolation Segment alert configurations only if you created them with the above method.
Delete Isolation Segment Alert Configurations
To delete a user-created alert configuration for an isolation segment, add a query
to the URL in a parameter named q
.
DELETE /v1/alert-configurations?q=...
See the following example:
uaac curl -X DELETE "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations?q=origin == 'some_origin' and name == 'Some.Metric.Name' and deployment == 'Some-Isolated-Deployment'"
This returns the number of deleted alert configurations. See the following example output.
1
Note: You can delete Isolation Segment alert configurations only if you created them through the Healthwatch API.
Disable Alerts on a Metric
PCF Healthwatch does not support disabling alerts on a specific metric. To ensure that PCF Healthwatch does not alert on a metric, update the threshold of the alert to a value that will never trigger an alert.
For more information about thresholds, see Thresholds.
For more information about updating alert configurations, see Update Alert Configurations.
Queries
The query
field is used in two ways:
GET
requests: Thequery
filters the alert configurations being queried.POST
requests: Thequery
specifies the alert configuration being updated.DELETE
requests: Thequery
filters the alert configurations being deleted.
The query
associated with an alert configuration denotes the trigger conditions for the alert.
A well-formed query
is a valid Spring Expression Language (SpEL) expression that uses only the equality, "=="
, and conjunction, "and"
, operators. For example:
"origin == 'some_origin' and name == 'Some.Metric.Name'" # valid
"origin > 'some_origin' and name == 'Some.Metric.Name'" # invalid (uses '>')
"origin == 'some_origin' or name == 'Some.Metric.Name'" # invalid (uses 'or')
The following fields can be queried:
origin
(required)name
(required)job
deployment
Thresholds
The threshold
field contains a threshold type
, as well as critical
and warning
threshold values. The type
can be "UPPER"
, "LOWER"
, "EQUALITY"
, or "INEQUALITY"
.
Alert configurations whose thresholds are of the UPPER
type trigger their alerts when the actual metric value is above the warning
value, and again when above the critical
value.
Thresholds with the LOWER
type work the same way, except the alerts trigger when the metric falls below the thresholds.
The EQUALITY
alerts trigger when the metric value is not exactly equal to the critical
threshold. These alerts do not have warning
thresholds.
The INEQUALITY
alerts trigger when the metric value is exactly equal to the critical
threshold. These alerts do not have warning
thresholds.
Supported Alerts
This section describes the default Warning and Critical alerts for PCF Healthwatch. Each alert includes recommended thresholds for metrics monitored by Healthwatch.
Pivotal recommends customizing the default thresholds for alerts indicated with a 1 in the following tables based on your environment. You can determine the best threshold for your environment by monitoring the metrics over time and noting the metric values that indicate acceptable and unacceptable system performance and health. For more information about updating alert configurations in PCF Healthwatch, see Update Alert Configurations.
By default, PCF Healthwatch includes the configurable alerts in the tables below. You can learn more about the metrics PCF Healthwatch emits here: PCF Healthwatch Metrics.
Performance Alerts
Alert | Metric | Threshold |
---|---|---|
Active Locks Held |
Name: ActiveLocks
Origin: locket Category: Compute Performance |
Threshold Type: EQUALITY
Critical: 4 Unit: Number Assessment Window (minutes): 5 |
Active Presences Held1 |
Name: ActivePresences
Origin: locket Category: Compute Performance |
Threshold Type: UPPER
Critical: 200 Warning: 150 Unit: Number Assessment Window (minutes): 15 |
Auctioneer Time to Fetch Cell State |
Name: AuctioneerFetchStatesDuration
Origin: auctioneer Category: Compute Performance |
Threshold Type: UPPER
Critical: 5000000000 Warning: 2000000000 Unit: ns Assessment Window (minutes): 5 |
App Instances Placement Failures Rate |
Name: AuctioneerLRPAuctionsFailed
Origin: auctioneer Category: App Instances |
Threshold Type: UPPER
Critical: 1 Warning: .5 Unit: Number Assessment Window (minutes): 5 |
App Instance Starts Rate1 |
Name: AuctioneerLRPAuctionsStarted
Origin: auctioneer Category: App Instances |
Threshold Type: UPPER
Critical: 100 Warning: 50 Unit: Number Assessment Window (minutes): 5 |
Router Exhausted Connections1 |
Name: backend_exhausted_conns
Origin: gorouter Category: Routing |
Threshold Type: UPPER
Critical: 10 Warning: 5 Unit: Number Assessment Window (minutes): 5 |
Number of Router 502 Bad Gateways1 |
Name: bad_gateways
Origin: gorouter Category: Routing |
Threshold Type: UPPER
Critical: 40 Warning: 30 Unit: Number Assessment Window (minutes): 5 |
Task Placement Failures Rate |
Name: AuctioneerTaskAuctionsFailed
Origin: auctioneer Category: App Instances |
Threshold Type: UPPER
Critical: 1 Warning: .5 Unit: Number Assessment Window (minutes): 5 |
BBS Time to Run LRP Convergence |
Name: ConvergenceLRPDuration
Origin: bbs Category: Compute Performance |
Threshold Type: UPPER
Critical: 20000000000 Warning: 10000000000 Unit: ns Assessment Window (minutes): 15 |
Number of Crashed App Instances1 |
Name: CrashedActualLRPs
Origin: bbs Category: App Instances |
Threshold Type: UPPER
Critical: 20 Warning: 10 Unit: Number Assessment Window (minutes): 5 |
Cloud Controller and Diego in Sync |
Name: Diego. AppsDomainSynced Origin: bbs Category: Compute Performance |
Threshold Type: EQUALITY
Critical: 1 Unit: Number Assessment Window (minutes): 5 |
Rate of Change in Running App Instances1 |
Name: Diego. LRPsAdded.1H Origin: healthwatch Category: App Instances |
Threshold Type: UPPER
Critical: 100 Warning: 50 Unit: Number Assessment Window (minutes): 5 |
Router File Descriptors |
Name: file_descriptors
Origin: gorouter Category: Routing |
Threshold Type: UPPER
Critical: 60000 Warning: 50000 Unit: Number Assessment Window (minutes): 5 |
PAS MySQL Galera Cluster Status |
Name: Galera. ClusterStatusSum Origin: healthwatch Category: MySQL |
Threshold Type: LOWER
Critical: 0.9999 Warning: 2.9999 Unit: Number Assessment Window (minutes): 5 |
PAS MySQL Galera Cluster Size |
Name: Galera. TotalPercentageHealthyNodes Origin: healthwatch Category: MySQL |
Threshold Type: LOWER
Critical: 0.3332 Warning: 0.9999 Unit: Percent Assessment Window (minutes): 5 |
Healthwatch BOSH Director Test Availability |
Name: health.check. bosh.director.probe.available Origin: healthwatch Category: BOSH Director |
Threshold Type: LOWER
Critical: 0.4 Warning: 0.6 Unit: Number Assessment Window (minutes): 10 |
Ops Manager Test Availability |
Name: health.check. OpsMan.probe.available Origin: healthwatch Category: Ops Manager |
Threshold Type: LOWER
Critical: 0.4 Warning: 0.6 Unit: Number Assessment Window (minutes): 5 |
Canary App Health Test Availability |
Name: health.check. CanaryApp.probe.available Origin: healthwatch Category: Canary App |
Threshold Type: LOWER
Critical: 0.4 Warning: 0.6 Unit: Number Assessment Window (minutes): 5 |
CLI Health Test Availability |
Name: health.check. cliCommand.probe.available Origin: healthwatch Category: CLI |
Threshold Type: LOWER
Critical: 0.4 Warning: 0.6 Unit: Number Assessment Window (minutes): 5 |
Healthwatch UI Availability |
Name: health.check. ui.available Origin: healthwatch Category: Healthwatch |
Threshold Type: LOWER
Critical: 0.4 Warning: 0.6 Unit: Number Assessment Window (minutes): 5 |
Healthwatch UI Availability |
Name: health.check. ui.available Origin: healthwatch Category: Healthwatch |
Threshold Type: LOWER
Critical: 0.4 Warning: 0.6 Unit: Number Assessment Window (minutes): 5 |
Healthwatch Nozzle Disconnects |
Name: ingestor.disconnects
Origin: healthwatch Category: Healthwatch |
Threshold Type: UPPER
Critical: 10 Warning: 5 Unit: Number Assessment Window (minutes): 5 |
Healthwatch Ingestor Data Drops |
Name: ingestor.dropped
Origin: healthwatch Category: Healthwatch |
Threshold Type: UPPER
Critical: 20 Warning: 10 Unit: Number Assessment Window (minutes): 5 |
Healthwatch Ingestor Metrics Ingested |
Name: ingestor.ingested
Origin: healthwatch Category: Healthwatch |
Threshold Type: LOWER
Critical: 0.01 Warning: 0.01 Unit: Number Assessment Window (minutes): 30 |
Healthwatch Ingestor BOSH System Metrics Ingested |
Name: ingestor.ingested. boshSystemMetrics Origin: healthwatch Category: Healthwatch |
Threshold Type: LOWER
Critical: 0.01 Warning: 0.01 Unit: Number Assessment Window (minutes): 30 |
Router Handling Latency1 |
Name: latency
Origin: gorouter Category: Routing |
Threshold Type: UPPER
Critical: 150 Warning: 100 Unit: ms Assessment Window (minutes): 30 |
UAA Request Latency |
Name: latency.uaa
Origin: gorouter Category: Routing |
Threshold Type: UPPER
Critical: 150 Warning: 100 Unit: ms Assessment Window (minutes): 5 |
Locks Held by BBS |
Name: LockHeld
Origin: bbs Category: Healthwatch |
Threshold Type: EQUALITY
Critical: 1 Unit: Number Assessment Window (minutes): 5 |
Locks Held by Auctioneer |
Name: LockHeld
Origin: auctioneer Category: Compute Performance |
Threshold Type: EQUALITY
Critical: 1 Unit: Number Assessment Window (minutes): 5 |
More App Instances Than Expected |
Name: LRPsExtra
Origin: bbs Category: App Instances |
Threshold Type: UPPER
Critical: 10 Warning: 5 Unit: Number Assessment Window (minutes): 5 |
Fewer App Instances Than Expected |
Name: LRPsMissing
Origin: bbs Category: App Instances |
Threshold Type: UPPER
Critical: 10 Warning: 5 Unit: Number Assessment Window (minutes): 5 |
Healthwatch Super Metrics Published |
Name: metrics.published
Origin: healthwatch Category: Healthwatch |
Threshold Type: LOWER
Critical: 0 Warning: 20 Unit: Number Assessment Window (minutes): 5 |
Time Since Last Route Register Received |
Name: ms_since_last_registry_update
Origin: gorouter Category: Routing |
Threshold Type: UPPER
Critical: 30000 Warning: 30000 Unit: ms Assessment Window (minutes): 5 |
PAS MySQL Server Availability |
Name: /mysql/available
Origin: mysql Job: mysql, database Category: MySQL |
Threshold Type: EQUALITY
Critical: 1 Unit: Number Assessment Window (minutes): 5 |
PAS MySQL Galera Cluster Node Readiness |
Name: /mysql/galera/wsrep_ready
Origin: mysql Job: mysql, database Category: MySQL |
Threshold Type: LOWER
Critical: 1 Warning: 0.9999 Unit: Number Assessment Window (minutes): 5 |
Cell Rep Time to Sync |
Name: RepBulkSyncDuration
Origin: rep Category: Compute Performance |
Threshold Type: UPPER
Critical: 10000000000 Warning: 5000000000 Unit: ns Assessment Window (minutes): 15 |
Number of Router 5XX Server Errors1 |
Name: responses.5xx
Origin: gorouter Category: Routing |
Threshold Type: UPPER
Critical: 40 Warning: 30 Unit: Number Assessment Window (minutes): 5 |
Route Emitter Time to Sync1 |
Name: RouteEmitterSyncDuration
Origin: route_emitter Category: Compute Performance |
Threshold Type: UPPER
Critical: 10000000000 Warning: 5000000000 Unit: ns Assessment Window (minutes): 15 |
Number of Route Registration Messages Sent and Received |
Name: RouteRegistration.MessagesDelta
Origin: healthwatch Category: Routing |
Threshold Type: UPPER
Critical: 50 Warning: 30 Unit: Number Assessment Window (minutes): 5 |
BBS Time to Handle Requests |
Name: RequestLatency
Origin: bbs Category: Compute Performance |
Threshold Type: UPPER
Critical: 10000000000 Warning: 5000000000 Unit: ns Assessment Window (minutes): 15 |
UAA Requests In Flight |
Name: server.inflight.count
Origin: uaa Category: UAA |
Threshold Type: UPPER
Critical: 200 Warning: 150 Unit: Number Assessment Window (minutes): 5 |
VM CPU |
Name: system.cpu.user
Origin: bosh-system-metrics-forwarder Category: All Jobs |
Threshold Type: UPPER
Critical: 95 Warning: 85 Unit: Percent Assessment Window (minutes): 5 |
VM Ephemeral Disk Used |
Name: system.disk. Origin: bosh-system-metrics-forwarder Category: All Jobs |
Threshold Type: UPPER
Critical: 90 Warning: 80 Unit: Percent Assessment Window (minutes): 5 |
VM Persistent Disk Used |
Name: system.disk. persistent.percent Origin: bosh-system-metrics-forwarder Category: All Jobs |
Threshold Type: UPPER
Critical: 90 Warning: 80 Unit: Percent Assessment Window (minutes): 5 |
VM Disk Used |
Name: system.disk. system.percent Origin: bosh-system-metrics-forwarder Category: All Jobs |
Threshold Type: UPPER
Critical: 90 Warning: 80 Unit: Percent Assessment Window (minutes): 5 |
VM Health Check Recovery |
Name: system.healthy
Origin: bosh-system-metrics-forwarder Category: All Jobs |
Threshold Type: LOWER
Critical: 0.4 Warning: 0.6 Unit: Number Assessment Window (minutes): 5 |
Router Throughput1 |
Name: total_requests
Origin: gorouter Category: Routing |
Threshold Type: UPPER
Critical: 125000 Warning: 100000 Unit: Number Assessment Window (minutes): 5 |
Number of Router Routes Registered1 |
Name: total_routes
Origin: gorouter Category: Routing |
Threshold Type: UPPER
Critical: 200 Warning: 100 Unit: Number Assessment Window (minutes): 5 |
VM Memory Used |
Name: system.mem.percent
Origin: bosh-system-metrics-forwarder Category: All Jobs |
Threshold Type: UPPER
Critical: 95 Warning: 85 Unit: Percent Assessment Window (minutes): 5 |
UAA Throughput Rate |
Name: uaa.throughput.rate
Origin: healthwatch Category: Healthwatch |
Threshold Type: UPPER
Critical: 15000 Warning: 12000 Unit: Number Assessment Window (minutes): 5 |
Unhealthy Cells2 |
Name: UnhealthyCell
Origin: rep Category: Compute Performance |
Threshold Type: EQUALITY
Critical: 0 Unit: Number Assessment Window (minutes): 5 |
1 Pivotal recommends customizing the default thresholds for these alerts based on your environment. You can determine the best threshold for your environment by monitoring the metrics over time and noting the metric values that indicate acceptable and unacceptable system performance and health. For more information about updating alert configurations in PCF Healthwatch, see Update Alert Configurations.
2 We are alerting by cell for this metric. We will notify at a critical level when any Diego Cell has been unhealthy for 5 minutes.
Scaling Alerts
Alert | Metric | Threshold |
---|---|---|
Number of Available Free Chunks of Cell Memory |
Name: Diego. AvailableFreeChunks Origin: healthwatch Category: Capacity |
Threshold Type: LOWER
Critical: 1 Warning: 2 Unit: Number Assessment Window (minutes): 5 |
Number of Available Free Chunks of Cell Disk |
Name: Diego. AvailableFreeChunksDisk Origin: healthwatch Category: Capacity |
Threshold Type: LOWER
Critical: 50 Warning: 100 Unit: Number Assessment Window (minutes): 5 |
Remaining Cell Disk Available |
Name: Diego. TotalAvailableDiskCapacity. 5M Origin: healthwatch Category: Capacity |
Threshold Type: LOWER
Critical: 6144 Warning: 12288 Unit: MBs Assessment Window (minutes): 5 |
Remaining Cell Memory Available1 |
Name: Diego. TotalAvailableMemoryCapacity. 5M Origin: healthwatch Category: Capacity |
Threshold Type: LOWER
Critical: 32768 Warning: 65536 Unit: MBs Assessment Window (minutes): 5 |
Cell Container Capacity Available |
Name: Diego. TotalPercentageAvailableContainerCapacity. 5M Origin: healthwatch Category: Capacity |
Threshold Type: LOWER
Critical: 0.35 Warning: 0.35 Unit: Percent Assessment Window (minutes): 30 |
Cell Disk Available |
Name: Diego. TotalPercentageAvailableDiskCapacity. 5M Origin: healthwatch Category: Capacity |
Threshold Type: LOWER
Critical: 0.35 Warning: 0.35 Unit: Percent Assessment Window (minutes): 30 |
Cell Memory Available1 |
Name: Diego. TotalPercentageAvailableMemoryCapacity. 5M Origin: healthwatch Category: Capacity |
Threshold Type: LOWER
Critical: 0.35 Warning: 0.35 Unit: Percent Assessment Window (minutes): 30 |
Doppler Message Rate Capacity |
Name: Doppler. MessagesAverage.1M Origin: healthwatch Category: Logging |
Threshold Type: UPPER
Critical: 1000000 Warning: 800000 Unit: Number Assessment Window (minutes): 60 |
Log Transport Loss Rate |
Name: Firehose. LossRate.1M Origin: healthwatch Category: Logging |
Threshold Type: UPPER
Critical: 0.01 Warning: 0.005 Unit: Percent Assessment Window (minutes): 5 |
Redis Counter Event Queue Size |
Name: redis. counterEventQueue.size Origin: healthwatch Category: Healthwatch |
Threshold Type: UPPER
Critical: 10000 Unit: Number Assessment Window (minutes): 5 |
Redis Value Metric Queue Size |
Name: redis. valueMetricQueue.size Origin: healthwatch Category: Healthwatch |
Threshold Type: UPPER
Critical: 10000 Unit: Number Assessment Window (minutes): 5 |
Syslog Adapter Capacity |
Name: SyslogDrain. Adapter.BindingsAverage.5M Origin: healthwatch Category: Logging |
Threshold Type: UPPER
Critical: 500 Warning: 450 Unit: Number Assessment Window (minutes): 60 |
Syslog Adapter Loss Rate |
Name: SyslogDrain. Adapter.LossRate.1M Origin: healthwatch Category: Logging |
Threshold Type: UPPER
Critical: 0.1 Warning: 0.01 Unit: Percent Assessment Window (minutes): 5 |
Router Instance CPU |
Name: system.cpu.user
Origin: bosh-system-metrics-forwarder Job: router Category: Routing |
Threshold Type: UPPER
Critical: 70 Warning: 60 Unit: Percent Assessment Window (minutes): 5 |
UAA Instance CPU |
Name: system.cpu.user
Origin: bosh-system-metrics-forwarder Job: uaa, control3 Category: UAA |
Threshold Type: UPPER
Critical: 90 Warning: 80 Unit: Percent Assessment Window (minutes): 5 |
1 Pivotal recommends customizing the default thresholds for these alerts based on your environment. You can determine the best threshold for your environment by monitoring the metrics over time and noting the metric values that indicate acceptable and unacceptable system performance and health. For more information about updating alert configurations in PCF Healthwatch, see Update Alert Configurations.
3 This handles two jobs to accommodate different job names between PAS and SRT*
Service Level Indicators
Alert | Metric | Threshold |
---|---|---|
BOSH Director Health |
Name: health.check. bosh.director.success Origin: healthwatch Category: BOSH Director |
Threshold Type: EQUALITY
Critical: 1 Unit: Number Assessment Window (minutes): 10 |
Ops Manager Availability |
Name: health.check. OpsMan.available Origin: healthwatch Category: Ops Manager |
Threshold Type: EQUALITY
Critical: 1 Unit: Number Assessment Window (minutes): 10 |
Canary App Availability |
Name: health.check. CanaryApp.available Origin: healthwatch Category: Canary App |
Threshold Type: LOWER
Critical: 0.5 Unit: Number Assessment Window (minutes): 5 |
Canary App Response Time |
Name: health.check. CanaryApp.responseTime Origin: healthwatch Category: Canary App |
Threshold Type: UPPER
Critical: 30000 Warning: 15000 Unit: ms Assessment Window (minutes): 5 |
CF Push Time |
Name: health.check. cliCommand.pushTime Origin: healthwatch Category: CLI |
Threshold Type: UPPER
Critical: 120000 Warning: 60000 Unit: ms Assessment Window (minutes): 10 |
Can CF Login |
Name: health.check. cliCommand.login Origin: healthwatch Category: CLI |
Threshold Type: INEQUALITY
Critical: 0 Unit: Number Assessment Window (minutes): 10 |
Can CF Push |
Name: health.check. cliCommand.push Origin: healthwatch Category: CLI |
Threshold Type: INEQUALITY
Critical: 0 Unit: Number Assessment Window (minutes): 10 |
Can CF Start |
Name: health.check. cliCommand.start Origin: healthwatch Category: CLI |
Threshold Type: INEQUALITY
Critical: 0 Unit: Number Assessment Window (minutes): 10 |
Can CF Stop |
Name: health.check. cliCommand.stop Origin: healthwatch Category: CLI |
Threshold Type: INEQUALITY
Critical: 0 Unit: Number Assessment Window (minutes): 10 |
Can CF Delete |
Name: health.check. cliCommand.delete Origin: healthwatch Category: CLI |
Threshold Type: INEQUALITY
Critical: 0 Unit: Number Assessment Window (minutes): 10 |
Can Receive Logs |
Name: health.check. cliCommand.logs Origin: healthwatch Category: CLI |
Threshold Type: INEQUALITY
Critical: 0 Unit: Number Assessment Window (minutes): 10 |
Errors
This section lists common error messages and their causes.
Error Message: “Unsupported condition in query expression”
Possible Cause: Key in query string is something other than name
, origin
, job
, or deployment
Error Message: “Unsupported query expression”
Possible Causes:
- Query string does not include AT LEAST
name
andorigin
- Found operator other than
and
or==
- Expression in query string not in format of
property == 'value' and ...
Error Message: “Invalid query expression”
Possible Cause: Invalid query string format
Error Message: “Invalid threshold type for metric ‘some_origin.some_name’”)
Possible Cause: Given threshold type does not match expected (see table above)
Error Message: “Must provide a warning threshold for an Upper/Lower Threshold”
Possible Cause: Threshold missing required values for type
Error Message: “name = 'some_name’ and origin = 'some_origin’ is not a supported alert configuration” Error Message: “name = 'some_name’, origin = 'some_origin’, and job = 'some_job’ is not a supported alert configuration”
Possible Cause: An alert configuration does not exist for the targeted metric
Walkthrough Example
A best practice deployment of Cloud Foundry includes at least three availability zones (AZs). For these types of deployments, Pivotal recommends that you have enough capacity to suffer failure of an entire AZ.
By default, Healthwatch sends an alert if the Diego.TotalPercentageAvailableMemoryCapacity.5M
metric falls below 35%, or one in three.
However, if your environment has been scaled up to five AZs you may wish to adjust the alert configuration accordingly to 20%, or more in five.
uaac token client get <my_healthwatch_admin_client> -s <my_healthwatch_admin_secret>
export token=$(uaac context | grep access_token | awk '{print $2}')
uaac curl -X POST "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations" \
-H "Content-Type: application/json" \
--data "{\"query\":\"origin == 'healthwatch' and name == 'Diego.TotalPercentageAvailableMemoryCapacity.5M'\",\"threshold\":{\"critical\":0.2,\"warning\":0.3,\"type\":\"LOWER\"}}"
The response body contains the updated alert configuration. You can then confirm the change:
uaac curl "https://healthwatch-api.SYSTEM-DOMAIN/v1/alert-configurations?q=origin == 'healthwatch' and name == 'Diego.TotalPercentageAvailableMemoryCapacity.5M'"
Configure PCF Healthwatch Alert Notifications
You can configure PCF Event Alerts to receive push notifications when a PCF Healthwatch alert occurs. For example, if you configured a PCF Healthwatch alert for memory on a VM, you can use PCF Event Alerts to receive a message on Slack if memory on the VM exceeds the threshold defined in the PCF Healthwatch alert.
For more information about configuring PCF Event Alerts for PCF Healthwatch, see PCF Event Alerts.