Using PCF Healthwatch
Warning: PCF Healthwatch v1.6 is no longer supported or available for download. PCF Healthwatch v1.6 has reached the End of General Support (EOGS) phase as defined by the Support Lifecycle Policy. To stay up to date with the latest software and security updates, upgrade to a supported version.
This topic describes how to use Pivotal Cloud Foundry (PCF) Healthwatch.
The PCF Healthwatch Dashboard
The PCF Healthwatch Dashboard is located at healthwatch.SYSTEM-DOMAIN
. The goal of the user interface is to provide an at-a-glance look
into the health the foundation.
The dashboard shows a set of panels that surface warning (orange) and critical (red) alerts for different parts
of the system. Behind the scenes, the healthwatch-alerts
app watches metrics to evaluate the health of the platform, and each metric belongs to a category covered by one of the display panels. When a metric reaches a configured alert threshold, an alert fires. The dashboard indicates the alert by changing the color of its category panel to orange or red, depending on the alert severity. Additionally, the alert displays in the Alert Stream panel on the right side of the dashboard.
To see and configure the thresholds for individual alerts, and the categories that the alerts map to, refer to the PCF Healthwatch API documentation.
Note that an out-of-the-box PCF Healthwatch installation will likely have noisy alerting, showing false positives. To ensure that alerts fire only in the event of meaningful problems, you need to tune some of the alerts to fit the size of your system.
Accessing PCF Healthwatch
You can access PCF Healthwatch and its data through the PCF Healthwatch UI or directly through the service datastore. In addition, PCF Healthwatch forwards the metrics that it creates into the Loggregator Firehose.
Access the PCF Healthwatch UI
To access the PCF Healthwatch UI, do the following:
- Navigate to
healthwatch.SYSTEM-DOMAIN
. - When prompted to log in, enter the username and password of a UAA user that has either the
healthwatch.read
scope or thehealthwatch.admin
scope.
The UAA admin user has both thehealthwatch.read
andhealthwatch.admin
scopes by default. If you want to log in with another UAA user, make this user a member of thehealthwatch.read
orhealthwatch.admin
group. For more information, including considerations for which scope to grant, see Allow Additional Users to Access the PCF Healthwatch UI.
Access Data Through MySQL
You can access metrics data through the PCF Healthwatch datastore. See PCF Healthwatch Metrics for the description of available data points.
The table below provides login information.
URL | MySQL VM IP |
Port | 3306 |
Username | MySQL Admin Password credentials in the PCF Healthwatch tile |
Password | MySQL Admin Password credentials in the PCF Healthwatch tile |
Database | platform_monitoring |
Tables | value_metric_agg , counter_event_agg , super_value_metric , alert and alert_configuration |
To access the MySQL datastore, you can do the following:
Method 1. Use BOSH to SSH into the MySQL VM and run the
mysql -u root -p
command.Method 2. Assign an external IP to the MySQL VM and a firewall rule to open ports 3306 and 3308 and access MySQL externally.
Method 3. Open a tunnel into your IaaS network and connect that way externally.
Access Super Metrics Through the Firehose
PCF Healthwatch forwards the super metrics that it creates into the Loggregator Firehose so that they can be picked up by existing Firehose consumers. Below is an example of a product-generated metric output received through a Firehose nozzle.
origin:"healthwatch" eventType:ValueMetric timestamp:1502293311588995438 deployment:"cf" job:"healthwatch-forwarder" index:"06231d64-ad9f-4112-8423-6b41f44c0cf5" ip:"10.0.4.82" valueMetric:<name:"Firehose.LossRate.1H" value:0 unit:"hr">
Access Super Metrics Through the Log Cache API
Note: This feature is available in PAS 2.2.5 and later.
You can access Healthwatch data directly with PromQL queries to the Log Cache API. The following Log Cache endpoints are Prometheus compatible:
/api/v1/query
/api/v1/query_range
Query Format
To form the promQL query for a metric, do the following:
- Remove
healthwatch.
from the beginning of the metric name. For example, changehealthwatch.Diego_AvailableFreeChunks
toDiego_AvailableFreeChunks
. - Replace all non-alphanumeric characters with
_
. - Add
{source_id="healthwatch-forwarder"}
to the end. Log Cache requiressource_id
for all queries.
See the Querying Prometheus for for more information about query syntax.
Example: Query recent values of Diego Available Free Chunks
This queries all values of Diego_AvailableFreeChunks
over the past two minutes:
curl "https://log-cache.SYSTEM-DOMAIN/api/v1/query" --data-urlencode 'query=Diego_AvailableFreeChunks{source_id="healthwatch-forwarder"}[2m]' -H "Authorization: $(cf oauth-token)"
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {
"deployment": "p-healthwatch-b8f99d6a724dbee699cc",
"index": "2336cbbe-e526-45d4-816e-dd2352d4fa0c",
"job": "healthwatch-forwarder",
"origin": "healthwatch"
},
"values": [
[
1536166538000000000,
"10"
],
[
1536166598000000000,
"8"
]
]
}
]
}
}
The returned values of [ 1536166538, "10" ]
and [ 1536166598, "8" ]
mean there were 10 free chunks two minutes ago and eight free chunks one minute ago.
Example: Query the latest value for the CF Push Health Check
This queries the latest value stored in Log Cache for the cf push
Health Check:
curl "https://log-cache.SYSTEM-DOMAIN/api/v1/query" --data-urlencode 'query=health_check_cliCommand_push{source_id="healthwatch-forwarder"}' -H "Authorization: $(cf oauth-token)"
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"deployment": "p-healthwatch-b8f99d6a724dbee699cc",
"index": "2336cbbe-e526-45d4-816e-dd2352d4fa0c",
"job": "healthwatch-forwarder",
"origin": "healthwatch"
},
"value": [
1537826971,
"1"
]
}
]
}
}
The result value of “1” means that cf push
is currently working. Any other value means that cf push
has timed out or failed.
This result could be used by automation tools waiting for cf push
to become available.
Example: Canary App Availability SLI
The Log Cache API can also execute the same query multiple times over a range:
curl "https://log-cache.SYSTEM-DOMAIN/api/v1/query_range?start=1537782809&end=1537804131&step=60s" --data-urlencode 'query=health_check_CanaryApp_available{source_id="healthwatch-forwarder"}' -H "Authorization: $(cf oauth-token)"
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {
"deployment": "cf-abc-123",
"index": "9496f02a-0a40-427a-b2e3-189e30064031",
"job": "healthwatch-forwarder",
"origin": "healthwatch"
},
"values": [
[ 1537790009, "1" ],
[ 1537790069, "1" ],
[ 1537790129, "1" ],
[ 1537790189, "1" ],
[ 1537790249, "1" ],
[ 1537790309, "0" ],
[ 1537790369, "1" ],
[ 1537790429, "1" ]
],
}
]
}
}
The query_range
endpoint is useful for seeing values over time with a fixed interval (step). This is useful
for charting, but you can also use it to calculate uptime, based on a critical threshold of v < 1
:
results [1 1 1 1 1 0 1 1]
number_of_failed_results (v < 1) = 1
total_number_of_results = 8
uptime = (total_number_of_results - number_of_failed_results) / total_number_of_results
uptime = 7 / 8
uptime = 87.5%
Log Cache Data Retention
These queries are subject to the retention time of the Log Cache service. You can discover the oldest metric stored with
the /v1/meta
Log Cache API endpoint.
curl "https://log-cache.SYSTEM-DOMAIN/v1/meta" -H "Authorization: $(cf oauth-token)" -k | jq .meta | grep -A 4 healthwatch-forwarder
If you would like to query data over a longer period of time, the instance count and/or RAM of the Doppler VM must be scaled up.
Alerting and Graphs
The rules for alerting and bucketing graph data are not always the same. As a result there are cases where a graph will have a failing indicator without an alert. This does not indicate a problem with alerting, the two types of information have different behavior rules.
A graph’s time bucket turns red when one test in the time period fails. Alerting may have a higher threshold - alert thresholds vary by metric and are based on an average aggregation so more than one failure in a period might be needed to trigger an alert.
Additionally, not all metrics have alerts associated with them so a complete absence of alerts on a failing chart is not necessarily a problem.
For more information on alerting rules see the alerts documentation.