Using Pivotal Healthwatch
This topic describes how to use Pivotal Healthwatch.
The Pivotal Healthwatch Dashboard
The Pivotal Healthwatch Dashboard is located at healthwatch.SYSTEM-DOMAIN
, where SYSTEM-DOMAIN
is the system domain URL configured in the TAS for VMs tile. The goal of the user interface is to provide an at-a-glance look into the health the foundation.
The dashboard shows a set of panels that surface warning (orange) and critical (red) alerts for different parts of the system. Behind the scenes, the healthwatch-alerts
app watches metrics to evaluate the health of the platform, and each metric belongs to a category covered by one of the display panels. When a metric reaches a configured alert threshold, an alert fires. The dashboard indicates the alert by changing the color of its category panel to orange or red, depending on the alert severity. Additionally, the alert displays in the Alert Stream panel on the right side of the dashboard.
To see and configure the thresholds for individual alerts and the categories to which the alerts are mapped, see Pivotal Healthwatch API.
An out-of-the-box Pivotal Healthwatch installation is likely to have noisy alerting, showing false positives. To ensure that alerts fire only in the event of meaningful problems, you must edit some of the alerts to fit the size of your system. For more information, see Update Alert Configurations in Configuring Pivotal Healthwatch Alerts.
Accessing Pivotal Healthwatch
You can access Pivotal Healthwatch and its data through the Pivotal Healthwatch UI or directly through the service datastore. In addition, Pivotal Healthwatch forwards the metrics that it creates into the Loggregator system.
Access the Pivotal Healthwatch UI
To access the Pivotal Healthwatch UI:
Navigate to
healthwatch.SYSTEM-DOMAIN
, whereSYSTEM-DOMAIN
is the system domain URL configured in the TAS for VMs tile. For example,sys.example.com
.When prompted to log in, enter the username and password of a UAA user that has either the
healthwatch.read
scope or thehealthwatch.admin
scope.
The UAA admin user has both thehealthwatch.read
andhealthwatch.admin
scopes by default. If you want to log in with another UAA user, make this user a member of thehealthwatch.read
orhealthwatch.admin
group. For more information, including considerations for which scope to grant, see Allow Additional Users to Access the Pivotal Healthwatch UI in Installing Pivotal Healthwatch.
Access Data Through MySQL
You can access metrics data through the Pivotal Healthwatch datastore. For descriptions of available data points, see Pivotal Healthwatch Metrics.
The table below provides login information.
URL | MySQL VM IP |
Port | 3306 |
Username | MySQL Admin Password credentials in the Pivotal Healthwatch tile |
Password | MySQL Admin Password credentials in the Pivotal Healthwatch tile |
Database | platform_monitoring |
Tables | value_metric_agg , counter_event_agg , super_value_metric , alert and alert_configuration |
To access the MySQL datastore, you can use one of the following methods:
Use BOSH to SSH into the MySQL VM and run the
mysql -u root -p
command.Assign an external IP to the MySQL VM and a firewall rule to open ports
3306
and3308
and access MySQL externally.Open a tunnel into your IaaS network and connect that way externally.
Access Super Metrics Through the Loggregator Firehose
Pivotal Healthwatch forwards the super metrics that it creates into the Loggregator system so that they can be picked up by consumers of the Loggregator Firehose. Below is an example of a product-generated metric output received through a Firehose nozzle.
origin:"healthwatch" eventType:ValueMetric timestamp:1502293311588995438 deployment:"cf" job:"healthwatch-forwarder" index:"06231d64-ad9f-4112-8423-6b41f44c0cf5" ip:"10.0.4.82" valueMetric:<name:"Firehose.LossRate.1H" value:0 unit:"hr">
Access Super Metrics Through the Log Cache API
Note: This feature is available in Pivotal Application Service (PAS) v2.2.5 and later.
You can access Healthwatch data directly with PromQL queries to the Log Cache API. The following Log Cache endpoints are Prometheus compatible:
/api/v1/query
/api/v1/query_range
Query Format
To form the promQL query for a metric:
Remove
healthwatch.
from the beginning of the metric name. For example, changehealthwatch.Diego_AvailableFreeChunks
toDiego_AvailableFreeChunks
.Replace all non-alphanumeric characters with
_
.Add
{source_id="healthwatch-forwarder"}
to the end. Log Cache requiressource_id
for all queries.
For more information about query syntax, see Querying Prometheus in the Prometheus documentation.
Example: Query Recent Values of Diego Available Free Chunks
The example curl
command below queries all values of Diego_AvailableFreeChunks
over the past two minutes:
curl "https://log-cache.SYSTEM-DOMAIN/api/v1/query" --data-urlencode 'query=Diego_AvailableFreeChunks{source_id="healthwatch-forwarder"}[2m]' -H "Authorization: $(cf oauth-token)"
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {
"deployment": "p-healthwatch-b8f99d6a724dbee699cc",
"index": "2336cbbe-e526-45d4-816e-dd2352d4fa0c",
"job": "healthwatch-forwarder",
"origin": "healthwatch"
},
"values": [
[
1536166538000000000,
"10"
],
[
1536166598000000000,
"8"
]
]
}
]
}
}
Where SYSTEM-DOMAIN
is the system domain URL configured in the TAS for VMs tile. For example, sys.example.com
.
The returned values of [ 1536166538, "10" ]
and [ 1536166598, "8" ]
mean there were 10 free chunks two minutes ago and eight free chunks one minute ago.
Example: Query the Latest Value for the cf push Health Check
The example curl
command below queries the latest value stored in Log Cache for the cf push
Health Check:
curl "https://log-cache.SYSTEM-DOMAIN/api/v1/query" --data-urlencode 'query=health_check_cliCommand_push{source_id="healthwatch-forwarder"}' -H "Authorization: $(cf oauth-token)"
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"deployment": "p-healthwatch-b8f99d6a724dbee699cc",
"index": "2336cbbe-e526-45d4-816e-dd2352d4fa0c",
"job": "healthwatch-forwarder",
"origin": "healthwatch"
},
"value": [
1537826971,
"1"
]
}
]
}
}
Where SYSTEM-DOMAIN
is the system domain URL configured in the TAS for VMs tile. For example, sys.example.com
.
A resulting value of 1
means that cf push
is currently working. Any other value means that cf push
has timed out or failed.
This result could be used by automation tools waiting for cf push
to become available.
Example: Canary App Availability SLI
The Log Cache API can also execute the same query multiple times over a range, as in the example curl
command below:
curl "https://log-cache.SYSTEM-DOMAIN/api/v1/query_range?start=1537782809&end=1537804131&step=60s" --data-urlencode 'query=health_check_CanaryApp_available{source_id="healthwatch-forwarder"}' -H "Authorization: $(cf oauth-token)"
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {
"deployment": "cf-abc-123",
"index": "9496f02a-0a40-427a-b2e3-189e30064031",
"job": "healthwatch-forwarder",
"origin": "healthwatch"
},
"values": [
[ 1537790009, "1" ],
[ 1537790069, "1" ],
[ 1537790129, "1" ],
[ 1537790189, "1" ],
[ 1537790249, "1" ],
[ 1537790309, "0" ],
[ 1537790369, "1" ],
[ 1537790429, "1" ]
],
}
]
}
}
Where SYSTEM-DOMAIN
is the system domain URL configured in the TAS for VMs tile. For example, sys.example.com
.
The query_range
endpoint is useful for seeing values over time with a fixed interval (step). This is useful
for charting, but you can also use it to calculate uptime, based on a critical threshold of v < 1
:
results [1 1 1 1 1 0 1 1]
number_of_failed_results (v < 1) = 1
total_number_of_results = 8
uptime = (total_number_of_results - number_of_failed_results) / total_number_of_results
uptime = 7 / 8
uptime = 87.5%
Log Cache Data Retention
These queries are subject to the retention time of the Log Cache service. Using the /v1/meta
Log Cache API endpoint, you can discover the oldest metric stored in Log Cache by running:
curl "https://log-cache.SYSTEM-DOMAIN/v1/meta" -H "Authorization: $(cf oauth-token)" -k | jq .meta | grep -A 4 healthwatch-forwarder
Where SYSTEM-DOMAIN
is the system URL configured in the TAS for VMs tile. For example, sys.example.com
.
If you would like to query data over a longer period of time, the instance count and/or RAM of the Doppler VM must be scaled up.
Alerting and Graphs
The rules for alerting and bucketing graph data are not always the same. As a result, there are cases where a graph has a failing indicator without an alert. This does not indicate a problem with alerting, as the two types of information have different behavior rules.
A graph’s time bucket turns red when one test in the time period fails. Alerting may have a higher threshold. Alert thresholds vary by metric and are based on an average aggregation, so more than one failure in a period might be necessary to trigger an alert.
Additionally, not all metrics have alerts associated with them, so a complete absence of alerts on a failing chart is not necessarily a problem.
For more information about alerting rules, see Thresholds in Configuring Pivotal Healthwatch Alerts.