Healthwatch Metrics
This topic lists the metrics created by Healthwatch, including Tanzu Application Service Exporter and Enterprise PKS Exporter.
Prometheus scrapes the /metrics
endpoint for each of the Exporters.
The frequency with which it scrapes can be configured in the Healthwatch tile.
Note: For external monitoring consumers, Healthwatch exposes the metrics that it creates through scrapable metrics endpoints. The metrics endpoints are secured by mTLS certs generated using Ops Manager CA cert.
Note: The VM names for each of these sections are in parenthesis. All of these vms are deployed from either the PAS Exporter Tile or the PKS Exporter Tile unless otherwise mentioned.
Bosh Service Level Indicator Metrics
BOSH is the technology behind Ops Manager to manage the VMs deployed. If the BOSH Director is not responsive or functional, BOSH-managed VMs lose their resiliency. Healthwatch executes a continuous test suite to validate the functionality of BOSH director.
There are three exporters that handle SLI metrics for BOSH, the Bosh Deployments Exporter, the Bosh Health Exporter, and the PKS Exporter.
Bosh Deployments Exporter (bosh-deployments-exporter)
The Bosh Deployments Exporter periodically checks to see if any BOSH Deployments other than the one created by the Bosh Health Exporter are running.
Metric | Description |
---|---|
bosh_deployments_status |
Bosh Deployments status, a 1 indicates a deployment is occurring on the director |
bosh_sli_duration_seconds_bucket{exported_job="bosh-deployments-exporter"} |
Number of seconds it took for the SLI test suite to run, grouped by duration |
bosh_sli_duration_seconds_count{exported_job="bosh-deployments-exporter"} |
Total number of metrics in all the buckets |
bosh_sli_duration_seconds_sum{exported_job="bosh-deployments-exporter"} |
Total value of the metrics in all the buckets |
bosh_sli_exporter_status{exported_job="bosh-deployments-exporter"} |
Exporter status, a 1 indicates the exporter is running and healthy |
bosh_sli_failures_total{exported_job="bosh-deployments-exporter"} |
Total number of failures of the SLI test suite |
bosh_sli_run_duration_seconds{exported_job="bosh-deployments-exporter"} |
Number of seconds it took for the SLI test suite to run |
bosh_sli_runs_total{exported_job="bosh-deployments-exporter"} |
Total number of runs of the SLI test suite. Use bosh_sli_failures_total{exported_job="bosh-deployments-exporter"} / bosh_sli_runs_total{exported_job="bosh-deployments-exporter"} to get failure rate. |
bosh_sli_task_duration_seconds_bucket{exported_job="bosh-deployments-exporter"} |
Number of seconds it took for a particular task to run, grouped by duration |
bosh_sli_task_duration_seconds_count{exported_job="bosh-deployments-exporter"} |
Total number of metrics in all the buckets |
bosh_sli_task_duration_seconds_sum{exported_job="bosh-deployments-exporter"} |
Total value of the metrics in all the buckets |
bosh_sli_task_run_duration_seconds{exported_job="bosh-deployments-exporter"} |
Number of seconds it took for a particular task to run |
bosh_sli_task_runs_total{exported_job="bosh-deployments-exporter"} |
Total number of runs for a particular SLI task. Use bosh_sli_task_failures_total{exported_job="bosh-deployments-exporter"} / bosh_sli_task_runs_total{exported_job="bosh-deployments-exporter"} to get failure rate. |
bosh_sli_task_failures_total{exported_job="bosh-deployments-exporter",task="tasks"} |
Total number of failures for a bosh tasks command |
Bosh Health Exporter (bosh-health-exporter)
The Bosh Health Exporter periodically creates and deletes a BOSH deployment.
Note:
The BOSH Health Exporter deploys and deletes a VM named bosh-health-exporter
as part of this test suite.
Metric | Description |
---|---|
bosh_sli_duration_seconds_bucket{exported_job="bosh-health-exporter"} |
Number of seconds it took for the SLI test suite to run, grouped by duration |
bosh_sli_duration_seconds_count{exported_job="bosh-health-exporter"} |
Total number of metrics in all the buckets |
bosh_sli_duration_seconds_sum{exported_job="bosh-health-exporter"} |
Total value of the metrics in all the buckets |
bosh_sli_exporter_status{exported_job="bosh-health-exporter"} |
Exporter status, a 1 indicates the exporter is running and healthy |
bosh_sli_failures_total{exported_job="bosh-health-exporter"} |
Total number of failures of the SLI test suite |
bosh_sli_run_duration_seconds{exported_job="bosh-health-exporter"} |
Number of seconds it took for the SLI test suite to run |
bosh_sli_runs_total{exported_job="bosh-health-exporter"} |
Total number of runs of the SLI test suite. Use bosh_sli_failures_total{exported_job="bosh-health-exporter"} / bosh_sli_runs_total{exported_job="bosh-health-exporter"} to get failure rate. |
bosh_sli_task_duration_seconds_bucket{exported_job="bosh-health-exporter"} |
Number of seconds it took for a particular task to run, grouped by duration |
bosh_sli_task_duration_seconds_count{exported_job="bosh-health-exporter"} |
Total number of metrics in all the buckets |
bosh_sli_task_duration_seconds_sum{exported_job="bosh-health-exporter"} |
Total value of the metrics in all the buckets |
bosh_sli_task_run_duration_seconds{exported_job="bosh-health-exporter"} |
Number of seconds it took for a particular task to run |
bosh_sli_task_runs_total{exported_job="bosh-health-exporter"} |
Total number of runs for a particular SLI task. Use bosh_sli_task_failures{exported_job="bosh-health-exporter"} / bosh_sli_task_runs{exported_job="bosh-health-exporter"} to get failure rate. |
bosh_sli_task_failures_total{exported_job="bosh-health-exporter",task="delete"} |
Total number of failures for a bosh delete-deployment command |
bosh_sli_task_failures_total{exported_job="bosh-health-exporter",task="deploy"} |
Total number of failures for a bosh deploy command |
bosh_sli_task_failures_total{exported_job="bosh-health-exporter",task="deployments"} |
Total number of failures for a bosh deployments command |
Platform Service Level Indicators
Healthwatch generates metrics that describe the health of several platform components. These metrics can be used to calculate percent availability and error budgets.
PAS SLI Exporter (pas-sli-exporter)
The Cloud Foundry Command Line Interface (cf
CLI)
enables developers to create and manage apps on Tanzu Application Service.
Healthwatch executes a continuous test suite to validate the core functions of the cf
CLI.
The table below provides information about the cf
CLI health smoke tests and the metrics that are generated for these tests.
Metric | Description |
---|---|
pas_sli_duration_seconds_bucket |
Number of seconds it took for the SLI test suite to run, grouped by duration |
pas_sli_duration_seconds_count |
Total number of metrics in all the buckets |
pas_sli_duration_seconds_sum |
Total value of the metrics in all the buckets |
pas_sli_exporter_status |
Exporter status, a 1 indicates the exporter is running and healthy |
pas_sli_failures_total |
Total number of failures of the SLI test suite |
pas_sli_run_duration_seconds |
Number of seconds it took for the SLI test suite to run |
pas_sli_runs_total |
Total number of runs of the SLI test suite. Use pas_sli_failures_total / pas_sli_runs_total to get failure rate. |
pas_sli_task_duration_seconds_bucket |
Number of seconds it took for a particular task to run, grouped by duration |
pas_sli_task_duration_seconds_count |
Total number of metrics in all the buckets |
pas_sli_task_duration_seconds_sum |
Total value of the metrics in all the buckets |
pas_sli_task_run_duration_seconds |
Number of seconds it took for a particular task to run |
pas_sli_task_runs_total |
Total number of runs for a particular SLI task. Use pas_sli_task_failures / pas_sli_task_runs to get failure rate. |
pas_sli_task_failures_total{task="delete"} |
Total number of failures for a cf delete command on PAS |
pas_sli_task_failures_total{task="login"} |
Total number of failures for a cf login command on PAS |
pas_sli_task_failures_total{task="logs"} |
Total number of failures for a cf logs command on PAS |
pas_sli_task_failures_total{task="push"} |
Total number of failures for a cf push command on PAS |
pas_sli_task_failures_total{task="setEnv"} |
Total number of failures for a cf set-env command on PAS |
pas_sli_task_failures_total{task="start"} |
Total number of failures for a cf start command on PAS |
pas_sli_task_failures_total{task="stop"} |
Total number of failures for a cf stop command on PAS |
PKS SLI Exporter (pks-sli-exporter)
The PKS Command Line Interface (PKS CLI) allows the operator to create and manage Kubernetes clusters. Healthwatch executes a continuous test suite to validate the core functions of the PKS CLI.
The table below provides information about the PKS CLI Health smoke tests and the metrics that are generated for these tests.
Metric | Description |
---|---|
pks_sli_duration_seconds_bucket |
Number of seconds it took for the SLI test suite to run, grouped by duration |
pks_sli_duration_seconds_count |
Total number of metrics in all the buckets |
pks_sli_duration_seconds_sum |
Total value of the metrics in all the buckets |
pks_sli_exporter_status |
Exporter status, a 1 indicates the exporter is running and healthy |
pks_sli_failures_total |
Total number of failures of the SLI test suite |
pks_sli_run_duration_seconds |
Number of seconds it took for the SLI test suite to run |
pks_sli_runs_total |
Total number of runs of the SLI test suite. Use pks_sli_failures_total / pks_sli_runs_total to get failure rate. |
pks_sli_task_duration_seconds_bucket |
Number of seconds it took for a particular task to run, grouped by duration |
pks_sli_task_duration_seconds_count |
Total number of metrics in all the buckets |
pks_sli_task_duration_seconds_sum |
Total value of the metrics in all the buckets |
pks_sli_task_run_duration_seconds |
Number of seconds it took for a particular task to run |
pks_sli_task_runs_total |
Total number of runs for a particular SLI task. Use pks_sli_task_failures / pks_sli_task_runs to get failure rate. |
pks_sli_task_failures_total{task="clusters"} |
Total number of failures for a pks clusters command |
pks_sli_task_failures_total{task="get-credentials"} |
Total number of failures for a pks get-credentials command |
pks_sli_task_failures_total{task="login"} |
Total number of failures for a pks login command |
pks_sli_task_failures_total{task="plans"} |
Total number of failures for a pks plans command |
Cert Expiration Exporter (cert-expiration-exporter)
Healthwatch exposes metrics about the expiration of certificates. For more information, see here.
The table below provides information about the metrics that are generated.
Metric | Description |
---|---|
ssl_certificate_expiry_seconds{exported_instance=~".*"} |
Duration in seconds until the certificate expires |
TSDB (tsdb)
App availability and responsiveness issues can significantly impact the experience of end users. Healthwatch allows operators to configure Canary URLs in the tile and expose whether the URL is running or not, along with the response time metrics.
Note: These metrics are created by the blackbox exporter job on the TSDB VM in the Healthwatch deployment.
Metric | Description |
---|---|
probe_dns_additional_rrs |
Returns number of entries in the additional resource record list |
probe_dns_answer_rrs |
Returns number of entries in the answer resource record list |
probe_dns_authority_rrs |
Returns number of entries in the authority resource record list |
probe_dns_duration_seconds |
Duration of DNS request by phase |
probe_dns_lookup_time_seconds |
Returns the time taken for probe dns lookup in seconds |
probe_dns_serial |
Returns the serial number of the zone |
probe_duration_seconds |
Returns how long the probe took to complete in seconds |
probe_failed_due_to_regex |
Indicates if probe failed due to regex |
probe_http_content_length |
Length of http content response |
probe_http_duration_seconds |
Duration of http request by phase, summed over all redirects |
probe_http_last_modified_timestamp_seconds |
Returns the Last-Modified HTTP response header in unixtime |
probe_http_redirects |
The number of redirects |
probe_http_ssl |
Indicates if SSL was used for the final redirect |
probe_http_status_code |
Response HTTP status code |
probe_http_uncompressed_body_length |
Length of uncompressed response body |
probe_http_version |
Returns the version of HTTP of the probe response |
probe_icmp_duration_seconds |
Duration of icmp request by phase |
probe_icmp_reply_hop_limit |
Replied packet hop limit (TTL for ipv4) |
probe_ip_addr_hash |
Specifies the hash of IP address. It’s useful to detect if the IP address changes. |
probe_ip_protocol |
Specifies whether probe ip protocol is IP4 or IP6 |
probe_ssl_earliest_cert_expiry |
Returns earliest SSL cert expiry in unixtime |
probe_ssl_last_chain_expiry_timestamp_seconds |
Returns last SSL chain expiry in unixtime |
probe_ssl_last_chain_info |
Contains SSL leaf certificate information |
probe_success |
Displays whether or not the probe was a success |
probe_tls_version_info |
Returns the TLS version used, or NaN when unknown |
Super Value Metrics (svm-forwarder)
The following metrics are Healthwatch 1.x metrics that are made available in the Loggregator Firehose for external use. They are created in Prometheus and are made available by the SVM Forwarder VM. See the Release Notes for more information.
Metric | Description |
---|---|
Diego_AppsDomainSynced |
Whether or not Cloud Controller and Diego are in sync |
Diego_AvailableFreeChunksDisk |
Available free chunks of disk in Diego |
Diego_AvailableFreeChunks |
Available free chunks of memory in Diego |
Diego_LRPsAdded_1H |
Rate of change in running app instances in 1 hour intervals |
Diego_TotalAvailableDiskCapacity_5M |
Remaining cell disk available in Diego in 5 minute intervals |
Diego_TotalAvailableMemoryCapacity_5M |
Remaining cell memory available in Diego in 5 minute intervals |
Diego_TotalPercentageAvailableContainerCapacity_5M |
Percentage of total available container capacity in Diego in 5 minute intervals |
Diego_TotalPercentageAvailableDiskCapacity_5M |
Percentage of total available disk in the Diego cells in 5 minute intervals |
Diego_TotalPercentageAvailableMemoryCapacity_5M |
Percentage of total available memory in the Diego cells in 5 minute intervals |
Doppler_MessagesAverage_1M |
Average Doppler message rate in 1 minute intervals |
Firehose_LossRate_1H |
Log transport loss rate in 1 hour intervals |
Firehose_LossRate_1M |
Log transport loss rate in 1 minute intervals |
SyslogAgent_LossRate_1M |
Syslog Agent loss rate in 1 minute intervals |
SyslogDrain_RLP_LossRate_1M |
Reverse Log Proxy loss rate in 1 minute intervals |
bosh_deployment |
Represents bosh_deployments_status , a 1 indicates a deployment is occurring on the director |
health_check_bosh_director_success |
BOSH SLI test status, 1 indicates success |
health_check_CanaryApp_available |
Whether the canary app is available |
health_check_CanaryApp_responseTime |
Response time of the canary app |
health_check_cliCommand_delete |
Can CF delete? |
health_check_cliCommand_login |
Can CF login? |
health_check_cliCommand_logs |
Can receive logs? |
health_check_cliCommand_probe_count |
Number of Healthwatch CLI command health probe assessments completed in the measured time interval. |
health_check_cliCommand_pushTime |
CF app push time |
health_check_cliCommand_push |
Can CF push? |
health_check_cliCommand_start |
Can CF start? |
health_check_cliCommand_stop |
Can CF stop? |
health_check_cliCommand_success |
Overall success of the CF CLI SLI tests |
uaa_throughput_rate |
UAA throughput rate |
Healthwatch Component Monitoring Metrics
The following metrics exist for the purpose of monitoring the Healthwatch components.
PKS Exporter (pks-exporter)
The PKS Exporter makes BOSH System metrics for PKS available in Prometheus.
Metric | Description |
---|---|
healthwatch_boshExporter_ingressLatency_seconds_bucket |
Number of seconds it took to process a batch of Loggregator envelopes, grouped by latency |
healthwatch_boshExporter_ingressLatency_seconds_count |
Total number of metrics in all the buckets |
healthwatch_boshExporter_ingressLatency_seconds_sum |
Total value of the metrics in all the buckets |
healthwatch_boshExporter_ingress_envelopes |
Number of envelopes received by observability metrics agent |
healthwatch_boshExporter_metricConversion_seconds_bucket |
Number of seconds it took to convert a bosh metric to a Prometheus gauge, grouped by duration |
healthwatch_boshExporter_metricConversion_seconds_count |
Total number of metrics in all the buckets |
healthwatch_boshExporter_metricConversion_seconds_sum |
Total value of the metrics in all the buckets |
healthwatch_boshExporter_status |
Exporter status, a 1 indicates the exporter is running and healthy |
PAS Exporter VMs (pas-exporter-*)
These following exporters take metrics from the Firehose
and make them accessible on a Prometheus-compatible /metrics
endpoint.
Each of the following exporters handles a specific subset of the Firehose metrics. The names of the exporters correspond to the metrics they export.
PAS Counter Exporter (pas-exporter-counter)
The PAS Counter Exporter makes Loggregator Firehose Counter metrics available in Prometheus.
Metric | Description |
---|---|
healthwatch_pasExporter_counterConversion_seconds |
Number of seconds it took to convert a counter envelope to a Prometheus counter |
healthwatch_pasExporter_evictedMetrics |
Number of metrics evicted from exporter cache |
healthwatch_pasExporter_ingressLatency_seconds |
Number of seconds it took process a batch of Loggregator envelopes |
healthwatch_pasExporter_ingress_envelopes |
Number of envelopes received by observability metrics agent |
healthwatch_pasExporter_status |
Exporter status, a 1 indicates the exporter is running and healthy |
PAS Gauge Exporter (pas-exporter-gauge)
The PAS Gauge Exporter makes Loggregator Firehose Gauge metrics available in Prometheus.
Metric | Description |
---|---|
healthwatch_pasExporter_evictedMetrics |
Number of metrics evicted from exporter cache |
healthwatch_pasExporter_gaugeConversion_seconds |
Number of seconds it took to convert a gauge envelope to a Prometheus gauge |
healthwatch_pasExporter_ingressLatency_seconds |
Number of seconds it took process a batch of Loggregator envelopes |
healthwatch_pasExporter_ingress_envelopes |
Number of envelopes received by observability metrics agent |
healthwatch_pasExporter_status |
Exporter status, a 1 indicates the exporter is running and healthy |
PAS Timer Exporter (pas-exporter-timer)
The PAS Timer Exporter makes Loggregator Firehose Timer metrics available in Prometheus.
Metric | Description |
---|---|
healthwatch_pasExporter_evictedMetrics |
Number of metrics evicted from exporter cache |
healthwatch_pasExporter_ingressLatency_seconds |
Number of seconds it took process a batch of Loggregator envelopes |
healthwatch_pasExporter_ingress_envelopes |
Number of envelopes received by observability metrics agent |
healthwatch_pasExporter_status |
Exporter status, a 1 indicates the exporter is running and healthy |
Prometheus Exposition
These metrics come from by the majority of the exporter VMs and relate to the /metrics
endpoint that Prometheus scrapes from.
Metric | Description |
---|---|
healthwatch_prometheusExpositionLatency_seconds |
Number of seconds it took to render Prometheus scrape page |
healthwatch_prometheusExposition_expiredMetrics |
Number of metrics expired from exporter cache |
healthwatch_prometheusExposition_histogramMapConversion |
Time it takes to convert histogram collection to a map |
healthwatch_prometheusExposition_metricMapConversion |
Time it takes to convert metrics collection to a map |
healthwatch_prometheusExposition_metricSorting |
Time it takes to sort metrics when rendering Prometheus exposition |
SVM Forwarder Monitoring Metrics (svm-forwarder)
The SVM Forwarder makes Healthwatch 1.x Super Value Metrics available in the Loggregator Firehose for external use. See the Release Notes for more information.
Metric | Description |
---|---|
failed_scrapes_total |
Total number of failed scrapes for the target source_id |
last_total_attempted_scrapes |
Count of attempted scrapes during last round of scraping |
last_total_failed_scrapes |
Count of failed scrapes during last round of scraping |
last_total_scrape_duration |
Time in milliseconds to scrape all targets in last round of scraping |
scrape_targets_total |
Total number of scrape targets identified from prom scraper config files |