Monitoring App Metrics

Page last updated:

This topic explains how to monitor the health of the App Metrics service using the logs, metrics, and Key Performance Indicators (KPIs) emitted by Tanzu Application Service and the App Metrics application itself.

For more information about monitoring TAS for VMs, see Monitoring TAS for VMs.

Healthwatch

The premier way to monitor App Metrics is using Healthwatch. Once installed, navigate to the JobHealth dashboard to view the App Metrics deployment which is named appMetrics.

Healthwatch also supports alerting based on VM persistent disk percentage system.disk.persistent.percent and VM health system.healthy.

App Metrics

App Metrics dashboard for the app metrics application also displays its the platform indicators as custom metrics. App Metrics also supports alerting based on dashboard indicators.

Key Performance Indicators

KPIs for App Metrics are the metrics that operators find most useful for monitoring their App Metrics service. KPIs are high-signal-value metrics that can indicate emerging issues.

VMware provides the following KPIs as general alerting and response guidance for typical App Metrics installations. VMware recommends that operators continue to fine-tune the alert measures to their installation by observing historical trends. VMware also recommends that operators expand beyond this guidance and create new, installation-specific monitoring metrics, thresholds, and alerts based on learning from their own installations.

BOSH Metrics

All BOSH-deployed components generate the following metrics. Monitor them to ensure that they are not consuming excess resources.


Log Store VMs (log-store-vms)

Metric disk_persistent_percent
Description Percentage of VM persistent disk used for Log Store.

Use: It is important to make sure that the system disks of the data services do not fill up and cause data loss and performance degradation.

Type: percent
PromQl Used: avg(avg_over_time(system_mem_percent{deployment=~‘log-store-prod’,job='log-store’,source_id='bosh-system-metrics-forwarder’}[60s])) by (index)
Recommended alert thresholds Yellow warning: > 70%
Red critical: > 85%
Recommended response Log Store disks should be scale up vertically as needed to prevent data loss. Scaling horizontally will result in data loss.

PostgreSQL VM (db-and-errand-runner)

Metric disk_persistent_percent
Description Percentage of VM persistent disk used for PostgreSQL.

Use: This stores custom indicator files, configured monitors and triggered alerts. The disk filling up will prevent further customization of dashboards and monitors and will prevent new alert triggers from being displayed on metrics graphs.

PromQl Used: avg(avg_over_time(system_disk_persistent_percent{deployment=~'appMetrics-.*’,job='db-and-errand-runner’,source_id='bosh-system-metrics-forwarder’}[60s]))
Type: percent
Recommended alert thresholds Yellow warning: > 90%
Red critical: > 95%
Recommended response Scale up disk as appropriate. Further customization will not be available while scaling is occuring.

Application Metrics

All applications pushed using Cloud Foundry automatically emit the following application metrics. App Metrics is a single application and thus can be monitored by App Metrics or another application monitoring services. The following KPIs can indicate problems with App Metrics and are useful for monitoring any application. Non-routed applications will return no data or all zeros for Latency, Errors and Traffic metrics.


Latency

Description The Amount of time to service a request.

Use: Slow feedback is a symptom of degraded performance.

PromQl Used: (sum(rate(http_duration_seconds_sum{source_id=“$sourceId”}[60s])) by (process_type, source_id) / sum(rate(http_duration_seconds_count{source_id=“$sourceId”}[60s])) by (process_type, source_id) * 1000)
Type: milliseconds
Recommended response Scale up as appropriate.

Traffic

Description The Amount of time to service a request.

Use: Slow feedback is a symptom of degraded performance.

PromQl Used: (sum(rate(http_duration_seconds_sum{source_id=“$sourceId”}[60s])) by (process_type, source_id) / sum(rate(http_duration_seconds_count{source_id=“$sourceId”}[60s])) by (process_type, source_id) * 1000)
Type: milliseconds
Recommended response Scale up as appropriate.

Errors

Description The rate of failed requests i.e. number of 500 status responses.

Use: Any number of failures indicate a problem with the application or underlying infrastructure.

PromQl Used: sum((rate(http_total{source_id=“$sourceId”,status_code=“500”}[60s:30s])) * 60) by (process_type, source_id)
Type: count
Recommended response Investigate application metrics and logs as well as the metrics.sys.DOMAIN/integration-status endpoint.

Saturation

Description The amount of resources being utilized by the application.

Use: This is made up of CPU, Memory and Disk. Performance may degrade as the amount of resource used approach the Saturation

CPU PromQl Used: avg(avg_over_time(cpu{source_id=“sourceId”}[60s])) by (process_type, source_id)
CPU Type: percent
Memory PromQl Used: avg(memory{source_id=“sourceId”} / memory_quota{source_id=“sourceId”}) by (process_type, source_id) * 100
Memory Type: percent
Disk PromQl Used: avg(disk{source_id=“sourceId”} / disk_quota{source_id=“sourceId”}) by (process_type, source_id) * 100
Disk Type: percent
Recommended alert thresholds for App Metrics Yellow warning: > 80%
Red critical: > 90%
Recommended response Scale up memory and disk quota on the app as appropriate and turn off the push-apps errand on the tile.