Monitoring and Troubleshooting Apps with App Metrics

Page last updated:

This topic describes how developers can monitor and troubleshoot their apps using App Metrics.

Overview

App Metrics helps you understand and troubleshoot the health and performance of your apps by offering the following indicators, data, and visualizations:

  • Latency: Response times for your app
  • Traffic: Number of requests made for your app
  • Errors: HTTP errors thrown by your app
  • Saturation (Container Metrics): Three charts measuring CPU, memory, and disk consumption percentages
  • Custom Metrics: User-customizable charts for measuring app performance, such as Spring Boot Actuator and Micrometer metrics, or user-defined custom business metrics
  • App Events: A chart of update, start, stop, crash, SSH, and staging failure events
  • Logs: A list of app logs that you can search, filter, and download

The following sections describe a standard workflow for using App Metrics to monitor or troubleshoot your apps.

View an App

In a browser, navigate to metrics.sys.DOMAIN and log in with your User Account and Authentication (UAA) credentials. Choose an app from the search bar for which you want to view metrics and/or logs. App Metrics respects UAA permissions such that you can view any app that runs in a space that you have access to.

App Metrics displays app data for a given time frame. See the sections below to Change the Time Frame for the dashboard.

Change the Time Frame

The charts show time along the horizontal axis. You can change the time frame for all charts and the logs by using the time selector options at the top of the window. You can select from several pre-set timescales or select a custom date range.

Zoom: From within any chart, click and drag to zoom in on areas of interest. This adjusts all of the charts, and the logs, to show data from that time frame.

Auto-Refreshing the Dashboard

Auto-refresh mode allows the metrics charts and logs on your dashboard to be updated on a timed interval as data is ingested.

To enable auto-refresh, click the REFRESH button next to the time selection options on the top right of the dashboard. This will enable live updating of metrics and logs data for the currently selected timeframe.

Note: The default auto-refresh interval is set to one minute and is currently not configurable.

View Metrics at the Process and App Instance Level

App Metrics relays metric data at the app process level to allow for an in-depth troubleshooting experience, even across rolling deployment. Users are able to view the app metrics related to a specific process and further drill down into specific instances within those processes, which correlates directly with the processes and app instances shown in Apps Manager.

The dashboard will display metrics aggregated across all processes by default. To view metrics by specific process, select a process type from the dropdown near the upper-left of the dashboard.

With a specific process type selected, the metrics charts will display aggregate data from all instances within the selected process type.

To view metrics for the individual instances within the selected process, select the “Instances” radio button at the upper-right of the dashboard.

To view metrics for a specific app instance (or selection of specific instances), select the desired instance(s) from the legend along the bottom of any chart on the dashboard while the “Instances” radio is selected.

Interpreting Metrics

The default metrics charts included with App Metrics provide high-level indicators of the Four Golden Signals for monitoring the health of apps running on distributed systems: Latency, Traffic, Errors, and Saturation.

The following sections explain how to use each of the charts on the dashboard to monitor and troubleshoot your app.

Network Metrics

Note: If apps are not configured for network traffic, they show No Data or zeros for the default Latency, Traffic, and Errors metrics.

Latency

  • Average latency of a request in milliseconds:

    A spike in response time means your users are waiting longer. Scaling app instances can spread that workload over more resources and result in faster response times.

Traffic

  • Number of network requests per minute:

    A spike in HTTP requests means more users are using your app. Scaling app instances can reduce the response time.

Errors

  • Number of network request errors per minute:

    A spike in HTTP errors means one or more 5xx errors have occurred. Check your app logs for more information.

Saturation (Container Metrics)

The following Container Metrics charts are available on the App Metrics dashboard to help monitor resource saturation:

  • CPU usage percentage:

    A spike in CPU might point to a process that is computationally heavy. Scaling app instances can relieve the immediate pressure, but you need to investigate the app to better understand and fix the root cause.

  • Memory usage percentage:

    A consistent, gradual increase in memory might mean a resource leak in the code. Scaling app memory can relieve the immediate pressure, but you need to find and resolve the underlying issue so that it does not occur again.

  • Disk usage percentage:

    A spike in disk might mean the app is writing logs to files instead of STDOUT, caching data to local disk, or serializing large sessions to disk.

Events

In addition, the Events chart helps to correllate these metrics to events for your app, including: Crash, Fail (staging failures), Update, Stop, Start, and SSH.

Note: The SSH event corresponds to someone successfully using SSH to access a container that runs an instance of the app.

See the following topics for more information about app events:

Adding Custom Metrics Charts

You can add custom metrics charts to your dashboard, including Spring Boot Actuator and Micrometer metrics, by defining the custom metrics you want to monitor and including them in an indicator document for your app.

In order to get custom, Actuator, or Micrometer metrics into the Metrics Store, you will need to bind Metric Registrar to your app and register your endpoint. For more information, see Configuring the Metric Registrar.

If you want to view custom metrics, you can configure your apps to emit those metrics out of the Loggregator Firehose and then view these metrics on the App Metrics dashboard.

In addition, Spring Boot apps with actuators or Micrometer metrics implemented emit these metrics out of the box, without any changes to source code.

Create an Indicator Document

An indicator document is a YAML document that specifies which app you want to monitor and the indicators you want to use to monitor it.

There are three steps to creating an indicator document:

  1. Find the metric you want to monitor
  2. Write the PromQL query
  3. Add the PromQL to your indicator document

Find the Metric Name

First verify that the metrics are being emitted. After you have configured Metrics Registrar to scrape your metrics endpoint, you can verify your respective endpoint for metric names.

If you are using a Prometheus-style metrics endpoint, you can do so by hitting your app’s metrics endpoint at app.domain/metrics and looking for the desired metric.

To validate Spring Boot Actuator and Micrometer metrics, see Metrics in Spring Boot Actuator: Production-ready Features in the Spring Boot documentation.

Write a PromQL Query

After you have the metric name, write a PromQL query for visualizing the metric.

  1. Find additional example PromQL for any of the default charts on the dashboard by clicking Info in the upper-right of any chart or visit the PromQL Query Examples documentation.

  2. Use the PromQL Explorer to test out PromQL before putting it in an indicator document:

    1. Click the + button at the bottom right of the dashboard.
    2. Test out queries to see how the graph looks before placing it in an indicator document.

    Note: PromQL should always have the source_id tag for non-admin users. App Metrics supports using a $sourceId parameter in the PromQL which automatically refers to the sourceId of the current app. Example: cpu{source_id="$sourceId"}

Add the PromQL to Your Indicator Document

After you have the PromQL ready, put it in an indicator document.

For example, if you have a custom metric customMetricName500 and want to graph the amount of errors over a 1 minute period, then your PromQL query is sum(avg_over_time(customMetricName500{source_id=\"$sourceId\"}[1m])). The following is an example of the YAML for an indicator document:

---
apiVersion: indicatorprotocol.io/v1
kind: IndicatorDocument

metadata:
  labels:
    deployment: "my deployment name"

spec:
  product:
    name: org,space,app-name
    version: 0.0.1

  indicators:
    - name: CustomErrorCount500
      promql: "sum(rate(customMetricName500{source_id='$sourceId'}[1m]))"
      documentation:
        title: "Custom Metric 500 Errors"
      presentation:
        units: "none"

The org,space,app-name in the example above determines which app these indicators are applied to. Replace org,space,app-name with the org, space, and app name of the app dashboard that you want to customize.

Indicator Document Schema

App Metrics uses a derivative version of the Indicator Protocol. For more information about the App Metrics-supported indicator document schema, see Indicator Document Template Reference.

Adding Monitoring and Alerting

You can add custom monitoring to your dashboard’s indicators by creating a custom monitor document for your app.

Creating a Monitor Document

Monitors are linked to specific indicators, so the first step to adding custom monitoring and alerting to your app is to verify the names of the indicators you would like to monitor.

You can view the indicator names of each chart on your app’s dashboard by hovering on the desired chart, clicking on the kabobs in the right-hand corner and selecting Info.

The indicator name can correspond to one of your custom indicators or to one of the default indicator names: RequestCount, HttpLatency, ErrorCount, CPU, MemoryPercentage, and DiskPercentage.

Once you have the indicator names you can create your monitor document that will define threshold for your indicator and the webhook to send alerts to. The following is an example of the YAML for an monitor document:

---
product: org,space,app-name

webhook_url: https://my-slack-webhook.com

monitors:
  - name: 500 Errors For Application
    indicator: ErrorCount
    warning:
       operator: gte
       threshold: 1.0
       duration: 1m
       only_every: 1h
    critical:
       operator: gte
       threshold: 2.0
       duration: 1m
       only_every: 15m

Please note that org,space,app-name above is responsible for defining which app these indicators will be applied to. Please replace this with the org, space, and app name of the app you wish to monitor.

Please also note that the https://my-slack-webhook.com should be where alerts are sent when a threshold is surpassed. Slack is the currently the only supported use case, but other webhook platforms may work if they accept a “text” payload.

Monitor Document Schema

For more detailed information on the monitor document schema, see the Monitor Document Template Reference.

Custom Metric Demos

Logs

The Logs view displays app log data ingested from the Loggregator Reverse Log Proxy (RLP):

Note: Logs with non-UTF-8 characters or non-standard UUID app GUIDs are not stored.

You can interact with the Logs view in the following ways:

  • Keyword: Perform a keyword search. While filtering on keywords, logs results will be reduced to only display log lines that contain the matching criteria. Matching terms will also be highlighted in blue.
  • Highlight: Enter a term to visually highlight within your search. The terms will be highlighted in orange within the current filter results.
  • Sources: Choose which sources to display logs from. For more information, see Log Types and Their Messages.
  • Download: Download a file containing logs for the current search.
  • Copy: Click the copy icon to copy the text of the log.

By default, the most recent 1,000 log lines will be displayed in the logs drawer. You can click SHOW 1000 MORE LOGS to load more.

Direct Data Access

You can query Metric Store and Log Store directly to access the raw data.

Metric Store API

To query Metric Store, consult the documentation for Using Metric Store

Log Store API

Prerequisites

Authorization & Authentication

When querying the API via HTTPS, each request must have the Authorization header set with a UAA provided token.

Querying via HTTP Endpoints

GET /v1/sources/{sourceID}/logs

Issues a query against Log Store data.

Path Parameters: - sourceID – The app or component source ID. App source ID is the same as app GUID.

Query Parameters:

  • query is a PromQL label selector query for filtering logs on message, message_type, source_type, and instance_id.
    • message – RegEx to search the log message body
    • message_type – The file descriptor the log was written to, OUT or ERR
    • source_type – The source of the log, any subset of {"API","APP","CELL","HEALTH","LGR","RTR","SSH","STG"} connected by pipes, e.g. "APP|API".
    • instance_id – Filter based on the instance ID of the app or component that wrote the log
  • startTime is an optional UNIX timestamp in nanoseconds or RFC3339. Defaults to 10 minutes ago. Must be before end time.
  • endTime is an optional UNIX timestamp in nanoseconds or RFC3339. Defaults to now. Must be after start time.
  • limit is an optional maximum number of logs to return. Defaults to 100.
  • page is an optional number of the page of logs to be returned, must be >= 1. Defaults to 1.
  • order is an optional order in which the logs will be returned, asc or desc. Defaults to desc.
export SYSTEM_DOMAIN="<YOUR_SYSTEM_DOMAIN>"
export SOURCE_ID="$(cf app <YOUR_APP> --guid)"
curl -XGET -H "Authorization: $(cf oauth-token)" \
     "https://log-store.$YOUR_SYSTEM_DOMAIN/v1/sources/$SOURCE_ID/logs" \
     --data-urlencode 'query={message=~"Error.*"}'

Response Body

{
  "metadata": {
    "count": 1,
    "links": {}
  },
  "items": [
    {
      "instance_id": "0",
      "message": "Error: Sample query didn't work",
      "message_type": "OUT",
      "source_id": "50efa176-bd06-42d1-bac8-672aab387e75",
      "source_type": "APP/PROC/WEB",
      "timestamp": "2020-03-24T06:57:29.788299446Z"
    }
  ]
}