LATEST VERSION: 1.5 - RELEASE NOTES
PCF Metrics v1.5

Troubleshooting PCF Metrics

This topic describes how to resolve common issues experienced while operating or using Pivotal Cloud Foundry (PCF) Metrics.

Insufficient Resources

Error Insufficient Resources
Cause Your PCF deployment has insufficient Diego resources to handle the apps pushed as part of a PCF Metrics installation.

The PCF Metrics tile deploys the following apps:
App Memory Disk
metrics-queue* 512 MB 1 GB
logs-queue* 256 MB 1 GB
metrics-ingestor* 512 MB 1 GB
metrics 1 GB 2 GB
metrics-ui 256 MB 1 GB
metrics-alerting 1 GB 2 GB
*Your number of instances of each of these applications depend on your sizing needs. You configure these instance counts as part of the Data Store pane of the tile.

Solution Increase the number of Diego cells so that your PCF deployment can support the apps pushed as part of the PCF Metrics installation:

  1. Navigate to the Resource Config section of the PAS tile.
  2. In the Diego Cell row, add another Instance.

Missing Specific Logs

Error Logs are missing for a specific application or a subset of logs are bing skipped.
Cause PCF Metrics does not store logs with messages containing non-UTF-8 characters or logs with application GUIDs that are not standard UUID.
Solution Remove non-UTF-8 characters from log messages and ensure it is created with a correct application GUID.

High CPU on PostgreSQL VM

Error PostgreSQL VM CPU is over 80%.
Cause The PostgreSQL VM does not have enough CPU allocated or enough space allocated for storage.
Solution Increase the size of the PostgreSQL VM

  1. If the disk storage is below 85%, ssh to the PostgreSQL VM and check the load average (with `uptime` or `top`). If the load average is high, increase the CPU allocated for the PostgreSQL VM.
  2. If the disk storage is over 85%, increase the storage space allocated for the PostgreSQL VM.
  3. The CPU may remain high for a period of time after the upgrade.

Too Many Clients Error

Error You encounter sorry, too many clients already errors when accessing the PCF Metrics UI in your browser.
Cause Your PostgreSQL is undersized for your logs load.
Solution Decrease the Logs Retention Window and increase the PostgreSQL Persistent Disk

  1. Navigate to the Metrics Components Config section of the PCF Metrics Tile.
  2. Decrease the Logs Retention Window to a smaller number to free up space in PostgreSQL.
  3. Navigate to the Resource Config section of the PCF Metrics Tile.
  4. Increase the Persistent Disk size of PostgreSQL Server to at least twice the current size.

Failed to Fetch Apps

Error Even though you entered the correct UAA credentials, the metrics app fails to fetch the list of apps.
Cause The browser plugins or cookies inject extraneous content in requests to Cloud Controller API, causing it to reject the request.
Solution Confirm the problem and clear the browser, as follows:
  1. Try the browser’s incognito mode to see if the metrics app is able to fetch the list of apps. If this works, the problem is likely cookies or plugins.
  2. Clear your browser cookies and plugins.

Redis Temporary Datastore Stops Accepting Metrics

Error You see both these problems:
  • Metrics stop appearing on the UI.
  • When you run cf metrics-ingestor logs, you see the following entry in the Ingestor logs:
    MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
Cause The Redis datastore is full. The component is out of memory or persistent disk space.
Solution Confirm the problem and scale up Redis, as follows:
  1. On the Metrics tile, click the Status tab and look to see if the memory or persistent disk usage of the Redis job is over 80%.
  2. Scale up the Redis component. For more information, see Scale the Temporary Datastore (Redis).

Received No Results Back from MySQL - Failing

Error Received no results back from mysql - failing
Cause The Ingestor is not functioning properly.
Solution
  1. From the cf CLI, target the system org and metrics-v1-5 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-5
  3. Run cf apps to see if these apps are running:
    • metrics-ingestor
    • metrics-queue
  4. If the apps are not running, run the following commands to start them:
  5. $ cf start metrics-ingestor
    $ cf start metrics-queue
  6. Run the following commands and search the app logs for ERROR messages containing additional information:
  7. $ cf logs metrics-ingestor --recent
    $ cf logs metrics-queue --recent

    Note: In some cases, the apps cannot communicate due to TLS certificate verification failure. If your deployment uses self-signed certs, ensure the Disable SSL certificate verification for this environment box is selected in the PAS Networking pane.

Failed to Connect to MySQL

Error Failed to connect to mysql
Cause MySQL is not running properly.
Solution
  1. Check the logs of the MySQL Server and MySQL Proxy jobs for errors.
    • You can download the logs from the PCF Metrics tile under the Status tab.
  2. From the cf CLI, target the system org and metrics-v1-5 space of your PCF deployment:
  3. $ cf target -o system -s metrics-v1-5
  4. Run the following command and ensure the security group can access the MySQL jobs:

    Note: PCF Metrics creates a default security group to allow all traffic to its apps.

  5. $ cf security-group metrics-api

Never Received App Logs

Error Never received app logs - something in the firehose -> PostgreSQL flow is broken
Cause Ingestor is not inserting logs correctly.
Solution
  1. From the cf CLI, target the system org and metrics-v1-5 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-5
  3. Run cf apps to see if these apps are running:
    • metrics-ingestor
    • logs-queue
  4. If the apps are not running, run the following commands to start them:
  5. $ cf start metrics-ingestor
    $ cf start logs-queue
  6. Run the following commands and search the app logs for ERROR messages containing additional information:
  7. $ cf logs metrics-ingestor --recent
    $ cf logs logs-queue --recent

    Note: In some cases, you might discover a failure to communicate with Loggregator in the form of a bad handshake error.

    Ensure the Loggregator Port setting in the PAS tile Networking pane is set to the correct value. For AWS, it is 4443. For all other IaaSes, it is 443.

    If the metrics-ingestor logs shows Aggregation Stored Procedures key is not in redis, stop the metrics-ingestor application, restart Redis, and start the metrics-ingestor application again.

    Redis is used for metrics and logs aggregation. If there is an error loading the stored procedure, the ingestor will fail to ingest both logs and metrics.

Metrics and Events Not Available

Error Network metrics are not available.
Container metrics are not available.
App events are not available.
Cause PCF Metrics is misconfigured and the frontend API does not receive logs from MySQL.
Solution
  1. From the cf CLI, target the system org and metrics-v1-5 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-5
  3. Run the following command to check the app logs and investigate the error:
  4. $ cf logs metrics --recent

Logs and Histograms Not Available

Error Logs are not available.
Histograms are not available.
Cause PCF Metrics is misconfigured and the frontend API does not receive logs from PostgreSQL.
Solution
  1. From the cf CLI, target the system org and metrics-v1-5 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-5
  3. Run the following command to check the app logs and investigate the error:
  4. $ cf logs metrics --recent

No Logs or Metrics in the UI

In some cases, the PCF Metrics UI might not display metrics and logs after successfully deploying.

Follow the steps in this section to help locate the app or component causing the problem.

Step 1: Check your Load Balancer Configuration

If you use a load balancer, the event-stream mechanism used by the Metrics UI might be blocked. See the table below to resolve this error.

If you do not use a load balancer, or this issue does not apply to your deployment, proceed to the next step.

Error In the case of a customer using an F5 load balancer, metrics and logs were not visible in the UI despite successful ingestion and no UI errors reported.
Cause The root of the issue was the traffic of type text/event-stream was blocked by the F5 load balancer.
Solution When F5 was configured to allow event-stream traffic, the issue was resolved.

Step 2: Check the PCF Metrics Apps

  1. From Ops Manager, click the PAS Tile.

    1. Click the Credentials tab.
    2. Under the UAA job, next to Admin Credentials, click Link to Credential.
    3. Record the username and password for use in the next step.
  2. Log in to the Cloud Foundry Command Line Interface (cf CLI) using the credentials from the previous step.

    $ cf login -a https://api.YOUR-SYSTEM-DOMAIN -u admin -p PASSWORD

  3. When prompted, select the system org and the metrics-v1-5 space.

  4. Ensure that the output displays the following apps, each in a started state:

    • metrics-ingestor
    • metrics-queue
    • logs-queue
    • metrics
    • metrics-ui
    • metrics-alerting
  5. Check the logs of each app for errors using the following command:

    $ cf logs APP-NAME --recent
    If you do not see any output, or if you did not find any errors, proceed to the next step.

Step 3: Check the Metrics Ingestor

  1. To get a higher level of detail from the metrics-ingestor application, set the LOG_LEVEL env variable:
    $ cf set-env metrics-ingestor LOG_LEVEL DEBUG
  2. To apply this setting, restage the application:
    $ cf restage metrics-ingestor
  3. Run the following command to stream logs:
    $ cf logs metrics-ingestor
  4. Watch the logs emitted by the metrics-ingestor app for errors. See the list below for common errors:

    • Aggregation Stored Procedures key is not in redis: Redis may have been restarted or is in a bad state. Stop the metrics-ingestor app. Restart redis. Start the metrics-ingestor app again.
    • Cannot connect to the firehose: PCF Metrics creates a UAA user to authenticate the connection to the Firehose. This user must have the doppler.firehose authority.
    • Could not find service with name: metrics-forwarder: The Metrics Forwarder Tile is not installed. Metrics will not display custom metrics without the Metrics Forwarder Tile but will otherwise function normally.
    • WebSocket Disconnects: If you see WebSocket disconnects logs in the Ingestor app, consider adding additional Ingestor instances. The Firehose might be dropping the Ingestor connection to avoid back pressure.
    • Redis errors: Investigate redis logs, for instructions see Advanced Troubleshooting with the BOSH CLI. Many possible solutions start with restarting redis.
      1. If the app is started and you do not find any errors, proceed to the next step.

Step 4: Check the log-queue

  1. To get a higher level of detail from the logs-queue application, set the LOG_LEVEL env variable:
    $ cf set-env logs-queue LOG_LEVEL DEBUG
  2. To apply this setting, restage the application:
    $ cf restage logs-queue
  3. Run the following command to stream logs:
    $ cf logs logs-queue
  4. Watch the logs emitted by the logs-queue app for errors.

    • A common error is that the app cannot connect to PostgreSQL due to the application security group (ASG) being deleted. This ASG allows the logs-queue application to create a network connection to the PostgreSQL VM. You can run cf security-group metrics-api to see if the ASG exists. If the ASG is not present, see Creating Application Security Groups to recreate it.
    • Could not find service with name: metrics-forwarder: The Metrics Forwarder Tile is not installed. Metrics will not display custom metrics without the Metrics Forwarder Tile but will otherwise function normally.
      1. If the app is started and you do not find any errors, proceed to the next step.

Step 5: Check MySQL

  1. From Ops Manager, select the PCF Metrics tile.

  2. Under the Status tab, record the IP of a MySQL Server node.

  3. Use bosh ssh to access the VM from the previous step. For instructions, see Advanced Troubleshooting with the BOSH CLI.

  4. Log in to mysql by running mysql -u USERNAME -p PASSWORD

    Note: If you do not know the username and password, you can run cf env metrics-queue with the system org and the metrics-v1-5 space targeted.

  5. Verify that the database was bootstrapped correctly:

    1. Run show databases and check for a metrics database.
      1. If there is no metrics database, the Push PCF Metrics Components Errand errand of the BOSH release might not have run or succeeded. Ensure the errand is selected in the tile configuration and update the tile.
  6. Run use metrics to select the metrics database:

    mysql> use metrics;

  7. Run show tables and ensure you see the following tables:

    mysql> show tables;
    +-----------------------------+
    | Tables_in_metrics           |
    +-----------------------------+
    | app_event                   |
    | app_metric                  |
    | app_metric_rollup           |
    | schema_version              |
    | app_metric_identifier       |
    +-----------------------------+
    

  8. Enter the following query several times to verify that the value returned does not decrease over time:

    mysql> select count(*) from metrics.app_metric_identifier where timestamp > ((UNIX_TIMESTAMP() - 60) * POW(10, 3));
    This command displays the rate at which metrics flow in over the last minute.

    1. If the command returns 0 or a consistently decreasing value, the problem is likely further up in ingestion; proceed to the next step.

Step 6: Check the Metrics Queue

  1. To get a higher level of detail from the metrics-queue application, set the LOG_LEVEL env variable:
    $ cf set-env metrics-queue LOG_LEVEL DEBUG
  2. To apply this setting, restage the application:
    $ cf restage metrics-queue
  3. Run the following command to stream logs:
    $ cf logs metrics-queue
  4. Watch the logs emitted by the metrics-queue app for errors.

    • A common error is that the app cannot connect to MySQL due to the application security group (ASG) being deleted. This ASG allows the logs-queue application to create a network connection to the MySQL VM. You can run cf security-group metrics-api to see if the ASG exists. If the ASG is not present, see Creating Application Security Groups to recreate it.
    • Could not find service with name: metrics-forwarder: The Metrics Forwarder Tile is not installed. Metrics will not display custom metrics without the Metrics Forwarder Tile but will otherwise function normally.
      1. If the app is started and you do not find any errors, proceed to the next step.

MySQL Failure

In some cases, a MySQL server might fail to restart. The following two sections describe the known conditions that cause this failure as well as steps for diagnosing and resolving them.

Cause 1: Monit Timed Out

Diagnose

Follow these steps to see if a monit time-out caused the MySQL node restart to fail:

  1. Use bosh ssh to access the failing node, using the IP address in the Ops Manager Director tile Status tab. For instructions, see Advanced Troubleshooting with the BOSH CLI.
  2. Run monit summary and check the status of the galera-init job.
  3. If the status of the galera-init job is Execution Failed, open the following file: /var/vcap/sys/log/pxc-mysql/galera-init.log.
    1. If the last line of the log indicates that MySQL started without issue, such as in the example below, monit likely timed out while waiting for the job to report healthy. Follow the steps below to resolve the issue.
      {"timestamp":"1536851105.372446537","source":"/var/vcap/packages/galera-init/bin/galera-init","message":"/var/vcap/packages/galera-init/bin/galera-init.galera-init started","log_level":1,"data":{}}
      

Resolve

Run the following commands to return the galera-init job to a healthy state:

  1. Run monit unmonitor galera-init.
  2. Run monit monitor galera-init.
  3. Run monit summary and confirm that the output lists galera-init as running.

Cause 2: Bin Logs Filled up the Disk

Diagnose

  1. Use bosh ssh to access the failing node. For instructions, see Advanced Troubleshooting with the BOSH CLI.
  2. Open the following log file: /var/vcap/sys/log/pxc-mysql/mysql.err.log.
  3. If you see log messages that indicate insufficient disk space, the persistent disk is likely storing too many bin logs. Confirm insufficient disk space by doing the following:
    1. Run df -h.
      1. Ensure that you see the /var/vcap/store folder is at or over 90% usage.
    2. Navigate to /var/vcap/store/pxc-mysql and run ls -al.
      1. Ensure that you see many files named with the format mysql-bin.########.

In MySQL for PCF, the server node does not make use of these logs and you can remove all except the most recent bin log. Follow the steps below to resolve the issue.

Resolve

  1. Log in to mysql by running mysql -u USERNAME -p PASSWORD

    Note: If you do not know the username and password, you can run cf env metrics-queue with the system org and the metrics-v1-5 space targeted.

  2. Run use metrics;.
  3. Run the following command:
    mysql> PURGE BINARY LOGS BEFORE 'YYYY-MM-DD HH:MM:SS'; 

Edit Your MySQL Server Configuration

  1. From Ops Manager, click the PCF Metrics Tile.
  2. Navigate to the Resource Config section of the PCF Metrics Tile.
  3. Increase the Persistent Disk size of MySQL Server to at least twice the current size.

Service metrics-forwarder does not exist

Error Service metrics-forwarder does not exist.
Cause The Metrics Forwarder Tile is not installed.
Solution Install the Metrics Fowarder Tile if you would like custom metrics; otherwise the error can be ignored. The service is optional.

Metrics API Unavailable

Error Metrics url shows Metrics API Unavailable.
Cause The URL is http.
Solution Go the the https version of the metrics URL.
Create a pull request or raise an issue on the source for this page in GitHub