LATEST VERSION: 1.3 - CHANGELOG
PCF Metrics v1.3

Troubleshooting PCF Metrics

This topic describes how to resolve common issues experienced while operating or using Pivotal Cloud Foundry (PCF) Metrics.

Errors during Deployment

The following sections describe errors that cause failure during a PCF Metrics tile and how to troubleshoot them.

Smoke Test Errors

PCF Metrics runs a set of smoke tests during installation to confirm system health. If the smoke tests discover any errors, you can find a summary of those errors at the end of the errand log output, including detailed logs about where the failure occurred.

The following tables describe common failures and how to resolve them.

Insufficient Resources

Error Insufficient Resources
Cause Your PCF deployment has insufficient Diego resources to handle the apps pushed as part of a PCF Metrics installation.

The PCF Metrics tile deploys the following apps:
App Memory Disk
metrics-ingestor* 512MB 1GB
mysql-logqueue* 1GB 1GB
elasticsearch-logqueue* 512MB 1GB
metrics-aggregator 256MB 1GB
metrics 1GB 1GB
worker-app-dev 1GB 1GB
worker-app-logs 1GB 1GB
worker-health-check 1GB 1GB
worker-reaper 1GB 1GB
*You may have more than one instance of each of the Ingestor and Loqueue apps depending your sizing needs. You configure these instance counts as part of the Data Store pane of the tile.

Solution Increase the number of Diego cells so that your PCF deployment can support the apps pushed as part of the PCF Metrics installation:

  1. Navigate to the Resource Config section of the Elastic Runtime tile.
  2. In the Diego Cell row, add another Instance.

Nginx Load Balancer

Error The Smoke tests for Metrics UI errand failed.
Or, the Smoke tests for Metrics UI checkbox is not selected and installation was successful, but the UI keeps loading and the graphs do not populate with data.
Cause The Nginx proxy_buffering property is on and causes Nginx to block SSE traffic.
Solution
  1. From the cf CLI, target the system org and metrics-v1-3 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-3
  3. Confirm that Smoke tests for Metrics UI errand was not run during installation by listing recent logs from the worker-app-logs and worker-app-dev apps:
    $ cf logs --recent worker-app-logs 
    $ cf logs --recent worker-app-dev
    If neither log contains the text jobStarted, then the jobs are not queued because Nginx is blocking SSEs.
  4. Turn off the Nginx proxy_buffering property.

Failed querying mysql

Error Failed querying mysql
Cause The tile deployed without the necessary errands selected to keep the internal database schema in sync with apps.
Solution Re-deploy the tile with the following errands selected:
  • Database migrations for PCF Metrics
  • Push PCF Metrics Data components
  • Push PCF Metrics UI component

Received no results back from mysql - failing

Error Received no results back from mysql - failing
Cause The Ingestor is not functioning properly.
Solution
  1. From the cf CLI, target the system org and metrics-v1-3 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-3
  3. Run cf apps to see if these apps are running:
    • metrics-ingestor
    • mysql-logqueue
  4. If the apps are not running, run the following commands to start them:
  5. $ cf start metrics-ingestor
    $ cf start mysql-logqueue
  6. Run the following commands and search the app logs for ERROR messages containing additional information:
  7. $ cf logs metrics-ingestor --recent
    $ cf logs mysql-logqueue --recent

    Note: In some cases, the apps cannot communicate due to TLS certificate verification failure. If your deployment uses self-signed certs, ensure the Disable SSL certificate verification for this environment box is checked in the Elastic Runtime Networking pane.

Failed to connect to mysql

Error Failed to connect to mysql
Cause MySQL is not running properly.
Solution
  1. Check the logs of the MySQL Server and MySQL Proxy jobs for errors.
    • You can download the logs from the PCF Metrics tile under the Status tab.
  2. From the cf CLI, target the system org and metrics-v1-3 space of your PCF deployment:
  3. $ cf target -o system -s metrics-v1-3
  4. Run the following command and ensure the security group can access the MySQL jobs:

    Note: PCF Metrics creates a default security group to allow all traffic to its apps.

  5. $ cf security-group metrics-api

Failed to start elasticsearch client

Error Failed to start elasticsearch client
Cause Elasticsearch is not running correctly.
Solution
  1. Check the logs of the Elasticsearch Master, Elasticsearch Coordinator, and Elasticsearch Data jobs for errors.
    • You can download the logs from the PCF Metrics tile under the Status tab.
  2. From the cf CLI, target the system org and metrics-v1-3 space of your PCF deployment:
  3. $ cf target -o system -s metrics-v1-3
  4. Run the following command and ensure the security group can access the Elasticsearch jobs:

    Note: PCF Metrics creates a default security group to allow all traffic to its apps.

  5. $ cf security-group metrics-api

Never received app logs

Error Never received app logs - something in the firehose -> elasticsearch flow is broken
Cause Ingestor is not inserting logs correctly.
Solution
  1. From the cf CLI, target the system org and metrics-v1-3 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-3
  3. Run cf apps to see if these apps are running:
    • metrics-ingestor
    • elasticsearch-logqueue
  4. If the apps are not running, run the following commands to start them:
  5. $ cf start metrics-ingestor
    $ cf start elasticsearch-logqueue
  6. Run the following commands and search the app logs for ERROR messages containing additional information:
  7. $ cf logs metrics-ingestor --recent
    $ cf logs elasticsearch-logqueue --recent

    Note: In some cases, you may discover a failure to communicate with Loggregator in the form of a bad handshake error.

    Ensure the Loggregator Port setting in the Elastic Runtime tile Networking pane is set to the correct value. For AWS, it is 4443. For all other IaaSes, it is 443.

Metrics and Events not available

Error Network metrics are not available.
Container metrics are not available.
App events are not available.
Cause PCF Metrics is misconfigured and the frontend API does not receive logs from MySQL.
Solution
  1. From the cf CLI, target the system org and metrics-v1-3 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-3
  3. Run the following command to check the app logs and investigate the error:
  4. $ cf logs metrics --recent

Logs and Histograms not available

Error Logs are not available.
Histograms are not available.
Cause PCF Metrics is misconfigured and the frontend API does not receive logs from Elasticsearch.
Solution
  1. From the cf CLI, target the system org and metrics-v1-3 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-3
  3. Run the following command to check the app logs and investigate the error:
  4. $ cf logs metrics --recent

Elasticsearch Instance does not Start

Error The Deployment fails because an Elasticsearch instance does not start.
Cause The instance may not start because its configured heap size is greater than that of the VM that hosts it.
Solution
  1. From the PCF Metrics tile in Ops Manager, select the Data Store settings pane.
  2. Record the value in the Elastic Search Heap Size field.
  3. Select the Resource Config pane and ensure the following jobs have RAM greater than or equal to the Elastic Search Heap Size
    • Elasticsearch Master
    • Elasticsearch Coordinator
    • Elasticsearch Data
  4. If any of the jobs do not have enough memory, do one of the following:
    • Give the job more RAM
    • Lower the Elastic Search Heap Size

No Logs or Metrics in the UI

In some cases, the PCF Metrics UI may not display metrics and logs after successfully deploying.

Follow the steps in this section to help locate the app or component causing the problem.

Step 1: Check your Load Balancer Configuration

If you use a load balancer, the event-stream mechanism used by the Metrics UI might be blocked. Refer to the table below to resolve this error.

If you do not use a load balancer, or this issue does not apply to your deployment, proceed to Step 2: Check the PCF Metrics Apps.

Error In the case of a customer using an F5 load balancer, metrics and logs were not visible in the UI despite successful ingestion and no UI errors reported.
Cause The root of the issue was the traffic of type text/event-stream was blocked by the F5 load balancer.
Solution When F5 was configured to allow event-stream traffic, the issue was resolved.

Step 2: Check the PCF Metrics Apps

  1. From Ops Manager, click the Elastic Runtime Tile.

    1. Click the Credentials tab.
    2. Under the UAA job, next to Admin Credentials, click Link to Credential.
    3. Record the username and password for use in the next step.
  2. Log in to the Cloud Foundry Command Line Interface (cf CLI) using the credentials from the previous step.

    $ cf login -a https://api.YOUR-SYSTEM-DOMAIN -u admin -p PASSWORD

  3. When prompted, select the system org and the metrics-v1-3 space.

  4. Ensure that the output displays the following apps, each in a started state:

    • metrics-ingestor
    • mysql-logqueue
    • elasticsearch-logqueue
    • metrics-aggregator
    • metrics
    • worker-app-dev
    • worker-app-logs
    • worker-health-check
    • worker-reaper
  5. Check the logs of each app for errors using the following command:

    $ cf logs APP-NAME --recent
    If you do not see any output, or if you did not find any errors, proceed to Step 3: Check the Elasticsearch Cluster.

Step 3: Check the Elasticsearch Cluster

  1. From Ops Manager, select the PCF Metrics tile.

  2. Under the Status tab, record the IP of an Elasticsearch Master node.

  3. Use bosh ssh to access the VM from the previous step. See the Advanced Troubleshooting with the BOSH CLI topic for instructions.

  4. Run the following command to list all the Elasticsearch indices:

    $ curl localhost:9200/_cat/indices?v | sort
    
    green open app_logs_1477512000 8 1 125459066 0 59.6gb 29.8gb green open app_logs_1477526400 8 1 129356671 0 59.1gb 29.5gb green open app_logs_1478174400 8 1 129747170 0 61.9gb 30.9gb . . . green open app_logs_1478707200 8 1 128392686 0 63.2gb 31.6gb green open app_logs_1478721600 8 1 102005754 0 53.5gb 26.5gb health status index pri rep docs.count docs.deleted store.size pri.store.size

    1. If the curl does not return a success response, Elasticsearch may not even be running correctly. Inspect the following logs for any failures or errors:
      • /var/vcap/sys/logs/elasticsearch/elasticsearch.stdout.log
      • /var/vcap/sys/logs/elasticsearch/elasticsearch.stderr.log
  5. Examine the status column of the output.

    1. If the status of any of the indices is not green, restart the Logqueue app:
      $ cf restart elasticsearch-logqueue
    2. Run the curl command periodically to see if the indices recover to a green status.
  6. Run the curl command several more times and examine the most recent index to see if the number of stored documents periodically increases.

    Note: The last row of the output corresponds to the most recent index. The sixth column displays the number of documents for the index.

    1. If all indices show a green status, but the number of documents does not increase, there is likely a problem further up in ingestion. Proceed to to Step 4: Check the Elasticsearch Logqueue.

Step 4: Check the Elasticsearch Logqueue

  1. Run cf apps to see if the elasticsearch-logqueue app instances are started.

  2. If any instance of the app is stopped, run the following command to increase logging:

    $ cf set-env elasticsearch-logqueue LOG_LEVEL DEBUG

    1. Run the following command to stream logs:
      $ cf logs elasticsearch-logqueue
    2. In a different terminal window, run the following command:
      $ cf restage elasticsearch-logqueue
    3. Watch the logs emitted by the elasticsearch-logqueue app for errors.
      • A common error is that the app cannot connect to Elasticsearch because a user deleted the application security group (ASG) that PCF Metrics creates to allow the Logqueue app to connect to the Elasticsearch VMs. You can run cf security-group metrics-api to see if the ASG exists. If not, see the documentation on Creating Application Security Groups.
  3. If the app is started and you do not find any errors, proceed to Step 5: Check the Metrics Ingestor.

Step 5: Check the Metrics Ingestor

  1. Run cf apps to see if the metrics-ingestor app instances are started.
  2. If any of the app instances are stopped, run the following command to increase logging:

    $ cf set-env metrics-ingestor LOG_LEVEL DEBUG

    1. Run the following command to stream logs:
      $ cf logs metrics-ingestor
    2. In a different terminal window, run the following command:
      $ cf restage metrics-ingestor
    3. Watch the logs emitted by the metrics-ingestor app for errors. Refer to the list below for common errors:
      • Cannot connect to the firehose: PCF Metrics creates a UAA user to authenticate the connection to the firehose. This user must have the doppler.firehose authority.
      • Cannot connect to the logqueues: There may be a problem with the UAA, or it could be throttling traffic.
      • WebSocket Disconnects: If you see Websocket disconnects logs in the Ingestor app, consider adding additional Ingestor instances. The Firehose may be dropping the Ingestor connection to avoid back pressure.
  3. If the app is started and you do not find any errors, proceed to Step 6: Check MySQL.

Step 6: Check MySQL

  1. From Ops Manager, select the PCF Metrics tile.

  2. Under the Status tab, record the IP of a MySQL Server node.

  3. Use bosh ssh to access the VM from the previous step. See the Advanced Troubleshooting with the BOSH CLI topic for instructions.

  4. Log in to mysql by running mysql -u USERNAME -p PASSWORD

    Note: If you do not know the username and password, you can run cf env mysql-logqueue with the system org and the metrics-v1-3 space targeted.

  5. Verify that the database was bootstrapped correctly:

    1. Run show databases and check for a metrics database.
      1. if there is no metrics database, the migrate_db errand of the BOSH release may not have run or succeeded. Ensure the errand is selected in the tile configuration and update the tile.
  6. Run show tables and ensure you see the following tables:

    Tables in Metrics

  7. Enter the following query several times to verify that the value returned does not decrease over time:

    mysql> select count(*) from metrics.app_metric_rollup where timestamp > ((UNIX_TIMESTAMP() - 60) * POW(10, 3));
    This command displays the rate at which metrics flow in over the last minute.

    1. If the command returns 0 or a consistently decreasing value, the problem is likely further up in ingestion. Proceed to Step 7: Check the MySQL Logqueue.

Step 7: Check the MySQL Logqueue

  1. Run cf apps to see if the mysql-logqueue app instances are started.

  2. If any instance of the app is stopped, run the following command to increase logging:

    $ cf set-env mysql-logqueue LOG_LEVEL DEBUG

    1. Run the following command to stream logs:
      $ cf logs mysql-logqueue
    2. In a different terminal window, run the following command:
      $ cf restage mysql-logqueue
    3. Watch the logs emitted by the mysql-logqueue app for errors.
      • A common error is that the app cannot connect to MySQL because a user deleted the application security group (ASG) that PCF Metrics creates to allow the Logqueue app to connect to the MySQL VMs. You can run cf security-group metrics-api to see if the ASG exists. If not, see the documentation on Creating Application Security Groups.
  3. If the app is started and you do not find any errors, proceed to Step 8: Check the Metrics Aggregator.

Step 8: Check the Metrics Aggregator

  1. Run cf apps to see if the metrics-aggregator app instances are started.

  2. If any instance of the app is stopped, run the following command to increase logging:

    $ cf set-env metrics-aggregator LOG_LEVEL DEBUG

    1. Run the following command to stream logs:
      $ cf logs metrics-aggregator
    2. In a different terminal window, run the following command:
      $ cf restage metrics-aggregator
    3. Watch the logs emitted by the metrics-aggregator app for errors.
      • A common error is that the app cannot connect to MySQL because a user deleted the application security group (ASG) that PCF Metrics creates to allow the aggregator app to connect to the MySQL VMs. You can run cf security-group metrics-api to see if the ASG exists. If not, see the documentation on Creating Application Security Groups.

MySQL Node Failure

In some cases, a MySQL server node may fail to restart. The following two sections describe the known conditions that cause this failure as well as steps for diagnosing and resolving them. If neither of the causes listed apply, the final section provides instructions for re-deploying BOSH as a last resort to resolve the issue.

Cause 1: Monit Timed Out

Diagnose

Follow these steps to see if a monit time-out caused the MySQL node restart to fail:

  1. Use bosh ssh to access the failing node, using the IP address in the Ops Manager Director tile Status tab. See the Advanced Troubleshooting with the BOSH CLI topic for instructions.
  2. Run monit summary and check the status of the mariadb_ctrl job.
  3. If the status of the mariadb_ctrl job is Execution Failed, open the following file: /var/vcap/sys/log/mysql/mariadb_ctrl.combined.log.
    1. If the last line of the log indicates that MySQL started without issue, such as in the example below, monit likely timed out while waiting for the job to report healthy. Follow the steps below to resolve the issue.
      {"timestamp":"1481149250.288255692","source":"/var/vcap/packages/
      mariadb_ctrl/bin/mariadb_ctrl","message":"/var/vcap/packages/
      mariadb_ctrl/bin/mariadb_ctrl.mariadb_ctrl
      started","log_level":1,"data":{}}

Resolve

Run the following commands to return the mariadb_ctrl job to a healthy state:

  1. Run monit unmonitor mariadb.
  2. Run monit monitor mariadb.
  3. Run monit summary and confirm that the output lists mariadb_ctrl as running.

Cause 2: Bin Logs Filled up the Disk

Diagnose

  1. Use bosh ssh to access the failing node. See the Advanced Troubleshooting with the BOSH CLI topic for instructions.
  2. Open the following log file: /var/vcap/sys/log/mysql/mysql.err.log.
  3. If you see log messages that indicate insufficient disk space, the persistent disk is likely storing too many bin logs. Confirm insufficient disk space by doing the following:
    1. Run df -h.
      1. Ensure that you see the /var/vcap/store folder is at or over 90% usage.
    2. Navigate to /var/vcap/store/mysql and run ls -al.
      1. Ensure that you see many files named with the format mysql-bin.########.

In MySQL for PCF, the server node does not make use of these logs and you can remove all except the most recent bin log. Follow the steps below to resolve the issue.

Resolve

  1. Log in to mysql by running mysql -u USERNAME -p PASSWORD

    Note: If you do not know the username and password, you can run cf env mysql-logqueue with the system org and the metrics-v1-3 space targeted.

  2. Run use metrics;.
  3. Run the following command:
    mysql> PURGE BINARY LOGS BEFORE 'YYYY-MM-DD HH:MM:SS'; 

Re-deploy BOSH to Restart the Node

If troubleshooting based on the causes mentioned above did not resolve the issue with your failing MySQL node, you can follow the steps below to recover it. Pivotal recommends only using this procedure as a if there are no other potential solutions available.

WARNING: This procedure is extremely costly in terms of time and network resources. The cluster takes a significant amount of time to put the data replicated to the rest of the cluster back into the rebuilt node. This procedure consumes considerable network bandwidth as potentially hundreds of gigabytes of data needs to transfer.

Stop the Ingestor App

  1. From Ops Manager, click the Elastic Runtime Tile.
    1. Click the Credentials tab.
    2. Under the UAA job, next to Admin Credentials, click Link to Credential.
    3. Record the username and password for use in the next step.
  2. Log in to the cf CLI using the credentials from the previous step.
    $ cf login -a https://api.YOUR-SYSTEM-DOMAIN -u admin -p PASSWORD
  3. Target the system org and metrics-v1-3 space of your PCF deployment:
    $ cf target -o system -s metrics-v1-3
  4. Stop data flow into the Galera cluster:
    $ cf stop metrics-ingestor

Edit Your Deployment Manifest

  1. Follow the steps in the Log in to BOSH section of the Advanced Troubleshooting with the BOSH CLI topic to target and log in to your BOSH Director. The steps vary slightly depending on whether your PCF deployment uses internal authentication or an external user store.
  2. Download the manifest of your PCF deployment:
    $ bosh download manifest YOUR-PCF-DEPLOYMENT YOUR-PCF-MANIFEST.yml
    

    Note: You must know the name of your PCF deployment to download the manifest. To retrieve it, run bosh deployments to list your deployments and locate the name of your PCF deployment.

  3. Open the manifest and set the number of instances of the failed server node to 0.
  4. Run bosh deployment YOUR-PCF-MANIFEST.yml to specify your edited manifest.
  5. Run bosh deploy to deploy with your manifest.
  6. Run bosh disks --orphaned to see the persistent disk or disks associated with the failed node.
    1. Record the CID of each persistent disk.
    2. Contact Pivotal Support to walk through re-attaching the orphaned disks to new VMs to preserve their data.
  7. Open the manifest and set the number of instances of the failed server node to 1.
  8. Run bosh deploy to deploy with your edited manifest.
  9. Wait for BOSH to rebuild the node.

MySQL SST Disabled Error

If you see the message below on a failing node in /var/vcap/sys/log/mysql/mysql.err.log, you can resolve the error by following the instructions in the Interruptor Logs section of the MySQL for PCF documentation.

WSREP_SST: [ERROR] ############################################################################## (20160610 04:33:21.338)
WSREP_SST: [ERROR] SST disabled due to danger of data loss. Verify data and bootstrap the cluster (20160610 04:33:21.340)
WSREP_SST: [ERROR] ############################################################################## (20160610 04:33:21.341)

Log Errors

Error The PCF Metrics UI does not show any new logs from Elasticsearch.
Cause The tile deployed with the Push PCF Metrics Data Components errand deselected
Solution Restart the Elasticsearch Logqueue using the cf CLI as follows:
  1. Target the system org and metrics-v1-3 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-3
  3. Run the following command to restart the Logqueue application:
  4. $ cf restart elasticsearch-logqueue

Note: To avoid having to apply this fix in the future, select the checkbox to enable the Push PCF Metrics Data Components errand before your next tile update.

503 Errors

Error You encounter 503 errors when accessing the PCF Metrics UI in your browser.
Cause Your Elasticsearch nodes may have become unresponsive.
Solution Check the Elasticsearch index health by following the procedure below, and consider adding additional Elasticsearch nodes.

  1. Retrieve the IP address of your Elasticsearch master node by navigating to the Metrics tile in the Ops Manager Installation Dashboard, clicking the Status tab, and recording the IP address next to ElasticSearchMaster.
  2. Elasticsearch ip
  3. SSH into the Ops Manager VM by following the instructions in SSH into Ops Manager.
  4. From the Ops Manager VM, use curl to target the IP address of your Elasticsearch master node. Follow the instructions in the Cluster Health topic of the Elasticsearch documentation.
Create a pull request or raise an issue on the source for this page in GitHub