LATEST VERSION: 1.4 - CHANGELOG
PCF Metrics v1.4

Troubleshooting PCF Metrics

This topic describes how to resolve common issues experienced while operating or using Pivotal Cloud Foundry (PCF) Metrics.

Errors during Deployment

The following sections describe errors that cause failure during a PCF Metrics tile and how to troubleshoot them.

Smoke Test Errors

PCF Metrics runs a set of smoke tests during installation to confirm system health. If the smoke tests discover any errors, you can find a summary of those errors at the end of the errand log output, including detailed logs about where the failure occurred.

The following tables describe common failures and how to resolve them.

Insufficient Resources

Error Insufficient Resources
Cause Your PCF deployment has insufficient Diego resources to handle the apps pushed as part of a PCF Metrics installation.

The PCF Metrics tile deploys the following apps:
App Memory Disk
metrics-ingestor* 256 MB 1 GB
mysql-logqueue* 512 MB 1 GB
elasticsearch-logqueue* 256 MB 1 GB
metrics 1 GB 2 GB
metrics-ui 64 MB 1 GB
*You may have more than one instance of each of the Ingestor and Logqueue apps depending your sizing needs. You configure these instance counts as part of the Data Store pane of the tile.

Solution Increase the number of Diego cells so that your PCF deployment can support the apps pushed as part of the PCF Metrics installation:

  1. Navigate to the Resource Config section of the Elastic Runtime tile.
  2. In the Diego Cell row, add another Instance.

Failed Querying MySQL

Error Failed querying mysql
Cause The tile deployed without the necessary errands selected to keep the internal database schema in sync with apps.
Solution Re-deploy the tile with the following errands selected:
  • Migrate Old Data to 1.4 Errand
  • Push PCF Metrics Components Errand

Received No Results Back from MySQL - Failing

Error Received no results back from mysql - failing
Cause The Ingestor is not functioning properly.
Solution
  1. From the cf CLI, target the system org and metrics-v1-4 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-4
  3. Run cf apps to see if these apps are running:
    • metrics-ingestor
    • mysql-logqueue
  4. If the apps are not running, run the following commands to start them:
  5. $ cf start metrics-ingestor
    $ cf start mysql-logqueue
  6. Run the following commands and search the app logs for ERROR messages containing additional information:
  7. $ cf logs metrics-ingestor --recent
    $ cf logs mysql-logqueue --recent

    Note: In some cases, the apps cannot communicate due to TLS certificate verification failure. If your deployment uses self-signed certs, ensure the Disable SSL certificate verification for this environment box is selected in the Elastic Runtime Networking pane.

Failed to Connect to MySQL

Error Failed to connect to mysql
Cause MySQL is not running properly.
Solution
  1. Check the logs of the MySQL Server and MySQL Proxy jobs for errors.
    • You can download the logs from the PCF Metrics tile under the Status tab.
  2. From the cf CLI, target the system org and metrics-v1-4 space of your PCF deployment:
  3. $ cf target -o system -s metrics-v1-4
  4. Run the following command and ensure the security group can access the MySQL jobs:

    Note: PCF Metrics creates a default security group to allow all traffic to its apps.

  5. $ cf security-group metrics-api

Failed to Start Elasticsearch Client

Error Failed to start elasticsearch client
Cause Elasticsearch is not running correctly.
Solution
  1. Check the logs of the Elasticsearch Master, Elasticsearch Coordinator, and Elasticsearch Data jobs for errors.
    • You can download the logs from the PCF Metrics tile under the Status tab.
  2. From the cf CLI, target the system org and metrics-v1-4 space of your PCF deployment:
  3. $ cf target -o system -s metrics-v1-4
  4. Run the following command and ensure the security group can access the Elasticsearch jobs:

    Note: PCF Metrics creates a default security group to allow all traffic to its apps.

  5. $ cf security-group metrics-api

Never Received App Logs

Error Never received app logs - something in the firehose -> elasticsearch flow is broken
Cause Ingestor is not inserting logs correctly.
Solution
  1. From the cf CLI, target the system org and metrics-v1-4 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-4
  3. Run cf apps to see if these apps are running:
    • metrics-ingestor
    • elasticsearch-logqueue
  4. If the apps are not running, run the following commands to start them:
  5. $ cf start metrics-ingestor
    $ cf start elasticsearch-logqueue
  6. Run the following commands and search the app logs for ERROR messages containing additional information:
  7. $ cf logs metrics-ingestor --recent
    $ cf logs elasticsearch-logqueue --recent

    Note: In some cases, you might discover a failure to communicate with Loggregator in the form of a bad handshake error.

    Ensure the Loggregator Port setting in the Elastic Runtime tile Networking pane is set to the correct value. For AWS, it is 4443. For all other IaaSes, it is 443.

Metrics and Events Not Available

Error Network metrics are not available.
Container metrics are not available.
App events are not available.
Cause PCF Metrics is misconfigured and the frontend API does not receive logs from MySQL.
Solution
  1. From the cf CLI, target the system org and metrics-v1-4 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-4
  3. Run the following command to check the app logs and investigate the error:
  4. $ cf logs metrics --recent

Logs and Histograms Not Available

Error Logs are not available.
Histograms are not available.
Cause PCF Metrics is misconfigured and the frontend API does not receive logs from Elasticsearch.
Solution
  1. From the cf CLI, target the system org and metrics-v1-4 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-4
  3. Run the following command to check the app logs and investigate the error:
  4. $ cf logs metrics --recent

No Logs or Metrics in the UI

In some cases, the PCF Metrics UI might not display metrics and logs after successfully deploying.

Follow the steps in this section to help locate the app or component causing the problem.

Step 1: Check your Load Balancer Configuration

If you use a load balancer, the event-stream mechanism used by the Metrics UI might be blocked. See the table below to resolve this error.

If you do not use a load balancer, or this issue does not apply to your deployment, proceed to Step 2: Check the PCF Metrics Apps.

Error In the case of a customer using an F5 load balancer, metrics and logs were not visible in the UI despite successful ingestion and no UI errors reported.
Cause The root of the issue was the traffic of type text/event-stream was blocked by the F5 load balancer.
Solution When F5 was configured to allow event-stream traffic, the issue was resolved.

Step 2: Check the PCF Metrics Apps

  1. From Ops Manager, click the Elastic Runtime Tile.

    1. Click the Credentials tab.
    2. Under the UAA job, next to Admin Credentials, click Link to Credential.
    3. Record the username and password for use in the next step.
  2. Log in to the Cloud Foundry Command Line Interface (cf CLI) using the credentials from the previous step.

    $ cf login -a https://api.YOUR-SYSTEM-DOMAIN -u admin -p PASSWORD

  3. When prompted, select the system org and the metrics-v1-4 space.

  4. Ensure that the output displays the following apps, each in a started state:

    • metrics-ingestor
    • mysql-logqueue
    • elasticsearch-logqueue
    • metrics-aggregator
    • metrics
    • metrics-ui
  5. Check the logs of each app for errors using the following command:

    $ cf logs APP-NAME --recent
    If you do not see any output, or if you did not find any errors, proceed to Step 3: Check the Elasticsearch Cluster.

Step 3: Check the Elasticsearch Cluster

  1. From Ops Manager, select the PCF Metrics tile.

  2. Under the Status tab, record the IP of an Elasticsearch Master node.

  3. Use bosh ssh to access the VM from the previous step. For instructions, see Advanced Troubleshooting with the BOSH CLI.

  4. Run the following command to list all the Elasticsearch indices:

    $ curl ELASTICSEARCH-HOST-IP:9200/_cat/indices?v | sort
    
    green open app_logs_1477512000 8 1 125459066 0 59.6gb 29.8gb green open app_logs_1477526400 8 1 129356671 0 59.1gb 29.5gb green open app_logs_1478174400 8 1 129747170 0 61.9gb 30.9gb . . . green open app_logs_1478707200 8 1 128392686 0 63.2gb 31.6gb green open app_logs_1478721600 8 1 102005754 0 53.5gb 26.5gb health status index pri rep docs.count docs.deleted store.size pri.store.size

    1. If the curl command does not return a success response, Elasticsearch might not even be running correctly. Inspect the following logs for any failures or errors:
      • /var/vcap/sys/log/elasticsearch/elasticsearch.stdout.log
      • /var/vcap/sys/log/elasticsearch/elasticsearch.stderr.log
  5. Examine the status column of the output.

    1. If any of the indices are red, delete them using the following command:
      curl -X DELETE ELASTICSEARCH-HOST-IP:9200/INDEX
    2. Restart the elasticsearch-logqueue app:
      $ cf restart elasticsearch-logqueue
    3. From each of the Elasticsearch VMs, run the following command:
      $ monit restart all
    4. Check periodically to verify the indices gradually recover to a green status.
  6. Run the curl command several more times and examine the most recent index to see if the number of stored documents periodically increases.

    Note: The last row of the output corresponds to the most recent index. The sixth column displays the number of documents for the index.

    1. If all indices show a green status, but the number of documents does not increase, there is likely a problem further up in ingestion. Proceed to to Step 4: Check the Elasticsearch Logqueue.
  7. Check whether cluster-level shard allocation is enabled:

    $ curl localhost:9200/_cluster/settings

    Examine the output:

    • "all" means the shard allocation is enabled.
    • "none" means the shard allocation is disabled.
  8. Re-enable the shard allocation by running the following command:

    $ curl -XPUT localhost:9200/_cluster/settings -d
    {"transient" : {
      "cluster.routing.allocation.enable" : "all"
    }}

  9. Check whether a proxy is present in front of Elasticsearch and whether HTTP traffic is enabled:

    1. Use cf ssh to SSH into any app in the metrics space:
      $ cf ssh APP-NAME
    2. Run the curl command with the IP address of the Elasticsearch Master node:
      $ curl ELASTICSEARCH-MASTER-IP-ADDRESS
    3. If the curl command fails, talk to your system administrator about removing the proxy or green-listing the PCF Metrics apps.

Step 4: Check the Elasticsearch Logqueue

  1. Run cf apps to see if the elasticsearch-logqueue app instances are started.

  2. If any instance of the app is stopped, run the following command to increase logging:

    $ cf set-env elasticsearch-logqueue LOG_LEVEL DEBUG

    1. Run the following command to stream logs:
      $ cf logs elasticsearch-logqueue
    2. In a different terminal window, run the following command:
      $ cf restage elasticsearch-logqueue
    3. Watch the logs emitted by the elasticsearch-logqueue app for errors.
      • A common error is that the app cannot connect to Elasticsearch because a user deleted the application security group (ASG) that PCF Metrics creates to allow the Logqueue app to connect to the Elasticsearch VMs. You can run cf security-group metrics-api to see if the ASG exists. If not, see Creating Application Security Groups.
  3. If the app is started and you do not find any errors, proceed to Step 5: Check the Metrics Ingestor.

Step 5: Check the Metrics Ingestor

  1. Run cf apps to see if the metrics-ingestor app instances are started.
  2. If any of the app instances are stopped, run the following command to increase logging:

    $ cf set-env metrics-ingestor LOG_LEVEL DEBUG

    1. Run the following command to stream logs:
      $ cf logs metrics-ingestor
    2. In a different terminal window, run the following command:
      $ cf restage metrics-ingestor
    3. Watch the logs emitted by the metrics-ingestor app for errors. See the list below for common errors:
      • Cannot connect to the firehose: PCF Metrics creates a UAA user to authenticate the connection to the Firehose. This user must have the doppler.firehose authority.
      • Cannot connect to the logqueues: There might be a problem with the UAA, or it could be throttling traffic.
      • WebSocket Disconnects: If you see WebSocket disconnects logs in the Ingestor app, consider adding additional Ingestor instances. The Firehose might be dropping the Ingestor connection to avoid back pressure.
  3. If the app is started and you do not find any errors, proceed to Step 6: Check MySQL.

Step 6: Check MySQL

  1. From Ops Manager, select the PCF Metrics tile.

  2. Under the Status tab, record the IP of a MySQL Server node.

  3. Use bosh ssh to access the VM from the previous step. For instructions, see Advanced Troubleshooting with the BOSH CLI.

  4. Log in to mysql by running mysql -u USERNAME -p PASSWORD

    Note: If you do not know the username and password, you can run cf env mysql-logqueue with the system org and the metrics-v1-4 space targeted.

  5. Verify that the database was bootstrapped correctly:

    1. Run show databases and check for a metrics database.
      1. If there is no metrics database, the migrate_db errand of the BOSH release might not have run or succeeded. Ensure the errand is selected in the tile configuration and update the tile.
  6. Run use metrics to select the metrics database:

    mysql> use metrics;

  7. Run show tables and ensure you see the following tables:

    mysql> show tables;
    +-------------------+
    | Tables_in_metrics |
    +-------------------+
    | app_event         |
    | app_metric        |
    | app_metric_rollup |
    | schema_version    |
    +-------------------+
    

  8. Enter the following query several times to verify that the value returned does not decrease over time:

    mysql> select count(*) from metrics.app_metric_rollup where timestamp > ((UNIX_TIMESTAMP() - 60) * POW(10, 3));
    This command displays the rate at which metrics flow in over the last minute.

    1. If the command returns 0 or a consistently decreasing value, the problem is likely further up in ingestion. Proceed to Step 7: Check the MySQL Logqueue.

Step 7: Check the MySQL Logqueue

  1. Run cf apps to see if the mysql-logqueue app instances are started.

  2. If any instance of the app is stopped, run the following command to increase logging:

    $ cf set-env mysql-logqueue LOG_LEVEL DEBUG

    1. Run the following command to stream logs:
      $ cf logs mysql-logqueue
    2. In a different terminal window, run the following command:
      $ cf restage mysql-logqueue
    3. Watch the logs emitted by the mysql-logqueue app for errors.
      • A common error is that the app cannot connect to MySQL because a user deleted the application security group (ASG) that PCF Metrics creates to allow the Logqueue app to connect to the MySQL VMs. You can run cf security-group metrics-api to see if the ASG exists. If not, see Creating Application Security Groups.
  3. If the app is started and you do not find any errors, proceed to Step 8: Check the Metrics Aggregator.

MySQL Node Failure

In some cases, a MySQL server node might fail to restart. The following two sections describe the known conditions that cause this failure as well as steps for diagnosing and resolving them. If neither of the causes listed apply, the final section provides instructions for re-deploying BOSH as a last resort to resolve the issue.

Cause 1: Monit Timed Out

Diagnose

Follow these steps to see if a monit time-out caused the MySQL node restart to fail:

  1. Use bosh ssh to access the failing node, using the IP address in the Ops Manager Director tile Status tab. For instructions, see Advanced Troubleshooting with the BOSH CLI.
  2. Run monit summary and check the status of the mariadb_ctrl job.
  3. If the status of the mariadb_ctrl job is Execution Failed, open the following file: /var/vcap/sys/log/mysql/mariadb_ctrl.combined.log.
    1. If the last line of the log indicates that MySQL started without issue, such as in the example below, monit likely timed out while waiting for the job to report healthy. Follow the steps below to resolve the issue.
      {"timestamp":"1481149250.288255692","source":"/var/vcap/packages/
      mariadb_ctrl/bin/mariadb_ctrl","message":"/var/vcap/packages/
      mariadb_ctrl/bin/mariadb_ctrl.mariadb_ctrl
      started","log_level":1,"data":{}}

Resolve

Run the following commands to return the mariadb_ctrl job to a healthy state:

  1. Run monit unmonitor mariadb.
  2. Run monit monitor mariadb.
  3. Run monit summary and confirm that the output lists mariadb_ctrl as running.

Cause 2: Bin Logs Filled up the Disk

Diagnose

  1. Use bosh ssh to access the failing node. For instructions, see Advanced Troubleshooting with the BOSH CLI.
  2. Open the following log file: /var/vcap/sys/log/mysql/mysql.err.log.
  3. If you see log messages that indicate insufficient disk space, the persistent disk is likely storing too many bin logs. Confirm insufficient disk space by doing the following:
    1. Run df -h.
      1. Ensure that you see the /var/vcap/store folder is at or over 90% usage.
    2. Navigate to /var/vcap/store/mysql and run ls -al.
      1. Ensure that you see many files named with the format mysql-bin.########.

In MySQL for PCF, the server node does not make use of these logs and you can remove all except the most recent bin log. Follow the steps below to resolve the issue.

Resolve

  1. Log in to mysql by running mysql -u USERNAME -p PASSWORD

    Note: If you do not know the username and password, you can run cf env mysql-logqueue with the system org and the metrics-v1-4 space targeted.

  2. Run use metrics;.
  3. Run the following command:
    mysql> PURGE BINARY LOGS BEFORE 'YYYY-MM-DD HH:MM:SS'; 

Re-deploy BOSH to Restart the Node

If troubleshooting based on the causes mentioned above did not resolve the issue with your failing MySQL node, you can follow the steps below to recover it. Pivotal recommends only using this procedure as a if there are no other potential solutions available.

WARNING! This procedure is extremely costly in terms of time and network resources. The cluster takes a significant amount of time to put the data replicated to the rest of the cluster back into the rebuilt node. This procedure consumes considerable network bandwidth as potentially hundreds of gigabytes of data needs to transfer.

Stop the Ingestor App

  1. From Ops Manager, click the Elastic Runtime Tile.
    1. Click the Credentials tab.
    2. Under the UAA job, next to Admin Credentials, click Link to Credential.
    3. Record the username and password for use in the next step.
  2. Log in to the cf CLI using the credentials from the previous step.
    $ cf login -a https://api.YOUR-SYSTEM-DOMAIN -u admin -p PASSWORD
  3. Target the system org and metrics-v1-4 space of your PCF deployment:
    $ cf target -o system -s metrics-v1-4
  4. Stop data flow into the Galera cluster:
    $ cf stop metrics-ingestor

Edit Your Deployment Manifest

  1. Follow the steps in Log in to BOSH in Advanced Troubleshooting with the BOSH CLI to target and log in to your BOSH Director. The steps vary slightly depending on whether your PCF deployment uses internal authentication or an external user store.
  2. Download the manifest of your PCF deployment:
    $ bosh download manifest YOUR-PCF-DEPLOYMENT YOUR-PCF-MANIFEST.yml
    

    Note: You must know the name of your PCF deployment to download the manifest. To retrieve it, run bosh deployments to list your deployments and locate the name of your PCF deployment.

  3. Open the manifest and set the number of instances of the failed server node to 0.
  4. Run bosh deployment YOUR-PCF-MANIFEST.yml to specify your edited manifest.
  5. Run bosh deploy to deploy with your manifest.
  6. Run bosh disks --orphaned to see the persistent disk or disks associated with the failed node.
    1. Record the CID of each persistent disk.
    2. Contact Pivotal Support to walk through re-attaching the orphaned disks to new VMs to preserve their data.
  7. Open the manifest and set the number of instances of the failed server node to 1.
  8. Run bosh deploy to deploy with your edited manifest.
  9. Wait for BOSH to rebuild the node.

Log Errors

Error The PCF Metrics UI does not show any new logs from Elasticsearch.
Cause The tile deployed with the Push PCF Metrics Data Components errand deselected
Solution Restart the Elasticsearch Logqueue using the cf CLI as follows:
  1. Target the system org and metrics-v1-4 space of your PCF deployment:
  2. $ cf target -o system -s metrics-v1-4
  3. Run the following command to restart the Logqueue app:
  4. $ cf restart elasticsearch-logqueue

Note: To avoid having to apply this fix in the future, select the checkbox to enable the Push PCF Metrics Data Components errand before your next tile update.

503 Errors

Error You encounter 503 errors when accessing the PCF Metrics UI in your browser.
Cause Your Elasticsearch nodes might have become unresponsive.
Solution Check the Elasticsearch index health by following the procedure below, and consider adding additional Elasticsearch nodes.

  1. Retrieve the IP address of your Elasticsearch master node by navigating to the Metrics tile in the Ops Manager Installation Dashboard, clicking the Status tab, and recording the IP address next to ElasticSearchMaster.
  2. Elasticsearch ip
  3. SSH into the Ops Manager VM by following the instructions in SSH into Ops Manager.
  4. From the Ops Manager VM, use curl to target the IP address of your Elasticsearch master node. Follow the instructions in Cluster Health of the Elasticsearch documentation.

Failed to Fetch Apps

Error Even though you entered the correct UAA credentials, the metrics app fails to fetch the list of apps.
Cause The browser plugins or cookies inject extraneous content in requests to Cloud Controller API, causing it to reject the request.
Solution Confirm the problem and clear the browser, as follows:
  1. Try the browser’s incognito mode to see if the metrics app is able to fetch the list of apps. If this works, the problem is likely cookies or plugins.
  2. Clear your browser cookies and plugins.

Redis Temporary Datastore Stops Accepting Metrics

Error You see both these problems:
  • Metrics stop appearing on the UI.
  • When you run cf metrics-ingestor logs, you see the following entry in the Ingestor logs:
    MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
Cause The Redis datastore is full. The component is out of memory or persistent disk space.
Solution Confirm the problem and scale up Redis, as follows:
  1. On the Metrics tile, click the Status tab and look to see if the memory or persistent disk usage of the Redis job is over 80%.
  2. Scale up the Redis component. For more information, see Scale the Temporary Datastore (Redis).
Create a pull request or raise an issue on the source for this page in GitHub