Troubleshooting PCF Metrics
- Insufficient Resources
- Missing Specific Logs
- High CPU on PostgreSQL VM
- Too Many Clients Error
- Failed to Fetch Apps
- Redis Temporary Datastore Stops Accepting Metrics
- No Logs or Metrics in the UI
- MySQL Failure
- Forward PCF Metrics Logs to a Syslog Endpoint
- Service metrics-forwarder does not exist
- Metrics API Unavailable
Page last updated:
This topic describes how to resolve common issues experienced while operating or using Pivotal Cloud Foundry (PCF) Metrics.
Insufficient Resources
Error | Insufficient Resources
|
|||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cause | Your PCF deployment has insufficient Diego resources to handle the apps pushed as part of a PCF Metrics installation.
The PCF Metrics tile deploys the following apps:
|
|||||||||||||||||||||
Solution | Increase the number of Diego cells so that your PCF deployment can support the apps pushed as part of the PCF Metrics installation:
|
Missing Specific Logs
Error | Logs are missing for a specific application or a subset of logs are bing skipped. |
---|---|
Cause | PCF Metrics does not store logs with messages containing non-UTF-8 characters or logs with application GUIDs that are not standard UUID. |
Solution | Remove non-UTF-8 characters from log messages and ensure it is created with a correct application GUID. |
High CPU on PostgreSQL VM
Error | PostgreSQL VM CPU is over 80% for an extended period of time. |
---|---|
Cause | The PostgreSQL VM does not have enough CPU allocated or enough space allocated for storage. |
Solution |
Increase the size of the PostgreSQL VM
|
Too Many Clients Error
Error | You encounter sorry, too many clients already errors when accessing the PCF Metrics UI in your browser. |
---|---|
Cause | Your PostgreSQL is running out of disk space, causing reduced performance and a spike of open connections.
Possible causes:
|
Solution |
|
Failed to Fetch Apps
Error | Even though you entered the correct UAA credentials, the metrics app fails to fetch the list of apps. |
---|---|
Cause | The browser plugins or cookies inject extraneous content in requests to Cloud Controller API, causing it to reject the request. |
Solution | Confirm the problem and clear the browser, as follows:
|
Redis Temporary Datastore Stops Accepting Metrics
Error |
You see both these problems:
|
---|---|
Cause | The Redis datastore is full. The component is out of memory or persistent disk space. |
Solution | Confirm the problem and scale up Redis, as follows:
|
Received No Results Back from MySQL - Failing
Error | Received no results back from mysql - failing
|
---|---|
Cause | The Ingestor is not functioning properly. |
Solution |
$ cf target -o system -s metrics-v1-6 $ cf start metrics-ingestor $ cf logs metrics-ingestor --recent Note: In some cases, the apps cannot communicate due to TLS certificate verification failure. If your deployment uses self-signed certs, ensure the Disable SSL certificate verification for this environment box is selected in the PAS Networking pane. |
Failed to Connect to MySQL
Error | Failed to connect to mysql
|
---|---|
Cause | MySQL is not running properly. |
Solution |
$ cf target -o system -s metrics-v1-6 $ cf security-group metrics-api |
Never Received App Logs
Error | Never received app logs - something in the firehose -> PostgreSQL flow is broken
|
---|---|
Cause | Ingestor is not inserting logs correctly. |
Solution |
$ cf target -o system -s metrics-v1-6 $ cf start metrics-ingestor $ cf logs metrics-ingestor --recent Note: In some cases, you might discover a failure to communicate with Loggregator in the form of a bad handshake error.
|
Metrics and Events Not Available
Error |
Network metrics are not available. Container metrics are not available. App events are not available.
|
---|---|
Cause | PCF Metrics is misconfigured and the frontend API does not receive logs from MySQL. |
Solution |
$ cf target -o system -s metrics-v1-6 $ cf logs metrics --recent |
Logs and Histograms Not Available
Error |
Logs are not available. Histograms are not available.
|
---|---|
Cause | PCF Metrics is misconfigured and the frontend API does not receive logs from PostgreSQL. |
Solution |
$ cf target -o system -s metrics-v1-6 $ cf logs metrics --recent |
No Logs or Metrics in the UI
In some cases, the PCF Metrics UI might not display metrics and logs after successfully deploying.
Follow the steps in this section to help locate the app or component causing the problem.
Step 1: Check your Load Balancer Configuration
If you use a load balancer, the event-stream mechanism used by the Metrics UI might be blocked. See the table below to resolve this error.
If you do not use a load balancer, or this issue does not apply to your deployment, proceed to the next step.
Error | In the case of a customer using an F5 load balancer, metrics and logs were not visible in the UI despite successful ingestion and no UI errors reported. |
---|---|
Cause | The root of the issue was the traffic of type text/event-stream was blocked by the F5 load balancer. |
Solution | When F5 was configured to allow event-stream traffic, the issue was resolved. |
Step 2: Check the PCF Metrics Apps
From Ops Manager, click the PAS Tile.
- Click the Credentials tab.
- Under the UAA job, next to Admin Credentials, click Link to Credential.
- Record the username and password for use in the next step.
Log in to the Cloud Foundry Command Line Interface (cf CLI) using the credentials from the previous step.
$ cf login -a https://api.YOUR-SYSTEM-DOMAIN -u admin -p PASSWORD
When prompted, select the
system
org and themetrics-v1-6
space.Ensure that the output displays the following apps, each in a
started
state:metrics-ingestor
metrics-queue
logs-queue
metrics
metrics-ui
metrics-alerting
Check the logs of each app for errors using the following command:
$ cf logs APP-NAME --recent
If you do not see any output, or if you did not find any errors, proceed to the next step.
Step 3: Check the Metrics Ingestor
- To get a higher level of detail from the metrics-ingestor application, set the LOG_LEVEL env variable:
$ cf set-env metrics-ingestor LOG_LEVEL DEBUG
- To apply this setting, restage the application:
$ cf restage metrics-ingestor
- Run the following command to stream logs:
$ cf logs metrics-ingestor
Watch the logs emitted by the
metrics-ingestor
app for errors. See the list below for common errors:- Aggregation Stored Procedures key is not in redis: Redis may have been restarted or is in a bad state.
Stop the
metrics-ingestor
app. Restart redis. Start themetrics-ingestor
app again. - Cannot connect to the firehose: PCF Metrics creates a UAA user to authenticate the connection to the Firehose.
This user must have the
doppler.firehose
authority. - Could not find service with name: metrics-forwarder: The Metrics Forwarder Tile is not installed. Metrics will not display custom metrics without the Metrics Forwarder Tile but will otherwise function normally.
- WebSocket Disconnects: If you see WebSocket disconnects logs in the Ingestor app, consider adding additional Ingestor instances. The Firehose might be dropping the Ingestor connection to avoid back pressure.
- Redis errors: Investigate redis logs, for instructions see Advanced Troubleshooting with the BOSH CLI.
Many possible solutions start with restarting redis.
- If the app is started and you do not find any errors, proceed to the next step.
- Aggregation Stored Procedures key is not in redis: Redis may have been restarted or is in a bad state.
Stop the
Step 4: Check the log-queue
- To get a higher level of detail from the logs-queue application, set the LOG_LEVEL env variable:
$ cf set-env logs-queue LOG_LEVEL DEBUG
- To apply this setting, restage the application:
$ cf restage logs-queue
- Run the following command to stream logs:
$ cf logs logs-queue
Watch the logs emitted by the
logs-queue
app for errors.- A common error is that the app cannot connect to PostgreSQL
due to the application security group (ASG) being deleted.
This ASG allows the logs-queue application to create a network connection to the PostgreSQL VM.
You can run
cf security-group metrics-api
to see if the ASG exists. If the ASG is not present, see Creating Application Security Groups to recreate it. - Could not find service with name: metrics-forwarder: The Metrics Forwarder Tile is not installed.
Metrics will not display custom metrics without the Metrics Forwarder Tile but will otherwise function normally.
- If the app is started and you do not find any errors, proceed to the next step.
- A common error is that the app cannot connect to PostgreSQL
due to the application security group (ASG) being deleted.
This ASG allows the logs-queue application to create a network connection to the PostgreSQL VM.
You can run
Step 5: Check MySQL
From Ops Manager, select the PCF Metrics tile.
Under the Status tab, record the IP of a MySQL Server node.
Use
bosh ssh
to access the VM from the previous step. For instructions, see Advanced Troubleshooting with the BOSH CLI.Log in to mysql by running
mysql -u USERNAME -p PASSWORD
Note: If you do not know the username and password, you can run
cf env metrics-queue
with thesystem
org and themetrics-v1-6
space targeted.Verify that the database was bootstrapped correctly:
- Run
show databases
and check for ametrics
database.- If there is no
metrics
database, thePush PCF Metrics Components Errand
errand of the BOSH release might not have run or succeeded. Ensure the errand is selected in the tile configuration and update the tile.
- If there is no
- Run
Run
use metrics
to select themetrics
database:mysql> use metrics;
Run
show tables
and ensure you see the following tables:mysql> show tables; +-----------------------------+ | Tables_in_metrics | +-----------------------------+ | app_event | | app_metric | | app_metric_rollup | | schema_version | | app_metric_identifier | +-----------------------------+
Enter the following query several times to verify that the value returned does not decrease over time:
mysql> select count(*) from metrics.app_metric_identifier where timestamp_minute > ((UNIX_TIMESTAMP() - 60) * POW(10, 3));
This command displays the rate at which metrics flow in over the last minute.- If the command returns
0
or a consistently decreasing value, the problem is likely further up in ingestion; proceed to the next step.
- If the command returns
Step 6: Check the Metrics Queue
- To get a higher level of detail from the metrics-queue application, set the LOG_LEVEL env variable:
$ cf set-env metrics-queue LOG_LEVEL DEBUG
- To apply this setting, restage the application:
$ cf restage metrics-queue
- Run the following command to stream logs:
$ cf logs metrics-queue
Watch the logs emitted by the
metrics-queue
app for errors.- A common error is that the app cannot connect to MySQL
due to the application security group (ASG) being deleted.
This ASG allows the logs-queue application to create a network connection to the MySQL VM.
You can run
cf security-group metrics-api
to see if the ASG exists. If the ASG is not present, see Creating Application Security Groups to recreate it. - Could not find service with name: metrics-forwarder: The Metrics Forwarder Tile is not installed.
Metrics will not display custom metrics without the Metrics Forwarder Tile but will otherwise function normally.
- If the app is started and you do not find any errors, proceed to the next step.
- A common error is that the app cannot connect to MySQL
due to the application security group (ASG) being deleted.
This ASG allows the logs-queue application to create a network connection to the MySQL VM.
You can run
MySQL Failure
In some cases, a MySQL server might fail to restart. The following two sections describe the known conditions that cause this failure as well as steps for diagnosing and resolving them.
Cause 1: Monit Timed Out
Diagnose
Follow these steps to see if a monit
time-out caused the MySQL node restart to fail:
- Use
bosh ssh
to access the failing node, using the IP address in the Ops Manager Director tile Status tab. For instructions, see Advanced Troubleshooting with the BOSH CLI. - Run
monit summary
and check the status of thegalera-init
job. - If the status of the
galera-init
job isExecution Failed
, open the following file:/var/vcap/sys/log/pxc-mysql/galera-init.log
.- If the last line of the log indicates that MySQL started without issue, such as in the example below,
monit
likely timed out while waiting for the job to report healthy. Follow the steps below to resolve the issue.{"timestamp":"1536851105.372446537","source":"/var/vcap/packages/galera-init/bin/galera-init","message":"/var/vcap/packages/galera-init/bin/galera-init.galera-init started","log_level":1,"data":{}}
- If the last line of the log indicates that MySQL started without issue, such as in the example below,
Resolve
Run the following commands to return the galera-init
job to a healthy state:
- Run
monit unmonitor galera-init
. - Run
monit monitor galera-init
. - Run
monit summary
and confirm that the output listsgalera-init
asrunning
.
Cause 2: Bin Logs Filled up the Disk
Diagnose
- Use
bosh ssh
to access the failing node. For instructions, see Advanced Troubleshooting with the BOSH CLI. - Open the following log file:
/var/vcap/sys/log/pxc-mysql/mysql.err.log
. - If you see log messages that indicate insufficient disk space, the persistent disk is likely storing too many bin logs.
Confirm insufficient disk space by doing the following:
- Run
df -h
.- Ensure that you see the
/var/vcap/store
folder is at or over90%
usage.
- Ensure that you see the
- Navigate to
/var/vcap/store/pxc-mysql
and runls -al
.- Ensure that you see many files named with the format
mysql-bin.########
.
- Ensure that you see many files named with the format
- Run
In MySQL for PCF, the server node does not make use of these logs and you can remove all except the most recent bin log. Follow the steps below to resolve the issue.
Resolve
- Log in to mysql by running
mysql -u USERNAME -p PASSWORD
Note: If you do not know the username and password, you can run
cf env metrics-queue
with thesystem
org and themetrics-v1-6
space targeted. - Run
use metrics;
. - Run the following command:
mysql> PURGE BINARY LOGS BEFORE 'YYYY-MM-DD HH:MM:SS';
Edit Your MySQL Server Configuration
- From Ops Manager, click the PCF Metrics Tile.
- Navigate to the Resource Config section of the PCF Metrics Tile.
- Increase the Persistent Disk size of MySQL Server to at least twice the current size.
Forward PCF Metrics Logs to a Syslog Endpoint
When using PCF Metrics 1.6 or higher on PCF 2.4 and beyond, you can configure the tile to forward its logs to a syslog endpoint for troubleshooting. Please see the “Forward PCF Metrics Logs to a Syslog Endpoint” section under the Configure the PCF Metrics Tile docs for more details.
Service metrics-forwarder does not exist
Error | Service metrics-forwarder does not exist. |
---|---|
Cause | The Metrics Forwarder Tile is not installed. |
Solution | Install the Metrics Fowarder Tile if you would like custom metrics; otherwise the error can be ignored. The service is optional. |
Metrics API Unavailable
Error | Metrics url shows Metrics API Unavailable. |
---|---|
Cause | The URL is http. |
Solution | Go the the https version of the metrics URL. |