Troubleshooting PCF Log Search

Page last updated:

This topic describes how to resolve common issues with missing logs the Pivotal Cloud Foundry (PCF) Log Search tile.

If not all logs expected are being received and displayed in the Log Search Kibana UI, the following items should be validated as part of the troubleshooting process.

Confirm that Shard Allocation is Enabled

  • Disabling the Enable Shard Allocation errand and applying changes that affect Elasticsearch nodes prevents new daily indexes from being created. When updating stemcells it is tempting to disable the Enable Shard Allocation errand to speed up the deployment. Unfortunately doing this leaves Elasticsearch in a “broken” state; specifically it is unable to create new daily indexes at midnight UTC.

  • If you suspect this has happened; check whether your cluster cluster.routing.allocation.enable setting has been set to something other than all by querying:

    If it has, you can correct this by re-enabling the errand in Ops Manager via the Log Search > Errand screen, and Apply Changes.

Confirm Correct Installation

  • Log Search (v1.0.2 and later) requires two important installation steps in order to receive all logs/metrics expected
    • Ensure NATS credentials from the Ops Manager tile were entered in Log Search tile.
    • Ensure System Logging was correctly configured in Elastic Runtime with IP information from Log Search tile.

Confirm that Resource Scale Matches Log Volume

  • The default resource configuration for the PCF Log Search tile is targeted primarily at consumption of platform logs. If you are also using the Experimental Firehose feature in order to consume application logs, this can put strain on the Log Parser and Elasticsearch data nodes if application logging volume is high.
    • In order to determine if you may need to scale your clusters to support application logging consumption, please Find the Rate of Incoming Messages and then look to the table shown in Scaling for a suggested amount to further scale up your resources.

      WARNING: If you recently upgraded to PCF v1.10 or later, you will likely need to scale up your Log Search cluster to accommodate higher Loggregator message throughput. See gRPC in Loggregator for more info.

Review Current Cluster Health

  • Log Search includes a visualization of current cluster health. It is accessible at logsearch_monitor.<SystemDomain>

  • Clusters that are in a red state are broken

    • The best way to troubleshoot the cluster health is to look at the logs for Elasticsearch Master and Elasticsearch Data nodes. This is found in Log Search > Status > Logs
  • Clusters that are in a yellow state are not replicating.

    • Verify there are multiple data nodes. If there are multiple data nodes in a yellow state, look at the logs for Elasticsearch Master and Elasticsearch Data nodes. This is found in Log Search > Status > Logs
    • If there is only a single data node, scale up the number of Elasticsearch Data nodes in Log Search > Resource Config

Check if Elasticsearch Data Nodes at Maximum Disk

  • This issue is most commonly caused by a Redis queuing issue, and is often evident in product logs

    • Example log message:

    Redis key size has hit a congestion threshold 1000000 suspending output for 10 seconds

  1. Restart the ingestor and wait a few minutes.
  2. If the cluster health continues to be unhealthy, scale up the size of the clusters.
    • Reference: Log Search Scaling Guide
    • Example: If the Elasticsearch data nodes are running near maximum on persistent disk, consider increasing the persistent disk or instance count to add more capacity.

Check if Parser VMs at Maximum CPU

  • As a general rule, use more Log parser nodes than Elasticsearch Data nodes.
  • Scale up parser nodes until the average CPU goes down.

Check for Multiple Elasticsearch Master VMs (v1.0.1 and earlier)

  • Ensure there is only one Elasticsearch Master VM. The product cannot support running with more than one Elasticsearch master VM. The ability to accidentally scale this was removed as of v1.0.2 and later.