Troubleshooting PCF Log Search

Page last updated:

This topic describes how to resolve common issues with missing logs the Pivotal Cloud Foundry (PCF) Log Search tile.

If not all logs expected are being received and displayed in the Log Search Kibana UI, the following items should be validated as part of the troubleshooting process.

Ensure the required Enable Shard Allocation errand is enabled

  • Disabling the Enable Shard Allocation errand and applying changes that affect Elasticsearch nodes prevents new daily indexes from being created. When updating stemcells it is tempting to disable the “Enable Shard Allocation” errand to speed up the deployment. Unfortunately doing this leaves Elasticsearch in a “broken” state; specifically it is unable to create new daily indexes at midnight UTC.

  • If you suspect this has happened; check whether your cluster cluster.routing.allocation.enable setting has been set to something other than all by querying:

    http://logsearch.<system_domain>/elasticsearch/_cluster/settings?pretty
    If it has, you can correct this by re-enabling the errand in Ops Manager via the Log Search > Errand screen, and Apply Changes.

Ensure the prescribed installation steps were followed

  • Log Search (v1.0.2 and later) requires two important installation steps in order to receive all logs/metrics expected
    • Ensure NATS credentials from the Ops Manager tile were entered in Log Search tile.
    • Ensure System Logging was correctly configured in Elastic Runtime with IP information from Log Search tile.

Ensure resource scale matches expected log volume

  • The default resource configuration for the PCF Log Search tile is targeted primarily at consumption of platform logs. If you are also using the Experimental Firehose feature in order to consume application logs, this can put strain on the Log Parser and Elastic Search data nodes if application logging volume is high.
    • In order to determine if you may need to scale your clusters to support application logging consumption, please Find the Rate of Incoming Messages and then look to the table shown in Scaling for a suggested amount to further scale up your resources.

Review the current cluster health

  • Log Search includes a visualization of current cluster health. It is accessible at logsearch_monitor.<SystemDomain>

  • Clusters that are in a red state are broken

    • The best way to troubleshoot the cluster health is to look at the logs for Elasticsearch Master and Elasticsearch Data nodes. This is found in Log Search > Status > Logs
  • Clusters that are in a yellow state are not replicating.

    • Verify there are multiple data nodes. If there are multiple data nodes in a yellow state, look at the logs for Elasticsearch Master and Elasticsearch Data nodes. This is found in Log Search > Status > Logs
    • If there is only a single data node, scale up the number of Elasticsearch Data nodes in Log Search > Resource Config

If Elastic Search data nodes are consistently running at maximum

  • This issue is most commonly caused by a Redis queuing issue, and is often evident in product logs

    • Example log message:

    Redis key size has hit a congestion threshold 1000000 suspending output for 10 seconds

  1. Restart the ingestor and wait a few minutes.
  2. If the cluster health continues to be unhealthy, scale up the size of the clusters.
    • Reference: Log Search Scaling Guide
    • Example: If the Elastic Search data nodes are running near maximum on persistent disk, consider increasing the persistent disk or instance count to add more capacity.

If Parser VMs are consistently running at maximum CPU

  • As a general rule, use more Log parser nodes than Elasticsearch Data nodes.
  • Scale up parser nodes until the average CPU goes down.

If multiple Elastic Search Master VMs (v1.0.1 and earlier)

  • Ensure there is only one Elastic Search Master VM. The product cannot support running with more than one Elastic Search master VM. The ability to accidentally scale this was removed as of v1.0.2 and later.
Create a pull request or raise an issue on the source for this page in GitHub