Loggregator Guide for Cloud Foundry Operators
Page last updated:
This topic contains information for Cloud Foundry deployments operators about how to configure the Loggregator system to avoid data loss with high volumes of logging and metrics data.
For determining the message throughput and reliability rates of your Loggregator system, see the section below.
To measure the message throughput of the Loggregator system, you can monitor the total number of egress messages from all Metrons on your platform using the
If you do not use a monitoring platform, you can follow the instructions below to measure the overall message throughput of your Loggregator system:
- Log in to the Cloud Foundry Command Line Interface (cf CLI) with your admin credentials:
$ cf login
- Install the Cloud Foundry Firehose plugin.
- Install Pipe Viewer:
$ apt-get install pv
- Run the following command:
$ cf nozzle -n | pv -l -i 10 -r > /dev/null
To measure the message reliability rate of your Loggregator system, you can run black-box tests. If you want to use this method, see the open-source cf-logmon app and the configuration instructions provided in the README.md file.
Most Loggregator configurations are set to use preferred resource defaults. If you want to customize these defaults or plan the capacity of your Loggregator system, see the formulas below.
Doppler resources can require scaling to accommodate your overall log and metric volume. Elastic Runtime recommends the following formula for determining the number of Doppler instances you need to achieve a loss rate of < 1%:
Number of Doppler instances = Number of logs per second / 2,000
Because it can be challenging to understand the ratio of metrics to logs, Elastic Runtime also recommends monitoring and scaling Doppler based on its ingress traffic. To do this, you need to sum two metrics and rate them per second:
Number of Doppler instances =
DopplerServer.listeners.receivedEnvelopes / 10,000
Using maximum values over a two-week period is a recommended approach for ingress-based capacity planning.
Traffic Controller resources are usually scaled in line with Doppler resources. Elastic Runtime recommends the following formula for determining the number of Traffic Controller instances:
Number of Traffic Controller instances = Number of Doppler instances / 4
In addition, Traffic Controller resources can require scaling to accommodate the number of your log streams and Firehose subscriptions.
Syslog Adapter is a Loggregator component that manages user-provided syslog drains. This component should be scaled depending on the number of your drain bindings.
Note: A drain binding is a syslog destination associated with an app. Apps can have multiple bindings.
Elastic Runtime recommends the following formula for determining the number of Syslog Adapter instances:
Number of Syslog Adapter instances = Number of drain bindings / 500
You can use the
scheduler.adapters metrics to configure auto-scaling of Syslog Adapters.
See Configuring Logging in Elastic Runtime for more information about scaling the Loggregator system.
You can scale a nozzle using the subscription ID specified when the nozzle connects to the Firehose. If you use the same subscription ID on each nozzle instance, the Firehose evenly distributes data across all instances of the nozzle.
For example, if you have two nozzle instances with the same subscription ID, the Firehose sends half of the data to one nozzle instance and half to the other. Similarly, if you have three nozzle instances with the same subscription ID, the Firehose sends one-third of the data to each instance.
If you want to scale a nozzle, the number of nozzle instances should match the number of Traffic Controller instances:
Number of nozzle instances = Number of Traffic Controller instances
Stateless nozzles should handle scaling gracefully. If a nozzle buffers or caches the data, the nozzle author must test the results of scaling the number of nozzle instances up or down.
The Traffic Controller alerts nozzles if they consume events too slowly. If a nozzle falls behind, Loggregator alerts the nozzle in two ways:
- TruncatingBuffer alerts: If the nozzle consumes messages more slowly than they are produced, the Loggregator system may drop messages. In this case, Loggregator sends the log message,
TB: Output channel too full. Dropped N messages, where
Nis the number of dropped messages. Loggregator also emits a CounterEvent with the name
doppler_proxy.slow_consumer. The nozzle receives both messages from the Firehose, alerting the operator to the performance issue.
You can configure Elastic Runtime to forward log data from apps to an external aggregator service. Using Log Management Services explains how to bind apps to the external service and configure it to receive logs from Elastic Runtime.
When a Diego Cell emits app logs to Metron, Diego breaks up log messages greater than approximately 60 KiB into multiple envelopes.