Loggregator Guide for Cloud Foundry Operators
Page last updated:
This topic contains information for Cloud Foundry deployments operators about how to configure the Loggregator system to avoid data loss with high volumes of logging and metrics data.
When the volume of log and metric data generated by Elastic Runtime components exceeds the storage buffer capacity of the Dopplers that collect it, data can be lost. Configuring System Logging in Elastic Runtime explains how to scale the Loggregator system to keep up with high stream volume and minimize data loss.
You can scale nozzles using the subscription ID, specified when the nozzle connects to the Firehose. If you use the same subscription ID on each nozzle instance, the Firehose evenly distributes events across all instances of the nozzle. For example, if you have two nozzles with the same subscription ID, the Firehose sends half of the events to one nozzle and half to the other. Similarly, if you have three nozzles with the same subscription ID, the Firehose sends each instance one-third of the event traffic.
Stateless nozzles should handle scaling gracefully. If a nozzle buffers or caches the data, the nozzle author must test the results of scaling the number of nozzle instances up or down.
The Traffic Controller alerts nozzles if they consume events too slowly. If a nozzle falls behind, Loggregator alerts the nozzle in two ways:
TruncatingBuffer alerts: If the nozzle consumes messages more slowly than they are produced, the Loggregator system may drop messages. In this case, Loggregator sends the log message,
TB: Output channel too full. Dropped (n) messages, where “n” is the number of dropped messages. Loggregator also emits a CounterEvent with the name
TruncatingBuffer.DroppedMessages. The nozzle receives both messages from the Firehose, alerting the operator to the performance issue.
PolicyViolation error: The Traffic Controller periodically sends
pingcontrol messages over the Firehose WebSocket connection. If a client does not respond to a
pongmessage within 30 seconds, the Traffic Controller closes the WebSocket connection with the WebSocket error code
ClosePolicyViolation (1008). The nozzle should intercept this WebSocket close error, alerting the operator to the performance issue.
An operator can scale the number of nozzles in response to these alerts to minimize the loss of data.
You can configure Elastic Runtime to forward log data from components and apps to an external aggregator service instead of routing it to the Loggregator Firehose. Configuring System Logging in Elastic Runtime explains how to enable log forwarding by specifying the aggregator address, port, and protocol.
Using Log Management Services explains how to bind applications to the external service and configure it to receive logs from Elastic Runtime.
The Diego cell emits application logs as UDP messages to the Metron. Diego breaks up log messages greater than approximately 60KiB into multiple envelopes to mitigate this constraint.