Logs, Metrics, and Nozzles

This topic explains how to integrate PCF services with Cloud Foundry’s logging system, the Loggregator, by writing to and reading from its Firehose endpoint.

Overview

Cloud Foundry’s Loggregator logging system collects logs and metrics from PCF apps and platform components and streams them to a single endpoint, the Firehose. Your tile can integrate its service with the Loggregator system in two ways:

  • By sending your service component logs and metrics to the Firehose, to be streamed along with PCF core platform component logs and metrics.

  • By installing a nozzle on the Firehose that directs Firehose data to be consumed by external services or apps. A built-in nozzle can enable a service to:

    • Drain metrics to an external dashboard product, for system operators
    • Send HTTP request details to search or analysis tools
    • Drain app logs to an external system
    • Auto-scale itself based on Firehose metrics

Firehose-to-syslog is a real world, production example of a nozzle.

Firehose Communication

PCF components publish logs and metrics to the Firehose through Metron agent processes that run locally on the component VMs. Metron agents input the data to the Loggregator system by writing it to Loggregator’s etcd key-value store via a gRPC proxy. The topic Overview of the Loggregator System shows how logs and metrics travel from PCF system components to the Firehose.

Component VMs running PCF services can publish logs and metrics the same way, by including a Metron agent that writes to etcd. In PCF v1.10 and higher, components only communicate with etcd via secure, encrypted https protocol. Earlier versions of PCF allow both encrypted https and unencrypted http communications with etcd.

Secure HTTPS Protocol: PCF 1.10+

To enable a service component to supply logs and metrics to the Firehose through encrypted communications, you need to include a Metron agent and a Consul agent in its template definitions.

The Metron definition includes double-paren properties defining a keypair for accessing etcd. The Consul definition includes double-paren properties for securely looking up the internal IP addresses of the etcd nodes at cf-etcd.service.cf.internal. This avoids hard-coding any etcd server addresses.

For example:

name: service
label: Service
templates:
  - name: consul
    release: consul
  - name: metron_agent
    release: loggregator
  - name: service
    release: service
manifest: |
  metron_agent:
    deployment: cf-my-service
    etcd:
      client_cert: (( ..cf.properties.cf_etcd_client_cert.cert_pem ))
      client_key: (( ..cf.properties.cf_etcd_client_cert.private_key_pem ))
  metron_endpoint:
    shared_secret: (( ..cf.doppler.shared_secret_credentials.password ))
  loggregator:
    etcd:
      require_ssl: true
      machines: ['cf-etcd.service.cf.internal']
      ca_cert: (( $ops_manager.ca_certificate ))
  consul:
    encrypt_keys:
    - (( ..cf.properties.consul_encrypt_key.value ))
    ca_cert: (( $ops_manager.ca_certificate ))
    agent_cert: (( ..cf.properties.consul_agent_cert.cert_pem ))
    agent_key: (( ..cf.properties.consul_agent_cert.private_key_pem ))
    agent:
      domain: cf.internal
      servers:
        lan: (( ..cf.consul_server.ips ))

Metron versions v72 and later do not use etcd to communicate with Loggregator, but the configuration above works with any version of Metron. If the Metron agent does not need values for etcd, it safely ignores them.

HTTP Protocol: PCF 1.9 and Earlier

In PCF v1.9, service components can send logs and metrics to the Firehose encrypted or unencrypted. In v1.8 and earlier releases, components only communicate their log and metrics data unencrypted.

To enable unencrypted communications with etcd, define a Metron agent and list the addresses of the etcd servers in the template definitions as follows:

name: service
label: Service
templates:
  - name: metron_agent
    release: loggregator
  - name: service
    release: service
manifest: |
  metron_agent:
    deployment: cf-my-service
  metron_endpoint:
    shared_secret: (( ..cf.doppler.shared_secret_credentials.password ))
  loggregator:
    etcd:
      machines: (( ..cf.etcd_server.ips ))

Nozzles

A nozzle is a component dedicated to reading and processing data that streams from the Firehose. A service tile can install a nozzle as either a managed service, with package type bosh-release; or as an app pushed to Elastic Runtime, with the package type app.

Develop a Nozzle

Pivotal recommends developing a nozzle in Go, to leverage the NOAA library. NOAA does the heavy lifting of establishing an authenticated websocket connection to the logging system as well as de-serializing the protocol buffers.

Draining the logs consists of:

  1. Authenticating
  2. Establishing a connection to the logging system
  3. Forwarding events on to their ultimate destination

Authenticate against the API (https://github.com/cloudfoundry-community/go-cfclient) with a user in the doppler.firehose group:

import "github.com/cloudfoundry-community/go-cfclient"

...

config := &cfclient.Config{
  ApiAddress:        apiUrl,
  Username:          username,
  Password:          password,
  SkipSslValidation: sslSkipVerify,
}

client, err := cfclient.NewClient(config)

Using the client’s token, create a consumer and connect to the Firehose with a subscription id. The id is important, since the Firehose looks for connections having the same id and only sends an event to one of those connections. This is how a nozzle developer can prevent message loss during upgrades an other deployments: run at least two instances.

token, err := client.GetToken()

consumer := consumer.New(config.TrafficControllerURL, &tls.Config{
  InsecureSkipVerify: config.SkipSSL,
}, nil)
events, errors := consumer.Firehose(firehoseSubscriptionID, token)

Firehose will give back two channels: one for events and a second for errors.

The events channel receives six different types of events.

  • ValueMetric: Some platform metric at a point in time, emitted by platform components. For example, how many 2xx responses the router has sent out.
  • CounterEvent: An incrementing counter, emitted by platform components. For example, a Diego cell’s remaining memory capacity.
  • Error: An error.
  • HttpStartStop: HTTP request details, including both application and platform requests.
  • LogMessage: A log message for an individual app.
  • ContainerMetric: Application container information. For example, memory used.

For the full details on events, see the dropsonde protocol.

The above events show how this data targets two different personae: platform operators and application developers. Keep this in mind when designing an integration.

Having doppler.firehose scope gets a nozzle data for every application as well as the platform. Any filtering based on the event payload is the nozzle implementor’s responsibility. An advanced integration could do something like combine a service broker with a nozzle to:

  • Let application developers opt-in to logging (implementing filtering in the nozzle)
  • Establish SSO exchange for authentication such that developers only can access logs for their space’s apps

For a full working example (suitable as an integration starting point), see firehose-nozzle.

Deploy a Nozzle

Once you’ve build a nozzle, you can deploy it as either a managed service or as an app.

As a Managed Service

Visit managed service for more details on what it means to be a managed service.

See also this example nozzle BOSH release.

As an App

You can also deploy the nozzle as an app on Elastic Runtime. Visit the Tile Generator’s section on pushed applications for more details.

Example Nozzles

There are several open source examples you could use as a reference for building your nozzle

firehose-nozzle

  • Example that simply writes to standard out
  • Useful starting point: scaffolding, tests, etc are in place

example-nozzle

  • A single file implementation with no tests: as minimal as things can get

gcp-tools-release

  • In addition to Nozzle data, it drains component syslogs and health data
  • Shows how to do a bosh-addon (for additional data outside a nozzle)
  • Nozzle is managed via bosh
  • Raw logs and metrics data take different paths in the source

firehose-to-syslog

  • Includes implementation code that adds additional metadata (potentially needed for acl)
    • Application name
    • Space guid & name
    • Org guid & name
  • logsearch-for-cloudfoundry packages this nozzle as a BOSH release

splunk-firehose-nozzle and splunk-firehose-nozzle-release

  • Source code based on firehose-to-syslog
  • Packaged as a BOSH release

datadog-firehose-nozzle

  • Another real world implementation

Log Format for PCF Components

Pivotal’s standard log format adheres to the RFC-5424 syslog protocol, with log messages formatted as follows:

<${PRI}>${VERSION} ${TIMESTAMP} ${HOST_IP} ${APP_NAME} ${PROD_ID} ${MSG_ID} ${SD-ELEMENT-instance} ${MESSAGE}

The Syslog Message Elements table immediately below describes each element of the log, and the Structured Instance Data Format table describes the contents of the structured data element that carries Cloud Foundry VM instance information.

Syslog Message Elements

This table describes each element of a standard PCF syslog message.

Syslog Message Element Meaning or Value
${PRI}

Priority value (PRI), calculated as 8 × Facility Code + Severity Code

Pivotal uses a Facility Code value of 1, indicating a user-level facility. This adds 8 to the RFC-5424 Severity Codes, resulting in the numbers listed in the table below.

If in doubt, default to 13, to indicate Notice-level severity.

${VERSION} 1
${TIMESTAMP}

The timestamp of when the log message is forwarded; typically slightly after it was generated. Example: 2017-07-24T05:14:15.000003Z

${HOST_IP} Internal IP address of origin server
${APP_NAME}

Process name of the program the generated the message. Prefixed with vcap. For example:

  • vcap.rep
  • vcap.garden
  • vcap.cloud_controller_ng

You can derive this process name from either the program name configured for the local Metron agent or the :prognamethat blackbox derives from the folder that syslog-release forwards logs into.

${PROD_ID} The Process ID of the syslog process doing the forwarding. If this is not easily available, default to - (hyphen) to indicate unknown.
${MSG_ID} The type of log message. If this is not easily available, default to - (hyphen) to indicate unknown.
${SD-ELEMENT-instance} Structured data (SD) relevant to PCF about the source instance (VM) that originates the log message. See the Structured Instance Data Format table below for content and format.
${MESSAGE} The log message itself, ideally in JSON

RFC-5424 Severity Codes

PCF components generate log messages with the following severity levels. The most common severity level is 13.

Severity Code Meaning
8 Emergency: system is unusable
9 Alert: action must be taken immediately
10 Critical: critical conditions
11 Error: error conditions
12 Warning: warning conditions
13 Notice: normal but significant condition
14 Informational: informational messages
15 Debug: debug-level messages

Structured Instance Data Format

The RFC-5424 syslog protocol includes a structured data element that people can use as they see fit. Pivotal uses this element to carry VM instance information as follows:

SD-ELEMENT-instance element Meaning
${ENTERPRISE_ID} Your Enterprise Number, as listed by the Internet Assigned Numbers Authority (IANA)
${DIRECTOR} The BOSH director managing the deployment.
${DEPLOYMENT} BOSH spec.deployment value
${INSTANCE_GROUP} BOSH instance_group, currently spec.job.name
${AVAILABILITY_ZONE} BOSH spec.az value
${ID} BOSH spec.id value. This is a GUID, not an index. Necessary because BOSH Availability Zone index values are not always unique or sequential.

Making Sense of Metrics

Monitoring Pivotal Cloud Foundry has a great rundown of the various metrics and how to make them useful.

Other Resources

Create a pull request or raise an issue on the source for this page in GitHub