LATEST VERSION: 1.9 - CHANGELOG
Pivotal Cloud Foundry v1.7

The Pivotal Cloud Ops Approach to Monitoring a Pivotal Cloud Foundry Deployment

Page last updated:

The Pivotal Cloud Ops team monitors the health of its Cloud Foundry deployments using a customized Datadog dashboard. This topic describes each of the key metrics as they are rendered in the custom dashboard, and why the Cloud Ops team uses them for monitoring the health of a Cloud Foundry deployment.

Note: Pivotal does not officially support Datadog.

Cloud Ops’ practices are tailored to the specific details of the Cloud Foundry deployments they operate. Therefore, the descriptions here are meant to be informative examples rather than general prescriptions. Pivotal recommends that operators experiment with different combinations of metrics and alerts appropriate to their specific requirements.

The Cloud Ops team’s custom configuration of Datadog’s dashboards, alerts, and screenboards can be found in the Datadog Config repository.

Dashboard

BOSH Health Monitor

Bosh health combined

What we monitor Health, broken down by component. Each row displays the average percentage of healthy instances for the relevant component over the last 5 minutes, and over the last 24 hours.

For example, suppose that your Router has ten instances. If one instance becomes unhealthy, the stoplight turns red and shows 90%.

We monitor health for the following components:

  • NATS
  • Doppler
  • Stats
  • HM9000
  • BOSH
  • NAT Box
  • ETCD
  • Router
  • API
  • UAA
Why we monitor it To ensure that all VMs are functioning properly.
System metric bosh.healthmonitor.system.healthy
Alerts triggered None
Notes Alerts generated from this metric are passed to a buffer queue in our alerting system, Pagerduty. Because BOSH restores systems quickly if they fail, we wait two minutes before forwarding any unresolved alerts to our operators.

Requests per Second

What we monitor Requests per second for each of the following components:

  • Router
  • API
  • UAA
Why we monitor it To track the flow of traffic through the components in the system.
System metric cf.collector.router.requests(component: app/cloudcontroller/uaa)
Alerts triggered None
Notes None

NATS Traffic Delta

What we monitor Delta of average NATS traffic over the last hour.

The displayed metric is the difference between the average NATS traffic over the last 30 minutes and the average NATS traffic over the interval from 90 to 60 minutes prior.
Why we monitor it To detect significant drops in NATS traffic. A sudden drop might indicate a problem with the health of the NATS VMs.
System metric aws.ec2.network_in
Alerts triggered None
Notes None

ETCD Leader Uptime

What we monitor Time since the ETCD leader last was down.
Why we monitor it When the ETCD leader goes down, it usually indicates a push failure.
System metric cloudops_tools.etcd_leader_health
Alerts triggered None
Notes The cloudops_tools metrics are generated by an internal app that the Pivotal Cloud Ops team developed. These metrics are not available on other Cloud Foundry deployments.

SSH Attempts

What we monitor Total SSH attempts. We log the count of connection attempts to our systems on the SSH port (port 22).
Why we monitor it A spike in SSH attempts is a good indicator of SSH-cracker attacks.
System metric cloudops_tools.ssh-abuse-monitor
Alerts triggered None
Notes
  • Diego cells send their iptables logs to Logsearch. A Cloud Ops internal app polls Logsearch for first packets and pushes the count to Datadog.
  • The cloudops_tools metrics are generated by an internal app that the Pivotal Cloud Ops team developed. These metrics are not available on other Cloud Foundry deployments.

The Router Status Column

Router column

App Instance Count

What we monitor Count of running app instances.
Why we monitor it Unexpected large fluctuations in app count can indicate malicious user behavior or Cloud Foundry component issues.
System metric avg:cf.collector.HM9000.HM9000.NumberOfAppsWithAllInstancesReporting
Alerts triggered running app number change rate
Notes Spikes in this metric might indicate the need to add more resources.

Total Routes

What we monitor Route count from the router, indicated as a delta over the last N minutes.
Why we monitor it The count on all routers should be the same. If this count differs between routers, it usually indicates a NATS problem.
System metric cf.collector.router.total_routes
Alerts triggered prod CF: Number of routes in the router’s routing table is too low
Notes The router is the only point of access into all Cloud Foundry components and customer apps. Large spikes in this graph typically indicate a problem, and could indicate a denial of service attack. For example, if the router goes down or does not have routes, the system is down and a large dip appears in the graph. However, some large spikes, such as those that would occur during a marketing event, are expected. Small fluctuations are not reflected on the graph.

Router Dial Errors

What we monitor Separate indicators monitor 5xx codes from the routers to backend CF components and user apps, respectively.
Why we monitor it Indicates failures connecting to components.
System metric avg:cloudops_tools.app_instance_monitor.router.dial.errors{domain:run.pivotal.io} / avg:cloudops_tools.app_instance_monitor.router.dial.errors{cf_component:false}
Alerts triggered
  • No data for router dial errors
  • Router dial errors for console.run.pivotal.io
  • Too many router dial errors for cf components
Notes
  • We investigate dial errors to admin domain apps, the Cloud Controller, UAA, Dopplers, and any other BOSH-deployed Cloud Foundry component. We expect dial errors from our large population of customer apps (4000+). 502s occur when customers push flawed apps, or are running dev iterations. 5xx messages in the 500/10 min range are normal. If we saw this number to jump to 1000+/10 min, we would investigate.
  • The cloudops_tools metrics are generated by an internal app that the Pivotal Cloud Ops team developed. These metrics are not available on other Cloud Foundry deployments.

Router CPU

What we monitor OS-level CPU usage.
Why we monitor it Routers are multi-threaded and consume a large number of CPU cycles. If the routers are using too much CPU, we use BOSH to scale them.
System metric bosh.healthmonitor.system.cpu.user{deployment:cf-cfapps-io2,job:}
Alerts triggered None
Notes In general, we add routers whenever doing so may resolve issues.

AWS Events

What we monitor The feed from aws ec2 events.
Why we monitor it Contains important or critical information from our IaaS about virtual machines, RDS, etc.
System metric N/A
Alerts triggered None
Notes Only applies to Cloud Foundry deployments on AWS.
Was this helpful?
What can we do to improve?
View the source for this page in GitHub