Scaling Cloud Controller
Page last updated:
This topic describes how and when to scale BOSH jobs in CAPI, and includes details about some key metrics, heuristics, and logs.
cloud_controller_ng
The cloud_controller_ng
Ruby process is the primary job in CAPI. It, along with nginx_cc
, powers the Cloud
Controller API that all users of VMware Tanzu Application Service for VMs (TAS for VMs) interact with. In addition
to serving external clients, cloud_controller_ng
also provides APIs for internal components within
TAS for VMs, such as Loggregator and Networking subsystems.
Note: Running bosh instances --vitals
returns CPU values. The CPU User value corresponds with the system.cpu.user
metric and is scaled by the number of CPUs. For example, on a 4-core api
VM, a cloud_controller_ng
process that is using 100% of a core is listed as using 25% in the system.cpu.user
metric.
When to Scale
When determining whether to scale cloud_controller_ng
, look for the following:
Key Metrics
Cloud Controller emits the following metrics:
cc.requests.outstanding
is at or consistently near 20.system.cpu.user
is above 0.85 utilization of a single core on the API VM.cc.vitals.cpu_load_avg
is 1 or higher.cc.vitals.uptime
is consistently low, indicating frequent restarts, possibly due to memory pressure.
Heuristic Failures
The following behaviors may occur:
- There is a latency in average response.
- Web UI responsiveness or timeouts are degraded.
bosh is --ps --vitals
has elevated CPU usage for thecloud_controller_ng
job in the API instance group.
Relevant Log Files
You can find the above heuristic failures in the following log files:
/var/vcap/sys/log/cloud_controller_ng/cloud_controller_ng.log
/var/vcap/sys/log/cloud_controller_ng/nginx-access.log
How to Scale
Before and after scaling Cloud Controller API VMs, verify that the Cloud Controller database is not overloaded. All Cloud Controller processes are backed by the same database, so heavy load on the database impacts API performance regardless of the number of Cloud Controllers deployed. Cloud Controller supports both PostgreSQL and MySQL, so there is no specific scaling guidance for the database.
In TAS for VMs deployments with internal MySQL clusters, a single MySQL database VM with CPU usage over 80% can be considered overloaded. When this happens, the MySQL VMs must be scaled up to prevent the added load of additional Cloud Controllers exacerbating the issue.
Cloud Controller API VMs should primarily be scaled horizontally. Scaling up the number of cores on a single VM is not
effective. This is because Ruby’s Global Interpreter Lock (GIL) limits the cloud_controller_ng
process so that it can
only effectively use a single CPU core on a multi-core machine.
Note: Since Cloud Controller supports both PostgreSQL and MySQL external databases, there is no absolute guidance on what a healthy database looks like. In general, high database CPU utilization is a good indicator of scaling issues, but always defer to the documentation specific to your database.
cloud_controller_worker_local
This job, also called “local workers”, is primarily responsible for handling files uploaded to the API VMs
during cf push
, such as packages
, droplets
, and resource matching.
When to Scale
When determining whether to scale cloud_controller_worker_local
, look for the following:
Key Metrics
Cloud Controller emits the following metrics:
cc.job_queue_length.cc-VM_NAME-VM_INDEX
is continuously growing.cc.job_queue_length.total
is continuously growing.
Heuristic Failures
The following behaviors may occur:
cf push
is intermittently failing.cf push
average time is elevated.
Relevant Log Files
You can find the above heuristic failures in the following log files:
/var/vcap/sys/log/cloud_controller_ng/cloud_controller_ng.log
How to Scale
Because local workers are located with the Cloud Controller API job, they are scaled horizontally along with the API.
cloud_controller_worker
Colloquially known as “generic workers” or just “workers”, this job and VM are responsible for handling asynchronous
work, batch deletes, and other periodic tasks scheduled by the cloud_controller_clock
.
When to Scale
When determining whether to scale cloud_controller_worker
, look for the following:
Key Metrics
Cloud Controller emits the following metrics:
cc.job_queue_length.cc-VM_TYPE-VM_INDEX
is continuously growing. For example,cc.job_queue_length.cc-cc-worker-0
.cc.job_queue_length.total
is continuously growing.
Heuristic Failures
The following behaviors may occur:
cf delete-org ORG_NAME
appears to leave its contained resources around for a long time.- Users report slow deletes for other resources.
- cf-acceptance-tests succeed generally, but fail during cleanup.
Relevant Log Files
You can find the above heuristic failures in the following log files:
/var/vcap/sys/log/cloud_controller_worker/cloud_controller_worker.log
How to Scale
The cc-worker VM can safely scale horizontally in all deployments, but if your worker VMs have CPU/memory headroom, you
can also use the cc.jobs.generic.number_of_workers
BOSH property to increase the number of worker processes on each
VM.
cloud_controller_clock and cc_deployment_updater
The cloud_controller_clock
job runs Diego sync process and schedules periodic background jobs. The
cc_deployment_updater
job is responsible for handling v3 rolling app deployments. For more information, see
Rolling App Deployments.
Note: Running bosh instances --vitals
returns CPU values. The
CPU User value corresponds with the system.cpu.user
metric and is scaled by the number of
CPUs. For example, on a 4-core api
VM, a cloud_controller_ng
process that is using 100% of a
core is listed as using 25% in the system.cpu.user
metric.
When to Scale
When determining whether to scale cloud_controller_clock
and cc_deployment_updater
, look for the following:
Key Metrics
Cloud Controller emits the following metrics:
cc.Diego_sync.duration
is continuously increasing over time.system.cpu.user
is high on the scheduler VM.
Heuristic Failures
The following behaviors may occur:
- Diego domains are frequently unfresh. For more information, see Domain Freshness in Overview of Domains in the BBS Server repository on GitHub.
- The Diego Desired LRP count is larger than the total process instance count reported through the Cloud Controller APIs.
- Deployments are slow to increase and decrease instance count.
Relevant Log Files
You can find the above heuristic failures in the following log files:
/var/vcap/sys/log/cloud_controller_clock/cloud_controller_clock.log
/var/vcap/sys/log/cc_deployment_updater/cc_deployment_updater.log
How to Scale
Both of these jobs are singletons, so extra instances are for failover HA rather than scalability. Performance issues are likely due to database overloading or greedy neighbors on the scheduler VM.
blobstore_nginx
The internal WebDAV blobstore that comes included with TAS for VMs by default. It is used by the
platform to store packages
, staged droplets
, buildpacks
, and cached app resources. Files are typically uploaded
to the internal blobstore through the Cloud Controller local workers and downloaded by Diego when app instances are
started.
When to Scale
When determining whether to scale blobstore_nginx
, look for the following:
Key Metrics
Cloud Controller emits the following metrics:
system.cpu.user
is consistently high on thesingleton-blobstore
VM.system.disk.persistent.percent
is high, indicating that the blobstore is running out of room for additional files.
Heuristic Failures
The following behaviors may occur:
cf push
is intermittently failing.cf push
average time is elevated.- App droplet downloads are timing out or failing on Diego.
Relevant Log Files
You can find the above heuristic failures in the following log files:
/var/vcap/sys/log/blobstore/internal_access.log
How to Scale
The internal WebDAV blobstore cannot be scaled horizontally, not even for availability purposes, because of its
reliance on the singleton-blobstore
VM’s persistent disk for file storage. For this reason, it is not recommended for
environments that require high availability. For these environments, you should use an external blobstore. For more
information, see
Cloud Controller Blobstore Configuration in
the open source Cloud Foundry documentation and Blob Storage in
High Availability in TAS for VMs topic.
The internal WebDAV blobstore can be scaled vertically, so scaling up the number of CPUs or adding faster disk storage can improve the performance of the internal WebDAV blobstore if it is under high load.
High numbers of concurrent app container starts on Diego can cause stress on the blobstore. This typically happens during upgrades in environments with a large number of apps and Diego cells. If vertically scaling the blobstore or improving its disk performance is not an option, limiting the max number of concurrent app container starts can mitigate the issue. For more information, see starting_container_count_maximum in auctioneer job in the BOSH documentation.