High Availability in PAS
Page last updated:
This topic describes the components used to ensure high availability in Pivotal Application Service, vertical and horizontal scaling, and the infrastructure required to support scaling component VMs for high availability.
This section describes the system components needed to ensure high availability.
During product updates and platform upgrades, the VMs in a deployment restart in succession, rendering them temporarily unavailable. During outages, VMs go down in a less orderly way. Spreading components across Availability Zones (AZs) and scaling them to a sufficient level of redundancy maintains high availability during both upgrades and outages and can ensure zero downtime.
Deploying PAS across three or more AZs and assigning multiple component instances to different AZ locations lets a deployment operate uninterrupted when entire AZs become unavailable. PAS maintains its availability as long as a majority of the AZs remain accessible. For example, a three-AZ deployment stays up when one entire AZ goes down, and a five-AZ deployment can withstand an outage of up to two AZs with no impact on uptime.
Production environments should use a highly-available customer-provided load balancing solution that does the following:
- Provides load balancing to each of the PAS Router IP addresses
- Supports SSL termination with wildcard DNS location
- Adds appropriate x-forwarded-for and x-forwarded-proto HTTP headers to incoming requests
- (Optional) Supports WebSockets
If you are deploying in lab and test environments, the
use-haproxy.yml ops file enables HAProxy for your foundation.
For more information, see Using Your Own Load Balancer.
For storing blobs, large binary files, the best approach for high availability is to use external storage such as Amazon S3 or an S3-compatible service.
If you store blobs internally using WebDAV or NFS, these components run as single instances and you cannot scale them. For these deployments, use the high availability features of your IaaS to immediately recover your WebDAV or NFS server VM if it fails. Contact Support if you need assistance.
The singleton compilation components do not affect platform availability.
You can scale platform capacity vertically by adding memory and disk, or horizontally by adding more VMs running instances of PAS components. The nature of the applications you host on PAS should determine whether you should scale vertically or horizontally.
For more information about scaling applications and maintaining app uptime, see Scaling an Application Using cf scale and Using Blue-Green Deployment to Reduce Downtime and Risk.
Scaling vertically means adding memory and disk to your component VMs.
To scale vertically, ensure that you allocate and maintain enough of the following:
- Free space on host Diego cell VMs so that apps expected to deploy can successfully be staged and run.
- Disk space and memory in your deployment such that if one host VM is down, all instances of apps can be placed on the remaining Host VMs.
- Free space to handle one AZ going down if deploying in multiple AZs.
Scaling horizontally means increasing the number of VM instances dedicated to running a functional component of the system.
You can horizontally scale most PAS components to multiple instances to achieve the redundancy required for high availability.
You should also distribute the instances of multiply-scaled components across different AZs to minimize downtime during ongoing operation, product updates, and platform upgrades. If you use more than three AZs, ensure that you use an odd number of AZs.
For more information regarding rolling app deployments, see Scaling Instances in PAS.
The table below provides the instance counts Pivotal recommends for a high-availability deployment and the minimum instances for a functional deployment:
|Pivotal Application Service (PAS) Job||Recommended Instance Number for HA||Minimum Instance Number||Notes|
|Diego Cell||≥ 3||1||The optimal balance between CPU and memory sizing and instance count depends on the performance characteristics of the apps that run on Diego Cells. Scaling vertically with larger Diego Cells makes for larger points of failure, and more apps go down when a Diego Cell fails. On the other hand, scaling horizontally decreases the speed at which the system re-balances apps. Re-balancing 100 Diego Cells takes longer and demands more processing overhead than re-balancing 20 Diego Cells.|
|Diego Brain||≥ 2||1||For high availability, use at least one per AZ, or at least two if only one AZ.|
|Diego BBS||≥ 2||1||For high availability in a multi-AZ deployment, use at least one instance per AZ. Scale Diego BBS to at least two instances for high availability in a single-AZ deployment.|
|MySQL Server||3||1||If you use an external database in your deployment, then you can set the MySQL Server instance count to
|MySQL Proxy||2||1||If you use an external database in your deployment, then you can set the MySQL Proxy instance count to
|NATS Server||≥ 2||1||In a high-availability deployment, you might run a single NATS instance if your deployment lacks the resources to deploy two stable NATS servers. Components using NATS are resilient to message failures and the BOSH Resurrector recovers the NATS VM quickly if it becomes non-responsive.|
|Cloud Controller||≥ 2||1||Scale the Cloud Controller to accommodate the number of requests to the API and the number of apps in the system.|
|Clock Global||≥ 2||1||For a high-availability deployment, scale the Clock Global job to a value greater than 1 or to the number of AZs you have.|
|Router||≥ 2||1||Scale the Gorouter to accommodate the number of incoming requests. Additional instances increase available bandwidth. In general, this load is much less than the load on Diego Cells.|
|Doppler Server||≥ 2||1||Deploying additional Doppler servers splits traffic across them. For a high-availability deployment, Pivotal recommends at least two per AZ.|
|Loggregator Traffic Controller||≥ 2||1||Deploying additional Loggregator Traffic Controllers allows you to direct traffic to them in a round-robin manner. For a high-availability deployment, Pivotal recommends at least two per AZ.|
|Syslog Scheduler||≥ 2||1||The Syslog Scheduler is a scalable component. For high availability, use at least one instance per AZ, or at least two instances if only one AZ is present.|
This section describes the surrounding infrastructure required to support scaling component VMs for high availability.
The BOSH Resurrector increases Pivotal Application Service (PAS) availability in the following ways:
- Reacts to hardware failure and network disruptions by recreating VMs on active, stable hosts
- Detects operating system failures by continuously monitoring VMs and recreating them as required
- Continuously monitors the BOSH Agent running on each VM and recreates the VMs as required
The BOSH Resurrector continuously monitors the status of all VMs in a PAS deployment. The Resurrector also monitors the BOSH Agent on each VM. If either the VM or the BOSH Agent fail, the Resurrector recreates the VM on another active host. To enable the BOSH Resurrector, see the Enable BOSH Resurrector section of the Using the BOSH Resurrector topic.
To configure your resource pools according to the requirements of your deployment, see the Ops Manager configuration topic for your IaaS.
Each IaaS has different ways of limiting resource consumption for scaling VMs. Consult with your IaaS administrator to ensure additional VMs and related resources, like IPs and storage, will be available when scaling.
For information about configuring resource pools for Amazon Web Services, see Amazon EC2 FAQs in the Amazon documentation. For information about configuring resource pools for OpenStack, see Manage projects and users in the OpenStack documentation. For information about configuring resource pools for vSphere, see the Resource Config Page section of the Configuring BOSH Director on vSphere topic.
For database services deployed outside PAS, plan to leverage your infrastructure’s high availability features and to configure backup and restore where possible. For more information about scaling internal database components, see Scaling Instances in PAS.
Note: Data services may have single points of failure depending on their configuration.
Contact Support if you need assistance.