Reference Architecture for Pivotal Cloud Foundry on vSphere
Page last updated:
This guide presents reference architectures for Pivotal Cloud Foundry (PCF) on vSphere.
Pivotal validates the reference architectures described in this topic against multiple production-grade usage scenarios. These designs are sized for up to 1500 app instances.
This document does not replace the basic installation documentation, but gives proven examples of how to apply those instructions to real-world production environments.
|PCF Products Validated||Version|
|PCF Ops Manager||1.12.latest|
This recommended architecture includes VMware vSphere, NSX-v, and the ESG (Edge Services Gateway), a software-defined network services gateway that runs on VMware ESXi virtual hosts and combines routing, firewall, NAT/SNAT, and load balancing. In the absence of NSX, see below for architectures that do not rely on the ESG.
To use all features listed here, ESG requires at least Advanced licensing from VMware.
For more information about installing and configuring ESG for use with PCF on vSphere, see the Cookbook for Using Edge Services Gateway on PCF and VMware NSX.
The diagram below shows an architecture for one PCF installation in vSphere clusters, segmented with Resource Pools. More Resource Pools can be added to the existing Clusters to stack more PCF installations into the same capacity.
This design supports long-term use, capacity growth at the vSphere level, and maximum installation security through the ESG firewall. It allocates a minimum of three or more servers to each cluster, as recommended by vSphere, and spreads PCF components across three or another odd number of clusters, as recommended for High Availability.
To create a system following this architecture, perform the following steps:
In vCenter, create or identify three existing clusters.
Enable DRS on each cluster and set vMotion to fully automated. Populate each cluster with a Resource Pool for each PCF installation.
For compute, populate each cluster with three or more ESXi hosts, making nine or more hosts for each installation. All installations collectively draw from the same group of nine or more hosts.
In the PCF deployment, use Ops Manager to create three Availability Zones (AZs), each corresponding to one of the Resource Pools from each cluster.
For storage, add dedicated datastores to each PCF deployment following one of the two approaches, vertical or horizontal, as described in the Storage Configuration section of this topic.
Supply core networking for each deployment by configuring an ESG with the following subnets. See the Networking Design section of this topic for details.
- Elastic Runtime (ERT)
- Service tiles (one or more)
- Dynamic service tiles (a network managed entirely by BOSH Director)
- Container-to-Container Networking
Pivotal recommends NSX Logical Switches (vWires) for all networks used by PCF. This approach avoids VLAN consumption while benefiting from the overlay capability NSX enables. NSX can create a DPG (Distributed Port Group) on a DVS (Distributed Virtual Switch) for each interface provisioned on the ESG as shown in the Port Groups diagram below.
Alternatively, port groups on a DVS with VLANs tagged on each can be used for the networks above.
You can scale up this architecture to support additional PCF installations with the same capacity, keeping each resource-protected and separated.
To support more PCF installations, scale this architecture vertically by adding Resource Pools to existing clusters. To add capacity to all PCF installations, scale it horizontally by adding ESXi hosts to the existing clusters in sets of three, one per cluster.
In this architecture, multiple PCF installations can share host resources. You can use vCenter resource allocation shares to assign
Low priority to pools used by each installation. When host resources keep up with demand, these share values make no difference, but when multiple installations compete for limited resources, you can prioritize a production installation over a development installation by assigning its resource pools a
High share value setting.
Shared storage is a requirement for PCF. You can allocate networked storage to the host clusters following one of two common approaches, horizontal or vertical. The approach you follow should reflect how your data center arranges its storage and host blocks in its physical layout.
Horizontal: You grant all hosts access to all datastores and assign a subset to each installation.
For example, with six datastores
ds06, you grant all nine hosts access to all six datastores. You then provision your first PCF installation to use stores
ds03, and your second PCF installation to use
Vertical: You grant each cluster its own dedicated datastores, creating a “cluster-aligned” storage strategy (vSphere VSAN is an example of storage that is aligned to the cluster).
From the example above, with six datastores
ds06, you assign datastores
ds02to your first cluster,
ds04to your second cluster, and
ds06to your third cluster. You then provision your first PCF installation to use
ds05, and your second PCF installation to use
ds06. With this arrangement, all VMs in the same installation and cluster share a dedicated datastore.
Note: If a datastore is part of a vSphere Storage Cluster using sDRS (storage DRS), you must disable the s-vMotion feature of the sDRS function on any datastores used by PCF. Otherwise, s-vMotion (Storage vMotion) activity can rename independent disks and cause BOSH to malfunction. For more information, see How to Migrate PCF to a New Datastore in vSphere.
Pivotal recommends the following capacity allocation for PCF installations:
- For production use, at least 8 TB of data storage, either as one 8 TB store or a number of smaller volumes adding up to 8 TB. Frequently-used development may require significantly more storage to accomodate new code and buildpacks.
- For small installations without many tiles, 4-6 TB.
The primary consumer of storage is the NFS/WebDAV blobstore.
Note: PCF does not currently support using vSphere Storage Clusters with the latest versions of PCF validated for the reference architecture. Datastores should be listed in the vSphere tile by their native name, not the cluster name created by vCenter for the storage cluster.
Pivotal recommends the following: - Either block-based (fiber channel or iSCSI) and file-based (NFS) over high-speed carriers such as 8Gb FC or 10GigE - Redundant storage for both ephemeral and persistent storage types used by PCF. All-flash-based storage and SSD-cached storage are highly desirable. - Data de-duplication and compression, if done in hardware at the storage array
These technologies can dramatically reduce the need for storage capacity.
Using VMware NSX SDN (software-defined networking) provides the following benefits:
- Network fencing of the entire PCF installation
- Distributed, hyper-local firewall capability per installation through the built-in ESG Firewall
- High capacity, resilient, distributed load balancing per installation through the ESG Load Balancer
- Element obfuscation through the use of non-routed RFC-1918 networks behind the ESG and the use of SNAT/DNAT connections to expose only the endpoints of Cloud Foundry that require exposure
- High-repeatability of installations through the reuse of all network and addressing conventions on the right-hand side of the diagram, the Tenant Side
- Centrally-managed rule and ACL sharing through the NSX Manager Global Ruleset
- HA pairs of ESGs (optional) for extra levels of redundancy
- BOSH CPI can add and remove Gorouter members from load balanced pools in ESG (not an Ops Manager feature)
- ESG Security Group tagging managed by BOSH, to group like VMs per PCF installation into security groups
Warning: The overlay network IP range for Container-to-Container Networking must not conflict with any other IP addresses in your network.
Note: When using VMware NSX for vSphere 6.2.3+, the default VXLAN port of 4789 used by Container-to-Container Networking will not work. To resolve this issue, override the default by navigating to the Networking section of the Elastic Runtime tile and entering a different value in VXLAN Tunnel Endpoint Port.
Each PCF installation consumes four or more networks with the ESG, aligned to the following specific job types:
- Infrastructure: This small network hosts resources that interact with the IaaS layer and back-office systems, such as the cloud provider interface (CPI), BOSH, Ops Manager, and other utility VMs such as jumpbox VM. Operators access these resources to manage a PCF installation.
- Deployment: Also known as the apps wire, this network has a large CIDR range. It hosts the Elastic Runtime tile, Diego Cells, and Windows Cells, and is the network onto which apps are deployed.
- Services: This network has a large CIDR range. It hosts service tiles that are installed using Ops Manager and managed by BOSH. One approach is to use this network for all PCF tiles except ERT.
- Dynamic Services: A single network granted to BOSH Director for use with service tiles that require an on-demand, or dynamic, address space for deployment. This is a special purpose network that is marked as “Services” with a checkbox in the vSphere Ops Manager Director tile. See About the On-Demand Services SDK for more information.
- IsoZones##: A single network granted to the PCF Isolation Segment tile for use to isolate Gorouters and Diego Cells into a network space independent of the ERT installation.
- Container-to-Container Networking: A single network dedicated to communication between containers.
Note: All of these networks are considered inside or tenant-side networks, and use non-routable RFC-1918 network space which are not advertised to the outside by the ESG to make provisioning repeatable. The ESG routes between the tenant and service provider side networks and connects traffic through using SNAT and DNAT.
For each PCF installation, you should provision an ESG with at least four routable IP addresses from the service provider:
- A static IP address by which NSX Manager manages the ESG
- A static IP address for use as egress SNAT. Traffic from the tenant side exits the Edge on this IP address
- A static IP address for DNATs to Ops Manager
- A static IP address for the load balancer VIP that balances to a pool of PCF Gorouters (HTTP/HTTPS)
In addition to these, many more uses for IP addresses on the routed side of the ESG exist. Pivotal recommends reserving a total of ten contiguous static IP addresses per ESG for future needs and flexibility. Examples include the following:
- Load balancer VIP for TCP Routers, if deploying TCP routing for non 80/443 access to apps
- Load Balancer VIP for DiegoBrains, if deploying in multiples
- Load Balancer VIP for MySQL proxies, if deploying in multiples
- Monitoring or metrics endpoint for platform monitoring
On the tenant side, each interface defined on the ESG acts as the IP gateway for the network used. Pivotal recommends allocating the following address ranges for the networks, and defining the gateway at
192.168.zzz.1 for each:
- Infrastructure network: 192.168.10.0/26
- Deployment network: 192.168.20.0/22
- Services network: 192.168.24.0/22
- Dynamic Services network: 192.168.28.0/22
- IsoZone## network: 192.168.32.0/22
- Container-to-Container Networking network: 10.255.0.0/16
Note: Review the container-to-container settings in your installation for any IP address range conflicts with the enterprise addressing scheme in use by your organization. The specified range cannot be used anywhere else in your deployment. Change the range to a non-conflicting range as needed.
Pivotal recommends vSphere DVS for all Clusters used by PCF. NSX creates a DPG for each interface provisioned on the ESG.
NSX Logical Switches should be used on the Tenant Side of this design, which leverages vWires, reducing the dependency on available VLAN capacity.
You can move the function of routing the Tenant-side networks shown on the NSX reference diagram to a different service provided by NSX. This service is the DLR, or Distributed Logical Router, a high-performance East/West routing feature.
To do this, you establish a DLR instance in the NSX Manager and use it to be the gateway of each Logical Switch used by a PCF installation, in place of the ESG being that gateway. You still need a North/South gateway to the outside world, so the ESG is still required, but the jobs it’s handling are now firewalling, NAT/SNAT and Load Balancing only.
You must put an interface from the DLR and one on the inside of the ESG together in order to accomplish the North/South flow; typically this is done with a very small “glue net” running OSPF. Refer to VMware’s technical materials around the DLR and PCF for greater details.
Benefits of adding the DLR include:
- Optimized East/West routing amongst the Tenant-side Logical Switches
- Increased interface count for Tenant-side Logical Switches (from 9 total to 99)
Concerns with this approach include:
- Addition of the “glue net OSPF” relationship between DLR and ESG
- Increased management and troubleshooting burden with the addition of the above
- Do not improve North/South network traffic; same as without DLR
The ESG can act as a stand-alone, one-armed load balancer.
This variant can improve performance and separate the dependence on the ESG that acts as NAT/SNAT/Firewall/Router by separating the load balancing function to a separate ESG deployed exclusively for use per installation.
This variant divides the jobs between two ESGs per install rather than one. To implement this architecture, you place a single interface (internal) of a new ESG on the Deployment network, enable the load balancing function, and DNAT to it through the boundary ESG.
The reference architecture for deploying production PCF on vSphere without VMware NSX SDN technology follows the base architecture, but with the following differences.
- Load balancing is handled by an external service, such as a hardware appliance or a VM from a third party.
- An external service also performs SSL termination unless that is passed through to the Gorouters.
- You must set up firewalls for each zone or network inside the installation, rather than having the ESG fence all the inside networks.
- There is no network fencing of the PCF installation, so RFC-1918 non-routable networks are not used, DNAT/SNAT is not used, and the address space consumed is from routable ranges already established in the datacenter.
The more traditional approach without SDN is to deploy a single VLAN for use with all of a PCF installation, or a pair of VLANs, one for infrastructure and one for the rest of the PCF installation. As VLAN capacity can be limited and scarce, this design seeks to limit the need for VLANs to a functional minimum.
In this example, the firewall and load balancer functions run outside of vSphere, on generic devices that most datacenters provide. The PCF installation is bound to two port groups provided by a DVS on ESXi, each of which aligns to different job types:
- Infra: CPI, BOSH, and Ops Manager VMs that communicate with the IaaS layer
- PCF: The deployment network for all tiles, including ERT
In a typical installation, you assign each of these port groups to a VLAN from the datacenter pool, and a routable IP address segment. Routing functions are handled by switching layers outside of vSphere, such as a top-of-rack or end-of-row switch/router appliance.
It is valid to deploy more networks than these two, up to and including those shown in the original design, so deploy them if the resources are readily available. The main thing to keep in mind is that this is a requirement per PCF installation, so keep a count of how many of those overall you will require.
If you are working with three or more ESXi hosts and want to use less resources than the base architecture requires, Pivotal recommends setting up PCF in three clusters with one host in each. The key point is to try to get to three AZs if possible, as this reduces future difficulties when increasing the size of your PCF installation.
To reduce resource use even further, you can place all hosts into a single cluster with VMware DRS and high availability enabled. The following is an example of that approach, using resource pools in the single cluster to emulate a three-cluster design. While the resource pools do little more that organize the VMs, it forces a habit of deploying PCF constructs in threes as recommended by Pivotal.
You may be tempted to consider a two-cluster architecture. A two-cluster architecture may offer useful symmetry at the vSphere level, but PCF works best when it deploys resources in odd numbers. A two-cluster configuration forces the operator into aligning odd-numbered components into even numbered containers, which does not work well for PCF internal voting algorithms. If you do not want to consume three clusters for PCF, using one cluster works better than using two.
For a single-cluster deployment, follow the networking setup described in either the base or the without-NSX. architectures above. The internal compute arrangement for a production PCF deployment does not affect its networking.
Pivotal recommends mapping all datastores used by PCF to all of the hosts in a single-cluster deployment.
To avoid downtime, some PCF customer scenarios demand a multi-datacenter architecture that spreads deployment resources across more than one physical location. A multi-datacenter architecture can support the hardware, power source, and geographic redundancy needed to guarantee high availability.
One strategy for high availability is to keep a record of how many hosts are in a cluster and deploy enough copies of a PCF component in that AZ to ensure survivability in a site loss. This means placing large, odd numbers of components in the cluster so that at least two components are left on either site in the event of a site outage. In a four host cluster, this calls for five VMs, so each site has at least two. DRS anti-affinity rules can be used here, set at the IaaS level, to force like VMs apart for best effect.
The two main ways of designing a multi-datacenter PCF architecture is with stretched clusters, in which single logical clusters combine components in multiple physical locations, and East/West clusters, in which locally self-contained clusters are mirrored across multiple locations.
For this approach, you define logical clusters that contain components physically located in two or more sites. With four hosts, for example, build a a four-host cluster with two hosts in an East datacenter and two from the West. Apply networking such that all hosts see the same networks through a stretched layer 2 application, or use NSX or another SDN solution to extend L2 networking via L3 connections.
PCF and BOSH treat the stretched cluster as an AZ, and make the same demands on it that they do with any other AZ. The hosting, networking, and storage components within the stretched cluster must perform with normal latency and connectivity.
For seamless operation, hosts must share all datastores, and you need to replicate storage across sites. Otherwise, vMotion cannot move VMs freely across hosts for maintenance or DRS.
A stretched version of the base architecture splits three clusters across two sites, yielding a 4×3×3 geometry:
- Four hosts per cluster (two from each site)
- Three clusters for PCF as AZs
- Three AZs mapped to PCF clusters
You can also deploy a stretched version of the single-cluster model. This may be the more practical approach to achieving HA, since any stretched deployment already requires so many resources from two sites.
As with any VMware installation, job scheduling works more efficiently when VMs have fewer cores, so you should configure many smaller Diego Cell VMs rather than a lower number of larger ones. If 2-core to 4-core VMs can handle your apps, favor them over 8- and 12-core options. This is especially important with stretched deployments.
Network traffic is a challenge with stretched clusters, since app traffic may enter at any connection point in either location, but can only leave through a designated gateway. The architect should consider that app traffic landing in the East might have to flow out of the West, a “trombone effect” that forces additional traffic across datacenter links.
For this approach, the architect assigns parallel capacity from two sites independently, and deploys clusters to PCF in matched pairs. This creates even numbers of clusters, which makes suboptimal use of resources in PCF.
East/West mirroring the base architecture yields a deployment with six total clusters, three from each side. This may seem like a lot of gear to apply to PCF, but in a Business Continuity and Disaster Recovery (BCDR) scenario, doubling everything is the point.
Combining the East/West multi-datacenter and single-cluster approaches creates a geometry with two clusters and pools in one cluster per site, or six AZs. Such a deployment only uses one cluster of capacity from each site, and does not scale readily. But drawing capacity from only one cluster makes it easy to provision with only a few hosts.
A multi-datacenter architecture makes replicating storage less critical. There are enough AZs from either side to survive a point failure, and you can recover the installation without vSphere HA enabled for the clusters.