Reference Architecture for Pivotal Cloud Foundry on vSphere

Page last updated:

This guide presents reference architectures for Pivotal Cloud Foundry (PCF) on vSphere.

Overview

Pivotal validates the reference architectures described in this topic against multiple production-grade usage scenarios. These designs are sized for up to 1500 app instances.

This document does not replace the basic installation documentation, but gives proven examples of how to apply those instructions to real-world production environments.

PCF Products Validated Version
PCF Ops Manager 1.11.latest
Elastic Runtime 1.11.latest

Base Reference Architecture

This recommended architecture includes VMware vSphere, NSX-v, and the ESG (Edge Services Gateway), a software-defined network services gateway that runs on VMware ESXi virtual hosts and combines routing, firewall, NAT/SNAT and load balaning. In the absence of NSX, see below for architectures that do not rely on the ESG.

To use all features listed here, ESG requires at least Advanced licensing from VMware.

For more information about installing and configuring ESG for use with PCF on vSphere, see the NSX Edge Cookbook for Pivotal Cloud Foundry on vSphere.

The diagram below shows an architecture for one PCF installation in vSphere clusters, segmented with Resource Pools. More Resource Pools can be added to the existing Clusters to stack more PCF installations into the same capacity.

This design supports long-term use, capacity growth at the vSphere level, and maximum installation security through the ESG firewall. It allocates a minimum of 3 (or more) servers to each cluster, as recommended by vSphere, and spreads PCF components across 3 (or another odd number) of clusters, as recommended for PCF.

Vsphere overview arch

View a larger version of this diagram.

Installation

To create a system following this architecture, do the following:

  1. In vCenter, create or identify three existing clusters.

  2. Enable DRS on each cluster and set vMotion to fully automated. Populate each cluster with a Resource Pool for each PCF installation.

  3. For compute, populate each cluster with three or more ESXi hosts, making nine+ hosts for each installation. All installations collectively draw from the same nine+ hosts.

  4. In the PCF deployment, use Ops Manager to create three Availability Zones (AZs), each corresponding to one of the Resource Pools from each cluster.

  5. For storage, add dedicated datastores to each PCF deployment following one of the two approaches, vertical or horizontal, as described below.

  6. Supply core networking for each deployment by configuring an ESG with the following subnets. See below for details:

    • Infrastructure
    • Elastic Runtime (ERT)
    • Service tiles (one or more)
    • Dynamic service tiles (a network managed entirely by BOSH Director)
    • IsoZone##

    Pivotal recommends NSX Logical Switches (vWires) for all networks used by PCF. This approach avoids VLAN consumption while benefiting from the overlay capability NSX enables. NSX can create a DPG (Distributed Port Group) on a DVS (Distributed Virtual Switch) for each interface provisioned on the ESG as shown in the Port Groups diagram below.

    Alternatively, port groups on a DVS with VLANs tagged on each can be used for the networks above.

Scaling

You can scale up this architecture to support additional PCF installations with the same capacity, keeping each resource-protected and separated.

To support more PCF installations, scale this architecture vertically by adding Resource Pools to existing clusters. To add capacity to all PCF installations, scale it horizontally by adding ESXi hosts to the existing clusters in sets of three, one per cluster.

Priority

In this architecture, multiple PCF installations can share host resources. You can use vCenter resource allocation shares to assign High, Normal, or Low priority to pools used by each installation. When host resources keep up with demand, these share values make no difference, but when multiple installations compete for limited resources, you can prioritize a production installation over a development installation (for example) by assigning its resource pools a High share value setting.

Storage Configuration

Shared storage is a requirement for PCF. You can allocate networked storage to the host clusters following one of two common approaches, horizontal or vertical. The approach you follow should reflect how your data center arranges its storage and host blocks in its physical layout:

  • Horizontal: You grant all hosts access to all datastores, and assign a subset to each installation. For example, with 6 datastores ds01 through ds06, you grant all nine hosts access to all six datastores, then provision PCF installation #1 to use stores ds01 through ds03, and installation #2 to use ds04 through ds06.

  • Vertical: You grant each cluster its own dedicated datastores, creating a “cluster-aligned” storage strategy. vSphere VSAN is an example of this architecture. With 6 datastores ds01 through ds06, for example, you assign datastores ds01 and ds02 to cluster 1, ds03 and ds04 to cluster 2, and ds05 and ds06 to cluster 3. Then you provision PCF installation #1 to use ds01, ds03, and ds05, and installation #2 to use ds02, ds04, and ds06. With this arrangement, all VMs in the same installation and cluster share a dedicated datastore.

Note: If a datastore is part of a vSphere Storage Cluster using sDRS (storage DRS), you must disable the s-vMotion feature on any datastores used by PCF. Otherwise, s-vMotion activity can rename independent disks and cause BOSH to malfunction. For more information, see How to Migrate PCF to a New Datastore in vSphere.

Storage Capacity and Type

Capacity

For production use, Pivotal recommends allocating at least 8 TB of data storage for a PCF installation, either as one 8 TB store or a number of smaller volumes adding up to 8 TB. Small installations without many tiles can use 4-6 TB. Frequently-used development PCF installations may require far more storage as there will be much more use of new code and buildpacks. The primary consumer of storage is the NFS/WebDAV blobstore.

Note: At time of publication, PCF does not support the use of vSphere Storage Clusters with the latest versions of PCF validated for the reference architecture. Datastores should be listed in the vSphere tile by their native name, not the cluster name created by vCenter for the storage cluster.

Type

Pivotal recommends either block-based (fiber channel or iSCSI) and file-based (NFS) over high-speed carriers such as 8Gb FC or 10GigE. Redundant storage is highly recommended for both ephemeral and persistent storage types used by PCF. All-flash-based storage and SSD-cached storage are highly desirable. Data de-duplication and compression, if done in hardware at the storage array, are highly desirable. These technologies can dramatically reduce the need for storage capacity.

Networking

Using VMware NSX SDN (software-defined networking) provides the following benefits:

  • Network fencing of the entire PCF installation
  • Distributed, hyper-local firewall capability per installation through the built-in ESG Firewall
  • High capacity, resilient, distributed load balancing per installation through the ESG Load Balancer
  • Element obfuscation through the use of non-routed RFC-1918 networks behind the ESG and the use of SNAT/DNAT connections to expose only the endpoints of Cloud Foundry that need exposure
  • High repeatability of installations through the reuse of all network and addressing conventions on the right-hand side of the diagram, the Tenant Side
  • Centrally-managed rule and ACL sharing via NSX Manager Global Ruleset
  • HA pairs of ESGs (optional) for extra levels of redundancy
  • BOSH CPI can add/remove Gorouter members from load balanced pools in ESG (not an Ops Manager feature)
  • ESG Security Group tagging managed by BOSH, to group like VMs per PCF installation into security groups

Note: When using VMware NSX for vSphere 6.2.3+, the default VXLAN port of 4789 used by Container-to-Container Networking will not work. To fix this issue, override the default by navigating to the Networking section of the Elastic Runtime tile and entering a different value in VXLAN Tunnel Endpoint Port.

Networking Design

Each PCF installation consumes four (or more) networks with the ESG, aligned to specific job types:

  • Infrastructure: This small network hosts resources that interact with the IaaS layer and back-office systems, such as the cloud provider interface (CPI), BOSH, Ops Manager, and other utility VMs such as Jumpbox VM. Operators will access these resources to manage a PCF installation.
  • Deployment: Also known as the apps wire, this network has a large CIDR range. It hosts the Elastic Runtime tile (ERT), Diego Cells, and Windows Cells, and is the network apps are deployed onto.
  • Services: This network has a large CIDR range. It hosts tiles (services) that are installed using Ops Manager and managed by BOSH. A simple approach is to use this network for all PCF tiles except ERT.
  • Dynamic Services: A single network granted to BOSH Director for use with tiles (services) that require an on-demand (dynamic) address space for deployment. This is a special purpose network that is marked as “Services” with a checkbox in the vSphere Ops Manager Director tile.
  • IsoZones##: A single network granted to the PCF Isolation Segment tile for use to isolate Gorouters and Diego Cells into a new network space independent of the ERT installation.

Note:All of these networks are considered “inside” or “tenant-side” networks, and use non-routable RFC-1918 network space which are not advertised to the outside by the ESG to make provisioning repeatable. The ESG routes between the tenant and service provider side networks and connects traffic through using SNAT and DNAT.

For each PCF installation, provision an ESG with at least four routable IP addresses from the service provider:

  1. A static IP address by which NSX Manager manages the ESG
  2. A static IP address for use as egress SNAT. Traffic from the tenant side exits the Edge on this IP address
  3. A static IP address for DNATs to Ops Manager
  4. A static IP address for the load balancer VIP that balances to a pool of PCF Gorouters (HTTP/HTTPS)

In addition to these, many more uses for IP addresses on the routed side of the ESG exist. Pivotal recommends reserving a total of ten contiguous static IP addresses per ESG for future needs and flexibility. Examples include the following:

  • Load balancer VIP for TCP Routers, if deploying TCP routing for non 80/443 access to apps
  • Load Balancer VIP for DiegoBrains, if deploying in multiples
  • Load Balancer VIP for MySQL proxies, if deploying in multiples
  • Monitoring or metrics endpoint for platform monitoring

On the tenant side, each interface defined on the ESG acts as the IP gateway for the network used. Pivotal recommends allocating the following address ranges for the networks, and defining the gateway at 192.168.zzz.1 for each:

  • Infrastructure network: 192.168.10.0/26
  • Deployment network: 192.168.20.0/22
  • CF Tiles network: 192.168.24.0/22
  • Dynamic Services network: 192.168.28.0/22
  • IsoZone## network: 192.168.32.0/22

Vsphere exploded edge

View a larger version of this diagram.

Distributed Port Groups

vSphere DVS (Distributed Virtual Switching) is recommended for all Clusters used by PCF. NSX will create a DPG (Distributed Port Group) for each interface provisioned on the ESG.

NSX Logical Switches should be used on the Tenant Side of this design, which leverages vWires, reducing the dependency on available VLAN capacity.

Vsphere port groups

View a larger version of this diagram.

High Performance Variants

One-Armed Load Balancing

The ESG can act as a stand-alone, one-armed load balancer.

This variant can improve performance and separate the dependence on the ESG that acts as NAT/SNAT/Firewall/Router by separating the load balancing function to a separate ESG deployed exclusively for use per installation.

In short, you divide the jobs between two ESGs per install rather than one. To implement this architecture, you place a single interface (internal) of a new ESG on the Deployment network, enable the load balancing function, and DNAT to it through the boundary ESG.

Reference Architecture Without VMware NSX

The reference architecture for deploying production PCF on vSphere without VMware NSX SDN technology follows the base architecture, but with the following differences.

Networking Features

  • Load balancing is handled by an external service, such as a hardware appliance or a VM from a third party.
  • An external service also performs SSL termination unless that is passed through to the Gorouters.
  • You must set up firewalls for each zone or network inside the installation, rather than having the ESG fence all the inside networks.
  • There is no network fencing of the PCF installation, so RFC-1918 non-routable networks are not used, DNAT/SNAT is not used, and the address space consumed is from routable ranges already established in the datacenter.

Networking Design

The more traditional approach without SDN would be to deploy a single VLAN for use with all of PCF, or possibly a pair of VLANs, one for infrastructure and one for the rest of PCF. As VLAN capacity is frequently limited and scarce, this design seeks to limit the need for VLANs to a functional minimum.

Vsphere no nsx

View a larger version of this diagram.

In this example, the firewall and load balancer functions run outside of vSphere, on generic devices that most datacenters provide. The PCF installation is bound to two port groups provided by a DVS on ESXi, each of which aligns to different job types:

  1. Infra: CPI, BOSH, and Ops Manager VMs that communicate with the IaaS layer
  2. PCF: The deployment network for all tiles, including ERT

In a typical installation, you assign each of these port groups to a VLAN out of the datacenter pool, and a routable IP address segment. Routing functions are handled by switching layers outside of vSphere, such as a top-of-rack (TOR) or end-of-row (EOR) switch/router appliance.

It is valid to deploy more networks than these two, up to and including those shown in the original design, so deploy them if the resources are readily available. The main thing to keep in mind is that this is a requirement per PCF installation, so keep a count of how many of those overall you will require.

Reference Architecture Without Multiple Clusters

If you are working with three or more ESXi hosts and want to use less resources than the base architecture requires, Pivotal recommends setting up PCF in three clusters with one host in each. The key point is to try to get to three AZs if at all possible, as this reduces future pain around trying to grow a PCF.

To reduce resource use even further, you can place all hosts into a single cluster with VMware DRS and HA (high availability) enabled. The following is an example of that approach, using resource pools in the single cluster to emulate (spoof) a three-cluster design. While the resource pools actually do little more that organize the VMs, it forces a habit of deploying PCF constructs in threes, which is desirable.

Vsphere single cluster

View a larger version of this diagram.

You may be tempted to consider a two-cluster architecture. A two-cluster architecture may offer useful symmetry at the vSphere level, but PCF works best when it deploys resources in odd numbers (1 - 3 - 5). A two-cluster configuration would force the operator into aligning odd-numbered components into even numbered containers, which does not work well for PCF internal voting algorithms. If you do not want to consume three clusters for PCF, using one cluster works better than using two.

Networking Design

For a single-cluster deployment, follow the networking setup described in either the base or the without-NSX. architectures above. The internal compute arrangement for a production PCF deployment does not affect its networking.

Pivotal recommends mapping all datastores used by PCF to all of the hosts in a single-cluster deployment.

Multi-Datacenter Reference Architecture

To avoid downtime, some PCF customer scenarios demand a multi-datacenter architecture that spreads deployment resources across more than one physical location. A multi-datacenter architecture can support the hardware, power source, and geographic redundancy needed to guarantee high availability.

One interesting strategy for high availability is to keep a record of how many hosts are in a cluster and deploy enough copies of a PCF component in that AZ to ensure survivability in a site loss. This means placing large, odd numbers of components in the cluster so that at least two components are left on either site in the event of a site outage. In a four host cluster, this would call for five VMs, so each site has at least two if not the third. DRS anti-affinity rules can be used here, set at the IaaS level, to force like VMs apart for best effect.

The two main ways of designing a multi-datacenter PCF architecture is with stretched clusters, in which single logical clusters combine components in multiple physical locations, and East/West clusters, in which locally self-contained clusters are mirrored across multiple locations.

Both of these approaches have their own caveats, and you can combine either with the the without-NSX and single-cluster architectures described above.

Multi-Datacenter vSphere With Stretched Clusters

For this approach, you define logical clusters that contain components physically located in two or more sites. With four hosts, for example, build a a four-host cluster with two hosts in an East datacenter and two from the West. Apply networking such that all hosts see the same networks through a stretched layer 2 application, or use NSX or another SDN solution to extend L2 networking via L3 connections.

Vsphere multi datacenter

View a larger version of this diagram.

PCF and BOSH treat the stretched cluster as an AZ, and make the same demands on it that they do with any other AZ. The hosting, networking, and storage components within the stretched cluster must perform with normal latency and connectivity.

For seamless operation, hosts must share all datastores, and you need to replicate storage across sites. Otherwise, vMotion cannot move VMs freely across hosts for maintenance or DRS.

A stretched version of the base architecture splits three clusters across two sites, yielding a 4×3×3 geometry:

  • Four hosts per cluster (two from each site)
  • Three clusters for PCF as AZs
  • Three AZs mapped to PCF clusters

You can also deploy a stretched version of the single-cluster model. This may be the more practical approach to achieving HA, since any stretched deployment already requires so many resources from two sites.

As with any VMware installation, job scheduling works more efficiently when VMs have fewer cores, so you should configure many smaller Diego Cell VMs rather than a lower number of larger ones. If 2-core to 4-core VMs can handle your apps, favor them over 8- and 12-core options. This is especially important with stretched deployments.

Network traffic is a challenge with stretched clusters, since app traffic may enter at any connection point in either location, but can only leave through a designated gateway. The architect should consider that app traffic landing in the East might have to flow out of the West, a “trombone effect” that forces additional traffic across datacenter links.

Multi-Datacenter vSphere With Combined East/West Clusters

For this approach, the architect assigns parallel capacity from two sites independently, and deploys clusters to PCF in matched pairs. This creates even numbers of clusters, which makes suboptimal use of resources in PCF.

East/West mirroring the base architecture yields a deployment with six total clusters, three from each side. This may seem like a lot of gear to apply to PCF, but in a Business Continuity and Disaster Recovery (BCDR) scenario, doubling everything is the point.

Combining the East/West multi-datacenter and single-cluster approaches creates a geometry with two clusters and pools in one cluster per site, or six AZs. Such a deployment only uses one cluster of capacity from each site, and does not scale readily. But drawing capacity from only one cluster makes it easy to provision with only a few hosts.

A multi-datacenter architecture makes replicating storage less critical. There are enough AZs from either side to survive a point failure, and you can recover the installation without vSphere HA enabled for the clusters.

Additional Documentation

Create a pull request or raise an issue on the source for this page in GitHub