Unlocking the Power of On-Demand RabbitMQ for PCF

This topic explains how to benefit from the two on-demand service plans.

Introduction

RabbitMQ for Pivotal Cloud Foundry (PCF) responds to the demands of PCF operators by offering a RabbitMQ on-demand cluster for their app developer teams, in addition to the single-node on-demand plan.

The on-demand cluster plan is designed for workloads that require the same resilience requirements as the pre-provisioned offering, but also require their workloads be isolated. The platform operations team can configure a RabbitMQ for PCF cluster to meet their business requirements and empower app development teams to self-serve their own RabbitMQ cluster.

RabbitMQ for PCF also provides smoke tests for the on-demand plans so that operations teams can validate the app developer workflow for on-demand services. See Dedicated Instance Smoke Test Process.

With the on-demand cluster plan, platform operators can now offer their app developers three types of RabbitMQ for PCF service plans:

  • On-demand single node—For app developer teams requiring greater isolation than provided by the pre-provisioned approach. App development teams can have full access to their own message broker to adapt the runtime parameters to their requirements. For more information on these parameters, see Parameters and Policies in the RabbitMQ documentation.

  • On-demand cluster—For an increased level of message resilience and cluster availability, as well as the benefits of workload isolation mentioned above.

  • Pre-provisioned—For light to moderate messaging needs, this service is fully operated and managed by platform operators as a service.

Note: The RabbitMQ for PCF tile will only provide the on-demand service in the future. For more information, see Deciding Which Service Plan to Use below.

For information about the pre-provisioned plan, see Deploying the RabbitMQ Pre-Provisioned Service. For information on using pre-provisioned plans to isolate workloads, see Creating Isolation with the Tile Replicator.

Deciding Which Service Plan to Use

In the future, Rabbit for PCF will only provide the on-demand service because it is designed for independent, isolated RabbitMQ instances. The existing pre-provisioned offering has many RabbitMQ instances on a single VM, in a multi-tenancy model. In this model, a single misbehaving app can take down the entire cluster for everyone.

From research and feedback on the issues customers had when using the pre-provisioned service, Pivotal is making different design decisions for the on-demand service.

To provide you enough visibility to decide which service to use, the table below describes the current feature discrepancies between the pre-provisioned and on-demand services, and Pivotal’s plans for addressing these discrepancies. Pivotal encourages you to give feedback on how to meet your use case requirements with the on-demand service.

FeaturePre-Provisioned ServiceOn-Demand Service
ConfigurationEnabled via base-64 encoded text boxPlan to address
Plugins Tier-1 plugins enabled using checkboxes in UI. A selection of tier-1 plugins are enabled by default on all instances. See RabbitMQ Server Settings That Cannot Be Disabled.
RabbitMQ admin credentials to access RabbitMQ Management UI Can set password using tile UI Can access by creating a service key, see Create an Admin User for a Service Instance.
Erlang cookie Operator can change, which has caused problems This is managed by the service. No operator intervention needed.
RabbitMQ TLS versions Available due to security concerns about the TLS packaged with the pre-provisioned service All instances only have TLS v1.1 and TLS v1.2 available.
Resource Sharing Yes. Service instances share resources on the same VM and can affect one another. No. On-demand ensures isolation between service instances by creating a separate VM per service instance.
External load balancer DNS nameAvailablePlan to address. IaaS-specific load balancers can still be used.
Disk free alarm limitConfigurableNo plans to address. The default persistent disk size is controlled at the plan level and is set relative to memory. This removes the ability to mis-configure the alarm limit.
Load-balancing Available using HAProxy Uses client-side load-balancing.
RabbitMQ servers static IPExistsEvaluating based on customer needs and feedback
Policy for new instancesConfigurablePlan to address
Almost-instant provisioningInstantly provisionedNo plans to address. Instant provisioning for on-demand is limited by the IaaS. This limitation will potentially be addressed by containers.
Individual instance upgradePossible when using tile replicatorPlan to address
Network partition behavior Defaults to pause_minority. Configurable. Defaults to pause_minority. Configurable for each plan.
TLSTLS enabled between client and RabbitMQ broker, possible to configure peer validationTLS between client and the RabbitMQ broker has been implemented.

On-Demand Single Node Plan Using RabbitMQ 3.7

This plan is designed to be simple to configure, deploy, and use. It gives app developer teams fast access to the power of the leading open source message broker backed by BOSH to meet all but the most demanding high availability app messaging requirements.

This plan can suit high-performance workloads requiring messaging resilience and asynchronous messaging replication. RabbitMQ copies messages to disk for resilience and allows asynchronous messaging replication through the RabbitMQ Federation plug-in.

This plan offers:

  • Fast access to an isolated instance of RabbitMQ scoped for the app developer teams
  • Org and Space Administrator access to the RabbitMQ Management UI so app developer teams can have full control over the node
  • Updates and upgrades initiated and controlled by the operator to keep the instance up-to-date with the latest security patches and bug fixes
  • Message resilience provided through RabbitMQ exchange, queue Federation, and Shovel plugins.

On-Demand Cluster Plan Using RabbitMQ 3.7

Like the single node plan, this plan is designed to be simple to configure, deploy and use. It gives app developer teams fast access to the power of the leading Open Source message broker backed by BOSH to meet all but the most demanding high availability app messaging requirements.

This plan can suit high performance workloads requiring messaging resilience (copied to disk) and asynchronous messaging replication through the RabbitMQ Federation plugin. With this plan, however, you also scale out RabbitMQ for PCF to multiple nodes.

This plan offers:

  • Fast access to an isolated, clustered instance of RabbitMQ scoped to the app developer team Orgs and Spaces
  • Administrator access to the RabbitMQ Management UI to give app developer teams full control over the cluster
  • Updates and upgrades initiated and controlled by the operator to keep the instance up-to-date with the latest security patches and bug fixes.
  • Message resilience provided by mirroring queues across RabbitMQ nodes, and the option to use the Federation and Shovel plugins.

General Principles of the Cluster Plan

The following are some general principles to be aware of when configuring the cluster plan:

Designed for Consistency

RabbitMQ clustering is not primarily a solution for increased availability. Instead, it is designed for consistency and partition tolerance, as described in the CAP theorem. RabbitMQ clustering provides increased message consistency through queue mirroring. This means that messages accessed in one queue are exactly the same as in another queue. For more information, see Consistency or Availability Trade-off.

Other options can be used for availability requirements, such as the use of federation between exchanges or queues.

For a detailed description of distributed RabbitMQ brokers, see the RabbitMQ documentation.

Number of Nodes

Every node in the on-demand cluster maintains a complete database of all metadata, and all changes to the metadata are confirmed by every node in the cluster. Therefore, going beyond seven nodes can have a significant negative impact on performance. For optimum resilience and performance, Pivotal recommends three nodes for most workloads.

Network Latency

RabbitMQ clusters are only recommended for deployment in low latency networks, which normally means that it is not advisable to deploy these clusters across availability zones (AZs). The stability and performance of the RabbitMQ cluster is heavily influenced by the workload on the nodes, replication choices, and network latency.

For this reason, Pivotal recommends that you deploy RabbitMQ clusters into a single Ops Manager AZ. However, where different AZs are in the same data center, with reliable low latency links, spanning AZs can be used.

For cloud IaaS deployments, Pivotal does not recommend that deployments span regions. For example, in Amazon Web Services (AWS) terms, deploying a RabbitMQ cluster across AZs within a region should provide high enough network performance to prevent impacting cluster stability. However, deploying across AWS regions is likely to lead to cluster instability. For more information, see the AWS documentation.

Consistency or Availability Trade-off

In a distributed messaging system, a trade-off must be made between availability or consistency when a network partition event occurs and one or more nodes are not able to communicate with each other. The cluster plan lets operators decide how they want the RabbitMQ cluster to react in the event of a network partition.

Pivotal recommends keeping the default cluster partition option of pause_minority because this satisfies most use cases. Choosing the pause_minority partition-handling strategy favors message consistency over availability. For more information about the options for handling partitions, see the RabbitMQ documentation. For a detailed description of the options available in RabbitMQ for PCF, see Clustering and Network Partitions.

Here is an example of how pause_minority works. If you create a RabbitMQ cluster with three nodes and one node becomes unable to communicate with the other two, this node is in the minority. The node that is in the minority is paused, and the other two nodes continue serving traffic. If each of the nodes loses connectivity with the other two, then the entire cluster is paused to preserve data since no majority can be established. The cluster heals when two or more nodes are able to communicate with each other.

RabbitMQ Queue Availability

It is important to be aware that message queue availability is different from cluster availability. So, having cluster availability does not mean that all of the messages within the queues are also available.

By default, queues within a RabbitMQ cluster are located on a single node—the node on which they were first declared. However, queues can be configured to mirror across multiple nodes, so that any message published to the queue is replicated to all mirrors. Enabling mirroring can have a negative impact on queue performance because messages must be copied to all mirrors before being acknowledged.

Each mirrored queue consists of one master and one or more mirrors, with the oldest mirror being promoted to the new master if the old master disappears for any reason. Consumers are connected to the master regardless of which node they connect to, and mirrors drop messages that have been acknowledged at the master. Queue mirroring enhances queue availability, but does not distribute load across nodes because each of the participating nodes must still do all the work.

App developers must decide if they want to use queue mirroring and determine the policy they want to apply to their queues. These choices have significant impact on the availability of their queues. For more information, see the RabbitMQ documentation.

Unlike the pre-provisioned plan, the cluster plan does not ship with a default load balancer. Therefore, developers must configure their app to use the array of hosts provided in VCAP_SERVICES. If developers enable queue mirroring, they must also ensure their apps have re-try logic and reconnection logic that iterates over the range of hosts provided. Most common RabbitMQ clients have this logic built into them. For more information, see the Spring Advanced Message Queuing Protocol (Spring AMQP) documentation.

Because the cluster plan is designed to enable app developer teams to self-serve, not having a load balancer in front of the RabbitMQ cluster has these benefits:

  • Manage resources better, as fewer VMs are needed.
  • Help with troubleshooting. Client IP is now the IP of the source container and not the HAProxy.
  • Reduce the number of hops between apps and broker. This helps with latency.
  • Determine queue placement. This makes sense for larger scale deployments.
  • Empower app developer teams to manage their cluster in the best way for their app.
  • Require re-try logic in an app if it needs HA access to a queue. Thus, all nodes can route to a queue if it is available.

Managing On-Demand Resources Through Plans

In configuring each plan, there are a number of operational controls that platform operations teams can use to manage the resources consumed by on-demand RabbitMQ:

  • Control Access—Operators can choose the app development orgs and spaces for which the plans are available and visible. Each plan can be enabled or disabled, and service access and visibility can either be global, or enabled per org and space through the command line.

    For example, you may decide to enable the single node on-demand plan across all app developer teams to meet their demand to isolate their workload. You may then choose to offer the on-demand cluster plan only to a subset of app developer teams who require the extra resources.

  • Set Quotas—You can set a global quota for all on-demand instances that takes precedence over each plan quota. This lets you guard against the risk of over-committing resources, but allows the flexibility of over-committing each plan, so you can meet the fluctuating demands of your app developers.

  • Control Resource Consumption—Each plan offers more fine-grained control over individual plan resource consumption. At the highest level, you can use the plan quota to control the number of instances that can be deployed within a foundation. For each plan, you can also configure the number of nodes that constitute a cluster (3, 5, or 7), the instance type, and persistent disk storage size to best suit your requirements.

  • Monitor—You can monitor the number of instances that have been deployed against the quota you have set so that you can plan future resource requirements.

Customizing Plan Options

The RabbitMQ for PCF on-demand plans expose a number of configuration options. In most cases, the default configurations meet most app demands. However, it is important for an operations team to consider the options to ensure that they provide the best service to their app developers. This section explains these options.

Configuration Options

Single Node and Cluster Plans

  • Enable/ Disable plan
  • Determine which orgs and spaces can see and access the plan
  • Set Service Instance Quota
  • Select AZ placement (where applicable)
  • Set RabbitMQ instance size (CPU and Memory)
  • Set persistent disk size (Persisted Message Store) for the RabbitMQ instance. Ensure the size of the persistent disk is at least twice as large as the instance memory.

Cluster Plan Only

Note: A load balancer, such as HAProxy, is not deployed with on-demand cluster plans.

Things That Are Preconfigured

The following are preconfigured for both the single node and the cluster plans:

  • RabbitMQ VM Type—When installing on PCF v2.0 or later, each RabbitMQ node is configured to have the following properties:

    • CPUs: 2
    • RAM: 8 GB
    • Ephemeral disk: 16 GB

    You can change these settings in the Service Plan Configuration page. Changing these settings affects all nodes.

  • Persistent Disk Type—When installing on PCF v2.0 or later, each RabbitMQ node is configured to have 30 GB of persistent disk space.


    You can change this setting in the Service Plan Configuration page. Pivotal recommends you set this value to be twice the amount of RAM of the selected RabbitMQ VM Type.

  • Metrics—Emitted to the Loggregator Firehose for all on-demand instances. The polling interval is set in the Ops Manager, in the Metrics polling interval field, in the Pre-Provisioned RabbitMQ tab of the RabbitMQ for PCF tile. Due to the impact of some of the cluster settings detailed below, Pivotal strongly recommends that you monitor the exposed metrics and configure alarms as recommended in Monitoring and KPIs for On-Demand RabbitMQ for PCF. See also Monitoring On-Demand RabbitMQ Clusters below.

  • Logs—RabbitMQ on-demand instance logs are forwarded using the same configuration as contained in the Syslog and Metrics tab of the RabbitMQ for PCF tile.

  • Disk free space limit—The disk free space limit is set to 150% of RAM of the instance type you select. For example, if you select an instance type with 10 GB of RAM, the disk free space limit is set to 15 GB. A cluster-wide alarm is triggered if the amount of free disk space drops below this, and all publishers are blocked. Instances must be configured to have persistent disks that are at least twice the size of instance RAM. For more information, see the RabbitMQ documentation.

  • Memory threshold for triggering flow control—Threshold at which flow control is triggered is set to 40% of the instance RAM. This means that when the alarm is triggered, all connections publishing messages are blocked cluster-wide until the alarm is cleared.

    For example, if you select an instance type with 10 GB of RAM, when more than 4 GB of memory is used, all publishing connections are blocked. For more information, see Memory Alarms in the RabbitMQ documentation.

  • Memory paging threshold—This is the level at which RabbitMQ tries to free up memory by instructing queues to page their contents out to disk. This is done to try to avoid reaching the high watermark and blocking publishers. This threshold is set to 50% of the configured high watermark, which is 20% of configured memory.

    For example, if you select an instance type with 10 GB of RAM, when more than 2 GB of memory is used, all queues start writing all queue contents to disk. For more information, see the RabbitMQ documentation.

Monitoring On-Demand RabbitMQ Clusters

  • It is important to monitor and compare the number of instances that have been deployed against the quota you set via the metric exposed on the Firehose.

  • Each instance is pre-configured to emit metrics to the Firehose and can be identified by the deployment tag, which has the service instance ID. It is important to monitor these metrics as recommended in Monitoring and KPIs for On-Demand RabbitMQ for PCF.

About Migrating a Pre-Provisioned Instance to an On-Demand Instance

Pivotal recommends the on-demand service for production workloads due to its workload isolation.

For instructions for developers about migrating, see Migrating From a Pre-Provisioned Instance to an On-Demand Instance. For how operators can turn off the pre-provisioned service, see Turning Off the Pre-Provisioned Service.

When migrating from a pre-provisioned to an on-demand offering, be aware of the following:

  • Pre-provisioned service instances have very low resource consumption, that is, a virtual host within an existing cluster. However, every on-demand service consists of dedicated VMs. Therefore, you must select an on-demand service plan that provides adequate resources, but avoids unnecessary resource consumption.

    • For example: you might select a single-node plan with a small VM for development purposes, but select a three-node cluster of large VMs for a mission critical system.
  • Pivotal recommends that you define all required structures in your app to ensure they get defined if you connect to a new instance. These structures include:

    • Exchanges
    • Queues
    • Bindings
  • If your pre-provisioned instance uses any of the following, ensure that you apply them to the on-demand instance:

    • Policies
    • virtual host-specific parameters, such as max_connections
    You can apply them using the RabbitMQ Management UI or using APIs.

  • You lose messages that have not been consumed when you delete your old service instance. If you do not want to lose messages, do one of the following:

    • Switch your producers to the new instance but keep the consumers bound to the old instance until the queues are empty.
    • Use shovel or federation plugins to consume messages from the old instance.

Differences between Pre-Provisioned and On-Demand Services Instances

There are some differences between pre-provisioned and on-demand services instances that you should be aware of:

  • On-demand instances are not fronted by a load balancer, therefore, ensure the following:

    • That you configure the RabbitMQ client in your app with all available nodes, not just one.
    • That your re-connection logic can handle a node failure. Pivotal recommends this behavior for Spring AMQP clients.
  • The instance you are migrating to might use a different version of RabbitMQ than your old instance. For more information, see the RabbitMQ for PCF Release Notes and the RabbitMQ Changelog.

  • Critical tier-1 plugins are enabled for on-demand. However, on-demand does not yet have the same plugins enabled as pre-provisioned. If you are missing a plugin, contact your Pivotal representative.

  • You might have configured your on-demand instance differently. For example, you might have changed:

    • VM sizing—CPU, RAM, and disk
    • RabbitMQ network partition behavior, both offerings default to pause_minority
  • If your RabbitMQ instance uses TLS, ensure that you enable TLS for on-demand instances. See Configure TLS for Your Service Instance.