Unlocking the Power of On-Demand RabbitMQ for PCF

Note: RabbitMQ for Pivotal Cloud Foundry (PCF) v1.13 is no longer supported because it has reached the End of General Support phase. To stay up to date with the latest software and security updates, upgrade to a supported version.

This topic explains how to benefit from the two on-demand service plans.

Introduction

RabbitMQ for Pivotal Cloud Foundry (PCF) responds to the demands of PCF operators to offer a RabbitMQ on-demand cluster for their application teams, in addition to the existing single-node on-demand plan. The on-demand cluster plan is aimed at workloads that require the same resilience requirements as the Pre-Provisioned offering, but also require their workloads be isolated.

The on-demand cluster plan is designed for workloads that require the same resilience requirements as the pre-provisioned offering, but also require their workloads be isolated. The platform operations team can configure a RabbitMQ for PCF cluster to meet their business requirements and empower app development teams to self-serve their own RabbitMQ cluster.

RabbitMQ for PCF also provides smoke tests for the on-demand plans so that operations teams can validate the app developer workflow for on-demand services. See Dedicated Instance Smoke Test Process.

Platform operators can now offer their app developers three types of RabbitMQ for PCF service plans:

  • Pre-provisioned—For light to moderate messaging needs, this service is fully operated and managed by platform operators as a service.

  • On-demand single node—For application teams requiring greater isolation than provided by the pre-provisioned approach. App development teams can have full access to their own message broker to adapt the runtime parameters to their requirements. For more information on these parameters, see Parameters and Policies in the RabbitMQ documentation.

  • On-demand cluster—For an increased level of message resilience and cluster availability, as well as the benefits of workload isolation mentioned above.

This topic explains how to benefit from the two on-demand plans above.

Note: The RabbitMQ for PCF tile will only provide the on-demand service in the future. For more information, see Deciding Which Service Plan to Use below.

For information about the pre-provisioned plan, see Deploying the RabbitMQ Pre-Provisioned Service. For information on using pre-provisioned plans to isolate workloads, see Creating Isolation with the Tile Replicator.

Deciding Which Service Plan to Use

In the future, Rabbit for PCF will only provide the on-demand service because it is designed for independent, isolated RabbitMQ instances. The existing pre-provisioned offering has many RabbitMQ instances on a single VM, in a multi-tenancy model. In this model, a single misbehaving app can take down the entire cluster for everyone.

From research and feedback on the issues customers had when using the pre-provisioned service, Pivotal is making different design decisions for the on-demand service.

To provide you enough visibility to decide which service to use, the table below describes the current feature discrepancies between the pre-provisioned and on-demand services, and Pivotal’s plans for addressing these discrepancies. Pivotal encourages you to give feedback on how to meet your use case requirements with the on-demand service.

FeaturePre-Provisioned ServiceOn-Demand Service
ConfigurationEnabled via base-64 encoded text boxPlan to address
Plug-insTier 1 plug-ins enabled via checkboxes in UIPlan to address for tier 1 plugins and selected community plugins
RabbitMQ admin credentials to access management dashboardCan set password via tile UI Evaluating based on customer needs and feedback
Erlang cookie Operator can change, which has caused problems No plan to address. You can manage the Erlang cookie correctly with the on-demand configuration.
RabbitMQ TLS versionsAvailable due to security concerns about the TLS packaged with the pre-provisioned service No plan to address. The security concerns are resolved by improved TLS in the on-demand service.
External load balancer DNS nameAvailablePlan to address. IaaS-specific load balancers can still be used.
Disk free alarm limitConfigurableNo plans to address. The default persistent disk size is controlled at the plan level and is set relative to memory. This removes the ability to mis-configure the alarm limit.
HAProxyExistsNo plans to address due to the following:

  • Large RabbitMQ customers have their own hardware load balancers. Small customers appreciate fewer VMs.
  • Network-heavy workloads require scaling the load balancer VMs together with RabbitMQ nodes.
  • Operations teams incur more overhead to support PROXY protocol, causing further overhead at various other layers.
RabbitMQ servers static IPExistsEvaluating based on customer needs and feedback
Policy for new instancesConfigurablePlan to address
Almost-instant provisioningInstantly provisionedNo plans to address. Instant provisioning for on-demand is limited by the IaaS. This limitation will potentially be addressed by containers.
Almost-guaranteed provisioningGuaranteed provisioningNo plans to address due to infrequent failures
VM-visibility from Ops ManagerAvailablePlan to address
Download logs from Ops ManagerAvailablePlan to address
Individual instance upgradePossible when using tile replicatorPlan to address
Network partition behaviorDefaults to pause_minority, configurablePlan to address
TLSTLS enabled between client and brokerTLS between client and broker has been implemented. Client authentication via certificates will be implemented. Inter-node TLS will be implemented.

On-Demand Single Node Plan Using RabbitMQ 3.7

This plan is designed to be simple to configure, deploy, and use. It gives application teams fast access to the power of the leading open source message broker backed by BOSH to meet all but the most demanding high availability app messaging requirements.

This plan can suit high-performance workloads requiring messaging resilience and asynchronous messaging replication. RabbitMQ copies messages to disk for resilience and allows asynchronous messaging replication through the RabbitMQ Federation plug-in.

This plan offers:

  • Fast access to an isolated instance of RabbitMQ scoped for the application teams
  • Org and Space Administrator access to the RabbitMQ Management UI so application teams can have full control over the node
  • Updates and upgrades initiated and controlled by the operator to keep the instance up-to-date with the latest security patches and bug fixes
  • Message resilience provided through RabbitMQ exchange, queue Federation, and Shovel plugins.

On-Demand Cluster Plan Using RabbitMQ 3.7

Like the single node plan, this plan is designed to be simple to configure, deploy and use. It gives application teams fast access to the power of the leading Open Source message broker backed by BOSH to meet all but the most demanding high availability app messaging requirements.

This plan can suit high performance workloads requiring messaging resilience (copied to disk) and asynchronous messaging replication through the RabbitMQ Federation plugin. With this plan, however, you also scale out RabbitMQ for PCF to multiple nodes.

This plan offers:

  • Fast access to an isolated, clustered instance of RabbitMQ scoped to the application team Orgs and Spaces
  • Administrator access to the RabbitMQ Management UI to give application teams full control over the cluster
  • Updates and upgrades initiated and controlled by the operator to keep the instance up-to-date with the latest security patches and bug fixes.
  • Message resilience provided by mirroring queues across RabbitMQ nodes, and the option to use the Federation and Shovel plugins.

General Principles of the Cluster Plan

The following are some general principles to be aware of when configuring the cluster plan:

Designed for Consistency

RabbitMQ clustering is not primarily a solution for increased availability. Instead, it is designed for consistency and partition tolerance, as described in the CAP theorem. RabbitMQ clustering provides increased message consistency through queue mirroring. This means that messages accessed in one queue are exactly the same as in another queue. For more information, see Consistency or Availability Tradeoff.

Other options can be used for availability requirements, such as the use of federation between exchanges or queues.

For a detailed description of distributed RabbitMQ brokers, see the RabbitMQ documentation.

Number of Nodes

Every node in the on-demand cluster maintains a complete database of all metadata, and all changes to the metadata are confirmed by every node in the cluster. Therefore, going beyond seven nodes can have a significant negative impact on performance. For optimum resilience and performance, Pivotal recommends three nodes for most workloads.

Network Latency

RabbitMQ clusters are only recommended for deployment in low latency networks, which normally means that it is not advisable to deploy these clusters across availability zones (AZs). The stability and performance of the RabbitMQ cluster is heavily influenced by the workload on the nodes, replication choices, and network latency.

For this reason, Pivotal recommends that you deploy RabbitMQ clusters into a single Ops Manager AZ. However, where different AZs are in the same data center, with reliable low latency links, spanning AZs can be used.

For cloud IaaS deployments, Pivotal does not recommend that deployments span regions. For example, in Amazon Web Services (AWS) terms, deploying a RabbitMQ cluster across AZs within a region should provide high enough network performance to prevent impacting cluster stability. However, deploying across AWS regions is likely to lead to cluster instability. For more information, see the AWS documentation.

Consistency or Availability Tradeoff

In a distributed messaging system, a tradeoff must be made between availability or consistency when a network partition event occurs and one or more nodes are not able to communicate with each other. The cluster plan lets operators decide how they want the RabbitMQ cluster to react in the event of a network partition.

Pivotal recommends keeping the default cluster partition option of pause_minority because this satisfies most use cases. Choosing the pause_minority partition-handling strategy favors message consistency over availability. For more information about the options for handling partitions, see the RabbitMQ documentation. For a detailed description of the options available in RabbitMQ for PCF, see Clustering and Network Partitions.

Here is an example of how pause_minority works. If you create a RabbitMQ cluster with three nodes and one node becomes unable to communicate with the other two, this node is in the minority. The node that is in the minority is paused, and the other two nodes continue serving traffic. If each of the nodes loses connectivity with the other two, then the entire cluster is paused to preserve data since no majority can be established. The cluster heals when two or more nodes are able to communicate with each other.

RabbitMQ Queue Availability

It is important to be aware that message queue availability is different from cluster availability. So, having cluster availability does not mean that all of the messages within the queues are also available.

By default, queues within a RabbitMQ cluster are located on a single node—the node on which they were first declared. However, queues can be configured to mirror across multiple nodes, so that any message published to the queue is replicated to all mirrors. Enabling mirroring can have a negative impact on queue performance because messages must be copied to all mirrors before being acknowledged.

Each mirrored queue consists of one master and one or more mirrors, with the oldest mirror being promoted to the new master if the old master disappears for any reason. Consumers are connected to the master regardless of which node they connect to, and mirrors drop messages that have been acknowledged at the master. Queue mirroring enhances queue availability, but does not distribute load across nodes because each of the participating nodes must still do all the work.

App developers must decide if they want to use queue mirroring and determine the policy they want to apply to their queues. These choices have significant impact on the availability of their queues. For more information, see the RabbitMQ documentation.

Unlike the pre-provisioned plan, the cluster plan does not ship with a default load balancer. Therefore, developers must configure their app to use the array of hosts provided in VCAP_SERVICES. If developers enable queue mirroring, they must also ensure their apps have re-try logic and reconnection logic that iterates over the range of hosts provided. Most common RabbitMQ clients have this logic built into them. For more information, see the Spring AMQP documentation.

Because the cluster plan is designed to enable application teams to self-serve, not having a load balancer in front of the RabbitMQ cluster has these benefits:

  • Manage resources better, as fewer VMs are needed.
  • Help with troubleshooting. Client IP is now the IP of the source container and not the HAProxy.
  • Reduce the number of hops between apps and broker. This helps with latency.
  • Determine queue placement. This makes sense for larger scale deployments.
  • Empower application teams to manage their cluster in the best way for their app.
  • Require re-try logic in an app if it needs HA access to a queue. Thus, all nodes can route to a queue if it is available.

Managing On-Demand Resources Through Plans

In configuring each plan, there are a number of operational controls that platform operations teams can use to manage the resources consumed by on-demand RabbitMQ:

  • Control Access—Operators can choose the app development orgs and spaces for which the plans are available and visible. Each plan can be enabled or disabled, and service access and visibility can either be global, or enabled per org and space through the command line.

    For example, you may decide to enable the single node on-demand plan across all application teams to meet their demand to isolate their workload. You may then choose to offer the on-demand cluster plan only to a subset of application teams who require the extra resources.

  • Set Quotas—You can set a global quota for all on-demand instances that takes precedence over each plan quota. This lets you guard against the risk of over-committing resources, but allows the flexibility of over-committing each plan, so you can meet the fluctuating demands of your app developers.

  • Control Resource Consumption—Each plan offers more fine-grained control over individual plan resource consumption. At the highest level, you can use the plan quota to control the number of instances that can be deployed within a foundation. For each plan, you can also configure the number of nodes that constitute a cluster (3, 5, or 7), the instance type, and persistent disk storage size to best suit your requirements.

  • Monitor—You can monitor the number of instances that have been deployed against the quota you have set so that you can plan future resource requirements.

Customizing Plan Options

The RabbitMQ for PCF on-demand plans expose a number of configuration options. In most cases, the default configurations meet most app demands. However, it is important for an operations team to consider the options to ensure that they provide the best service to their app developers. This section explains these options.

Configuration Options

Single Node and Cluster Plans

  • Enable/ Disable plan
  • Determine which orgs and spaces can see and access the plan
  • Set Service Instance Quota
  • Select AZ placement (where applicable)
  • Set RabbitMQ instance size (CPU and Memory)
  • Set persistent disk size (Persisted Message Store) for the RabbitMQ instance. Ensure the size of the persistent disk is at least twice as large as the instance memory.

Cluster Plan Only

Note: A load balancer, such as HAProxy, is not deployed with on-demand cluster plans.

Things That Are Preconfigured

The following are preconfigured for both the single node and the cluster plans:

  • RabbitMQ VM Type—When installing on PCF v2.0 or later, each RabbitMQ node is configured to have the following properties:

    • CPUs: 2
    • RAM: 8 GB
    • Ephemeral disk: 16 GB

    You can change these settings in the Service Plan Configuration page. Changing these settings affects all nodes.

  • Persistent Disk Type—When installing on PCF v2.0 or later, each RabbitMQ node is configured to have 30 GB of persistent disk space.


    You can change this setting in the Service Plan Configuration page. Pivotal recommends you set this value to be twice the amount of RAM of the selected RabbitMQ VM Type.

  • Metrics—Emitted to the Loggregator Firehose for all on-demand instances. The polling interval is set in the Ops Manager, in the Metrics polling interval field, in the Pre-Provisioned RabbitMQ tab of the RabbitMQ for PCF tile. Due to the impact of some of the cluster settings detailed below, Pivotal strongly recommends that you monitor the exposed metrics and configure alarms as recommended in Monitoring and KPIs for On-Demand RabbitMQ for PCF. See also Monitoring On-Demand RabbitMQ Clusters below.

  • Logs—RabbitMQ on-demand instance logs are forwarded using the same configuration as contained in the Syslog tab of the RabbitMQ for PCF tile.

  • Disk free space limit—The disk free space limit is set to 150% of RAM of the instance type you select. For example, if you select an instance type with 10 GB of RAM, the disk free space limit is set to 15 GB. A cluster-wide alarm is triggered if the amount of free disk space drops below this, and all publishers are blocked. Instances must be configured to have persistent disks that are at least twice the size of instance RAM. For more information, see the RabbitMQ documentation.

  • Memory threshold for triggering flow control—Threshold at which flow control is triggered is set to 40% of the instance RAM. This means that when the alarm is triggered, all connections publishing messages are blocked cluster-wide until the alarm is cleared.

    For example, if you select an instance type with 10 GB of RAM, when more than 4 GB of memory is used, all publishing connections are blocked. For more information, see Memory Alarms in the RabbitMQ documentation.

  • Memory paging threshold—This is the level at which RabbitMQ tries to free up memory by instructing queues to page their contents out to disk. This is done to try to avoid reaching the high watermark and blocking publishers. This threshold is set to 50% of the configured high watermark, which is 20% of configured memory.

    For example, if you select an instance type with 10 GB of RAM, when more than 2 GB of memory is used, all queues start writing all queue contents to disk. For more information, see the RabbitMQ documentation.

Monitoring On-Demand RabbitMQ Clusters

  • It is important to monitor and compare the number of instances that have been deployed against the quota you set via the metric exposed on the Firehose.

  • Each instance is pre-configured to emit metrics to the Firehose and can be identified by the deployment tag, which has the service instance ID. It is important to monitor these metrics as recommended in Monitoring and KPIs for On-Demand RabbitMQ for PCF.

About Migrating a Pre-Provisioned Instance to an On-Demand Instance

Pre-provisioned service instances have very low resource consumption, that is, a vhost within an existing cluster. However, every on-demand service consists of dedicated VMs. Therefore, you must select an on-demand service plan that provides adequate resources, but avoids unnecessary resource consumption.

For example, you might select a single-node plan with a small VM for development purposes, but select a three-node cluster of large VMs for a mission critical system.

  • Pivotal recommends that you define all required structures in your app to ensure they get defined if you connect to a new instance. These structures include:

    • Exchanges
    • Queues
    • Bindings
  • If your pre-provisioned instance uses any of the following, ensure that you apply them to the on-demand instance:

    • Policies
    • vhost specific parameters, such as max_connections
    You can apply them using the RabbitMQ Management Dashboard or using APIs.

  • You lose messages that have not been consumed when you delete your old service instance. If you do not want to lose messages, do one of the following:

    • Switch your producers to the new instance but keep the consumers bound to the old instance until the queues are empty.
    • Use shovel or federation plugins to consume messages from the old instance.

How to migrate from a Pre-Provisioned Instance to an On-Demand Instance

To migrate from one service instance to another, do the following:

  1. Create an on-demand instance.

  2. Bind it to your application.

  3. Unbind the pre-provisioned instance from your application.

  4. Restart your application.

  5. When ready, delete your pre-provisioned instance.

  6. If you have service-keys for the old instance, make sure to re-create them using the new instance and replace the old credentials.

Differences between Pre-Provisioned and On-Demand Services Instances

There are some differences between pre-provisioned and on-demand services instances that you should be aware of:

  • On-demand instances are not fronted by a load balancer, therefore, ensure the following:

    • That you configure the RabbitMQ client in your application with all available nodes, not just one.
    • That your re-connection logic can handle a node failure. Pivotal recommends this behavior for Spring AMQP clients.
  • The instance you are migrating to might use a different version of RabbitMQ than your old instance. For more information, see the RabbitMQ for PCF Release Notes and the RabbitMQ Changelog.

  • Critical Tier-1 plugins are enabled for on-demand, however, on-demand does not yet have the same plugins enabled as pre-provisioned. If you are missing a plugin, contact your Pivotal representative.

  • You might have configured your on-demand instance differently. For example, you might have changed:

    • VM sizing—CPU, RAM, and disk
    • RabbitMQ network partition behavior, both offerings default to pause_minority
  • If your RabbitMQ instance uses TLS, ensure that you enable TLS for on-demand instances. See Configure TLS for Your Service Instance.