Upgrading RabbitMQ for PCF

Note: RabbitMQ for Pivotal Cloud Foundry v1.13 is no longer supported because it has reached the End of General Support phase. To stay up to date with the latest software and security updates, upgrade to a supported version.

RabbitMQ for PCF enables automated upgrades between versions of the product. In some versions, you might be required to take the RabbitMQ cluster offline. Whenever this is necessary, it is noted in the release notes for those versions.

The upgrade paths for each version are detailed in the Product Compatibility Matrix.

This topic applies to both the on-demand and pre-provisioned services.

About the Upgrade

This section contains information about upgrading RabbitMQ for PCF.

Warning: You must follow the upgrade procedure described in Upgrade the RabbitMQ for Pivotal Cloud Foundry Pre-Provisioned Service to successfully upgrade from RabbitMQ for PCF v1.12 to v1.13. If you do not follow the procedure, you will experience prolonged downtime and might experience data loss.

About Upgrading to RabbitMQ for PCF v1.13

As of v1.13, RabbitMQ for PCF no longer supports RabbitMQ v3.6. Manual steps are required to upgrade both the pre-provisioned and the on-demand service to RabbitMQ for PCF v1.13. This is due to incompatibilities between RabbitMQ versions v3.6 and v3.7, as well as a major Erlang update provided by RabbitMQ for PCF v1.13.

Due to these constraints, you cannot do a rolling upgrade of the cluster for this tile release.

There are two upgrade procedures for this release depending on the type of service you are upgrading:

General Notes About the Upgrade Process

The following notes apply to upgrading to any version of RabbitMQ for PCF.

  • Upgrading to a newer version of the product does not cause any loss of data or configuration.

  • It might take busy RabbitMQ nodes a long time to shut down during the upgrade and you must not interrupt this process.

  • To benefit from rolling upgrades, configure your apps to reconnect after a node restarts. For more information, see Handling Node Restarts in Applications in the RabbitMQ documentation.

  • The benefit you get from stemcell rolling upgrades depends on how you have configured network partition handling and the Resource Config tab. An HAProxy instance count of 2 and a RabbitMQ node count of 3 are required for rolling stemcell upgrades. These counts are the default. For more information, see Clustering and Network Partitions.

  • Ops Manager ensures the instances are updated with the new packages and any configuration changes are applied automatically.

Release Policy

When a new version of RabbitMQ is released, a new version of RabbitMQ for PCF is released soon after.

For more information about the Pivotal Platform release policy, see Release Policy.

Downtime When Upgrading

A guide for downtime during upgrade deployments is shown in the table below. In some cases, the RabbitMQ cluster remains available during a tile upgrade, but individual queues on cluster nodes might be taken offline.

The length of the downtime depends on whether there is a stemcell update to replace the operating system image or whether the existing VM can just have the RabbitMQ software updated. Stemcell updates incur additional downtime while the IaaS creates the new VM.

The RabbitMQ cluster becomes unavailable only when upgrading between specific versions of Erlang or RabbitMQ. This is stated in the release notes for those versions.

Important: The following table is only a guide. Always check the release notes for the version you are upgrading to.

Upgrade Type Will Downtime Be Required For This Upgrade / Update
Major Tile Version The RabbitMQ cluster is taken offline for the duration of the upgrade.
Minor Tile Version The RabbitMQ cluster is taken offline for the duration of the upgrade.
Patch Tile Version Normally these are rolling deployments with each node being updated in turn. In these cases the cluster remains available, but individual queues might be taken offline as each node is restarted. There are specific migration paths that require downtime, which are identified in the release notes for that version.
Stemcell-Only Patch Tile Version Where the patch update is only a new stemcell version these are rolling deployments with each node being updated in turn. In these cases the cluster remains available, but individual queues might be taken offline as each node is restarted.

Prerequisite

Before you upgrade, you must ensure that the following directory does not exist:

/var/vcap/store/rabbitmq/Mnesia.rabbit@…

If this directory exists, you must move Mnesia.rabbit@… from /var/vcap/store/rabbitmq. If you do not move this directory, you might not be able to restart RabbitMQ successfully after the upgrade.

If you have already upgraded, moving this directory from /var/vcap/store/rabbitmq resolves the issue.

Upgrade the RabbitMQ for PCF Pre-Provisioned Service

RabbitMQ for PCF v1.13 deploys RabbitMQ v3.7 to run the pre-provisioned service. Previous versions deploy RabbitMQ v3.6.

To upgrade the pre-provisioned service, also known as the multi-tenant service or p-rabbitmq, you must stop all but one node in the RabbitMQ cluster, perform an upgrade and then restart the stopped nodes. This process has downtime for the duration of the upgrade steps, and includes these two procedures:

  1. Prepare to Upgrade the Pre-Provisioned Service

  2. Upgrade the Pre-Provisioned Service

Prepare to Upgrade the Pre-Provisioned Service

Before installing the RabbitMQ for PCF v1.13, do the following to prepare for the upgrade:

  1. Decide on the upgrade strategy.

    Pivotal recommends you try the strategies in the order shown below and only go on to the next strategy if the previous one is not possible.

    1. Drain your message queues as much as possible by stopping the producers and allowing the consumers to empty the queues.

      To find out which queues currently store the most messages use the RabbitMQ Management Dashboard or the rabbitmqadmin list queues name message_bytes command.

      Note: Ideally, all queues in RabbitMQ should be empty or almost empty. Upgrading with more data stored in RabbitMQ significantly increases memory and disk requirements, and extends the upgrade duration.

    2. If it was not possible to drain the message queues, then disable automatic queue synchronization.

      Change any automatic synchronization policy you might have to be manual using ha-sync-mode: manual. This prevents data replication when shutting down nodes during upgrade. The remaining nodes do not create new replicas, therefore, memory and disk requirements are significantly lower. For more information about setting RabbitMQ policies, see the RabbitMQ documentation.

    3. If it was not possible to disable automatic queue synchronization, then ensure that your RabbitMQ cluster has sufficent RAM and disk space.

      As RabbitMQ nodes are shut down, messages are migrated onto the remaining nodes until all messages end up on the last remaining node. As a result, that node may need more RAM and disk space.

      The required amount of disk space and RAM depends on your environment. Pivotal recommends that you do the upgrade on a pre-production environment that is similar to your production environment to see if disk space and RAM estimates are sufficient.


      To increase RAM and disk space, do the following:

      1. Click the Resource Config tab.

      2. For the RabbitMQ node, select an appropriate VM Type and click Save.

      3. If you are using Ops Manager v2.3 or later, click Review Pending Changes. For more information about this Ops Manager page, see Reviewing Pending Product Changes.

      4. Click Apply Changes in the Ops Manager Dashboard.
  2. Ensure that your RabbitMQ cluster has sufficient persistent disk.


    Persistent disk usage might double during this upgrade as RabbitMQ creates a backup of the persistent database and performs the database migration. This is on top of the increased disk usage caused by non-empty queues and queue synchronization.


    Ideally, persistent disk usage should not exceed 40% of the disk size. For example:

    • You have RabbitMQ server nodes with 30 GB persistent disk.
    • Pre-upgrade persistent disk usage is 15 GB.
    • Persistent disk usage jumps to 30 GB during upgrade.
    • The file system reserves 5% of persistent disk for the root user.

    In this case, increase the persistent disk size to at least 32 GB, and preferably 40 GB. For more information, see Increasing Persistent Disk During Upgrade.

  3. Ensure there are no disk space or memory alarms:

    1. Browse to pivotal-rabbitmq.SYSTEM-DOMAIN
      Where:
      • SYSTEM-DOMAIN is your system domain. You can find this by clicking the PAS tile in the Ops Manager Installation Dashboard, and then clicking the Domains tab.
    2. Log in with the admin credentials you specified in the Pre-Provisioned RabbitMQ tab when you configured the tile: admin-credentials

    3. In the Overview tab, check the Nodes section for alarm status. Make sure that all metrics are green.
  4. In the RabbitMQ Management UI, ensure the cluster is healthy.


    Do not rely on the BOSH instances output. That indicates the state of the Erlang VM, not RabbitMQ.

    WARNING: Do not attempt the upgrade if there are active alarms or if the cluster is not in a healthy state.

Upgrade the Pre-Provisioned Service

To upgrade the pre-provisioned service to RabbitMQ for PCF v1.13, do the following:

  1. Stage the tile. For instructions, see Download and Install RabbitMQ for PCF.


    Adjust the configuration settings if necessary. Pivotal recommends you do not make any adjustments unless they are to increase the disk or RAM.

  2. Use the BOSH CLI to stop the HAProxy node by running the following command:

    bosh -d p-rabbitmq-GUID stop rabbitmq-haproxy
    

    Where:

    • GUID is the RabbitMQ for PCF deployment GUID.
      • If you have multiple HAProxy nodes, stop all of them.

      • If you are using an external load balancer and are concerned about losing some messages, stop forwarding connections to RabbitMQ server nodes.

    At this point, apps can no longer communicate with the RabbitMQ service.

  3. Use the BOSH CLI to stop all but one of the RabbitMQ server nodes.


    Run this command on RabbitMQ server nodes until you have only one node running:

    bosh -d p-rabbitmq-GUID stop rabbitmq-server/NODE-INDEX
    

    Where:

    • GUID is the RabbitMQ for PCF deployment GUID.
    • NODE-INDEX is the index number of the node.

    To keep track of your progress, Pivotal recommends that you stop all nodes except for the node with index 0.

    Note: When shutting down the nodes you may see a post-deploy script failed error. You can safely ignore it and continue until there is only one node left.

  4. Run the following command to SSH into the RabbitMQ server that is currently running:

    bosh -d p-rabbitmq-GUID ssh rabbitmq-server/NODE-INDEX
    
  5. Verify the contents of the node_running_at_shutdown file by running the following:

    cat /var/vcap/store/rabbitmq/mnesia/db/nodes_running_at_shutdown
    

    There should be only one node listed. For example:

    bash$ cat /var/vcap/store/rabbitmqmnesia/db/nodes_running_at_shutdown
    [rabbit@2d6a7c96149d5cb12be2c06bf9b19042].
    

    If there is more than one node listed, see How to identify the correct upgrading node during RabbitMQ Tile upgrade 1.12 to 1.13 in the Pivotal Support knowledge base before continuing to the next step.

  6. Turn off the errands.

    1. In the Ops Manager Dashboard, set all the errands to Off.

    2. If there is an INSTALL Spring Cloud Services section, expand it, and then set all the errands to Off.
      Ops Manager Errands

    Note: The errands fail if you leave them on because the RabbitMQ HAProxy was stopped.

  7. If you are using Ops Manager v2.3 or later, click Review Pending Changes. For more information about this Ops Manager page, see Reviewing Pending Product Changes.

  8. Click Apply Changes.


    You should now have a single RabbitMQ node running with RabbitMQ v3.7.

  9. Start the RabbitMQ server nodes you stopped previously.

    1. For each node you shut down previously, run this command:

      bosh -d p-rabbitmq-GUID start rabbitmq-server/NODE-INDEX
      

      Where:

      • GUID is the RabbitMQ for PCF deployment GUID.
      • NODE-INDEX is the index number of the node.

      Note: To minimize downtime, you can run the bosh start command for multiple nodes in parallel. This is important if you have a lot of data in your queues during migration (not recommended) or a large number of vhosts.

    2. Make sure all nodes are running by checking the BOSH vms:

      bosh -d p-rabbitmq-GUID vms
      
  10. At this point all nodes should be running, however, the cluster will not be able to serve any traffic until each node completes internal datastore migrations.

    • The duration of this process depends on the number of messages, number of vhosts, and your hardware. You can assume about 1 minute for 1 GB of stored messages and 1 second for each vhost (even if all queues are empty).

    • You can monitor the progress by tailing the logs of your RabbitMQ nodes with the following command:

      bosh -d p-rabbitmq-GUID logs -f rabbitmq-server
      

      Where:

      • GUID is the RabbitMQ for PCF deployment GUID.
      • If you see messages such as Started message store of type persistent or Setting permissions for 'broker' (repeated for different vhosts), wait until these processes finish before going to the next step.
  11. Start the HAProxy node.


    Run this command:

    bosh -d p-rabbitmq-GUID start rabbitmq-haproxy
    

    Where:

    • GUID is the RabbitMQ for PCF deployment GUID.
  12. Make sure all nodes are running by checking their status in the RabbitMQ Management Dashboard.

  13. Run the following errands with BOSH CLI commands. In the following commands, GUID is the the RabbitMQ for PCF deployment GUID.

    • Broker Registrar:

       bosh -d p-rabbitmq-GUID run-errand broker-registrar
      
    • Smoke Tests:

       bosh -d p-rabbitmq-GUID run-errand smoke-tests
      
  14. If you disabled automatic synchronization policies in a previous step, you can now enable them using ha-sync-mode: automatic.

  15. Make sure that your apps reconnect to RabbitMQ.

    WARNING: Some RabbitMQ client libraries do not support automatic reconnection so you might have to restart some apps.

Upgrade the RabbitMQ for PCF On-Demand Service

As of v1.13, RabbitMQ for PCF no longer provides on-demand service plans with RabbitMQ v3.6. Therefore, to upgrade to RabbitMQ for PCF v1.13, app developers must migrate their RabbitMQ v3.6 instances to RabbitMQ v3.7. Using blue-green app deployments, you can do this without downtime.

There are three main steps to upgrade the on-demand RabbitMQ for PCF v1.13:

  1. Migrate On-Demand Instances from RabbitMQ v3.6 to v3.7

  2. Check for On-Demand Service Instances Running RabbitMQ v3.6

  3. Stage the Tile

For issues with upgrading the on-demand service, see:

Migrate On-Demand Instances from RabbitMQ v3.6 to v3.7

To migrate on-demand instances from RabbitMQ v3.6 to v3.7, do the following:

  1. In your existing deployment of RabbitMQ for PCF v1.12, create on-demand RabbitMQ v3.7 service plans. Be sure to configure the v3.7 on-demand plans to match the configurations of your existing v3.6 on-demand plans.

  2. Migrate your on-demand service instances from RabbitMQ v3.6 to v3.7, using blue-green deployments. The process you use to do this depends on how much your apps can tolerate downtime.

    For guidance on how to do this migration, see the following resources:

Ensure There Are No On-Demand Service Instances Running RabbitMQ v3.6

After migration there should be no on-demand service instances running RabbitMQ v3.6. You must verify this before upgrading from RabbitMQ for PCF v1.12 to v1.13.

Note: The upgrade fails if any on-demand service instances are running RabbitMQ v3.6.

To search for on-demand service instances running RabbitMQ v3.6, do the following:

  1. Target the CF deployment used in the environment you want to upgrade.

  2. Search for the on-demand service p.rabbitmq:

    cf curl /v2/services?q=label:p.rabbitmq
    
  3. In the output from step 2, find service_plans_url. This shows the URL for all service plans. Run cf curl against that URL. For example:

    cf curl /v2/services/SERVICE-GUID/service_plans
    

    This returns a list of all RabbitMQ for PCF on-demand service plans, which might include plans using RabbitMQ v3.6 and plans using RabbitMQ v3.7.

  4. In the list returned above:

    1. Locate the service plans using RabbitMQ v3.6.

    2. For each of these service plans, find the property service_instances_url and note the URL for that plan.
  5. Run cf curl against each URL found in step 4. For example:

    cf curl /v2/service_plans/SERVICE-PLAN-GUID/service_instances
    
  6. Ensure that the result for each service plan shows zero instances using RabbitMQ v3.6.

Stage the Tile

Do the following:

  1. Follow the steps in Download and Install RabbitMQ for PCF.

  2. Run the following errands with BOSH CLI commands. In the following commands, GUID is the the RabbitMQ for PCF deployment GUID.

    • Register Broker:

      bosh -d p-rabbitmq-GUID run-errand register-broker
      

      This automatically deletes the deprecated plans that use RabbitMQ v3.6, but keeps plans that use RabbitMQ v3.7.

    • Smoke Tests:

      bosh -d p-rabbitmq-GUID run-errand smoke-tests
      
Was this helpful?
What can we do to improve?