Troubleshooting Pivotal Cloud Foundry IPsec Add-On

Page last updated:

This topic provides instructions to verify that strongSwan-based IPsec works with your Pivotal Cloud Foundry (PCF) deployment and general recommendations for troubleshooting IPsec.

Verify that IPsec Works with PCF

To verify that IPsec works between two hosts, you can check that traffic is encrypted in the deployment with tcpdump, perform the ping test, and check the logs with the steps below.

  1. Check traffic encryption and perform the ping test. Select two hosts in your deployment with IPsec enabled and note their IP addresses. These are referenced below as IP-ADDRESS-1 and IP-ADDRESS-2.
    1. SSH into IP-ADDRESS-1.
      $ ssh IP-ADDRESS-1
    2. On the first host, run the following, and allow it to continue running.
      $ tcpdump host IP-ADDRESS-2
    3. From a separate command line, run the following:
      $ ssh IP-ADDRESS-2
    4. On the second host, run the following:
      $ ping IP-ADDRESS-1
    5. Verify that the packet type is ESP. If the traffic shows ESP as the packet type, traffic is successfully encrypted. The output from tcpdump will look similar to the following:
      03:01:15.242731 IP IP-ADDRESS-2 > IP-ADDRESS-1: ESP(spi=0xcfdbb261,seq=0x3), length 100
  2. Open the /var/log/daemon.log file to obtain a detailed report, including information pertaining to the type of certificates you are using, and to verify that there is an established connection.
  3. Navigate to your Installation Dashboard, and click Recent Install Logs to view information regarding your most recent deployment. Search for “ipsec” and the status of the IPsec job.
  4. Run ipsec statusall to return a detailed status report regarding your connections. The typical path for this binary: /var/vcap/packages/strongswan-x.x.x/sbin. x.x.x represents the version of strongSwan packaged into the IPsec.

If you experience symptoms that IPsec does not establish a secure connection, return to the Installing IPsec topic and review your installation.

If you encounter issues with installing IPsec, reference the following Troubleshooting IPsec section.

Troubleshoot IPsec

IPsec Installation Issues

Symptom

Unresponsive applications or incomplete responses, particularly for large payloads

Explanation: Packet Loss

IPsec packet encryption increases the size of packet payloads on host VMs. If the size of the larger packets exceeds the maximum transmission unit (MTU) size of the host VM, packet loss may occur when the VM forwards those packets.

If your VMs were created with an Amazon PV stemcell, the default MTU value is 1500 for both host VMs and the application containers. If your VMs were created with Amazon HVM stemcells, the default MTU value is 9001. Garden containers default to 1500 MTU.

Solution

Implement a 100 MTU difference between host VM and the contained application container, using one of the following approaches:

  • Decrease the MTU of the application containers to a value lower than the MTU of the VM for that container. In the Elastic Runtime tile configuration, click Networking and modify Applications Network Maximum Transmission Unit (MTU) (in bytes) before you deploy. Decrease it from the default value of 1454 to 1354.

  • Increase the MTU of the application container VMs to a value greater than 1500. Pivotal recommends a headroom of 100. Run ifconfig NETWORK-INTERFACE mtu MTU-VALUE to make this change. Replace NETWORK-INTERFACE with the network interface used to communicate with other VMs For example: $ ifconfig NETWORK-INTERFACE mtu 1600


Symptom

Unresponsive applications or incomplete responses, particularly for large payloads

Explanation: Network Degradation

IPsec data encryption increases the size of packet payloads. If the number of requests and the size of your files are large, the network may degrade.

Solution

Scale your deployment by allocating more processing power to your VM CPU or GPUs, which, additionally, decreases the packet encryption time. One way to increase network performance is to compress the data prior to encryption. This approach increases performance by reducing the amount of data transferred.

IPsec Runtime Issues

Symptom

Errors relating to IPsec, including symptoms of network partition. You may receive an error indicating that IPsec has stopped working.

For example, this error shows a symptom of IPsec failure, a failed clock_global-partition:

Failed updating job clock_global-partition-abf4378108ba40fd9a43 > clock_global-partition-abf4378108ba40fd9a43/0
(ddb1fbfa-71b1-4114-a82c-fd75867d54fc)
(canary):   Action Failed
get_task:   Task 044424f7-c5f2-4382-5d81-57bacefbc238
result:     Stopping Monitored Services: Stopping service
ipsec:      Sending stop request to Monit: Request failed,
response:   Response{ StatusCode: 503, Status: '503 Service Unavailable' } (00:05:22)..

Explanation: Asynchronous monit Job Priorities

When a monit stop command is issued to the NFS mounter job, it hangs, preventing a shutdown of the PCF cluster.

This is not a problem with the IPsec add-on release itself. Rather, it is a known issue with the NFS mounter job and the monit stop script that can manifest itself after IPsec is deployed with PCF v1.7.

This issue occurs when monit job priorities are asynchronous. Because the order of job shutdown is arbitrary, it is possible that the IPsec job will be stopped first. After this happens, the network connectivity for that VM goes away, and the NFS mounter job loses visibility to the associated storage. This causes the NFS mounter job to hang, and it blocks the monit stop from completing. See the Monit job Github details for further information.

Note: This issue affects deployments using CF v231 or earlier, but in CF v232 the release uses an nginx blobstore instead of the NFS blobstore. The error does not exist for PCF deployments using CF releases greater than CF v231. The error also does not apply to PCF deployments that use WebDAV as their Cloud Controller blobstore.

Solution

  1. bosh ssh into the stuck instance:
    $ bosh ssh JOB INDEX
    
  2. Authenticate as root and use the sv stop agent command to kill the BOSH Agent:
    $ sudo su
    # sv stop agent
    
  3. Run bosh cloudcheck to detect the missing monit job VM.
    # bosh cloudcheck
    VM with cloud ID `vm-3e37133c-bc33-450e-98b1-f86d5b63502a' missing:
    
    - Ignore problem - Recreate VM using last known apply spec - Delete VM reference (DANGEROUS!)
  4. Choose Recreate VM using last known apply spec.

  5. Continue with your deploy procedure.


Symptom

  • App fails to start with the following message:

    FAILED
    Server error,
    status code: 500,
    error code: 10001,
    message: An unknown error occurred.
    The Cloud Controller log shows it is unable to communicate with Diego due to getaddrinfo failing.

  • Deployment fails with a similar error message: diego_database-partition-620982d595434269a96a/0 (a643c6c0-bc43-411b-b011-58f49fb61a6f)' is not running after update. Review logs for failed jobs: etcd

Explanation: Split Brain consul

This error indicates a “split brain” issue with Consul.

Solution

Confirm this diagnosis by checking the peers.json file from /var/vcap/store/consul_agent/raft. If it is null, then there may be a split brain. To fix this problem, follow these steps:

  1. Run monit stop on all Consul servers:
  2. Run rm -rf /var/vcap/store/consul_agent/ on all Consul servers.
  3. Run monit start consul_agent on all Consul servers one at a time.
  4. Restart the consul_agent process on the Cloud Controller VM. You may need to restart consul_agent on other VMs, as well.


Symptom

You see that communication is not encrypted between two VMs.

Explanation: Error in Network Configuration

The IPsec BOSH job is not running on either VM. This problem could happen if both IPsec jobs crash, both IPsec jobs fail to start, or the subnet configuration is incorrect. There is a momentary gap between the time when an instance is created and when BOSH sets up IPsec. During this time, data can be sent unencrypted. This length of time depends on the instance type, IAAS, and other factors. For example, on a t2.micro on AWS, the time from networking start to IPsec connection was measured at 95.45 seconds.

Solution

Set up a networking restriction on host VMs to only allow IPsec protocol and block the normal TCP/UDP traffic. For example, in AWS, configure a network security group with the minimal networking setting as shown below and block all other TCP and UDP ports.

Additional AWS Configuration
Type Protocol Port Range Source
Custom Protocol AH (51) All 10.0.0.0/16
Custom Protocol ESP (50) All 10.0.0.0/16
Custom UDP Rule UDP 500 10.0.0.0/16

Note: When configuring a network security group, IPsec adds an additional layer to the original communication protocol. If a certain connection is targeting a port number, for example port 8080 with TCP, it actually uses IP protocol 50/51 instead. Due to this detail, traffic targeted at a blocked port may be able to go through.


Symptom

You see unencrypted app messages in the logs.

Explanation: etcd Split Brain

Solution

  1. Check for split brain etcd by connecting with bosh ssh into each etcd node:
    $ curl localhost:4001/v2/members

  2. Check if the members are consistent on all of etcd. If a node has only itself as a member, it has formed its own cluster and developed "split brain." To fix this issue, SSH into the split brain VM and run the following commands:

    1. $ sudo su -

    2. # monit stop etcd

    3. # rm -r /var/vcap/store/etcd

    4. # monit start etcd

  3. Check the logs to confirm the node rejoined the existing cluster.


Symptom

IPsec deployment fails with Error filling in template 'pre-start.erb'

Error 100: Unable to render instance groups for deployment. Errors are:
   - Unable to render jobs for instance group 'consul_server-partition-f9c4b18fd83cf3114d7f'. Errors are:
     - Unable to render templates for job 'ipsec'. Errors are:
       - Error filling in template 'pre-start.erb' (line 12: undefined method `each_with_index' for #)
   - Unable to render jobs for instance group 'nats-partition-f9c4b18fd83cf3114d7f'. Errors are:
     - Unable to render templates for job 'ipsec'. Errors are:
       - Error filling in template 'pre-start.erb' (line 12: undefined method `each_with_index' for #)

Explanation: Typographical or syntax error in deployment descritor YAML syntax

Solution

Check the deployment descriptor YAML syntax for the CA certificates entry:

releases:
- {name: ipsec, version: 1.0.0}

addons:
- name: ipsec-addon
  jobs:
  - name: ipsec
    release: ipsec
  properties:
    ipsec:
      ipsec_subnets:
      - 10.0.1.1/20
      no_ipsec_subnets:
      - 10.0.1.10/32  # bosh director
      instance_certificate: |
        -----BEGIN CERTIFICATE-----
        MIIEMDCCAhigAwIBAgIRAIvrBY2TttU/LeRhO+V1t0YwDQYJKoZIhvcNAQELBQAw
        ...
        -----END CERTIFICATE-----
      instance_private_key: |
        -----BEGIN EXAMPLE RSA PRIVATE KEY-----
        MIIEogIBAAKCAQEAtAkBjrzr5x9g0aWgyDEmLd7m9u/ZzpK7UScfANLaN7JiNz3c
        ...
        -----END EXAMPLE RSA PRIVATE KEY-----
      ca_certificates:
        - |
        -----BEGIN CERTIFICATE-----
        MIIEUDCCArigAwIBAgIJAJVLBeJ9Wm3TMA0GCSqGSIb3DQEBCwUAMB0xGzAZBgNV
        BAMMElBDRiBJUHNlYyBBZGRPbiBDQTAeFw0xNjA4MTUxNzQwNDVaFw0xOTA4MTUx
        ...
        -----END CERTIFICATE-----

In the example above, the values that appear after the ca_certificates: key are contained within a list and are not just a single certificate. This entry must be followed by a line starting with -, and ending with |. The lines following this contain the PEM encoded certificate(s).

The error message shown above indicating a problem with the each_with_index method provides a hint that the - | YAML syntax sequence is missing. Use this syntax even in situations where there is only one CA certificate, for example a list of one entry.

Create a pull request or raise an issue on the source for this page in GitHub