LATEST VERSION: 1.9 - RELEASE NOTES
IPsec Add-On for PCF v1.9

Troubleshooting the IPsec Add-on for PCF

Page last updated:

This topic provides instructions to verify that strongSwan-based IPsec works with your Pivotal Cloud Foundry (PCF) deployment and general recommendations for troubleshooting IPsec.

Verify that IPsec Works with PCF

To verify that IPsec works between two hosts, you can check that traffic is encrypted in the deployment with tcpdump, perform the ping test, and check the logs with the steps below.

  1. Check traffic encryption and perform the ping test. Select two hosts in your deployment with IPsec enabled and note their IP addresses. These are referenced below as IP-ADDRESS-1 and IP-ADDRESS-2.
    1. SSH into IP-ADDRESS-1.
      $ ssh IP-ADDRESS-1
    2. On the first host, run the following, and allow it to continue running.
      $ tcpdump host IP-ADDRESS-2
    3. From a separate command line, run the following:
      $ ssh IP-ADDRESS-2
    4. On the second host, run the following:
      $ ping IP-ADDRESS-1
    5. Verify that the packet type is ESP. If the traffic shows ESP as the packet type, traffic is successfully encrypted. The output from tcpdump will look similar to the following:
      03:01:15.242731 IP IP-ADDRESS-2 > IP-ADDRESS-1: ESP(spi=0xcfdbb261,seq=0x3), length 100
  2. Open the /var/log/daemon.log file to obtain a detailed report, including information pertaining to the type of certificates you use, and to verify an established connection exists.
  3. Navigate to your Installation Dashboard, and click Recent Install Logs to view information regarding your most recent deployment. Search for “ipsec” and the status of the IPsec job.
  4. Run ipsec statusall to return a detailed status report regarding your connections. The typical path for this binary: /var/vcap/packages/strongswan-x.x.x/sbin. x.x.x represents the version of strongSwan packaged into the IPsec.

If you experience symptoms that IPsec does not establish a secure connection, return to the Installing the IPsec Add-on for PCF topic and review your installation.

If you encounter issues with installing IPsec, refer to the Troubleshooting IPsec section of this topic.

Troubleshoot IPsec

IPsec Installation Issues

Symptom

Unresponsive applications or incomplete responses, particularly for large payloads

Explanation: Packet Loss

IPsec packet encryption increases the size of packet payloads on host VMs. If the size of the larger packets exceeds the maximum transmission unit (MTU) size of the host VM, packet loss may occur when the VM forwards those packets.

If your VMs were created with an Amazon PV stemcell, the default MTU value is 1500 for both host VMs and the application containers. If your VMs were created with Amazon HVM stemcells, the default MTU value is 9001. Garden containers default to 1500 MTU.

Solution

Either decrease the MTU of the application containers or increase the MTU of the application container VMs. The MTU of the application containers must be at least 100 MTU lower than the MTU of the application container VMs.

Do one of the following:

  • Decrease the MTU of the application containers.

    1. In the Pivotal Application Service tile configuration, click Networking.
    2. Decrease Applications Network Maximum Transmission Unit (MTU) (in bytes) from the default value of 1454 to 1354.
  • Increase the MTU of the application container VMs.

    On a command line, run the following command

    ifconfig NETWORK-INTERFACE mtu MTU-VALUE
    

    Where:

    • NETWORK-INTERFACE is the network interface used to communicate with other VMs.
    • MTU-VALUE is a value greater than 1500. Pivotal recommends a headroom of 100.

    For example:

    $ ifconfig example-interface mtu 1600
    


Symptom

Unresponsive applications or incomplete responses, particularly for large payloads

Explanation: Network Degradation

IPsec data encryption increases the size of packet payloads. If the number of requests and the size of your files are large, the network may degrade.

Solution

Scale your deployment by allocating more processing power to your VM CPU or GPUs, which, additionally, decreases the packet encryption time. One way to increase network performance is to compress the data prior to encryption. This approach increases performance by reducing the amount of data transferred.


Symptom

IPsec fails to deploy with one of the following errors:

  • IPsec detected that the CA certificate and instance certificate have the same serial number. Please ensure that the serials are not the same and redeploy.

  • IPsec detected that the CA certificate and instance certificate have the same subject. Please ensure that the subjects are not the same and redeploy.

  • Invalid CA/instance certificate. This may be caused by a duplicate subject or serial number.

Explanation: Same Serial Number/Common Name

Every certificate in the certificate chain must have a unique serial number and common name, also called the subjectName.

For example, a root certificate cannot have the same common name as an intermediate certificate; an instance certificate cannot have the same serial number as an intermediate certificate.

Solution

Follow the steps in Procedure 1: Rotate the Instance Certificate and Instance Private Key to generate a new chain of certificates and replace the old chain.

When you generate the new certificates, ensure no certificates have a duplicate serial numbers or common names.


IPsec Runtime Issues

Symptom

  • App fails to start with the following message:

    FAILED
    Server error,
    status code: 500,
    error code: 10001,
    message: An unknown error occurred.
    The Cloud Controller log shows it is unable to communicate with Diego due to getaddrinfo failing.

  • Deployment fails with a similar error message: diego_database-partition-620982d595434269a96a/0 (a643c6c0-bc43-411b-b011-58f49fb61a6f)' is not running after update. Review logs for failed jobs: etcd

Explanation: Split Brain consul

This error indicates a “split brain” issue with Consul.

Solution

Confirm this diagnosis by checking the peers.json file from /var/vcap/store/consul_agent/raft. If it is null, then there may be a split brain. To fix this problem, follow these steps:

  1. Run monit stop on all Consul servers:
  2. Run rm -rf /var/vcap/store/consul_agent/ on all Consul servers.
  3. Run monit start consul_agent on all Consul servers one at a time.
  4. Restart the consul_agent process on the Cloud Controller VM. You may need to restart consul_agent on other VMs, as well.


Symptom

You see that communication is not encrypted between two VMs.

Explanation: Error in Network Configuration

The IPsec BOSH job is not running on either VM. This problem could happen if both IPsec jobs crash, both IPsec jobs fail to start, or the subnet configuration is incorrect. There is a momentary gap between the time when an instance is created and when BOSH sets up IPsec. During this time, data can be sent unencrypted. This length of time depends on the instance type, IAAS, and other factors. For example, on a t2.micro on AWS, the time from networking start to IPsec connection was measured at 95.45 seconds.

Solution

Set up a networking restriction on host VMs to only allow IPsec protocol and block the normal TCP/UDP traffic. For example, in AWS, configure a network security group with the minimal networking setting as shown below and block all other TCP and UDP ports.

Additional AWS Configuration
Type Protocol Port Range Source
Custom Protocol AH (51) All 10.0.0.0/16
Custom Protocol ESP (50) All 10.0.0.0/16
Custom UDP Rule UDP 500 10.0.0.0/16

Note: When configuring a network security group, IPsec adds an additional layer to the original communication protocol. If a certain connection is targeting a port number, for example port 8080 with TCP, it actually uses IP protocol 50/51 instead. Due to this detail, traffic targeted at a blocked port may be able to go through.


Symptom

You see unencrypted app messages in the logs.

Explanation: etcd Split Brain

Solution

  1. Check for split brain etcd by connecting with BOSH ssh into each etcd node:
    $ curl localhost:4001/v2/members

  2. Check if the members are consistent on all of etcd. If a node has only itself as a member, it has formed its own cluster and developed "split brain." To fix this issue, SSH into the split brain VM and run the following commands:

    1. $ sudo su -

    2. # monit stop etcd

    3. # rm -r /var/vcap/store/etcd

    4. # monit start etcd

  3. Check the logs to confirm the node rejoined the existing cluster.


Symptom

IPsec deployment fails with Error filling in template 'pre-start.erb'

Error 100: Unable to render instance groups for deployment. Errors are:
   - Unable to render jobs for instance group 'consul_server-partition-f9c4b18fd83cf3114d7f'. Errors are:
     - Unable to render templates for job 'ipsec'. Errors are:
       - Error filling in template 'pre-start.erb' (line 12: undefined method `each_with_index' for #)
   - Unable to render jobs for instance group 'nats-partition-f9c4b18fd83cf3114d7f'. Errors are:
     - Unable to render templates for job 'ipsec'. Errors are:
       - Error filling in template 'pre-start.erb' (line 12: undefined method `each_with_index' for #)

Explanation: Typographical or syntax error in deployment descritor YAML syntax

Solution

Check the deployment descriptor YAML syntax for the CA certificates entry:

releases:
- {name: ipsec, version: 1.9.9}

addons:
- name: ipsec-addon
  jobs:
  - name: ipsec
    release: ipsec
  properties:
    ipsec:
      ipsec_subnets:
      - 10.0.1.1/20
      no_ipsec_subnets:
      - 10.0.1.10/32  # bosh director
      instance_certificate: |
        -----BEGIN CERTIFICATE-----
        MIIEMDCCAhigAwIBAgIRAIvrBY2TttU/LeRhO+V1t0YwDQYJKoZIhvcNAQELBQAw
        ...
        -----END CERTIFICATE-----
      instance_private_key: |
        -----BEGIN EXAMPLE RSA PRIVATE KEY-----
        MIIEogIBAAKCAQEAtAkBjrzr5x9g0aWgyDEmLd7m9u/ZzpK7UScfANLaN7JiNz3c
        ...
        -----END EXAMPLE RSA PRIVATE KEY-----
      ca_certificates:
        - |
        -----BEGIN CERTIFICATE-----
        MIIEUDCCArigAwIBAgIJAJVLBeJ9Wm3TMA0GCSqGSIb3DQEBCwUAMB0xGzAZBgNV
        BAMMElBDRiBJUHNlYyBBZGRPbiBDQTAeFw0xNjA4MTUxNzQwNDVaFw0xOTA4MTUx
        ...
        -----END CERTIFICATE-----

In the example above, the values that appear after the ca_certificates: key are contained within a list and are not just a single certificate. This entry must be followed by a line starting with -, and ending with |. The lines following this contain the PEM encoded certificate(s).

The error message shown above indicating a problem with the each_with_index method provides a hint that the - | YAML syntax sequence is missing. Use this syntax even in situations where there is only one CA certificate, for example a list of one entry.


Symptom

Complete system outage with no warning.

Explanation: IPsec Certificates Might Have Expired

Expired IPsec certificates can cause a sudden system outage. For example, the self-signed certificates generated by the script provided in the installation instructions have a lifetime of 365 days. IPsec certificates expire if you do not rotate them within their lifetime.

Solution

Renew expired IPsec certificates. To avoid future downtime due to expired IPsec certificates, set a calendar reminder to rotate the certificates before they expire.

For how to renew certificates, see Renewing Expired IPsec Certificates. For how to rotate them, see Rotating IPsec Certificates.


Symptom

BOSH shows a VM in a failing state. On the failing VM, monit summary shows ipsec with a status of Does not exist.

Explanation: IPsec stopped/crashed and Monit Cannot Automatically Bring it Back Up.

IPsec is required to be the first process to start and the last process to stop. As a result, the start and stop scripts are located in pre-start and post-stop, which is a concept to BOSH, but not to Monit. Monit is not able to bring IPsec back up automatically because it does not know what pre-start is. After you run the pre-start manually, then Monit is able to detect IPsec as healthy.

Solution

  1. From the failing VM, run the following command as root: /var/vcap/jobs/ipsec/bin/pre-start
  2. Run monit summary. If the restart is successful, Monit shows ipsec with a status of Running.
Create a pull request or raise an issue on the source for this page in GitHub