Skip to content

Troubleshooting

Upgrading Postgres

Older versions of Concourse using Postgres release 28 or previous might have difficulty upgrading smoothly. In these cases, during a BOSH deploy step you might encounter an error like this when BOSH attempts to update the DB instance:

1
2
Task TASK-NUMBER | TIMESTAMP | Updating instance db: db/UUID (0) (canary) (00:03:17)
        L Error: Action Failed get_task: Task UUID result: 1 of 2 pre-start scripts failed. Failed Jobs: postgres. Successful Jobs: bosh-dns.

In order to avoid this error, migrate in two steps, using Postgres 30 as an intermediary. For example, if you were upgrading from release 28 to 36:

  1. Upgrade from release 28 to release 30
  2. Upgrade from release 30 to release 36

However, if you've encountered the above error and try to redeploy with different versions, you might then encounter this error:

1
2
Task TASK-NUMBER | TIMESTAMP | Updating instance db: db/UUID (0) (canary) (00:01:22)
        L Error: 'db/UUID (0)' is not running after update. Review logs for failed jobs: postgres, pg_janitor, bosh-dns, bosh-dns-resolvconf, bosh-dns-healthcheck

The database instance is locked, and must be 'unlocked' before we can continue.

Unlock the database instance

  1. SSH into the database VM:

    1
    bosh -e BOSH-ENVIRONMENT-ALIAS -d DEPLOYMENT-NAME ssh db
    

    Where:

    • BOSH-ENVIRONMENT-ALIAS is your BOSH environment alias
    • DEPLOYMENT-NAME is your deployment name
  2. Switch to the root user:

    1
    sudo su
    
  3. Find and remove a lock file called POSTGRES_UPGRADE_LOCK

    1
    find / -name "POSTGRES_UPGRADE_LOCK"
    

    For example, in the case that the lock file is located in ~/var/vcap/store/postgres/:

    1
    rm ~/var/vcap/store/postgres/POSTGRES_UPGRADE_LOCK
    
  4. Exit the SSH session:

    1
    2
    exit
    exit
    

At this point you will be able to run your bosh deploy command again, and the database instance should update successfully.


Worker public key is no longer an array

In situations where a Concourse instance has multiple workers in different pools, a Concourse manifest might have more than one worker public key. Some users have operations files to append public keys to their manifest at /instance_groups/name=web/jobs/name=web/properties/worker_gateway/authorized_keys. This ops file will fail to interpolate with v5.5.x, since the field is now a string instead of a list.

Instead of appending keys, you can concatinate the two public keys in an ops file using a multi-line yaml string. Pivotal suggests using an ops file that looks like this:

1
2
3
4
5
6
---
- type: replace
  path: /instance_groups/name=web/jobs/name=web/properties/worker_gateway/authorized_keys?
  value: |
    ((first.public_key))
    ((second.public_key))

Missing variables interpolated by Credhub now error

Many Concourse operators use tools like Credhub for centralized credential management for their Concourse instances. In the concourse-bosh-deployment repository referenced in these upgrade guides, there are various examples where variables are used as placeholders which are meant to be replaced at deployment time. In the past, you could use these values as-is without specifying variables, and Credhub would seamlessly take over to interpolate anything that's missing at the time of deployment.

For example, if foo is a key in Credhub, an operator would need to pass ((foo)) to BOSH without interpolating some value for foo first. This results in something like this in your pipeline.yml:

1
2
3
4
5
put: some-bosh-deployment
params:
  ...
    vars:
        secret: "((/bosh-name/group/foo))"

In v5.5.x, this scenario will fail with an error message stating that BOSH cannot find the variable /bosh-name/cf/cf_admin_password. To fix this, move the variable into a variables file:

1
2
---
secret: "((/bosh-name/group/foo))"

This strategy allows us to pass the variable ((/bosh-name/group/foo)) literally to the BOSH deployment. This way, it can be Credhub-managed within the foundation you're deploying.


Enable Certificate Rotation

As of this writing, there’s an operations file on concourse-bosh-deployment master that turns on the Let’s Encrypt ACME service that is not in the v5.5.x releases. Using this operations files can help reduce the number of certificates you need to rotate. If you'd like to try this with your v5.5.x deployment, duplicate that ops file into your own repo like so:

1
2
3
4
---
- type: replace
  path: /instance_groups/name=web/jobs/name=web/properties/lets_encrypt?/enabled
  value: true

You may also have to remove the ops file that specifies web TLS certificates, because it doesn’t make sense to say that these certificates automatically rotate and try to specify them at the same time. You can do this by removing the following operations file from the BOSH command that deploys your Concourse:

1
2
# remove this ops file as part of enabling Lets Encrypt ACME
concourse-bosh-deployment/cluster/operations/tls.yml