Troubleshooting Router Error Responses

Page last updated:

This topic helps operators to better understand if 502s are a result of the Pivotal Application Service (PAS) Platform or an application.

Points of Failure

There are different points of failure in which 502s can come from:

  1. Infrastructure
    • Load Balancer
    • Network
  2. Platform - PCF
    • Gorouter
    • Diego Cells
  3. Application

In the Infrastructure, 502s can occur in the following way:

  • From the Load Balancer, 502s can surface when the Gorouters are not receiving traffic at all.
    • This can be observed if the Load Balancer is logging 502s but the Gorouters are not.

In the Platform, 502s can occur in the following ways:

  • If the Gorouter is unable to connect to the application container:
    • TCP dial issues (cannot make an initial connection to the backend). The Gorouter will retry TCP dial errors up to three times, if it still fails then a 502 will be returned to the client and logged to the access.log. This may be due to:
      • An application that is unresponsive (which indicates an issue with the application.)
      • The Gorouter has a stale route (which indicates an issue with the platform.)

        Note: If you experience intermittent misrouting due to stale routes, you can configure PAS to prune routes using TTL expiry for TLS backends. For more information, see the PAS v2.6 Release Notes.

      • The application container is corrupted (which indicates a problem with the platform.)
      • These types of errors may look like this within the gorouter.log:
        [2018-07-05 17:59:10+0000] {"log_level":3,"timestamp":1530813550.92134,"message":
        "error":"dial tcp getsockopt: connection refused"}}
  • If the Gorouter successfully dials the endpoint but an error occurs:
    • read: connection reset by peer errors can occur when the application closes the connection abruptly with a TCP RST packet and not the expected FIN-ACK. This will cause the Gorouter to retry the next endpoint. Note, Gorouter does not currently retry on write: connection reset by peer failures.
    • TLS Handshake errors. When these errors occur, the Gorouter will retry up to three times and if it’s still failing then a 502 may be returned. These errors appear similar to the following in the gorouter.log (and a 502 will be logged in the access.log):
      [2018-07-05 18:20:54+0000] {"log_level":3,"timestamp":1530814854.4359834,"message":"
      "error":"x509:certificate is valid for 53079ca3-c4fe-4910-78b9-c1a6, not xxx"}}
    • If the Gorouter successfully connects to the endpoint, but an error occurs while the request is in transport (i.e. Gorouter has not received a response from the endpoint):
    • Prior to PCF 2.0, there was a bug that logged a 502 for requests canceled by clients before the server responded with headers. PCF 2.0 and beyond, if the same situation occurs, a 499 is returned.

In an Application, 502s can occur in the following ways (Note: the Gorouter will not retry any error response that is returned by the application):

  • If 502s are only occurring from a particular application instance and not all of the applications on the platform, then it is likely an application-related error (i.e. application is overloaded, unresponsive, cannot connect to database, etc.).
  • If all applications are experiencing 502s, then it could either be a platform issue (possible misconfiguration) or an application issue (i.e. all applications are unable to connect to an upstream database).

General Debugging Steps

Here are general debugging steps for any issue resulting with 502 error codes:

  • Gather the Gorouter logs & Diego Cell logs at the time of the incident. To SSH into the router VM, see Advanced Troublshooting with the BOSH CLI. To downlood the router VM logs from Ops Manager, see Monitoring PCF VMs from Ops Manager.
  • Review the logs and consider the following questions:
    1. Which errors are the Gorouters returning?
    2. Is the Gorouter’s routing table accurate (are the endpoints for the route as expected)?
    3. Do the Diego Cell logs have anything interesting about unexpected app crashes and/or restarts?
    4. Is the application healthy and handling requests successfully? (try using request tracing headers to verify)
  • Was there a recent platform change or upgrade that caused an increase in 502s?
  • Are there any suspicious metrics spiking? How is CPU and Memory utilization?

Gorouter Error Classification Table

Error Type Status Code Source of Issue Evidence
Dial 502 Application or Platform logs with error dial tcp
525 Platform* - logs with error tls: first record does not look like a TLS handshake
- backend_tls_handshake_failed metric increments
HostnameMismatch 503 Platform - logs with error x509: certificate is valid for <x> not <y>
- backend_invalid_id metric increments
UntrustedCert 526 Platform - logs with an error prefix x509: certificate signed by unknown authority
- backend_invalid_tls_cert metric increments
RemoteFailedCertCheck 496 Platform logs with error remote error: tls: bad certificate
ContextCancelled 499 Client/App logs with error context canceled

Note: This status code is never returned to clients, only logged, as it occurs when the downstream client closes the connection before Gorouter responds.

RemoteHandshakeFailure 525 Platform - logs with error remote error: tls: handshake failure
- backend_tls_handshake_failed metric increments

*Note: any platform issue could be the result of a misconfiguration

For each of the above errors, there will be a backend-endpoint-failure log line in gorouter.log and an error message in gorouter.err.log. Additionally, the access.log will record the request status codes.