Troubleshooting Router Error Responses

Page last updated:

This topic helps operators to better understand if 502 errors are a result of the VMware Tanzu Application Service for VMs (TAS for VMs) tile or an app.

Points of Failure

There are different points of failure in which 502 errors can come from:

  1. Infrastructure
    • Load balancer
    • Network
  2. Platform - Ops Manager
    • Gorouter
    • Diego Cell(s)
  3. App

In the infrastructure, 502 errors can occur in the following way:

  • From the load balancer, 502 errors can surface when the Gorouters are not receiving traffic at all.
    • This can be observed if the load balancer is logging 502 errors but the Gorouters are not.

In the platform, 502 errors can occur in the following ways:

  • If the Gorouter is unable to connect to the app container:

    • If TCP cannot make an initial connection to the back end. The Gorouter retries TCP dial errors up to three times. If it still fails, a 502 is be returned to the client and logged to the access.log. This may be due to:
      • An app that is unresponsive, indicating an issue with the app
      • The Gorouter has a stale route, indicating an issue with the platform
      • The app container is corrupted, indicating a problem with the platform
      • These types of errors may look like this within the gorouter.log:
        [2018-07-05 17:59:10+0000] {"log_level":3,"timestamp":1530813550.92134,"message":
        "error":"dial tcp getsockopt: connection refused"}}
  • If the Gorouter successfully dials the endpoint but an error occurs:

    • read: connection reset by peer errors can occur when the app closes the connection abruptly with a TCP RST packet and not the expected FIN-ACK. This causes the Gorouter to retry the next endpoint. The Gorouter does not currently retry on write: connection reset by peer failures.
    • TLS handshake errors. When these errors occur, the Gorouter retries up to three times. If it still fails, a 502 may be returned. These errors appear similar to the following in the gorouter.log, and a 502 error is logged in the access.log:
      [2018-07-05 18:20:54+0000] {"log_level":3,"timestamp":1530814854.4359834,"message":"
      "error":"x509:certificate is valid for 53079ca3-c4fe-4910-78b9-c1a6, not xxx"}}
    • If the Gorouter successfully connects to the endpoint, but an error occurs while the request is in transport, such as if the Gorouter has not received a response from the endpoint.
    • Prior to PCF 2.0, there was a bug that logged a 502 error for requests canceled by clients before the server responded with headers. PCF 2.0 and beyond. If the same situation occurs, a 499 error is returned.

In an app, 502 errors can occur in the following ways:

Note: The Gorouter does retry any error response returned by the app.

  • If 502 errors only occur from one or more particular app instances and not all of the apps on the platform, it is likely an app-related error. The app may be overloaded, unresponsive, or unable to connect to the database.
  • If all apps are experiencing 502 errors, then it could either be a platform issue, such as a possible misconfiguration, or an app issue, such as all apps being unable to connect to an upstream database.

General Debugging Steps

Some general debugging steps for any issue resulting in 502 errors are as follows:

  • Gather the Gorouter logs and Diego Cell logs at the time of the incident. To SSH into the router VM, see Advanced Troublshooting with the BOSH CLI. To downlood the router VM logs from Ops Manager, see Monitoring VMs.

  • Review the logs and consider the following:

    1. Which errors are the Gorouters returning?
    2. Is the Gorouter’s routing table accurate (are the endpoints for the route as expected)?
    3. Do the Diego Cell logs have anything interesting about unexpected app crashes and/or restarts?
    4. Is the app healthy and handling requests successfully? (try using request tracing headers to verify)
  • Was there a recent platform change/upgrade that caused an increase in 502 errors?

  • Are there any suspicious metrics spiking? How is the CPU and memory utilization?

Gorouter Error Classification Table

Error Type Status Code Source of Issue Evidence
Dial 502 App or Platform logs with error dial tcp
525 Platform* - logs with error tls: first record does not look like a TLS handshake
- backend_tls_handshake_failed metric increments
HostnameMismatch 503 Platform - logs with error x509: certificate is valid for <x> not <y>
- backend_invalid_id metric increments
UntrustedCert 526 Platform - logs with an error prefix x509: certificate signed by unknown authority
- backend_invalid_tls_cert metric increments
RemoteFailedCertCheck 496 Platform logs with error remote error: tls: bad certificate
ContextCancelled 499 Client/App logs with error context canceled

Note: This status code is never returned to clients, only logged, as it occurs when the downstream client closes the connection before the Gorouter responds.

RemoteHandshakeFailure 525 Platform - logs with error remote error: tls: handshake failure
- backend_tls_handshake_failed metric increments

*Any platform issue could be the result of a misconfiguration

For each of the above errors, there is a backend-endpoint-failure log line in gorouter.log and an error message in gorouter.err.log. Additionally, the access.log records the request status codes.