Troubleshooting Router Error Responses

Page last updated:

This topic helps operators understand and debug 502 errors that are a result of their infrastructure, Pivotal Application Service (PAS), or an app.

Overview

In your deployment, 502 errors can come from any of the following:

If you are unsure of the source of 502 errors, see General Debugging Steps below.

General Debugging Steps

Some general debugging steps for any issue resulting in 502 errors are as follows:

  1. Gather the Gorouter logs and Diego Cell logs at the time of the incident. To SSH into the router VM, see Advanced Troubleshooting with the BOSH CLI. To download the router VM logs from Ops Manager, see Monitoring VMs in Pivotal Platform.

  2. Review the logs and consider the following:

    1. Which errors are the Gorouters returning?
    2. Is Gorouter’s routing table accurate? Are the endpoints for the route as expected? For more information, see Dynamic Routing Table in the Gorouter documentation on GitHub.
    3. Do the Diego Cell logs have anything interesting about unexpected app crashes or restarts?
    4. Is the app healthy and handling requests successfully? You can use request tracing headers to verify. For more information, see HTTP Headers for Zipkin Tracing in HTTP Routing.
  3. Consider the following:

    • Does your load balancer log 502 errors, but Gorouter does not? This means that traffic is not reaching Gorouter.
    • Was there a recent platform change or upgrade that caused an increase in 502 errors?
    • Are there any suspicious metrics spiking? How is the CPU and memory utilization?

Diagnose Gorouter Errors

This section describes how to diagnose Gorouter errors.

Gorouter Cannot Connect to the App Container

If Gorouter cannot connect to the app container, you might see this error in the gorouter.log:

[2018-07-05 17:59:10+0000] {"log_level":3,"timestamp":1530813550.92134,"message":
"backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint":
{"ApplicationId":"","Addr":"10.0.32.15:60099","Tags":null,"RouteServiceUrl":""},
"error":"dial tcp 10.0.32.15:60099: getsockopt: connection refused"}}

If TCP cannot make an initial connection to the backend, Gorouter retries TCP dial errors up to three times. If it still fails, Gorouter returns a 502 to the client and writes to the access.log.

Any of the following can cause connection errors between Gorouter and the app container:

  • An app that is unresponsive, indicating an issue with the app.
  • A stale route in Gorouter, indicating an issue with the platform. For more information, see Diagnose Stale Routes below.
  • A corrupted app container, indicating a problem with the platform.

Gorouter Errors After Connecting

If Gorouter successfully dials the endpoint but an error occurs, you might see the following:

  • read: connection reset by peer errors. These can occur when the app closes the connection abruptly with a TCP RST packet and not the expected FIN-ACK. This causes Gorouter to retry the next endpoint. Gorouter does not currently retry on write: connection reset by peer failures.
  • TLS handshake errors. When these errors occur, the Gorouter retries up to three times. If it still fails, Gorouter can return a 502. These errors appear similar to the following in the gorouter.log, and a 502 error is logged in the access.log:
    [2018-07-05 18:20:54+0000] {"log_level":3,"timestamp":1530814854.4359834,"message":"
    backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint":
    {"ApplicationId":"","Addr":"10.0.16.17:61002","Tags":null,"RouteServiceUrl":""},
    "error":"x509:certificate is valid for 53079ca3-c4fe-4910-78b9-c1a6, not xxx"}}
    
  • If a clients cancels a request before the server responds with headers, Gorouter returns a 499 error.

Diagnose Stale Routes in Gorouter

A stale route occurs when Gorouter contains out-of-date route information for a backend app. In nearly all cases, stale routes are self-correcting.

If SSL verification is enabled, when Gorouter detects that it is sending traffic to the wrong app, it prunes that backend app from its route table and terminates the connection. SSL verification from Gorouter to backends is always on in PAS v2.4.0 and later.

Causes of Stale Routes

When a route is unmapped or when an app container is deleted because the app is deleted or moved, a deregister message is sent to Gorouter. This message tells Gorouter to delete the route mapping to that container.

If Gorouter does not receive this deregister message, the route is now considered stale. Gorouter still attempts to send traffic to the app.

You are more likely to have stale routes when the following are true:

How to Locate Stale Routes

The following procedure helps you identify stale routes:

  1. Verify the state of the deployment. Run cf routes for all spaces and ensure the route is only mapped to the intended apps. Sometimes, there can be multiple routes using the same hostname and domain but with different paths. If the domain is shared, check all orgs as well.
  2. Examine the Gorouter routes table. It might be necessary to check multiple Gorouters, as it is possible that some received the proper deregister message and some did not.
    1. SSH to the VM where Gorouter is running.
    2. To print the entire Gorouter routes table, run: /var/vcap/jobs/gorouter/bin/retreive-local-routes | jq .
    3. Find the entry for the suspected stale route. Note the values for address and private_instance_id.
  3. Cross-reference the Gorouter routes table entry with actual Long-Running Processes (LRPs):
    1. SSH onto the Diego Cell where the IP address matches the IP address that you found on the routes table entry.
    2. To get information about all of the actual LRPs, run: cfdot actual-lrps | jq .'
    3. Look through the actual LRPs to find the instance ID that you noted from the routes table. If that instance ID exists and the port in the route table does not exist in the ports section, then there is likely a stale route.

      Note: You might be tempted to use the CAPI endpoint GET /v3/processes/:guid/stats to find out information about the host and ports the app is using. However, it is an app developer endpoint and does not provide complete information for operators. Use the cfdot CLI on the Diego Cell to view the actual LRPs directly and all at once.

How to Fix Stale Routes

The following procedure helps you fix stale routes:

  1. Ensure that SSL verification is enabled. For more information, see With TLS Enabled in HTTP Routing.

    Note: Using TLS to verify app identity depends on SSL verification. If you disable SSL verification, there is no way to avoid misrouting.

  2. If there is a stale route, then restarting Gorouter fixes the immediate issue. If you restart all of the Gorouters and see the same issue for the exact same route, then the issue is not a stale route.
  3. If Gorouter is continually missing deregister messages, it might be because either the NATS message bus or the Gorouters are overwhelmed. Look at the VM usage and consider scaling.

Gorouter Error Classification Table

Error Type Status Code Source of Issue Evidence
Dial 502 App or Platform Logs with error dial tcp
AttemptedTLSWith
NonTLSBackend
525 Platform* Logs with error tls: first record does not look like a TLS handshake or backend_tls_handshake_failed metric increments
HostnameMismatch 503 Platform Logs with error x509: certificate is valid for <x> not <y>
or backend_invalid_id metric increments
UntrustedCert 526 Platform Logs with error prefix x509: certificate signed by unknown authority or backend_invalid_tls_cert metric increments
RemoteFailedCertCheck 496 Platform Logs with error remote error: tls: bad certificate
ContextCancelled 499 Client/App Logs with error context canceled

Note: This status code appears in logs only. It is never returned to clients as it occurs when the downstream client closes the connection before Gorouter responds.

RemoteHandshakeFailure 525 Platform Logs with error remote error: tls: handshake failure and backend_tls_handshake_failed metric increments

*Any platform issue could be the result of a misconfiguration.

For each of the above errors, there is a backend-endpoint-failure log entry in gorouter.log and an error message in gorouter.err.log. Additionally, the access.log records the request status codes. For more information, see the Gorouter documentation on GitHub.

Diagnose App Errors

This section describes app-related 502 errors.

If 502 errors only occur in specific app instances and not all app instances on the platform, it is likely an app-related error. The app might be overloaded, unresponsive, or unable to connect to the database.

If all apps are experiencing 502 errors, then it could either be a platform issue, such as a misconfiguration, or an app issue, such as all apps being unable to connect to an upstream database.

Note: Gorouter does not retry any error response returned by the app.