k8s nginx ingress returns randomly 502 error on load

we are using an nginx (1.15.8.1) as ingress controller on our k8s cluster (v.1.17), managed by rancher (2.5.7).

This worked pretty fine so far, but now we set up a custom API pod that can be externally accessed via ingress.

Now, doing some load testing of the API by firing request every second at the API randomly returns "502 Bad Gateway" every couple of requests. However not periodically.

<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>openresty/1.15.8.1</center>
</body>
</html>

The respective log on ingress controller:

2021/04/11 11:02:48 [error] 25430#25430: *55805583 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: xxx, server: xxx, request: "POST /api/v1/xxx HTTP/2.0", upstream: "http://xxx/api/v1/xxx", host: "xxx"

We ran the very same container on a non K8s environment as docker container, where we never experienced this issue before, so at the moment I would assume this is not a problem of the container/API implementation.

My thoughts were:

1- the service definition that abstracts the API pod cannot route the traffic to a pod and therefore replies with a 502

-> I changed the ingress to directly go to the workdload for testing to understand if this is the issue. I get the same errors then, so the service definition seems not to be the problem

2- timeout in ingress resource (b/c the post replies with latency) might cause the routing to fail

-> i set the timeouts to

    nginx.ingress.kubernetes.io/proxy-connect-timeout: 1d
    nginx.ingress.kubernetes.io/proxy-read-timeout: 1d
    nginx.ingress.kubernetes.io/proxy-send-timeout: 1d

but ran in the same error again. so this seem not to be the problem either

3- error in API container (most unlikely) which sends back errors

-> checking the container logs reveals that the missing request (which returned the 502 or anything similar which is then evaluated on ingress level as 502) was never received by the API container. Furthermore this doesn't make sense at all from a traffic routing perspective.

4- As a final step i scaled up the pod to two containers and everything works perfectly well.

Summary: with the results above i sum up that 1- service definition cannot be the problem (test 1) 2- ingress timeouts cannot be the problem 3- the api itself cannot be the problem

Thinking about test 4 it points in a direction that the container availability in the pods could be a problem in terms that the API drops requests (resulting in a 502) when the container is in a specific state (this would explain why it works with 2 containers scaling) OR the traffic routing times out somewhere on the way.

Currently i don't have any further ideas where to go from here. Any hint is appreciated.


Solution 1:

if you use the nginx ingress reverse proxy, it maybe the cause. Try to check in Configmap for proxy-next-upstream settings, and extend it to handle the http_502 case.
Also enable the retry-non-idempotent if the request that got 502 is POST, LOCK, PATCH, if it is safe for your app to do so.

My guess: when the backend pod reach its load limit (or pod recycling), it "rejects" new request and nginx reverse proxy is more sensitive with this rejection. Without nginx-ingress, k8s handling the load balancing and it can handle everything better, queuing instead of rejecting.