How to Handle Sudden Burst in New HTTPS Connections?

I've got a fleet of Java Vertx servers behind a load balancer that handles spikey traffic. One minute it may be handling 150k r/m, the next it may be handling 2mm r/m, then right back down to 150k r/m. I'm finding that during these spikes, the entire fleet may become unresponsive for minutes and drop connections, while the cpu and mem pressure on any one box barely hits 50% utilization.

To test what exactly is causing the outage, I setup a single test server that matches the specs of one in my production fleet to see how much I could throw at it before it gave out. My test involves using 10 other machines, each of which open 500 https connections to the server and send 1mm requests about 2kb per request payload. This totals in 5k concurrent connections opened, sending a total of 10mm requests, for roughly 20gb of data transfers.

Once the connections are opened I can fire off about 700k requests per minute. I monitor the servers availability simply by making a request to a health endpoint and recording the response time. The response time is fast, tens of milliseconds. I am happy with these results.

But before the flood of data starts coming in, theses 10 machines must first make 5k connections. During this time, the server is unresponsive and may even timeout when i try to check the health endpoint. I believe this is what is causing the outages in my production fleet- the sudden increase in new connections. Once the connections are established, the server has no trouble handling all of the data coming in.

I've update the nofile ulimit, net.core.netdev_max_backlog, net.ipv4.tcp_max_syn_backlog, and net.core.somaxconn, but it still hangs when receiving a burst of 5k new connection requests within a few seconds of each other.

Is there anything I can do to establish new connections quicker?

Edit:

The actual server runs in a docker container. My net settings aren't being applied to the container. Going to try that next and see if it makes a difference.

Edit Edit:

It's all in SSL. Making so many connections that quickly through plain HTTP is near instant. So i've got to figure out how to establish TLS connections quicker.

Edit Edit Edit:

I found that the native java security ssl handler was the bottleneck. Switching to netty-tcnative (aka native OpenSSL) pretty much solved my problem with HTTPS.


Solution 1:

Thank you @MichaelHampton for your help.

I found a solution for my problem, and hopefully it may help others (particularly if you are using Java).

I have heard many suggestions to simply increase nofiles to allow more connections, but i'd like to start by reiterating that the problem is not that the server isn't able to make more connections, it's that it's not able to make connections quick enough and dropping connections.

My first attempt to solve this problem was to increase the connection queue through net.ipv4.tcp_max_syn_backlog, net.core.somaxconn and again in the application's server config where appropriate. For vertx this is server.setAcceptBacklog(...);. This resulted in accepting more connections in queue, but it didn't make establishing the connections any faster. From a connecting client's point of view, they were no longer reset connections due to overflow, establishing connections just took much longer. For this reason, increasing the connection queue wasn't a real solution and just traded one problem for another.

Trying to narrow down where in the connection process the bottleneck was, I tried the same benchmarks with HTTP instead of HTTPS and found that the problem went away completely. My particular problem was with the TLS Handshake itself and the servers ability to satisfy it.

With some more digging into my own application, I found that replacing Javas default SSLHandler with a native one (OpenSSL) greatly increased the speed of connecting via HTTPS.

Here were the changes I made for my specific application (using Vertx 3.9.1).

  1. Add netty-tcnative dependencies
<!-- https://mvnrepository.com/artifact/io.netty/netty-tcnative -->
<dependency>
    <groupId>io.netty</groupId>
    <artifactId>netty-tcnative</artifactId>
    <version>2.0.31.Final</version>
    <classifier>osx-x86_64</classifier>
    <scope>runtime</scope>
</dependency>

<!-- https://mvnrepository.com/artifact/io.netty/netty-tcnative -->
<dependency>
    <groupId>io.netty</groupId>
    <artifactId>netty-tcnative</artifactId>
    <version>2.0.31.Final</version>
    <classifier>linux-x86_64-fedora</classifier>
    <scope>compile</scope>
</dependency>

The first dependency is for osx to test at runtime. The second is for centos linux when compiled. linux-x86_64 is also available for other flavors. I tried to use boringssl because openssl doesn't support ALPN but after many hours I couldn't get it to work so i've decided to live without http2 for now. With most connections only sending 1-2 small requests before disconnecting this really isn't an issue for me anyway. If you could use boringssl instead, that's probably preferred.

  1. Because I am not using an uber version of the dependency. I needed to install the os dependencies for centos. This was added to the Dockerfile
RUN yum -y install openssl
RUN yum -y install apr
  1. To tell the vertx server to use OpenSSL instead of the Java version, set the OpenSSL options on the server (even if just the default object)
httpServerOptions.setOpenSslEngineOptions(new OpenSSLEngineOptions());
  1. Finally, in my run script, I added the io.netty.handler.ssl.openssl.useTasks=true option to Java. This tells the ssl handler to use tasks when handling the requests so that it is non-blocking.
java -Dio.netty.handler.ssl.openssl.useTasks=true -jar /app/application.jar

After these changes, I am able to establish connections much quicker with less overhead. What took tens of seconds before and resulted in frequent connection resets now takes 1-2 seconds with no resets. Could be better, but a big improvement from where I was.

Solution 2:

Nice fix!.

So it seems to be the SSL layer, it certainly has to do a lot more processing, in terms network handshakes, and crypto transformations which take resources. Unless your SSL can offload some of the processing onto hardware, SSL can certainly increase load on your servers, and as you found out not all SSL libraries are created equal!.

These problems are a great candidate for a front end reverse proxy. This can ideally be place before your application, and handle all SSL connections to clients, and then do http to your back end.

Your original application has a little bit less to do, as your front end reverse proxy can soak up all the SSL work, and tcp connection management.

Apache and NGNIX can do this, and has quite a few options for load balancing those connections to the least loaded backend server.

You will find that NGNIX can do SSL terminations a lot faster than java can, and even if java can, your distributing the processing of the connection management across machines, thus reducing load (memory/cpu/disk io) on your back end server. You get the side effect of making the configuration of the back end simpler.

Downside is the your using http between your proxy and applications, which in some ultra secure environments is not desirable.

Good Luck!