Running some benchmarks using ab, and tomcat starts to really slow down

I'm running some benchmarks using apache bench for a java app that is running on tomcat.

Say I run a test like:

ab -c 10 -n 10000 http://localhost:8080/hello/world

It will run just fine. If I follow it with:

ab -c 50 -n 50000 http://localhost:8080/hello/world

Again it will run fine, but if I try again it starts to slow down after maybe 3500 completed requests.

I need help in terms of trying to debug the root cause of this.

I ran top, and I have a few gigs of memory that is unused so memory doesn't seem to be the issue.

The tomcat6 process does go to 70-80 or even 107%.

It seems restarting tomcat solves the issue, but at times a server reboot is required.

This is on a default tomcat install that has 200 threads allocated to it.

Tomcat logs are empty.

Update

So I changed both tcp_tw_recycle/reuse to 1, and running netstat shows a very low count now.

Previous to changing tcp_tw_recycle/reuse, I noticed things slowing down and ran netstat and I had 32400 tcp TIME_WAIT connections.

So an update on running the benchmarks now, with the -k switch I'm seeing MUCH more throughput. BUT, at some point things again start to slow down, but restarting tomcat now brings things back to normal. Before, even if I restarted tomcat, response times running ab would be very very slow. Now after changing tcp_tw_recycle/reuse, restarting tomcat brings things back to normal. Running top shows tomcat at only around 20% of cpu, so it seems the problem is with tomcat now, but how can I figure out what?


Solution 1:

There may be a few things going on here. Your command above translates to 50 concurrent connections, each issuing 1000 requests. One thing to note here is that if I recall correctly apachebench does not enable keep alive by default. It may be worth adding this (pass -k to your command above). This will be more of a real world test anyway, as most user agents do use keep-alive, as does Tomcat, by default. This should help the issue if my theories below are correct.

1) I suspect that your slamming that thread pool with too many requests, since each one is tearing down. This is a pretty big hit to those threads, as well as the TCP/IP stack on the system. Which leads me to...

2) You may be (ok, you probably are) running out of ephemeral ports and or hitting TIME_WAIT sockets. If each request is indeed a new, unique request, you're very likely going to be running into a TIME_WAIT situation with thousands of sockets in that state (have a look at netstat -an |grep -ic TIME_WAIT for a count of them during your load). These sockets will be ineligible for re-use unless you've enabled time_wait_reuse on your system. The fact that you're using localhost only makes this worse.

For more information on setting time_wait reuse up, have a look here. Also note that this thread correctly points out that setting the fin_wait timeout is incorrect in the context of time_wait, so avoid that. Tickling fin_wait in the context of TIME_WAIT is wrong and won't help you.

So have a look and potentially tweak tcp_tw_recycle/reuse specifically. These will help you get through your tests, as will keep-alive.