Requests per second slower when using nginx for load balancing
Solution 1:
Concurrency was my first thought as the default concurrency in ab is one and adding a load balancer will always increase the latency of a request but you mentioned that you are setting concurrency to 100 so this shouldn't be the cause.
The reverse proxy will likely be adding a header to each request. This makes the responses slightly larger when using nginx than when not. Probably an imperceptible change if you are running this over a Gigabit internal network but if you are running this from your office or from your home, and particularly if you are using a small file to do this test, the extra data could cause a measurable difference. Of course, small files are pretty normal on the web, so small files might make for a more realistic benchmark.
Caching can also make a difference to subsequent runs depending on how your benchmark is being run. This will make your first run slower than all the runs after it. This is compounded even further when load balancing because there are twice as many caches to warm up. If you tested nginx first, that could have caused a difference. You can mitigate this by turning off all caching or ignoring the first run you do. It's pretty difficult to get all caches and some may not even be under your control. I'd favour the ignoring-the-first-run method. You mentioned that you have done several runs with different values but what you need to do to avoid cache-based inaccuracies is to run exactly the same benchmark two or more times in a row and ignore the first run.
Another thing that can cause this sort of behaviour is a lock somewhere else in the system. By a "lock" I mean a resource that only one of the webservers can use at a time. An example of this would be storing PHP sessions in a MyISAM table in your database. Every request to a PHP page is either going to do a read request on this table to look up the session or a write request to create a new session. Because MyISAM tables have table-level locking, only one of your webservers can be using this table at any given time, and since every page will have to use this table, this can negate the advantage of having two webservers completely. The faster the rest of your system is, the more relative effect a lock will have. It doesn't have to be a database either, it could be a shared webroot on a SAN or NAS, so even static files are not immune to this kind of problem. You didn't mention any other systems in your original question, but this problem will very likely show up as your system grows.
Lastly, a bit (it turned into quite a lot) of general advice on benchmarking. The reason you get a particular speed (or number of requests per second for this kind of benchmarking) is always due to a single bottleneck. Apache benchmark will just keep requesting as fast as it can until some resource reaches 100% utilised. This resource may be the CPUs in your webservers or it may be the CPU in the reverse proxy server. However, this is unlikely. Disk access and network bandwidth (internal and external) are usually the first bottlenecks you run into, long before CPU speed becomes an issue. Even if you see a resource at 90% utilised, this is not the bottleneck. There will be another one somewhere at 100% that is stopping this one from going any higher than 90%. The one at 100% may be on a different system and it may not be a system you own. It can be the network, and this means a particular device such as a switch or a NIC or even the cables that is part of the network.
To find the true bottle neck, you should start at some value you can measure (say, the number of nginx workers currently active) and ask "Why isn't this going any higher?" If it has reached its maximum value, then you have found your bottleneck. If not, the next place you should look is a connected request. Whether you go upstream or downstream is a matter of gut instinct. Downstream, nginx will be asking for network slots to pass the requests to Apache. Ask yourself if the number of open network connections is at its maximum. Then the NIC's bandwidth. Then the network's bandwidth. Then the Apache machine's NIC's bandwidth. You can skip some of these steps if the answer is obvious but don't just go randomly guessing your way through the system. Make your quest ordered and logical.
Sometimes the bottleneck you run into will be on the machine you are running ab on. When this happens, the benchmark is meaningless. All you have tested is the speed of the machine or network you are running ab on. You would get the same result benchmarking Google as you would your site. In order to ensure you have a meaningful benchmark, you must find the bottleneck while the benchmark is running. (Or at the very least ensure that it is not on the testing machine.) In order to improve your site's benchmarks, it is necessary to find the bottleneck in the system and widen it and this is easiest to do while the benchmark is running.
Testing a large system like you are means that the number of places a bottleneck could hide is quite large. Sometimes it can help to narrow your benchmark down to just a few parts of the system. Cutting out nginx and going to Apache is one example of this and running your benchmark in the same network as the webservers is another. But you can go further and benchmark individual components such as disk, network and RAM latency and throughput.
Unfortunately, not all resources have nice easy percentages reported the way CPU and RAM usage do. For instance, writing a large file to a disk you may get 40MB/s but when writing lots of little files and reading them back in simultaneously (such as PHP sessions stored on the disk) you may get 10MB/s. In order to find the true size of a resource you must run benchmarks on each part of your system individually. Don't assume that you will get 1000Mb/s over your internal network just because you have a Gigabit switch. IP, TCP and application-level headers such as NFS headers can all reduce this benchmark as can slower NICs and cables. Hardware errors can also affect all sorts of benchmarks while the hardware still functions but at less than the manufacturer's specifications.
The bottleneck may be on the nginx machine. If so, the reason for the load balanced solution being slower than the direct single server should be obvious. At this point, some of rmalayter's suggestions would be good to follow. Until you know where the bottleneck is, you're just guessing and so are we. If the bottleneck is elsewhere, you should probably find it and then come back here and look for or ask a more specific question.
Solution 2:
How big is the file content you're testing with?
Turn the logging level in nginx to "warn", and check error.log. You will likely see warnings about writing proxied content to disk temprary files. I suspect you need to increase proxy_buffers number/size. Or turn proxy buffering off completely. The defaults for nginx are too low to be useful for any reasonable modern content.
With a similar configuration, I see 3700 requests/second for a static 57kB html file coming from two back-end IIS boxes. All are single-CPU virtual machines with 2 GB of RAM. I have proxy_buffers set as "proxy_buffers 32 16k;". Obviously, if you're only seeing 50 requests per second with Apache, you are testing a dynamic page, right?