Amazon EC2 VPC: NAT instance download speed performance drop

I have a set of servers inside Amazon EC2 in VPC. Inside this VPC I have a private subnet and a public subnet. In the public subnet I have set up a NAT machine on a t2.micro instance that basically runs this NAT script on startup, injecting rules into iptables. Downloading files from the internet from a machine inside the private subnet works fine.

However I compared the download speed of a file on an external high-bandwidth FTP server directly from my NAT machine to the download speed from a machine inside my private subnet (via the same NAT machine). There was a really significant difference: around 10MB/s from the NAT machine vs. 1MB/s when downloading from the machine inside the private subnet.

There is no CPU usage on the NAT machine, so this cannot be the bottleneck. When trying the same test with bigger machines (m3.medium with "moderate network performance" and m3.xlarge with "high network performance"), I also could not get download speeds greater than 2.5MB/s.

Is this a general NAT problem that can (and should) be tuned? Where does the performance drop come from?

Update

With some testing, I could narrow this problem down. When I am using Ubuntu 12.04 or Amazon Linux NAT machines from 2013, everything runs smoothly and I get the full download speeds, even on the smallest t2.micro instances. It does not matter whether I use PV or HVM machines. The problem seems to be kernel-related. These old machines have a Kernel version 3.4.x, whereas the newer Amazon Linux NAT machines or Ubunut 14.XX have Kernel version 3.14.XX. Is there any way to tune the newer machines?


We finally found the solution. You can fix the download speed by running on the NAT machine (as root):

ethtool -K eth0 sg off

This disables scatter-gather mode, which (as far as I understand this) stops offloading some network work on the network card itself. Disabling this option leads to higher CPU usage on the client as the CPU now has to do the work itself. However on a t2.micro machine we only saw around 5% of CPU usage when downloading a DVD image.

Note that this won't survive a restart, so make sure to set this in rc.local or at least before setting up NAT.


I also use NAT boxes in a similar setup in production so very interested in your findings. I haven't had similar findings before production, but maybes it's an issue that I haven't paid attention to before.

Let's do some science!

============================================================================

Theory: NAT boxes can download and upload faster then a client who is using the NAT.

Experiment: Match the questioners experiment. t2.micros with Amazon NAT 2014.09 2 subnets with the NAT going to an IGW and private subnet pointing to the NAT. (Shared Tenancy. General Purpose SSD)

Procedure:

# install speedtest
$ sudo yum install python-pip -y --enablerepo=epel; sudo pip install speedtest-cli
# run against the same server
$ speedtest-cli --server 935 --simple
# run it many times
$ for ((n=0;n<10;n++)); do speedtest-cli --simple --server 935; done

Data:

          Nat:     Client
Download  727.38   157.99
Upload    250.50   138.91

Conclusion: OP is not lying.

============================================================================

Theory: Different kernel versions lead to different results.

Experiment: Set up 3 nat boxes, each with magnetic SSD, m3.medium (no bursting), and dedicated tenancy. Run a speed test.

Procedure: See last experiment. Also, set up a routing table for each NAT box. Used a blackhole routing table to prove that the changes propagated when I swapped routing tables.

  1. Using a NAT.
  2. curl google.com works.
  3. Switch to blackhole.
  4. Wait for curl google.com to fail on the client.
  5. Switch to new NAT.
  6. curl google.com works.

Here are my 3 nat boxes: 2014.09 3.14.20-20.44.amzn1.x86_64 2014.03 3.10.42-52.145.amzn1.x86_64 2013.09 3.4.62-53.42.amzn1.x86_64

Data:

All 3 boxes get very similar results when running speedtest-cli --server 935

09/14   03/14   09/13
355.51, 356.55, 364.04
222.59, 212.45, 252.69

From the client:

09/14   03/14   09/13
351.18, 364.85, 363.69
186.96, 257.58, 248.04

Conclusion: Is there degradation? No. Is there any difference between the kernel versions? No.

============================================================================

Theory: Dedicated versus shared tenancy makes a difference.

Experiment: 2 NAT boxes. Both using NAT 2014.09. One with shared tenancy, one with dedicated tenancy.

Data: Both boxes have similar performance:

Shared Nat   Dedicated Nat
387.67       387.26
296.27       336.89

They also have similar standard deviations:

$ python3
>>> import statistics
>>> shared_download = [388.25, 333.66, 337.44, 334.72, 338.38, 335.52, 333.73, 333.28, 334.43, 335.60]
>>> statistics.stdev(shared_download)
16.858005318937742
>>> dedicated_download = [388.59, 338.68, 333.97, 337.42, 326.77, 346.87, 336.74, 345.52, 362.75, 336.77]
>>> statistics.stdev(dedicated_download)
17.96480002671891

And when you run the 2x2 combinations:

      Shared Client/Sh. NAT  Sh. Client/Dedicated Nat  Ded. Client/Sh. Nat  Ded. Client/Ded. NAT
Upload       290.83                      288.17                283.13              340.94
Download     260.01                      250.75                248.05              236.06

Conclusion: Really unclear the shared versus dedicated doesn't seem to make a big difference.

Meta conclusions:

The test that's probably worth redoing would be OP's test with m3.mediums. I was able to duplicate the t2.micro results, but my m3.medium seems to conflict with OP's m3.medium results.

I'd be interested in seeing your data on kernel versions as well.

Perhaps the most interesting part is how I was unable to get a m3.medium NAT to go quickly.