AWS: reduce nat gateway costs for a small system

I am setting up infrastructure for a startup which pretty much will not have a lot of traffic, but should be able to scale when needed.

We are favouring a setup with a LB, that distributes traffic to the frontend nodes in a dedicated private subnet (over 3 availability zones), which in turn make request to the backed nodes on their own dedicated private subnet, which in turn makes requests to mongodb managed via atlas and vpc peering.

In order for each node to provision it requires internet access. The backend nodes also make request to third party services and therefor require internet access when they are running as well.

I see three options:

  • set up a nat gateway for each private subnet in each availability zone. Depending on location this amounts to around 30$ per subnet per availability zone. With 3 availability zones and 2 subnets this will total to around 180$ a month, which is actually more than we plan to use for the ec2 instances while there is not much traffic and load on the system. We could probably cut that down to just use 1 nat gateways in each availability zone for all the private subnets, but still that's around 90$.

  • set up ec2 instances as nat gateway, which will probably be a little cheaper, however requires maintenance and setup.

  • just use one private subnet, assign public ips to each node and use the internet gateway, via route table entries. I don't think using dedicated private subnets will make much sense as the nodes should be able to connect with each other via the gateway anyway.

The last option will most likely be the cheapest option as one elastic ip is already included within an ec2 instance and dedicated gateways are not needed. However I was wondering if there is a significant downside or risk involved in doing so? We plan to return to the idea with dedicated subnets when there is a need to (like there is significant traffic), but we really would like to keep costs as a low as possible in the beginning.


Solution 1:

You seem to be laboring under some misunderstandings about network fundamentals in VPC.

set up a nat gateway for each private subnet in each availability zone.

For all practical purposes, this is never something you would actually need to do.

The maxiumum number of NAT Gateways you would ever need in a single VPC would be 1 per AZ.

NAT Gateways are never placed on (any of) the subnet(s) they serve. NAT Gateways are placed on a public subnet, which has a default route pointing to the Internet Gateway. They then provide NAT services to instances on other subnets, where the NAT Gateway is specified as the default gateway for those subnets.

So the number of private subnets in am AZ has no relationship to the number of NAT Gateways. Unless you need Internet bandwidths in excess of 45 Gbit/s per AZ, you don't need multiple NAT Gateways.

Next, you do not technically need but one NAT Gateway per VPC. A NAT Gateway is a logical entity, not a physical one, so there is no known mechanism by which one can "fail" (except at initial creation, when there's a possibility of it failing to be defined). The things that suggest against sharing a NAT Gateway across regions are these:

  • You'll pay for cross-region traffic using the gateway. As long as this is less than the cost of the gateway, it could still make sense.
  • You'll see a slight uptick in latency, typically single digit milliseconds, for cross-zone traffic using the NAT Gateway to access the Internet. This is a tradeoff, but may be insignificant.
  • The complete outage, failure, loss, or destruction of the AZ hosting the gateway will result in loss of use of the gateway across all AZs but this has apparently never happened to date.

Next, using EC2 instances as NAT devices requires almost no setup. The stock AMI for NAT Instances is zero-config at the instance level. You can also build your own. EC2 Instance Recovery can repair a NAT Instance when the underlying hardware actually fails or the hypervisor becomes unresponsive (rare but possible).

I don't think using dedicated private subnets will make much sense as the nodes should be able to connect with each other via the gateway anyway.

That isn't really relevant either way. Dedicated subnets or a lack of them has no real impact on the way instances communicate with each other -- only the way they believe they are communicating. The "default router" on each subnet in a VPC is an imaginary device that exists for compatibility with the way IP over Ethernet works. When two instances in a VPC are allowed by Security Groups and Network ACLs to communicate, the way their actual traffic gets from one instance to another is identical, regardless of whether the two instances are on the same subnet or not.

Cross-subnet, an instance goes through the motions of arping the default gateway and sending traffic to it, meanwhile the hypervisor plays along but effectively ignores all of that and sends the traffic directly to the other instance's hypervisor. Within a subnet, the instance arps for its peer, the hypervisor spoofs that response (the ARP "who has" never appears at the target instance, yet the source instance sees the response the target never generated) and the node-to-node traffic follows exactly the same path as before.

We all did fine for years using EC2 instances as NAT Instances because that was the only option -- NAT Gateway is a relatively new service. If you are trying to save costs, go with that. Or go with that in all AZs except for one, and use one NAT Gateway in that one AZ.

Add VPC endpoints for services that support them, like S3 and DynamoDB, since these endpoints allow you to access those services without a NAT device.

Solution 2:

You can route outbound traffic from several private subnets to the same natgw instance.

Sometimes I create a separate public "management" subnet, where the the natgw (or similar resources) is located, and provide all private subnets which should be able to access internet with according route to that natgw.

That layout makes it more obvious that this subnet is for special purpose and eventually used by several other separated subnets. Typically I assign such subnet a tiny CIDR.