Pricing of Google App Engine Flexible env, a $500 lesson

I followed the Nodejs on App Engine Flexible env tutorial: https://cloud.google.com/nodejs/getting-started/hello-world

Having successfully deployed and tested the tutorial, I changed the code to experiment a little and successfully deployed it... and then left it running since this was a testing environment (not public).

A month later, I receive a bill from Google for over $370!

In the transaction details I see the following:

Oct 1 – 31, 2017 App Engine Flex Instance RAM: 5948.774 Gibibyte-hours ([MYPROJECT]) $42.24

Oct 1 – 31, 2017 App Engine Flex Instance Core Hours: 5948.774 Hours ([MYPROJECT]) $312.91

How did this testing environment with almost 0 requests require about 6,000 hours of resources? In the worst, I would have assume 720 hrs running fulltime for a month @ $0.05 per hour would cost me ~$40. https://cloud.google.com/appengine/pricing

Can someone help shed light on this? I have not been able to find out why so many resources were needed?

Thanks for the help!

For more data, this is the traffic over the last month (basically 0): Traffic Data

And instance dataInstance Data

UPDATE: Note that I did bring one modification to the package.json: I added nodemon as a dependency and added it as part of my "nmp start" script. Though I doubt this explains the 6000 hours of resources:

  "scripts": {
    "deploy": "gcloud app deploy",
    "start": "nodemon app.js",
    "dev": "nodemon app js",
    "lint": "samples lint",
    "pretest": "npm run lint",
    "system-test": "samples test app",
    "test": "npm run system-test",
    "e2e-test": "samples test deploy"
  },

App.yaml (default-no change from tutorial)

runtime: nodejs
env: flex

Solution 1:

After multiple back and forth with Google, and hours of reading blogs and looking at reports, I've finally found an explanation for what happened. I will post it here with my suggestions so that other people do not also fall victim to this problem.

Note, this may seem obvious to some, but as a new GAE user, all of this was brand new to me.

In short, when deploying to GAE and using the following command "$ gcloud app deploy", it creates a new version and sets it as the default, but also and more importantly, it does NOT remove the previous version that was deployed.

More info about versions and instances can be found here: https://cloud.google.com/appengine/docs/standard/python/an-overview-of-app-engine

So in my case, without knowing it, I had created multiple versions of my simple node app. These versions are still running in case one needs to switch following an error. But these versions also require instances, and the default, unless stated in the app.yaml, is 2 instances.

Google says:

App Engine by default scales the number of instances running up and down to match the load, thus providing consistent performance for your app at all times while minimizing idle instances and thus reducing cost.

However, from my experience, this was not the case. As I said earlier, I pushed my node app with nodemon which it seems was causing errors.

In the end, following the tutorial and not shutting down the project, I had 4 versions, each with 2 instances running full-time for 1.5 months serving 0 requests and generating lots of error messages and it cost me $500.

RECOMMENDATIONS IF YOU STILL WANT TO USE GAE FLEX ENV:

  1. First and foremost, setup a billing budget & alerts so that you do not get surprised by an expensive invoice that is automatically charged to your CC: https://cloud.google.com/billing/docs/how-to/budgets

  2. In a testing env, you most likely do not need multiple versions, so while deploying use the following command:
    $ gcloud app deploy --version v1

  3. Update your app.yaml to force only 1 instance with minimal resources:

runtime: nodejs
env: flex

# This sample incurs costs to run on the App Engine flexible environment.
# The settings below are to reduce costs during testing and are not appropriate
# for production use. For more information, see:
# https://cloud.google.com/appengine/docs/flexible/nodejs/configuring-your-app-with-app-yaml
manual_scaling:
  instances: 1
resources:
  cpu: 1
  memory_gb: 0.5
  disk_size_gb: 10
  1. Set daily spending limit

enter image description here

See this blog post for more info: https://medium.com/google-cloud/three-simple-steps-to-save-costs-when-prototyping-with-app-engine-flexible-environment-104fc6736495

I wish some of these steps had been included in the tutorial in order to protect those who are trying to learn and experiment, but it was not.

Google App Engine Flex env can be tricky if one does not know all these details. A friend pointed me to Heroku, that has both set pricing and Free/Hobby offers. I was able to quickly push a new node app there, and it worked like charm! https://www.heroku.com/pricing

It "only" cost me $500 to learn this lesson, but I do hope this helps others looking at Google App Engine Flex Env.

Solution 2:

If you want to reduce your GAE costs please do not use manual_scaling as suggested in this article or the accepted answer!

The beautiful thing about Google App Engine is that it can scale up and down to hundreds of machines within milliseconds based on demand. And you only pay for instances that are running.

To be able to optimize your costs you need to understand the different scaling options and instance types:

1. App engine flex vs standard:

The details about differences can be found here, but one important difference relevant for this question is:

[Standard is] Intended to run for free or at very low cost, where you pay only for what you need and when you need it. For example, your application can scale to 0 instances when there is no traffic.

2. Scaling Options:

  • Automatic scaling: Google will scale your app depending on demand and configuration you provided.
  • Manual scaling: No scaling at all, GAE will run exact # of instances you asked for, all the time(very misleading naming)
  • Basic scaling: It will scale up to limit you set and will also scale down after certain time

3. Instance Types: There are 2 instance types, and they basically differ in the time it takes to spin up a new instance. F class instances(used in automatic scaling) can be created when there is need within ~0.1 seconds and B class instances(used in manual scaling/basic) within ~0.7 seconds: enter image description here

enter image description here

Now that you understood the basics let's go back to accepted answer:

manual_scaling:
  instances: 1
resources:
  cpu: 1
  memory_gb: 0.5
  disk_size_gb: 10

What this instructs GAE is to run a custom instance class(more costly), all the time. Obviously this is not the cheapest option because B1/F1 instance type could be used instead(it has lower specs) and it is also running an instance constantly.

What would be the cheapest is to turn off the instance when there is no traffic. If you don't mind the ~0.1 second spin up time you could go with this instead:

instance_class: F1
automatic_scaling:
  max_instances: 1 (--> you can adjust this as you wish)
  min_instances: 0 (--> will scale to 0 when there is no traffic so won't incur costs)

This will fall within the free quotas google provide and it should not cost you anything if you don't have any real traffic.

PS: It's also highly recommended to set up daily spending limit in case you forgot something running or you have some costly settings somewhere(daily spending limits are deprecated but will be available until July 24, 2021, source).

Solution 3:

We had code deployed to GAE FE go absolutely nuts due to a cascading, exponential failure (bounced emails generated bounced-email emails, etc.) and we could NOT turn off the GAE instances that were bugged. After 4+ hours, and 1M+ emails sent (Mailgun just would NOT let us disable the account. It said "Please wait up to 24 hours for the password change to go into effect", and revoking API keys did nothing), the redis VM was stopped, the DB down, and all the site's code reduced to a single "Down For Maintenance" static 503 page), the emails kept being sent.

I determined that GAE FE just simply does not end either docker VMs or Cloud Compute VMs (redis) that are under CPU load. Maybe never! Once we actually deleted the Compute VM (instead of "merely" stopping it), the emails instantly stopped.

But, our DB continued to get filled with "could not send email" notices for up to 2 more hours, despite the GAE app reporting 100% of the versions and instances to be "Stopped". I ended up having to change the Google Cloud SQL password.

We kept checking the bill, and the 7 rogue instances kept using up CPU and so we cancelled the card used on that account, and the site did, in fact, go down when the bill was past due, but so did the rogue instances. We never were able to resolve the situation with GAE email support.


Update (30 Sep 2020): This is still the worst moment of my 22 year career!! An entire company of 15 crack genius devs couldn't figure out how to turn off GAE. We knew customers were receiving MILLIONS of emails when one of my dev's couldn't access her GMail account. Couldn't unplug it, couldn't turn it off. It was quite a "Terminator" moment!

It wouldn't have been nearly so bad, except for expenses, if MailGun had allowed us to actually disable the API access or change the password. But it would have still been bad expense-wise on GAE.

I no longer trust servers I can't issue reboot on.

In the end, MailGun only charged us about $50. GAE, however... If I had just assumed "OK, mails stopped, we can stop", we could have ended up with a $20,000 excess bill! As it was, it "only" cost $1,500. And we never could get in contact with anyone to dispute it. So the CEO just ate it.