Vertex AI prediction - Autoscaling cannot set minimum node to 0

I am unclear abut Vertex AI pricing for model predictions. In the documentation, under the heading More about automatic scaling of prediction nodes one of the points mentioned is:

"If you choose automatic scaling, the number of nodes scales automatically, and can scale down to zero for no-traffic durations"

The example provided in the documentation later also seems to suggest that during a period with no traffic, zero nodes are in use. However, when I create an Endpoint in Vertex AI, under the Autoscaling heading it says:

"Autoscaling: If you set a minimum and maximum, compute nodes will scale to meet traffic demand within those boundaries"

The value of 0 under "Minimum number of compute nodes" is not allowed so you have to enter 1 or greater, and it is mentioned that:

Default is 1. If set to 1 or more, then compute resources will continuously run even without traffic demand. This can increase cost but avoid dropped requests due to node initialization.

My question is, what happens when I select autoscaling by setting Minimum to 1 and Maximum to, say, 10. Does 1 node always run continuously? Or does it scale down to 0 nodes in no traffic condition as the documentation suggests.

To test I deployed an Endpoint with Autoscaling (min and max set to 1) and then when I sent a prediction request the response was almost immediate, suggesting the node was already up. I did that again after about an hour and again the response was immediate suggesting that the node never shut down probably. Also, for high latency requirements, is having autoscale to 0 nodes, if that is indeed possible, even practical, i.e., what latency can we expect for starting up from 0 nodes?


Are you using an N1 or a non-N1 machine type? If you want to autoscale to zero, you must use non-N1 machines. See second note from node allocation:

Note: Versions that use a Compute Engine (N1) machine type cannot scale down to zero nodes. They can scale down to 1 node, at minimum.

Update: AI Platform supports scaling to zero, while Vertex AI currently does not. From the scaling documentation, nodes can scale but there is no mention that it can scale down to zero. Here's a public feature request for people who wants to track this issue.

With regards to latency requirements, the actual output will vary. However, one thing to note according to the documentation is that the service may not be able to bring nodes online fast enough to keep up with large spikes of request traffic. If your traffic regularly has steep spikes, and if reliably low latency is important to your application, you may want to consider manual scaling.

Additional Reference: https://cloud.google.com/ai-platform/prediction/docs/machine-types-online-prediction#automatic_scaling