First of all, I'm assuming that with "draw calls", you mean the command that tells the GPU to render a certain set of vertices as triangles with a certain state (shaders, blend state and so on).

Draw calls aren't necessarily expensive. In older versions of Direct3D, many calls required a context switch, which was expensive, but this isn't true in newer versions.

The main reason to make fewer draw calls is that graphics hardware can transform and render triangles much faster than you can submit them. If you submit few triangles with each call, you will be completely bound by the CPU and the GPU will be mostly idle. The CPU won't be able to feed the GPU fast enough.

Making a single draw call with two triangles is cheap, but if you submit too little data with each call, you won't have enough CPU time to submit as much geometry to the GPU as you could have.

There are some real costs with making draw calls, it requires setting up a bunch of state (which set of vertices to use, what shader to use and so on), and state changes have a cost both on the hardware side (updating a bunch of registers) and on the driver side (validating and translating your calls that set state).

But the main cost of draw calls only apply if each call submits too little data, since this will cause you to be CPU-bound, and stop you from utilizing the hardware fully.

Just like Josh said, draw calls can also cause the command buffer to be flushed, but in my experience that usually happens when you call SwapBuffers, not when submitting geometry. Video drivers generally try to buffer as much as they can get away with (several frames sometimes!) to squeeze out as much parallelism from the GPU as possible.

You should read the nVidia presentation Batch Batch Batch!, it's fairly old but covers exactly this topic.


Graphics APIs like Direct3D translate their API-level calls into device-agnostic commands and queue them up in a buffer. Flushing that buffer, to perform actual work, is expensive -- both because it implies the actual work is now being performed, and because it can incur a switch from user to kernel mode on the chip (and back again), which is not that cheap.

Until the buffer is flushed, the GPU is able to do some prep work in parallel with the CPU, so long as the CPU doesn't make a blocking request (such as mapping data back to the CPU). But the GPU won't -- and can't -- prepare everything until it needs to actually draw. Just because some vertex or texture data is on the card doesn't mean it's arranged appropriately yet, and may not be arrangeable until vertex layouts are set or shaders are bound, et cetera. The bulk of the real work happens during the command flush and draw call.

The DirectX SDK has a section on accurately profiling D3D performance which, while not directly related to your question, can supply some hints as to what is and is not expensive and (in some cases) why.

More relevant is this blog post (and the follow-up posts here and here), which provide a good overview of the logical, low-level operational process of the GPU.

But, essentially (to try and directly answer your questions), the reason the calls are expensive isn't that there is necessarily a lot of data to transfer, but rather that there is a large body of work beyond just shipping data across the bus that gets deferred until the command buffer is flushed.