Why does GL divide `gl_Position` by W for you rather than letting you do it yourself?

Note: I understand the basic math. I understand that the typical perspective function in various math libraries produces a matrix that converts z values from -zNear to -zFar back into -1 to +1 but only if the result is divided by w

The specific question is what is gained by the GPU doing this for you rather than you having to do it yourself?

In other words, lets say the GPU did not magically divide gl_Position by gl_Position.w and that instead you had to do it manually as in

attribute vec4 position;
uniform mat4 worldViewProjection;

void main() {
  gl_Position = worldViewProjection * position;
  
  // imaginary version of GL where we must divide by W ourselves
  gl_Position /= gl_Position.w;
}

What breaks in this imaginary GL because of this? Would it work or is there something about passing in the value before it's been divided by w that provides extra needed info to the GPU?

Note that if I actually do it the texture mapping perspective breaks.

"use strict";
var m4 = twgl.m4;
var gl = twgl.getWebGLContext(document.getElementById("c"));
var programInfo = twgl.createProgramInfo(gl, ["vs", "fs"]);

var bufferInfo = twgl.primitives.createCubeBufferInfo(gl, 2);

var tex = twgl.createTexture(gl, {
  min: gl.NEAREST,
  mag: gl.NEAREST,
  src: [
    255, 255, 255, 255,
    192, 192, 192, 255,
    192, 192, 192, 255,
    255, 255, 255, 255,
  ],
});

var uniforms = {
  u_diffuse: tex,
};

function render(time) {
  time *= 0.001;
  twgl.resizeCanvasToDisplaySize(gl.canvas);
  gl.viewport(0, 0, gl.canvas.width, gl.canvas.height);

  gl.enable(gl.DEPTH_TEST);
  gl.enable(gl.CULL_FACE);
  gl.clear(gl.COLOR_BUFFER_BIT | gl.DEPTH_BUFFER_BIT);

  var projection = m4.perspective(
      30 * Math.PI / 180, 
      gl.canvas.clientWidth / gl.canvas.clientHeight, 
      0.5, 10);
  var eye = [1, 4, -6];
  var target = [0, 0, 0];
  var up = [0, 1, 0];

  var camera = m4.lookAt(eye, target, up);
  var view = m4.inverse(camera);
  var viewProjection = m4.multiply(projection, view);
  var world = m4.rotationY(time);

  uniforms.u_worldInverseTranspose = m4.transpose(m4.inverse(world));
  uniforms.u_worldViewProjection = m4.multiply(viewProjection, world);

  gl.useProgram(programInfo.program);
  twgl.setBuffersAndAttributes(gl, programInfo, bufferInfo);
  twgl.setUniforms(programInfo, uniforms);
  gl.drawElements(gl.TRIANGLES, bufferInfo.numElements, gl.UNSIGNED_SHORT, 0);

  requestAnimationFrame(render);
}
requestAnimationFrame(render);
body {  margin: 0; }
canvas { display: block; width: 100vw; height: 100vh; }
<script id="vs" type="notjs">
uniform mat4 u_worldViewProjection;
uniform mat4 u_worldInverseTranspose;

attribute vec4 position;
attribute vec3 normal;
attribute vec2 texcoord;

varying vec2 v_texcoord;
varying vec3 v_normal;

void main() {
  v_texcoord = texcoord;
  v_normal = (u_worldInverseTranspose * vec4(normal, 0)).xyz;
  gl_Position = u_worldViewProjection * position;
  gl_Position /= gl_Position.w;
}
  </script>
  <script id="fs" type="notjs">
precision mediump float;

varying vec2 v_texcoord;
varying vec3 v_normal;

uniform sampler2D u_diffuse;

void main() {
  vec4 diffuseColor = texture2D(u_diffuse, v_texcoord);
  vec3 a_normal = normalize(v_normal);
  float l = dot(a_normal, vec3(1, 0, 0));
  gl_FragColor.rgb = diffuseColor.rgb * (l * 0.5 + 0.5);
  gl_FragColor.a = diffuseColor.a;
}
  </script>
  <script src="https://twgljs.org/dist/4.x/twgl-full.min.js"></script>
  <canvas id="c"></canvas>

But, is that because the GPU actually needs z and w to be different or is it just GPU design and a different design could derive the info it needed if we did the w divide ourselves?

Update:

After asking this question I ended up writing this article that illustrates the perspective interpolation.


I'd like to extent on BDL's answer. It is not only about the perspective interpolation. It is also about the clipping. The space the value gl_Position is supposed to be provided in is called clip space, and this is before the division by w.

The (default) clip volume of OpenGL is defined in clip space as

-w <= x,y,z <= w   (with w varying per vertex)

After the division by w we get

-1 <= x,y,z <= 1   (in NDC coordinates).

However, if you try to do the clipping after the division by w, and would check against that cube in NDC, you get a problem, because all clip space points fullfilling this:

 w <= x,y,z <= -w (in clip space)

will also fullfill the NDC constraint.

The thing here is that points behind the camera will be transformed to somewhere in front of the camera, mirrored (since x/-1 is the same as -x/1). This also happens to the z coordinate. One might argue that this is irrelevant, because any point behind the camera is projected behind (in the sense of more far away than) the far plane, as per the construction of the typical projection matrix, so it will lie outside of the viewing volume in either case.

But if you have a primitive where at least one point is inside the view volume, and at least one point is behind the camera, you should have a primitive which intersects the near plane also. However, after the division by w, it will intersect the far plane now!. So clipping in NDC space, after the division, is much harder to get right. I tried to visualize this in this drawing:

top-down view of eye space and NDC with and without clipping (the drawing is to-scale, the depth range of projection is much shorter than anyone would typically use, to better illustrate the issue).

The clipping is done as a fixed-function stage in hardware and it has to be done before the division, hence you should provide the correct clip-space coordinates to work on.

(Note: actual GPUs might not use an extra clipping stage at all, they actually might also use a clipless rasterizer, like it is speculated in Fabian Giesen's blog article there. There are some algorithms like Olano and Greer (1997). However, this all works by doing the rasterization directly in homogenous coordinates, so we still need the w...)


The reason is, that not only gl_Position gets divided by the homogeneous coordinate, but also all other interpolated varyings. This is called perspective correct interpolation which requires the division to be after the interpolation (and thus after the rasterization). So doing the division in the vertex shader would simply not work. See also this post.


It's even simpler; the clipping happens after the vertex shading. If the vertex shader was allowed (or more strongly, mandated) to do perspective divison the clipping would have to happen in homogeneous coordinates which would be very inconvenient. The vertex attributes are still linear in clip coordinates which makes clipping a child's play instead of having to clip in homogeneous coordinates:

v' = 1.0f / (lerp(1.0 / v0, 1.0 / v1, t))

See how division-heavy that would be? In clip coordinates it is simply:

v' = lerp(v0, v1, t)

It is even better than that: the clipping limits in clip coordinates are:

-w < x < w

This means the distances to clip planes (left and right) are trivial to compute in clip coordinates:

x - w, and w - x. It's just so much simpler and efficient to clip in clip coordinates that it just makes all the sense in the world to insist that vertex shader outputs are in clip coordinates. Then let the hardware do the clipping and dividing by w-coordinate since there is no reason left to leave it to the user anymore. It's also simpler as that way we don't need post-clip vertex shader (which would also include mapping into the viewport but that is another story). The way they designed it is actually quite nice. :)