Why are homogenous coordinates needed in image projection?

It isn’t so much that homogeneous coordinates are needed. Rather, they make computation simpler by making many important transformations linear. They can then all be represented as $4\times4$ matrices and composition is just matrix multiplication. It’s unfortunate that this source presents the use of homogeneous coordinates as a mere trick since there’s a natural connection between them and projective geometry.

Even before projective transformations were introduced, you probably already had to deal with rotation, translation and scaling. For instance, the object-to-world and world-to-camera coordinate transformations often involve rotations and translations. Although rotations about the origin in $\mathbb R^3$ are linear transformations and so can be represented as $3\times 3$ matrices, neither translations nor rotations about an arbitrary point (which are just a composition of a rotation about the origin and a couple of translations) can be. By introducing homogeneous coordinates, you can represent all affine transformations of $\mathbb R^3$ as $4\times4$ matrices. This includes all of the linear transformations that were possible with $3\times3$ matrices as well as translations (and combinations thereof). For example, a translation by $(\Delta x,\Delta y,\Delta z)$ is represented by the matrix $$ \pmatrix{1&0&0&\Delta x \\ 0&1&0&\Delta y \\ 0&0&1&\Delta z \\ 0&0&0&1}. $$

Moving to perspective projection, we place the center of projection—pinhole or eye—at the origin in camera coordinates, locate the image plane perpendicular to the $z$-axis at $(0,0,f)$, with $f<0$, and direct the line of sight along the negative $z$-axis. (This arrangement keeps the image plane coordinate system right-handed.) The ray to a point $P$ then intersects the image plane at the point $\left(f\frac{P_x}{P_z},f\frac{P_y}{P_z},f\right)=\frac f{P_z}\left(P_x,P_y,P_z\right)$. This is just a form of scaling transformation, but unfortunately it’s non-linear because the scale factor depends on the point’s $z$-coordinate. That is, if you tried to write this transformation as a $3\times 3$ matrix, it’d be something like $$\pmatrix{\frac f{P_z}&0&0 \\ 0&\frac f{P_z}&0 \\ 0&0&\frac f{P_z}},$$ but that doesn’t give you a linear transformation because the entries of this matrix aren’t constants—one of the coordinates of the point being transformed appears in it.

Notice something interesting here: Homogeneous coordinates have the property that the tuples $(x_1, x_2, \dots, 1)$ and $(\alpha x_1, \alpha x_2, \dots, \alpha)$ represent the same point (for $\alpha\ne0$). If you interpret the coordinates of a point in $\mathbb R^3$ as the homogeneous coordinates of a point in the image plane, then the perspective projection is just the identity map! This is no coincidence. When you use homogeneous coordinates, you’ve actually moved from the Euclidean space $\mathbb R^{n+1}$ to the projective space $\mathbb P^n(\mathbb R)$ by identifying points in $\mathbb R^{n+1}$ that lie on the same line through the origin (minus the origin itself) with each other. I.e., all of the points that have the same projection are an equivalence class that is a single point in the projective space.

Taking advantage of this scaling property, we can write the coordinates of the projected point on the image plane as the homogeneous coordinates $\left(P_x,P_y,P_z,\frac{P_z}f\right)$, which can be obtained from $P=(P_x,P_y,P_z,1)$ by applying the linear transformation given by the matrix:$$ \pmatrix{ 1&0&0&0 \\ 0&1&0&0 \\ 0&0&1&0 \\ 0&0&\frac1f&0 }. $$ The upper-left submatrix is indeed the identity map, as noted above.

Using this representation of perspective projection, you can continue to combine all of the necessary image transformations by simply multiplying their matrices together. This representation has other advantages besides. This homogeneous perspective matrix can be modified slightly so that the transformed $z$-coordinate encodes depth for $z$-buffering. Also, since you can multiply by an arbitrary scale factor without changing its effect, it can be tweaked to make the view volume after projection a cube with sides at $-1$ and $+1$, which makes clipping easy. Historically, when floating-point operations were very expensive compared to integer operations, homogeneous coordinates allowed you to use integer (really rational) arithmetic by carrying a denominator along in the last coordinate. This isn’t as important any more, although being able to rescale the coordinates can still be handy.