Modern GPUs: How "intelligent" are they?
There are many resources on 3D programming (OpenGL or DirectX) and the corresponding graphics pipelines available, but I'm wondering at which level they are implemented on a modern GPU.
So far I've been able to find out that there has been a move from very specialized circuritry that implements the various stages of the graphics pipeline to a more general approach. This transformation has been partially reflected on the 3D APIs in the form of programmable shaders. Most transistors seem to be dedicated to massively parallel SIMD units that execute the actual shader instructions.
But what about the rest of the graphics pipeline? Is that still implemented in hardware?
Is a modern GPU (think Nvidia Fermi) basically a set of "stupid" SIMD arrays that are fed with instructions and data from the CPU and various caches, and all the actual logic that maps the graphics pipeline to those instructions happens in the graphics driver?
Or are there some controlling units somewhere in the GPU that translate the incoming high-level instruction and data streams (compiled shader programs, vertex data and attributes, and textures) into actual SIMD instructions and take care of synchronization, memory allocation etc.?
I suspect that the reality is somewhere in between those two extremes, and the answer would be rather lengthy and based on a lot of speculation (there has to be a reason to certain GPU vendors refusing to publish any documentation on their products, let alone driver source code...), but any hints in the right direction and useful resources would be greatly appreciated.
So far, I've found a series of blog posts that has been immensely useful in understanding more about modern GPUs, but I'm missing some kind of higher level overview about the overall architecture - I can understand most of the mentioned concepts, but don't quite get how they fit together.
So far I've been able to find out that there has been a move from very specialized circuritry that implements the various stages of the graphics pipeline to a more general approach. This transformation has been partially reflected on the 3D APIs in the form of programmable shaders. Most transistors seem to be dedicated to massively parallel SIMD units that execute the actual shader instructions.
Correct. Basically, due to the relatively large feature size on older GPUs, the only way to efficiently implement things like basic lighting, antialiasing, texture mapping, geometry, etc. was to use a "fixed function" pipeline. They sacrificed flexibility for the sake of performance because they didn't have enough chip density to be able to implement it using a more generic massively parallel SIMD architecture like current GPUs.
Is a modern GPU (think Nvidia Fermi) basically a set of "stupid" SIMD arrays that are fed with instructions and data from the CPU and various caches, and all the actual logic that maps the graphics pipeline to those instructions happens in the graphics driver?
Certain things are still done in hardware; others aren't. For example, ROPs are still used in the very final stage to push pixel data into the VGA chipset. Note I'm using "VGA chipset" here as a generic term to refer to the mechanism that transmits a video signal to your monitor, regardless of whether it's truly "VGA" in any respect.
It is true, in general, that current GPU architectures such as Nvidia Fermi and AMD Southern Islands are, for the most part, massively parallel CPUs where they have a custom instruction set, and each individual "core" is extremely weak, but there are a whole lot of cores (sometimes several thousand). But there is still graphics-specific hardware in there:
Hardware video decoding is often done, in large part, using fixed function chips. This is particularly true when DRM (Digital Restrictions Management) is involved. Sometimes "hardware" video decoding really means a firmware-guided set of instructions which are just served up as regular old tasks for the SIMD cores. It really depends.
With the exception of a very few compute-specific Nvidia boards (Tesla), almost all "generic SIMD" graphics cards have a complete array of hardware dedicated to video output. Video output is not the same as rendering; the fixed function output elements include LVDS/TMDS/HDMI/DisplayPort codecs, HDCP, and even audio processing (basically a little DSP), since HDMI supports audio.
"Graphics memory" is still stored on-board with the GPUs, so that they don't have to traverse the chatty and relatively high latency PCIe bus to hit system RAM, which itself is slower and takes longer to respond than the more expensive, higher quality, faster graphics memory (e.g. GDDR5) which comes in smaller capacities but higher speeds than system memory. The process of storing stuff in graphics memory and retrieving it from there to the GPU or to the CPU is still pretty much a fixed function operation. Some GPUs have their own sort of "IOMMU", but this memory management unit is distinct (separate) from the CPU. This is not true, however, for recent Intel GPUs integrated into their processors (Sandy and Ivy Bridge), where the memory architecture is almost entirely "coherent" (graphics memory is system memory) and reads from graphics memory are as cheap for the CPU as they are for the GPU.
Or are there some controlling units somewhere in the GPU that translate the incoming high-level instruction and data streams (compiled shader programs, vertex data and attributes, and textures) into actual SIMD instructions and take care of synchronization, memory allocation etc.?
The "native" language of the SIMDs is almost always generated by the driver in software, and not by the GPU's own firmware. This is especially true for DirectX 9 / OpenGL 2.x level features. Shaders written in high level languages like HLSL, GLSL or OpenGL ARB shader assembler are eventually translated, by the driver, into GPU instructions by banging on certain registers and doing the required PCIe hoops in order to send over batch buffers of compute and/or render commands.
A few things, like hardware tessellation (DirectX 11 / OpenGL 4.0) are again pushed into the hardware in a fixed-function way, similar to how they used to do almost everything in the old days. This is because, again, performance constraints require that the most efficient way to do these computations is to have dedicated circuitry for it, rather than having firmware or the driver "program" the SIMDs to do it.
I suspect that the reality is somewhere in between those two extremes, and the answer would be rather lengthy and based on a lot of speculation (there has to be a reason to certain GPU vendors refusing to publish any documentation on their products, let alone driver source code...), but any hints in the right direction and useful resources would be greatly appreciated.
AMD and Intel have very robust documentation out in the open about their recent GPUs, as well as fully working open source graphics drivers for Linux (see the Mesa and Direct Rendering Manager projects). If you look at some of the code in these drivers, you'll laugh, because the graphics driver writers actually have to implement the geometry of things like drawing various shapes or patterns, in "software" (but using hardware commands to submit the real legwork to the hardware for processing), because neither the GPU firmware nor the fixed function stuff is present anymore to process it fully in hardware :) It's kind of funny what they have to do to support OpenGL 1.x / 2.x on new hardware.
The evolution has kind of gone like this:
- Very long ago (before real-time 3d rendering was considered possible): Ray-tracing on the CPU was normal for non-real-time rendering. For simple graphics like you see in early versions of Windows, the CPU was fast enough to draw simple shapes (rectangles, characters of a font, shading patterns, etc.) without fixed function hardware, but it couldn't draw too complex stuff.
- Long ago (OpenGL 1.x): almost everything implemented by solid state hardware; "electrically" fixed functions were the norm even for basic operations
- A while ago (OpenGL 2.x): A transition towards making GPUs more programmable had begun. "Fragment shaders" (aka pixel shaders) on 5 year old hardware can almost perform arbitrary calculations like a CPU, but it's limited by the architecture, which is still very much geared towards graphics. Hence, OpenCL / DirectCompute are not available on this hardware.
- Recently (OpenGL 3.x): The transition to general purpose GPUs is mostly complete, but they are of course, optimized to workloads involving large matrices of data (think linear algebra) being submitted in batches, rather than CPUs which can efficiently operate on long sequences of very small data (1+1, 2*4, 5*6 in sequence, etc.) General purpose computing is available via OpenCL, CUDA, etc. but the hardware is still not a full-on "SIMD coprocessor" because (a) you still have to hammer hardware-specific registers to get to the GPU functionality; (b) reading from the GPU VRAM is very slow because of the PCIe bus overhead (reading from the GPU is not very optimized on current architecture); (c) memory and cache architecture is not coherent with the CPU; a lot of legacy fixed function hardware is still laying around.
- Present (OpenGL 4.x): Got rid of a lot of the legacy fixed function hardware. Improved the GPU read latency somewhat. IOMMUs allow for a (translated) hardware-assisted mapping between VRAM and system memory. Also introduced hardware tessellation, bringing back elements of fixed function.
- Future (HSA): The GPU is basically a co-processor. It is all but fully integrated with the CPU with very little impedance (for reads/writes) between the GPU and CPU, even for dedicated GPUs on the PCIe bus. Fully coherent memory architecture -- "mi memoria es su memoria" (my memory is your memory). Userspace programs can read from "VRAM" just like they read from system memory with no driver shim, and the hardware takes care of it. You have the CPU for "serial" processing (do this, then do that, then do this, then do that) for modest quantities of data, and the GPU for "parallel" processing (perform this operation on this huge dataset and divide it up how you see fit). The board that the GPU sits on might still have ROPs, HDMI codec, etc. but this stuff is necessary for display output, so you can either move that into the motherboard and have all GPUs be like Tesla cards, or leave it on the GPU if it results in reduced latency.