__builtin_prefetch, How much does it read?

I think it just emit one FETCH machine instruction, which basically fetches a line cache, whose size is processor specific.

And you could use __builtin_prefetch (con[i+3].Pfrom) for instance. By my (small) experience, in such a loop, it is better to prefetch several elements in advance.

Don't use __builtin_prefetch too often (i.e. don't put a lot of them inside a loop). Measure the performance gain if you need them, and use GCC optimization (at least -O2). If you are very lucky, manual __builtin_prefetch could increase the performance of your loop by 10 or 20% (but it could also hurt it).

If such a loop is crucial to you, you might consider running it on GPUs with OpenCL or CUDA (but that requires recoding some routines in OpenCL or CUDA language, and tuning them to your particular hardware).

Use also a recent GCC compiler (the latest release is 4.6.2) because it is making a lot of progress on these areas.


(added in january 2018:)

Both hardware (processors) and compilers have made a lot of progress regarding caches, so it seems that using __builtin_prefetch is less useful today (in 2018). Be sure to benchmarck.


It reads a cache line. Cache line size may vary, but it is most likely to be 64 bytes on modern CPUs. If you need to read multiple cache lines, check out prefetch_range.