Why is a pthread mutex considered "slower" than a futex?
Futexes were created to improve the performance of pthread mutexes. NPTL uses futexes, LinuxThreads predated futexes, which I think is where the "slower" consideration comes. NPTL mutexes may have some additional overhead, but it shouldn't be much.
Edit: The actual overhead basically consists on:
- selecting the correct algorithm for the mutex type (normal, recursive, adaptive, error-checking; normal, robust, priority-inheritance, priority-protected), where the code heavily hints to the compiler that we are likely using a normal mutex (so it should convey that to the CPU's branch prediction logic),
- and a write of the current owner of the mutex if we manage to take it which should normally be fast, since it resides in the same cache-line as the actual lock which we have just taken, unless the lock is heavily contended and some other CPU accessed the lock between the time we took it and when we attempted to write the owner (this write is unneeded for normal mutexes, but needed for error-checking and recursive mutexes).
So, a few cycles (typical case) to a few cycles + a branch misprediction + an additional cache miss (very worst case).
The short answer to your question is that futexes are known to be implemented about as efficiently as possible, while a pthread mutex may or may not be. At minimum, a pthread mutex has overhead associated with determining the type of mutex and futexes do not. So a futex will almost always be at least as efficient as a pthread mutex, until and unless someone thinks up some structure lighter than a futex and then releases a pthreads implementation that uses that for its default mutex.
Technically speaking pthread mutexes are not slower or faster than futexes. pthread is just a standard API, so whether they are slow or fast depends on the implementation of that API.
Specifically in Linux pthread mutexes are implemented as futexes and are therefore fast. Actually, you don't want to use the futex API itself as it is very hard to use, does not have the appropriate wrapper functions in glibc and requires coding in assembly which would be non portable. Fortunately for us the glibc maintainers already coded all of this for us under the hood of the pthread mutex API.
Now, because most operating systems did not implement futexes then programmers usually mean by pthread mutex is the performance you get from usual implementation of pthread mutexes, which is, slower.
So it's a statistical fact that in most operating systems that are POSIX compliant the pthread mutex is implemented in kernel space and is slower than a futex. In Linux they have the same performance. It could be that there are other operating systems where pthread mutexes are implemented in user space (in the uncontended case) and therefore have better performance but I am only aware of Linux at this point.
Because they stay in userspace as much as possible, which means they require fewer system calls, which is inherently faster because the context switch between user and kernel mode is expensive.
I assume you're talking about kernel threads when you talk about POSIX threads. It's entirely possible to have an entirely userspace implementation of POSIX threads which require no system calls but have other issues of their own.
My understanding is that a futex is halfway between a kernel POSIX thread and a userspace POSIX thread.