Why does Monitor.PulseAll result in a "stepping stair" latency pattern in signaled threads?

One difference between these version is that in PulseAll case - the threads immediately repeat the loop, locking the object again.

You have 12 cores, so 12 threads are running, execute the loop, and enter the loop again, locking the object (one after another) and then entering wait state. All that time the other threads wait. In ManualEvent case you have two events, so threads don't immediately repeat the loop, but gets blocked on ARES events instead - this allows other threads to take lock ownership faster.

I've simulated similar behavior in PulseAll by adding sleep at the end of the loop in ReadLastTimestampAndPublish. This lets other thread to lock syncObj faster and seem to improve the numbers I'm getting from the program.

static void ReadLastTimestampAndPublish()
{
    while(true)
    {
        lock(SyncObj)
        {
            Monitor.Wait(SyncObj);
        }
        IterationToTicks.Add(Tuple.Create(Iteration, s.Elapsed.Ticks - LastTimestamp));
        Thread.Sleep(TimeSpan.FromMilliseconds(100));   // <===
    }
}

To start off, this is not an answer, merely my notes from looking at the SSCLI to find out exactly what is going on. Most of this is well above my head, but interesting nonetheless.

The trip down the rabbit hole starts with a call to Monitor.PulseAll, which is implemented in C#:

clr\src\bcl\system\threading\monitor.cs:

namespace System.Threading
{
    public static class Monitor 
    {
        // other methods omitted

        [MethodImplAttribute(MethodImplOptions.InternalCall)]
        private static extern void ObjPulseAll(Object obj);

        public static void PulseAll(Object obj)
        {
            if (obj==null) {
                throw new ArgumentNullException("obj");
            }

            ObjPulseAll(obj);
        } 
    }
}

InternalCall methods get routed in clr\src\vm\ecall.cpp:

FCFuncStart(gMonitorFuncs)
    FCFuncElement("Enter", JIT_MonEnter)
    FCFuncElement("Exit", JIT_MonExit)
    FCFuncElement("TryEnterTimeout", JIT_MonTryEnter)
    FCFuncElement("ObjWait", ObjectNative::WaitTimeout)
    FCFuncElement("ObjPulse", ObjectNative::Pulse)
    FCFuncElement("ObjPulseAll", ObjectNative::PulseAll)
    FCFuncElement("ReliableEnter", JIT_MonReliableEnter)
FCFuncEnd()

ObjectNative lives in clr\src\vm\comobject.cpp:

FCIMPL1(void, ObjectNative::PulseAll, Object* pThisUNSAFE)
{
    CONTRACTL
    {
        MODE_COOPERATIVE;
        DISABLED(GC_TRIGGERS);  // can't use this in an FCALL because we're in forbid gc mode until we setup a H_M_F.
        THROWS;
        SO_TOLERANT;
    }
    CONTRACTL_END;

    OBJECTREF pThis = (OBJECTREF) pThisUNSAFE;
    HELPER_METHOD_FRAME_BEGIN_1(pThis);
    //-[autocvtpro]-------------------------------------------------------

    if (pThis == NULL)
        COMPlusThrow(kNullReferenceException, L"NullReference_This");

    pThis->PulseAll();

    //-[autocvtepi]-------------------------------------------------------
    HELPER_METHOD_FRAME_END();
}
FCIMPLEND

OBJECTREF is some magic sprinkled on top of Object (the -> operator is overloaded), so OBJECTREF->PulseAll() is actually Object->PulseAll() which is implemented in clr\src\vm\object.h and just forwards the call on to ObjHeader->PulseAll:

class Object
{
  // snip   
  public:
  // snip
    ObjHeader   *GetHeader()
    {
        LEAF_CONTRACT;
        return PTR_ObjHeader(PTR_HOST_TO_TADDR(this) - sizeof(ObjHeader));
    }
  // snip
    void PulseAll()
    {
        WRAPPER_CONTRACT;
        GetHeader()->PulseAll();
    }
  // snip
}

ObjHeader::PulseAll retrieves the SyncBlock, which uses AwareLock for Entering and Exiting the lock on the object. AwareLock (clr\src\vm\syncblk.cpp) uses a CLREvent (clr\src\vm\synch.cpp) created as a MonitorEvent (CLREvent::CreateMonitorEvent(SIZE_T)), which calls UnsafeCreateEvent (clr\src\inc\unsafe.h) or the hosting environment's synchronization methods.

clr\src\vm\syncblk.cpp:

void ObjHeader::PulseAll()
{
    CONTRACTL
    {
        INSTANCE_CHECK;
        THROWS;
        GC_TRIGGERS;
        MODE_ANY;
        INJECT_FAULT(COMPlusThrowOM(););
    }
    CONTRACTL_END;

    //  The following code may cause GC, so we must fetch the sync block from
    //  the object now in case it moves.
    SyncBlock *pSB = GetBaseObject()->GetSyncBlock();

    // GetSyncBlock throws on failure
    _ASSERTE(pSB != NULL);

    // make sure we own the crst
    if (!pSB->DoesCurrentThreadOwnMonitor())
        COMPlusThrow(kSynchronizationLockException);

    pSB->PulseAll();
}

void SyncBlock::PulseAll()
{
    CONTRACTL
    {
        INSTANCE_CHECK;
        NOTHROW;
        GC_NOTRIGGER;
        MODE_ANY;
    }
    CONTRACTL_END;

    WaitEventLink  *pWaitEventLink;

    while ((pWaitEventLink = ThreadQueue::DequeueThread(this)) != NULL)
        pWaitEventLink->m_EventWait->Set();
}

DequeueThread uses a crst (clr\src\vm\crst.cpp) which is a wrapper around critical sections. m_EventWait is a manual CLREvent.

So, all of this is using OS primitives unless the default hosting provider is overriding things.