How to alpha blend RGBA unsigned byte color fast?

Use SSE - start around page 131.

The basic workflow

  1. Load 4 pixels from src (16 1 byte numbers) RGBA RGBA RGBA RGBA (streaming load)

  2. Load 4 more which you want to blend with srcbytetop RGBx RGBx RGBx RGBx

  3. Do some swizzling so that the A term in 1 fills every slot I.e

    xxxA xxxB xxxC xxxD -> AAAA BBBB CCCC DDDD

    In my solution below I opted instead to re-use your existing "maskcurrent" array but having alpha integrated into the "A" field of 1 will require less loads from memory and thus be faster. Swizzling in this case would probably be: And with mask to select A, B, C, D. Shift right 8, Or with origional, shift right 16, or again.

  4. Add the above to a vector that is all -255 in every slot

  5. Multiply 1 * 4 (source with 255-alpha) and 2 * 3 (result with alpha).

    You should be able to use the "multiply and discard bottom 8 bits" SSE2 instruction for this.

  6. add those two (4 and 5) together

  7. Store those somewhere else (if possible) or on top of your destination (if you must)

Here is a starting point for you:

    //Define your image with __declspec(align(16)) i.e char __declspec(align(16)) image[640*480]
    // so the first byte is aligned correctly for SIMD.
    // Stride must be a multiple of 16.

    for (int y = top ; y < bottom; ++y)
    {
        BYTE* resultByte = GET_BYTE(resultBits, left, y, stride, bytepp);
        BYTE* srcByte = GET_BYTE(srcBits, left, y, stride, bytepp);
        BYTE* srcByteTop = GET_BYTE(srcBitsTop, left, y, stride, bytepp);
        BYTE* maskCurrent = GET_GREY(maskSrc, left, y, width);
        for (int x = left; x < right; x += 4)
        {
            //If you can't align, use _mm_loadu_si128()
            // Step 1
            __mm128i src = _mm_load_si128(reinterpret_cast<__mm128i*>(srcByte)) 
            // Step 2
            __mm128i srcTop = _mm_load_si128(reinterpret_cast<__mm128i*>(srcByteTop)) 

            // Step 3
            // Fill the 4 positions for the first pixel with maskCurrent[0], etc
            // Could do better with shifts and so on, but this is clear
            __mm128i mask = _mm_set_epi8(maskCurrent[0],maskCurrent[0],maskCurrent[0],maskCurrent[0],
                                        maskCurrent[1],maskCurrent[1],maskCurrent[1],maskCurrent[1],
                                        maskCurrent[2],maskCurrent[2],maskCurrent[2],maskCurrent[2],
                                        maskCurrent[3],maskCurrent[3],maskCurrent[3],maskCurrent[3],
                                        ) 

            // step 4
            __mm128i maskInv = _mm_subs_epu8(_mm_set1_epu8(255), mask) 

            //Todo : Multiply, with saturate - find correct instructions for 4..6
            //note you can use Multiply and add _mm_madd_epi16

            alpha = *maskCurrent;
            red = (srcByteTop[R] * alpha + srcByte[R] * (255 - alpha)) / 255;
            green = (srcByteTop[G] * alpha + srcByte[G] * (255 - alpha)) / 255;
            blue = (srcByteTop[B] * alpha + srcByte[B] * (255 - alpha)) / 255;
            CLAMPTOBYTE(red);
            CLAMPTOBYTE(green);
            CLAMPTOBYTE(blue);
            resultByte[R] = red;
            resultByte[G] = green;
            resultByte[B] = blue;
            //----

            // Step 7 - store result.
            //Store aligned if output is aligned on 16 byte boundrary
            _mm_store_si128(reinterpret_cast<__mm128i*>(resultByte), result)
            //Slow version if you can't guarantee alignment
            //_mm_storeu_si128(reinterpret_cast<__mm128i*>(resultByte), result)

            //Move pointers forward 4 places
            srcByte += bytepp * 4;
            srcByteTop += bytepp * 4;
            resultByte += bytepp * 4;
            maskCurrent += 4;
        }
    }

To find out which AMD processors will run this code (currently it is using SSE2 instructions) see Wikipedia's List of AMD Turion microprocessors. You could also look at other lists of processors on Wikipedia but my research shows that AMD cpus from around 4 years ago all support at least SSE2.

You should expect a good SSE2 implimentation to run around 8-16 times faster than your current code. That is because we eliminate branches in the loop, process 4 pixels (or 12 channels) at once and improve cache performance by using streaming instructions. As an alternative to SSE, you could probably make your existing code run much faster by eliminating the if checks you are using for saturation. Beyond that I would need to run a profiler on your workload.

Of course, the best solution is to use hardware support (i.e code your problem up in DirectX) and have it done on the video card.


You can always calculate the alpha of red and blue at the same time. You can also use this trick with the SIMD implementation mentioned before.

unsigned int blendPreMulAlpha(unsigned int colora, unsigned int colorb, unsigned int alpha)
{
    unsigned int rb = (colora & 0xFF00FF) + ( (alpha * (colorb & 0xFF00FF)) >> 8 );
    unsigned int g = (colora & 0x00FF00) + ( (alpha * (colorb & 0x00FF00)) >> 8 );
    return (rb & 0xFF00FF) + (g & 0x00FF00);
}


unsigned int blendAlpha(unsigned int colora, unsigned int colorb, unsigned int alpha)
{
    unsigned int rb1 = ((0x100 - alpha) * (colora & 0xFF00FF)) >> 8;
    unsigned int rb2 = (alpha * (colorb & 0xFF00FF)) >> 8;
    unsigned int g1  = ((0x100 - alpha) * (colora & 0x00FF00)) >> 8;
    unsigned int g2  = (alpha * (colorb & 0x00FF00)) >> 8;
    return ((rb1 | rb2) & 0xFF00FF) + ((g1 | g2) & 0x00FF00);
}

0 <= alpha <= 0x100


For people that want to divide by 255, i found a perfect formula:

pt->r = (r+1 + (r >> 8)) >> 8; // fast way to divide by 255

Here's some pointers.

Consider using pre-multiplied foreground images as described by Porter and Duff. As well as potentially being faster, you avoid a lot of potential colour-fringing effects.

The compositing equation changes from

r =  kA + (1-k)B

... to ...

r =  A + (1-k)B

Alternatively, you can rework the standard equation to remove one multiply.

r =  kA + (1-k)B
==  kA + B - kB
== k(A-B) + B

I may be wrong, but I think you shouldn't need the clamping either...


I can't comment because I don't have enough reputation, but I want to say that Jasper's version will not overflow for valid input. Masking the multiplication result is necessary because otherwise the red+blue multiplication would leave bits in the green channel (this would also be true if you multiplied red and blue separately, you'd still need to mask out bits in the blue channel) and the green multiplication would leave bits in the blue channel. These are bits that are lost to right shift if you separate the components out, as is often the case with alpha blending. So they're not overflow, or underflow. They're just useless bits that need to be masked out to achieve expected results.

That said, Jasper's version is incorrect. It should be 0xFF-alpha (255-alpha), not 0x100-alpha (256-alpha). This would probably not produce a visible error.

I've found an adaptation of Jasper's code to be be faster than my old alpha blending code, which was already decent, and am currently using it in my software renderer project. I work with 32-bit ARGB pixels:

Pixel AlphaBlendPixels(Pixel p1, Pixel p2)
{
    static const int AMASK = 0xFF000000;
    static const int RBMASK = 0x00FF00FF;
    static const int GMASK = 0x0000FF00;
    static const int AGMASK = AMASK | GMASK;
    static const int ONEALPHA = 0x01000000;
    unsigned int a = (p2 & AMASK) >> 24;
    unsigned int na = 255 - a;
    unsigned int rb = ((na * (p1 & RBMASK)) + (a * (p2 & RBMASK))) >> 8;
    unsigned int ag = (na * ((p1 & AGMASK) >> 8)) + (a * (ONEALPHA | ((p2 & GMASK) >> 8)));
    return ((rb & RBMASK) | (ag & AGMASK));
}