How to use hardware acceleration with ffmpeg

After some investigation I was able to implement the necessary HW accelerated decoding on OS X (VDA) and Linux (VDPAU). I will update the answer when I get my hands on Windows implementation as well. So let's start with the easiest:

Mac OS X

To get HW acceleration working on Mac OS you should just use the following: avcodec_find_decoder_by_name("h264_vda"); Note, however that you can accelerate h264 videos only on Mac OS with FFmpeg.

Linux VDPAU

On Linux things are much more complicated(who is surprised?). FFmpeg has 2 HW accelerators on Linux: VDPAU(Nvidia) and VAAPI(Intel) and only one HW decoder: for VDPAU. And it may seems perfectly reasonable to use vdpau decoder like in the Mac OS example above: avcodec_find_decoder_by_name("h264_vdpau");

You might be surprised to find out that it doesn't change anything and you have no acceleration at all. That's because it is only the beginning, you have to write much more code to get the acceleration working. Happily, you don't have to come up with a solution on your own: there are at least 2 good examples of how to achieve that: libavg and FFmpeg itself. libavg has VDPAUDecoder class which is perfectly clear and which I've based my implementation on. You can also consult ffmpeg_vdpau.c to get another implementation to compare. In my opinion the libavg implementation is easier to grasp, though.

The only things both aforementioned examples lack is proper copying of the decoded frame to the main memory. Both examples uses VdpVideoSurfaceGetBitsYCbCr which killed all the performance I gained on my machine. That's why you might want to use the following procedure to extract the data from a GPU:

bool VdpauDecoder::fillFrameWithData(AVCodecContext* context,
    AVFrame* frame)
{
    VdpauDecoder* vdpauDecoder = static_cast<VdpauDecoder*>(context->opaque);
    VdpOutputSurface surface;
    vdp_output_surface_create(m_VdpDevice, VDP_RGBA_FORMAT_B8G8R8A8, frame->width, frame->height, &surface);
    auto renderState = reinterpret_cast<vdpau_render_state*>(frame->data[0]);
    VdpVideoSurface videoSurface = renderState->surface;

    auto status = vdp_video_mixer_render(vdpauDecoder->m_VdpMixer,
        VDP_INVALID_HANDLE,
        nullptr,
        VDP_VIDEO_MIXER_PICTURE_STRUCTURE_FRAME,
        0, nullptr,
        videoSurface,
        0, nullptr,
        nullptr,
        surface,
        nullptr, nullptr, 0, nullptr);
    if(status == VDP_STATUS_OK)
    {
        auto tmframe = av_frame_alloc();
        tmframe->format = AV_PIX_FMT_BGRA;
        tmframe->width = frame->width;
        tmframe->height = frame->height;
        if(av_frame_get_buffer(tmframe, 32) >= 0)
        {
            VdpStatus status = vdp_output_surface_get_bits_native(surface, nullptr,
                reinterpret_cast<void * const *>(tmframe->data),
                reinterpret_cast<const uint32_t *>(tmframe->linesize));
            if(status == VDP_STATUS_OK && av_frame_copy_props(tmframe, frame) == 0)
            {
                av_frame_unref(frame);
                av_frame_move_ref(frame, tmframe);
                return;
            }
        }
        av_frame_unref(tmframe);
    }
    vdp_output_surface_destroy(surface);
    return 0;
}

While it has some "external" objects used inside you should be able to understand it once you have implemented the "get buffer" part(to which the aforementioned examples are of great help). Also I've used BGRA format which was more suitable for my needs maybe you will choose another.

The problem with all of it is that you can't just get it working from FFmpeg you need to understand at least basics of the VDPAU API. And I hope that my answer will aid someone in implementing the HW acceleration on Linux. I've spent much time on it myself before I realized that there is no simple, one-line way of implementing HW accelerated decoding on Linux.

Linux VA-API

Since my original question was regarding VA-API I can't not leave it unanswered. First of all there is no decoder for VA-API in FFmpeg so avcodec_find_decoder_by_name("h264_vaapi") doesn't make any sense: it is nullptr. I don't know how much harder(or maybe simpler?) is to implement decoding via VA-API since all the examples I've seen were quite intimidating. So I choose not to use VA-API at all and I had to implement the acceleration for an Intel card. Fortunately enough for me, there is a VDPAU library(driver?) which works over VA-API. So you can use VDPAU on Intel cards!

I've used the following link to setup it on my Ubuntu.

Also, you might want to check the comments to the original question where @Timothy_G also mentioned some links regarding VA-API.