Difference between Frames and Packets in FFmpeg

To answer your first and third questions:

  • according to the doc for the AVPacket class: "For video, it should typically contain one compressed frame. For audio it may contain several compressed frames.
  • the decode video example gives this code that reads all frames within a packet; you can also use it to count the frames:
static void decode(AVCodecContext *dec_ctx, AVFrame *frame, AVPacket *pkt,
                   const char *filename)
{
    char buf[1024];
    int ret;
    ret = avcodec_send_packet(dec_ctx, pkt);
    if (ret < 0) {
        fprintf(stderr, "Error sending a packet for decoding\n");
        exit(1);
    }
    while (ret >= 0) {
        ret = avcodec_receive_frame(dec_ctx, frame);
        if (ret == AVERROR(EAGAIN) || ret == AVERROR_EOF)
            return;
        else if (ret < 0) {
            fprintf(stderr, "Error during decoding\n");
            exit(1);
        }
        printf("saving frame %3d\n", dec_ctx->frame_number);
        fflush(stdout);
        /* the picture is allocated by the decoder. no need to
           free it */
        snprintf(buf, sizeof(buf), filename, dec_ctx->frame_number);
        pgm_save(frame->data[0], frame->linesize[0],
                 frame->width, frame->height, buf);
    }
}

Simply put, a packet is a block of data.

This is generally determined by bandwidth. If the device has limited internet speeds, or a phone with a choppy signal, then packetsize will be smaller. If it's a desktop with dedicated service, packetsize could be quite a bit larger.

A frame could be thought of as one cell of animation, but typically these days, due to compression, it's not an actual keyframe image, but simply the changes since the last entire keyframe. They'll send one keyframe, an actual image once every few seconds or so, but every frame in-between is just a blending of data that specifies which pixels have changed since the last image, the delta.

So yea, let's say your packetsize is 1024 bytes, then your resolution will be limited to however many pixels that stream can carry the changes for. They might send one-frame-per-packet to keep it simple, but I don't think there's anything that absolutely guarantees that, as the datastream is reconstructed from those packets, often out of order, and then the frame deltas are generated once all those packets are pieced together.

Audio takes up much less space than video, so they might only need to send one audio packet for every 50 video packets.

I know these guys did a few clips on video-streams being recombined from packets, on their channel -- https://www.youtube.com/watch?v=DkIhI59ysXI


Basically, frames are natural, while packets are artifical. 😉

Frames are substantial, packets are auxiliary – they help process a stream successively by smaller parts of acceptable sizes (instead of processig a stream as a whole). “Divide and conquer.”

enter image description here

Packet has multiple frames, right?

Packet may have multiple (encoded) frames, or it may have only one, even incomplete.

Can a frame be only part of one Packet?

No. It may be spread over several packets. See the Frame 1 in the picture.

I refer to the case where a half of the frame information is in packet1 and another half in packet2? Is it possible?

Yes. See the Frame 1.

How will we know how many frames are in a packet in LibAV?

Frames per packet may be different in different multimedia files, it depends on how a particular stream was encoded.

Even in the same stream there may be packets with different number of (encoded) frames – compare Packet 0 and Packet 1.

There is no info in a packet how many (encoded) frames it contains.

Frames in the same packet have generally different sizes (as in the picture above), so a packet is not an array of equally-sized elements (frames).