FFMPEG multiple outputs performance (Single instance vs Multiple instances)

I am working on creating multiple encoded streams from the single file input (.mp4). Input stream has no audio. Each encoded stream is created by cropping different part of the input and then encoded with the same bit-rate on 32 core system.

Here're the scenarios I am trying as explained in ffmpeg wiki for creating multiple outputs. https://trac.ffmpeg.org/wiki/Creating%20multiple%20outputs

Scenario1 (Using single ffmpeg instance)

ffmpeg -i input.mp4 \

-filter:v crop=iw/2:ih/2:0:0 -c:v libx264 -b:v 5M out_1.mp4 \

-filter:v crop=iw/2:ih/2:iw/2:0 -c:v libx264 -b:v 5M out_2.mp4 \

-filter:v crop=iw/2:ih/2:0:ih/2 -c:v libx264 -b:v 5M out_3.mp4

In this case, I am assuming that ffmpeg will be decoding the input only once and it will be supplied to all the crop filters. Please correct me if that is not right.

Scenario2 (Using multiple ffmpeg instances and hence three separate processes)

ffmpeg -i input.mp4 -filter:v crop=iw/2:ih/2:0:0 -c:v libx264 -b:v 5M out_1.mp4

ffmpeg -i input.mp4 -filter:v crop=iw/2:ih/2:iw/2:0 -c:v libx264 -b:v 5M out_2.mp4

ffmpeg -i input.mp4 -filter:v crop=iw/2:ih/2:0:ih/2 -c:v libx264 -b:v 5M out_3.mp4

In my case, I actually need to encode even more number of streams by cropping different sections of the input video. I am showing three here just to make this example simpler.

Now, in terms of fps performance I see that scenario 2 performs better. It also uses cpu to its maximum (more than 95% cpu utilization). Scenario 1 has less fps and cpu utilization is way lower (close to 65%). Also, in this case, as I increase the number of streams to be encoded the cpu utilization does not increase linearly. it almost becomes 1.5x when I go from one stream to two. But after that the increments are very low (probably 10% and even less with more streams).

So my question is: I want to use single instance ffmpeg because it avoids decoding multiple times and also, because the input I have could be as big as 4K or even bigger. What should I do to get better cpu utilization (> 90%) and hence better fps hopefully? also, why is the cpu utilization not increasing linearly with number of streams to be encoded? Why doesn't single instance ffmpeg perform as good as multiple instances? It seems to me that with single ffmpeg instance, all the encodes are not truly running in parallel.

Edit: Here's the simplest possible way I can reproduce and explain the issue in case things are not so clear. Keep in my mind, that this is just for experiment purposes to understand the issue.

Single Instance: ffmpeg -y -i input.mp4 -c:v libx264 -x264opts threads=1 -b:v 1M -f null - -c:v libx264 -x264opts threads=1 -b:v 1M -f null - -c:v libx264 -x264opts threads=1 -b:v 1M -f null -

Multiple Instances: ffmpeg -y -i input.mp4 -c:v libx264 -x264opts threads=1 -b:v 1M -f null - | ffmpeg -y -i input.mp4 -c:v libx264 -x264opts threads=1 -b:v 1M -f null - | ffmpeg -y -i input.mp4 -c:v libx264 -x264opts threads=1 -b:v 1M -f null -

Note that I am limiting x264 to single thread. In case of single instance, I would expect ffmpeg to generate 1 encoding thread for each x264 encode and execute them in parallel. But I see that only one cpu core is fully utilized which makes me believe that only one encode session is running at a time. On the other hand, with the case of multiple instances, I see that three cpu cores are fully utilized which i guess means that all the three encodes are running in parallel.

I really hope some experts can jump in and help with this.


A less obvious problem is that depending on your input/output or filters ffmpeg might need to do pixel format conversion internally and in certain cases this becomes a bottleneck when using parallel outputs if done on each stream separately.

The idea is to do the pixel format conversion once if possible, like:

-filter_complex '[0:v]format=yuv420p, split=3[s1][s2][s3]' \
-map '[s1]' ... \
-map '[s2]' ... \
-map '[s3]' ... \

Same filters applied to all outputs should also be used only once. Some filters might need a specific pixel format.

For other causes see the small note at the bottom of the wiki:

Parallel encoding

Outputting and re encoding multiple times in the same FFmpeg process will typically slow down to the "slowest encoder" in your list. Some encoders (like libx264) perform their encoding "threaded and in the background" so they will effectively allow for parallel encodings, however audio encoding may be serial and become the bottleneck, etc. It seems that if you do have any encodings that are serial, it will be treated as "real serial" by FFmpeg and thus your FFmpeg may not use all available cores.