You might want to use a faster lossless codec that will take more temporary space, like ffvhuff or something, if you have fast disks. Otherwise, use -preset ultrafast with -crf 0. (same as -qp 0, enables lossless mode.)

x264 slow only compresses a tiny better than superfast in lossless mode, BTW. Maybe if you had animation with some bit-identical blocks more than 1 frame back, so multiple ref frames could help, then higher settings would do more. My findings, for a 1h:22m NTSC DV deinterlace (720x480p60), were 27.9GB (superfast) vs 27.0GB (slow), or 55.0GB for ffvhuff, or 37.6GB for ffv1 (default settings, somewhat smaller with even slower ffv1 compress/decompress settings) all in yuv420). CABAC encode/decode at that bitrate takes a ton of CPU; I should have used ultrafast instead of superfast, or just -x264-params cabac=0.

TL;DR: use libx264 -preset ultrafast or ffvhuff for intermediate files.

ffv1 or h.264 with CABAC aren't worth the encode/decode CPU time when file size doesn't really matter. And huffyuv doesn't do yuv 4:2:0, you need ffvhuff for that. Lagarith is GPL, but decoder-only in ffmpeg, and the encoder isn't ported to anything but windows. (speed vs. compression tradeoff probably not too impressive vs. x264 either, except maybe for noiseless sources like animation, where the prediction / run-length stuff before the entropy coder will do well.)

Also, pretty sure you could use -vf drawtext during the bmp -> 2 sec video step. And why are you using hard-CBR (bitrate 35M, max-bitrate 35M) for that? Why not just lossless for that step, too?

Not usually useful to specific -profile for x264. It sets the profile flags in the output to the lowest it can, given what it puts into the bitstream. (i.e. it will set profile to high if 8x8dct=1, and work out what level based on rez, bitrate, and ref frames.) edit: -profile also forces x264 to lower ref frame count or other settings needed to keep to the limits for the given profile. Still, rare to need to use it, unless targeting some HW decoder.

What IS useful on the lossy final encode is using -preset slower or something to have x264 use more CPU time, but get more quality for the same bitrate. This pretty much replaces having to tweak all the knobs, like ref frames, trellis, adaptive b frames, motion search, etc.

To try to answer the actual question, the things that should matter for being able to concat different h.264 streams (from your camera's encoder and from BMP -> x264) should just be resolution, colorspace (yuv420 vs yuv444 or something), and probably some h.264 stream parameters, like interlaced or not.

If you're planning to keep the entire 35Mbit/s thing around, and don't want to get it to a reasonable file size that you could send over the internet, then you'd want to match whatever your camera's hardware encoder is doing. Or, you could run the whole thing through x264, which will take time, but with -preset veryslow you can probably do 10Mbit/s, or maybe even 5, with little noticeable quality loss. Try -crf 18, that's generally pretty much transparent. (Yeah you can still easily notice the difference if you pause and flip back and forth between x264 and source frames).

If should be possible to do the whole process in a single invocation of ffmpeg. The filtergraph that ultimately ends in the concat filter can have chains that generate 2 sec of repeated still image + silence interspersed with chains that are just inputfile->output.

(Or if you really want to not xcode the camera output, and can figure out what's different between your camera's output and x264: one invocation of ffmpeg per still image, and then one final concat.)

Hardware encoders in phones and cameras are usually pretty bad. The only way they look good is by throwing a lot of bitrate at the problem. x264 can usually do a lot better, esp. with -preset slow, slower, or veryslow. Obviously there will be another generation of loss, but you can typically cut the video bitrate down to 2Mbit/s for 1080p@24fps hollywood movies with very sharp detail. Noisy handheld sources with high motion the whole time will take more bits, but -crf 18 is constant quality, so I'd recommend that as something that will be good enough for viewing even up close on a good monitory. I'd probably still save the original sources, though, since storage only gets cheaper, and you can't ever recover the original quality from the x264 output. Still, I'd just keep them to have them. If you're giving this file to anyone, or copying it over the internet, x264 -preset slow does good stuff. Even the default -preset medium is good. If you don't set a target bitrate or quality, the default is I think -crf 23.

I'm not doing much to answer how to get x264 to make output you can concat with your camera's non-xcoded bitstream, since that's not something I've had to do, and it doesn't really interest me to find out, sorry. Mostly answering so anyone who wanted to follow your starting point won't be led to -crf 0 or something silly like that. :P