Faster forking of large processes on Linux?

What's the fastest, best way on modern Linux of achieving the same effect as a fork-execve combo from a large process ?

My problem is that the process forking is ~500MByte big, and a simple benchmarking test achieves only about 50 forks/s from the process (c.f ~1600 forks/s from a minimally sized process) which is too slow for the intended application.

Some googling turns up vfork as having being invented as the solution to this problem... but also warnings about not to use it. Modern Linux seems to have acquired related clone and posix_spawn calls; are these likely to help ? What's the modern replacement for vfork ?

I'm using 64bit Debian Lenny on an i7 (the project could move to Squeeze if posix_spawn would help).


On Linux, you can use posix_spawn(2) with the POSIX_SPAWN_USEVFORK flag to avoid the overhead of copying page tables when forking from a large process.

See Minimizing Memory Usage for Creating Application Subprocesses for a good summary of posix_spawn(2), its advantages and some examples.

To take advantage of vfork(2), make sure you #define _GNU_SOURCE before #include <spawn.h> and then simply posix_spawnattr_setflags(&attr, POSIX_SPAWN_USEVFORK)

I can confirm that this works on Debian Lenny, and provides a massive speed-up when forking from a large process.

benchmarking the various spawns over 1000 runs at 100M RSS
                            user     system      total        real
fspawn (fork/exec):     0.100000  15.460000  40.570000 ( 41.366389)
pspawn (posix_spawn):   0.010000   0.010000   0.540000 (  0.970577)

Outcome: I was going to go down the early-spawned helper subprocess route as suggested by other answers here, but then I came across this re using huge page support to improve fork performance.

Having tried it myself using libhugetlbfs to simply make all my app's mallocs allocate huge pages, I'm now getting around 2400 forks/s regardless of the process size (over the range I'm interested in anyway). Amazing.


Did you actually measure how much time forks take? Quoting the page you linked,

Linux never had this problem; because Linux used copy-on-write semantics internally, Linux only copies pages when they changed (actually, there are still some tables that have to be copied; in most circumstances their overhead is not significant)

So the number of forks doesn't really show how big the overhead will be. You should measure the time consumed by forks, and (which is a generic advice) consumed only by the forks you actually perform, not by benchmarking maximum performance.

But if you really figure out that forking a large process is a slow, you may spawn a small ancillary process, pipe master process to its input, and receive commands to exec from it. The small process will fork and exec these commands.

posix_spawn()

This function, as far as I understand, is implemented via fork/exec on desktop systems. However, in embedded systems (particularly, in those without MMU on board), processes are spawned via a syscall, interface to which is posix_spawn or a similar function. Quoting the informative section of POSIX standard describing posix_spawn:

  • Swapping is generally too slow for a realtime environment.

  • Dynamic address translation is not available everywhere that POSIX might be useful.

  • Processes are too useful to simply option out of POSIX whenever it must run without address translation or other MMU services.

Thus, POSIX needs process creation and file execution primitives that can be efficiently implemented without address translation or other MMU services.

I don't think that you will benefit from this function on desktop if your goal is to minimize time consumption.


If you know the number of subprocess ahead of time, it might be reasonable to pre-fork your application on startup then distribute the execv information via a pipe. Alternatively, if there is some sort of "lull" in your program it might be reasonable to fork ahead of time a subprocess or two for quick turnaround at a later time. Neither of these options would directly solve the problem but if either approach is suitable to your app, it might allow you to side-step the issue.


I've come across this blog post: http://blog.famzah.net/2009/11/20/a-much-faster-popen-and-system-implementation-for-linux/

pid = clone(fn, stack_aligned, CLONE_VM | SIGCHLD, arg);

Excerpt:

The system call clone() comes to the rescue. Using clone() we create a child process which has the following features:

  • The child runs in the same memory space as the parent. This means that no memory structures are copied when the child process is created. As a result of this, any change to any non-stack variable made by the child is visible by the parent process. This is similar to threads, and therefore completely different from fork(), and also very dangerous – we don’t want the child to mess up the parent.
  • The child starts from an entry function which is being called right after the child was created. This is like threads, and unlike fork().
  • The child has a separate stack space which is similar to threads and fork(), but entirely different to vfork().
  • The most important: This thread-like child process can call exec().

In a nutshell, by calling clone in the following way, we create a child process which is very similar to a thread but still can call exec():

However I think it may still be subject to the setuid problem:

http://ewontfix.com/7/ "setuid and vfork"

Now we get to the worst of it. Threads and vfork allow you to get in a situation where two processes are both sharing memory space and running at the same time. Now, what happens if another thread in the parent calls setuid (or any other privilege-affecting function)? You end up with two processes with different privilege levels running in a shared address space. And this is A Bad Thing.

Consider for example a multi-threaded server daemon, running initially as root, that’s using posix_spawn, implemented naively with vfork, to run an external command. It doesn’t care if this command runs as root or with low privileges, since it’s a fixed command line with fixed environment and can’t do anything harmful. (As a stupid example, let’s say it’s running date as an external command because the programmer couldn’t figure out how to use strftime.)

Since it doesn’t care, it calls setuid in another thread without any synchronization against running the external program, with the intent to drop down to a normal user and execute user-provided code (perhaps a script or dlopen-obtained module) as that user. Unfortunately, it just gave that user permission to mmap new code over top of the running posix_spawn code, or to change the strings posix_spawn is passing to exec in the child. Whoops.