FastCGI and Apache 500 error intermittently

I have a FastCGI (mod_fastcgi)problem. It happens every once in a while, and does not casue a complete server meltdown, just 500 errors. Here are a couple things. First I am using APC so PHP is in control of it's own processes, not FastCGI. Also, I have the webroot set as:

/var/www/html

And the fcgi-bin inside:

/var/www/html/fcgi-bin

First off here is the apache error_log:

[Fri Jan 07 10:22:39 2011] [error] [client 50.16.222.82] (4)Interrupted system call: FastCGI: comm with server "/var/www/html/fcgi-bin/php.fcgi" aborted: select() failed, referer: http://www.domain.com/

I also ran strace on the 'fcgi-pm' process. Here is a snip from the trace around the time it bombs out:

21725 gettimeofday({1294420603, 14360}, NULL) = 0
21725 read(14, "C /var/www/html/fcgi-bin/php.fcgi - - 6503 38*", 16384) = 46
21725 alarm(131)                        = 0
21725 select(15, [14], NULL, NULL, NULL) = 1 (in [14])
21725 alarm(0)                          = 131
21725 gettimeofday({1294420603, 96595}, NULL) = 0
21725 read(14, "C /var/www/html/fcgi-bin/php.fcgi - - 6154 23*C /var/www/html/fcgi-bin/php.fcgi - - 6483 28*", 16384) = 92
21725 alarm(131)                        = 0
21725 select(15, [14], NULL, NULL, NULL) = 1 (in [14])
21725 alarm(0)                          = 131
21725 gettimeofday({1294420603, 270744}, NULL) = 0
21725 read(14, "C /var/www/html/fcgi-bin/php.fcgi - - 5741 38*", 16384) = 46
21725 alarm(131)                        = 0
21725 select(15, [14], NULL, NULL, NULL) = 1 (in [14])
21725 alarm(0)                          = 131
21725 gettimeofday({1294420603, 311502}, NULL) = 0
21725 read(14, "C /var/www/html/fcgi-bin/php.fcgi - - 6064 32*", 16384) = 46
21725 alarm(131)                        = 0
21725 select(15, [14], NULL, NULL, NULL) = 1 (in [14])
21725 alarm(0)                          = 131
21725 gettimeofday({1294420603, 365598}, NULL) = 0
21725 read(14, "C /var/www/html/fcgi-bin/php.fcgi - - 6179 33*C /var/www/html/fcgi-bin/php.fcgi - - 5906 59*", 16384) = 92
21725 alarm(131)                        = 0
21725 select(15, [14], NULL, NULL, NULL) = 1 (in [14])
21725 alarm(0)                          = 131
21725 gettimeofday({1294420603, 454405}, NULL) = 0

I noticed that the 'select()' seems to stay the same regardless, however the read() changes its return from 46 to some other number while it is bombing out. Has anyone seen anything like this. Could this be some sort of file locking?

Thanks, Ben


Solution 1:

Synopsis

I have observed the very same behavior with Apache; it seems that this problem is not specific to lighttpd.

In my case, the symptoms were exactly the same; the Apache access logs were peppered with intermittent 500 response codes, and there were no corresponding entries in PHP's error log (and PHP error-reporting was configured to be maximally verbose).

I described the issue extensively on the Apache mailing list (search the list archives for the subject "Intermittent 500 responses in access.log without corresponding entries in error.log").

Root Cause

1100110's answer hints at the root cause, but I'll provide additional documentation, straight from Apache, as well as suggestions for eliminating the problem.

Here is the official word from Apache on this matter:

https://httpd.apache.org/mod_fcgid/mod/mod_fcgid.html :

Special PHP considerations

By default, PHP FastCGI processes exit after handling 500 requests, and they may exit after this module has already connected to the application and sent the next request. When that occurs, an error will be logged and 500 Internal Server Error will be returned to the client. This PHP behavior can be disabled by setting PHP_FCGI_MAX_REQUESTS to 0, but that can be a problem if the PHP application leaks resources. Alternatively, PHP_FCGI_MAX_REQUESTS can be set to a much higher value than the default to reduce the frequency of this problem. FcgidMaxRequestsPerProcess can be set to a value less than or equal to PHP_FCGI_MAX_REQUESTS to resolve the problem.

PHP child process management (PHP_FCGI_CHILDREN) should always be disabled with mod_fcgid, which will only route one request at a time to application processes it has spawned; thus, any child processes created by PHP will not be used effectively. (Additionally, the PHP child processes may not be terminated properly.) By default, and with the environment variable setting PHP_FCGI_CHILDREN=0, PHP child process management is disabled.

The popular APC opcode cache for PHP cannot share a cache between PHP FastCGI processes unless PHP manages the child processes. Thus, the effectiveness of the cache is limited with mod_fcgid; concurrent PHP requests will use different opcode caches.

There we have it.

Possible Solutions

Option 1

One solution is to set PHP_FCGI_MAX_REQUESTS to zero, but taking this measure introduces the potential for memory leaks to grow out of control.

The various bits of documentation that I have consulted do not make it clear whether PHP via Fast-CGI suffers from inherent memory-leaking (hence this built-in "process recycling" behavior) or if the risk is limited to poorly-written, "runaway" scripts.

In any case, there is risk inherent to setting PHP_FCGI_MAX_REQUESTS to zero, especially in a shared hosting environment.

Option 2

A second solution, as described in the excerpt above, is to set FcgidMaxRequestsPerProcess to a value less than or equal to PHP_FCGI_MAX_REQUESTS. The documentation omits an important point, however: the value must also be greater than zero (because zero means "unlimited" or "disable the check" in this context). Given that the default value for FcgidMaxRequestsPerProcess is zero, and the default value for PHP_FCGI_MAX_REQUESTS is 500, any administrator who has not overridden these values will experience the intermittent 500 response codes. For this reason, I fail to understand why FcgidMaxRequestsPerProcess and PHP_FCGI_MAX_REQUESTS do not share the same default value. Perhaps this is because configuring these two directives as such yields the same net result as setting PHP_FCGI_MAX_REQUESTS to zero; the documentation is ambiguous in this regard.

Option 3

A third solution is to abandon Fast-CGI altogether, in favor of a comparable alternative, such as suPHP or plain-old CGI + SuExec. I have performed some basic, raw performance benchmarking across the various PHP modes, and my findings are as follows:

  1. Mod-PHP 77.7
  2. CGI 69.0
  3. suPHP 67.0
  4. Fast-CGI 55.7

Mod-PHP is the highest-performing, with a score of 77.7. The scores are arbitrary and serve only to demonstrate the relative variance in page-load-times across PHP modes.

If we assume that these benchmarks are fairly representative, then there seem to be very few reasons to cling to Fast-CGI, given this one (fairly serious) flaw in its implementation. The only substantial reason that comes to mind is op-code caching. My understanding is that PHP cannot utilize op-code caching via CGI or suPHP mode (because processes do not persist across requests).

While Fast-CGI does not take advantage of op-code caching (e.g., via APC) out-of-the-box, clever users have devised a method for rendering APC effective with Fast-CGI (via per-user caches): http://www.brandonturner.net/blog/2009/07/fastcgi_with_php_opcode_cache/ . There are several drawbacks, however:

  1. The memory (RAM) requirements are considerable, as there is a dedicated cache for each user. (For perspective, consider that in Mod-PHP mode, all users share a single cache.)
  2. Apache must use the older module, mod_fastcgi, instead of the newer equivalent, mod_fcgid. (For details, see the article cited in the paragraph above.)
  3. The configuration is rather complex.

As a related corollary, you said the following in your question:

First I am using APC so PHP is in control of it's own processes, not FastCGI.

Unless you're using mod_fastcgi (and not mod_fcgid), and unless you've followed steps similar to those cited a few paragraphs above, APC is consuming resources without effect. As such, you may wish to disable APC.

Summary of Solution

Take one of the following three measures:

  1. Set the PHP_FCGI_MAX_REQUESTS environment variable to zero. (Introduces potential for memory leaks in PHP scripts to grow out of control.)
  2. Set FcgidMaxRequestsPerProcess to a value less than or equal to PHP_FCGI_MAX_REQUESTS, but greater than zero.
  3. Abandon Fast-CGI in favor of a comparable alternative, such as suPHP or plain-old CGI + SuExec.

Solution 2:

I read somewhere (dealing with lighttpd, not apache) that php cannot handle more than 500 requests for some reason. The 501st request will bomb for whatever reason.

Sorry I do not have more information than that, but it's at the very least worth a shot.

tl;dr try setting PHP_FCGI_MAX_REQUESTS to 500 and seeing if the problem clears itself up.

Found the information, it applies to Lighttpd, and I do not know if it applies to apache or not.

Test it and I would love to hear if this is only an issue with lighttpd, or if it is a general issue.

Why is my PHP application returning an error 500 from time to time?

"This problem seems to stem from a little-known issue with PHP: PHP stops accepting new FastCGI connections after handling 500 requests; unfortunately, there is a potential race condition during the PHP cleanup code in which PHP can be shutting down but still have the socket open, so lighty can send request number 501 to PHP and have it "accepted", but then PHP appears to simply exit, causing a 500 return from lighty.

To limit this occurance, set PHP_FCGI_MAX_REQUESTS to 500."

--http://redmine.lighttpd.net/projects/1/wiki/Docs:PerformanceFastCGI