webserver horrible slow, sometimes incredible fast

i am running a small community ( 6000+ Members ) on a non-virtual 64-bit ubuntu 11.04 system.

I am not a Linux-pro, not even advanced, i just tried to setup a webserver, which does nothing special actually. Delivering some dynamic PHP and RoR websites is its task. So it might be that my configuration files do look horrible bad. Also, i might use the wrong vocabulary, so in doubt, please ask.

Having a current all-time record of 520 registered users (board-accounts, no system-users) online at same time, average server-load is about 2.0 - 5.0. Meantime (~250 users) average server load value is at about 0.4 - 0.8, sometimes, on some expensive searches a bit higher. everything fine.

From time to time however, the load increases up to 120 (120.0, not 12.0 ;) ). In this time, its hard to even connect via SSH, but when i reach the server, and use top/htop/iotop to see whats happening, i cannot identify any process causing high CPU load.

iotop tells me about a current reading/writing speed of about approx. 70kb/s, which is quite equal to power-off i think.

Memory-Usage is max. at ~ 12GB of 16GB, so swap remains empty.

now the odd (at least for me:)

waiting some minutes ( since i always get a bit into a panic when this happens, it feels like 5 minutes, but i suppose its more like 20-30 minutes) and the server is back to normal. everything continues as normal.

another odd fact:

when i run hdparm -tT /dev/sda, i get answer like:

/dev/sda:
  Timing cached reads:   7180 MB in  2.00 seconds = 3591.13 MB/sec
  Timing buffered disk reads: 348 MB in  3.02 seconds = 115.41 MB/sec

when i run the same command while the server is "frozen", the answer is like

/dev/sda:  <- takes about 5 minutes until this line appears
  Timing cached reads:   7180 MB in  2.00 seconds = 3591.13 MB/sec <- 5 more minutes
  Timing buffered disk reads: 348 MB in  3.02 seconds = 115.41 MB/sec <- another 5 minutes

so the values are the same, but the quoted time is completely wrong. using time command as prefix also tells me that ~ 15 minutes were used.

I searched in dmesg, /var/log/[messages|syslog] - nothing found.

/var/log/errors however tells me that:

Jul  4 20:28:30 localhost kernel: [19080.671415] INFO: task php5-fpm:27728 blocked for more than 120 seconds.
Jul  4 20:28:30 localhost kernel: [19080.671419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

multiple times. now that message does tell me that php5-fpm task was blocked or did block ? - but not if that is the cause or just one of the results of that "freeze". Anyone?

to cut the long story short, i dont know where even to start analyzing. So if you can give me any advice by looking at following specs and configs, or ask me to provide more information, i`d be glad.

Specs:

    6 Core AMD Phenom(tm) II X6 1055T Processor *
    16 Gigabyte Ram
    2x 1.5 TB Seagate ST1500DL003-9VT16L via SATA 3 via SoftwareRaid (i suppose)

Services: (due to service --status-all, those with [ + ]) 

    nginx Webserver 1.0.14
    mySQL 5.1.63 Server 
    Ruby on Rails 2.3.11 ( passenger-nginx-module )
    php5-fpm 5.3.6-13ubuntu3.7 
    SSH
    ido2db


Further services:

     default crontab + nightly backup.
     syslog-ng

Website consists of 2 subdomains, forum. and www. where forum is a phpBB3.x PHP-Board, and www a Ruby on Rails 2.3.11 application (portal).

Mini-Note: sometimes i notice that the forum is pretty slow, in contrast to the always-fast (except for this "freeze") portal. Both share the same Database, but the portal is using it read-only.

The Webserver is nginx, using phusion passenger module to communicate with the ruby-application. Also, for the forum it communicates with php5-fpm via socket:

relevant nginx configuration parts ( with comments/questions starting by ; )

; in case of freeze due to too high Filesystem activity, maybe adding a limit?
#worker_rlimit_nofile 50000;
user  www-data;
; 6 cores, so i read 6 fits. maybe already wrong?
worker_processes  6;  
pid /var/run/nginx.pid;
events {
        worker_connections  1024;
}


http {
        passenger_root /var/lib/gems/1.8/gems/passenger-3.0.11;
        passenger_ruby /usr/bin/ruby1.8;

; the forum once featured a chat, which was working w/o websockets. 
; so it was a hell of pull requests (deactivated now, freeze still happening)
        keepalive_timeout  65;
        keepalive_requests 50;
        gzip  on;

        server {
                listen 80;
                server_name www.domain.tld;
                root /var/www/domain/rails/public;
                passenger_enabled on;
        }

        server {
                listen     80;
                server_name  forum.domain.tld;

                location / {
                        root   /var/www/domain/forum;
                        index  index.php;
                }
; satic stuff to be handled by nginx
                location ~* ^/style/.+.(jpg|jpeg|gif|css|png|js|ico|xml)$ {
                        access_log              off;
                        expires            30d;
                        root /var/www/domain/forum/;
                }

; now the php magic, note the "backend"-fcgi_pass
                location ~ .php$ {
                        fastcgi_split_path_info ^(.+\.php)(.*)$;
                        fastcgi_pass   backend;
                        fastcgi_index  index.php;
                        fastcgi_param  SCRIPT_FILENAME  /var/www/domain/forum$fastcgi_script_name;
                        include fastcgi_params;
                        fastcgi_param  QUERY_STRING      $query_string;
                        fastcgi_param  REQUEST_METHOD   $request_method;
                        fastcgi_param  CONTENT_TYPE      $content_type;
                        fastcgi_param  CONTENT_LENGTH   $content_length;
                        fastcgi_intercept_errors                on;
                        fastcgi_ignore_client_abort      off;
                        fastcgi_connect_timeout 60;
                        fastcgi_send_timeout 180;
                        fastcgi_read_timeout 180;
                        fastcgi_buffer_size 128k;
                        fastcgi_buffers 256 16k;
                        fastcgi_busy_buffers_size 256k;
                        fastcgi_temp_file_write_size 256k;
                        fastcgi_max_temp_file_size 0;
                }

                location ~ /\.ht {
                        deny  all;
                }

        }

;the php5-fpm socket. i read that /dev/shm/ whould be the fastes place for this. bad idea in general?
        upstream backend {
                server unix:/dev/shm/phpfpm;
        }
       ...
}

php5-fpm settings (i changed this values due to php5-fpm error log messages higher and higher.. (freeze-problem was there before as well)*


listen = /dev/shm/phpfpm 
user = www-data
group = www-data
pm = dynamic


; holy, 4000! well, shinking this value to earth-level gave me 
; 100s of 502 bad gateway commands. this values were quite stable.
; since there are only max 520 users online i dont get it, why i would need
; as many children as configured here. due to keep-alive maybe?
; asking questions is easier for me since restarting server will make
; my community-members angry ;)
pm.max_children      = 4000 
pm.start_servers     = 100
pm.min_spare_servers = 50 
pm.max_spare_servers = 150 
pm.max_requests      = 10

pm.status_path = /status
ping.path = /ping
ping.response = pong
slowlog = log/$pool.log.slow

;should i use rlimit?
;rlimit_files = 1024

chdir = /

mysql/my.cnf

[client]
port        = 3306
socket      = /var/run/mysqld/mysqld.sock

[mysqld_safe]
socket      = /var/run/mysqld/mysqld.sock
nice        = 0

[mysqld]
user        = mysql
socket      = /var/run/mysqld/mysqld.sock
port        = 3306
basedir     = /usr
datadir     = /var/lib/mysql
tmpdir      = /tmp
skip-external-locking
bind-address        = 127.0.0.1
key_buffer      = 16M
max_allowed_packet  = 16M
thread_stack        = 192K
thread_cache_size       = 8
myisam-recover         = BACKUP

; high number, but less gives some phpBB errors.
max_connections        = 450
table_cache            = 512

; i read twice the cpu cores, bad?
thread_concurrency     = 12 
join_buffer_size       = 2084K
concurrent_insert      = 3
query_cache_limit   = 64M
query_cache_size        = 512M
query_cache_type    = 1

log_error                = /var/log/mysql/error.log
log_slow_queries    = /var/log/mysql/mysql-slow.log
long_query_time = 2
expire_logs_days    = 10
max_binlog_size         = 100M
low_priority_updates=1

[mysqldump]
quick
quote-names
max_allowed_packet  = 16M

[isamchk]
key_buffer      = 16M
!includedir /etc/mysql/conf.d/

I used smartctl already, hdds seem to be fine. /proc/mdstatus quotes:

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md3 : active raid1 sda3[1]
      1459264192 blocks [2/1] [_U]

md1 : active raid1 sda1[0]
      3911680 blocks [2/1] [U_]

unused devices:

ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127727
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 127727
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I quote some questions in my configuration files, these are not (intentional) directly problem-related, but would be nice for me to know wether they are indeed questionable or done right.

One additional Fact: my MYSQL-database is at 12GB size.

i dont know if that does matter, but mytop sometimes shows me 4-5 seconds long insert queries, some are 20-30 seconds long. Its just a feeling that i am unable to prove (because i dont know how), but when i disable the database, the freeze seems not to happen.
Example:

i created a dummy rails application to see the development log. the app made some sql-queries, reads and inserts.

the log quite often was like:

 DbTest Load (0.3ms)   SELECT * FROM `db_test` WHERE (`db_test`.`id` = 31722) LIMIT 1
 SQL (0.1ms)   BEGIN
 DbTest Update (0.3ms)   UPDATE `db_test` SET `updated_at` = '2012-07-04 23:32:34' WHERE `id` = 31722

 - now the log stands still for 5-60 seconds.

 SQL (49.1ms)   COMMIT

 - SQL-Update time in the log does not include freeze time

Rendering test/index
Completed in 96ms (View: 16, DB: 59) | 200 OK [http://localhost:9000/test]

Bad part is: this mini-freeze here only happens from time to time as well. note: meanwhile i cannot even upload files via scp.

I currently feel like running form bad to worse and back by googling for my server-problem due to immense lack of knowledge regarding server configurations. It still makes me wonder, why those problems even appear, since 250 users a time is not such a high amount, right?

So my questions:

whats wrong and how to fix? ;) or:
what information can i provide to make the situation more clear?
can you point at some critical bad configuration-line which i should consider to catch up in the documentation?
are there any tools i can run to see some possible bottlenecks?
any further advice? (next to: "pay someone who knows what he does" - its a private project, server costs enough already. :))

Thanks for your time and help.

Best Regards, Daniel

P.S.: i renamed the configfiles to domain.tld since i dont want to have any % more load to the server until its fixed. might be a exaggeratedly thought..

P.P.S: if i asked a complete duplicate question, sorry. my search results seemed to be quite specific in their own way.

Edit:

just got some iotop 99.99% values while system seems to be frozen. can this fact be considered?

Edit2:

now i just noticed that this even occours with a load of 3-5.. iotop results are from 0-99% raid/mysql.. mhmm

You've looked at all sorts of metrics, but seem to have missed the ones I'd start with: what happens to your request times during the slowdown - while you'd expect everything to be slower, are there URLs with higher levels of access leading up to the events? Do the events follow any sort of pattern with relation to time?

You seem to have high levels of concurrency - but parts of your MySQL configuration seems to be setup for MyISAM - innodb might be better for this setup, however a slow mysqld will only indirectly affect load metrics (unless the 120 waiting processes are all mysqld?). Are you running a mix of engines? If you're sticking with MyISAM, the reduce the number of threads and increase the key_buffer_size. Regardless which engine your tables use, change your long query time to zero (at least temporarily) and start parsing those log files with mysqldumpslow.

I wouldn't put much faith in hdparm's benchmarks - it's a very poor substitute for things like bonnie++ and fio - but even the latter is difficult yo use to model real application traffic.

Last time I had random freezing like that it was a dodgy hard drive cable causing the drive to time out and need restarting occasionally; I would have expected that errors of that level would be reported quite loudly in dmesg though :S

webserver horrible slow, sometimes incredible fast

Edit:

Edit2:

Related

Recent Posts