webserver horrible slow, sometimes incredible fast
i am running a small community ( 6000+ Members ) on a non-virtual 64-bit ubuntu 11.04 system.
I am not a Linux-pro, not even advanced, i just tried to setup a webserver, which does nothing special actually. Delivering some dynamic PHP and RoR websites is its task. So it might be that my configuration files do look horrible bad. Also, i might use the wrong vocabulary, so in doubt, please ask.
Having a current all-time record of 520 registered users (board-accounts, no system-users) online at same time, average server-load is about 2.0 - 5.0. Meantime (~250 users) average server load value is at about 0.4 - 0.8, sometimes, on some expensive searches a bit higher. everything fine.
From time to time however, the load increases up to 120 (120.0, not 12.0 ;) ). In this time, its hard to even connect via SSH, but when i reach the server, and use top/htop/iotop to see whats happening, i cannot identify any process causing high CPU load.
iotop tells me about a current reading/writing speed of about approx. 70kb/s, which is quite equal to power-off i think.
Memory-Usage is max. at ~ 12GB of 16GB, so swap remains empty.
now the odd (at least for me:)
waiting some minutes ( since i always get a bit into a panic when this happens, it feels like 5 minutes, but i suppose its more like 20-30 minutes) and the server is back to normal. everything continues as normal.
another odd fact:
when i run hdparm -tT /dev/sda, i get answer like:
/dev/sda:
Timing cached reads: 7180 MB in 2.00 seconds = 3591.13 MB/sec
Timing buffered disk reads: 348 MB in 3.02 seconds = 115.41 MB/sec
when i run the same command while the server is "frozen", the answer is like
/dev/sda: <- takes about 5 minutes until this line appears
Timing cached reads: 7180 MB in 2.00 seconds = 3591.13 MB/sec <- 5 more minutes
Timing buffered disk reads: 348 MB in 3.02 seconds = 115.41 MB/sec <- another 5 minutes
so the values are the same, but the quoted time is completely wrong. using time command as prefix also tells me that ~ 15 minutes were used.
I searched in dmesg, /var/log/[messages|syslog] - nothing found.
/var/log/errors however tells me that:
Jul 4 20:28:30 localhost kernel: [19080.671415] INFO: task php5-fpm:27728 blocked for more than 120 seconds. Jul 4 20:28:30 localhost kernel: [19080.671419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
multiple times. now that message does tell me that php5-fpm task was blocked or did block ? - but not if that is the cause or just one of the results of that "freeze". Anyone?
to cut the long story short, i dont know where even to start analyzing. So if you can give me any advice by looking at following specs and configs, or ask me to provide more information, i`d be glad.
Specs: 6 Core AMD Phenom(tm) II X6 1055T Processor * 16 Gigabyte Ram 2x 1.5 TB Seagate ST1500DL003-9VT16L via SATA 3 via SoftwareRaid (i suppose) Services: (due to service --status-all, those with [ + ]) nginx Webserver 1.0.14 mySQL 5.1.63 Server Ruby on Rails 2.3.11 ( passenger-nginx-module ) php5-fpm 5.3.6-13ubuntu3.7 SSH ido2db Further services: default crontab + nightly backup. syslog-ng
Website consists of 2 subdomains, forum. and www. where forum is a phpBB3.x PHP-Board, and www a Ruby on Rails 2.3.11 application (portal).
Mini-Note: sometimes i notice that the forum is pretty slow, in contrast to the always-fast (except for this "freeze") portal. Both share the same Database, but the portal is using it read-only.
The Webserver is nginx, using phusion passenger module to communicate with the ruby-application. Also, for the forum it communicates with php5-fpm via socket:
relevant nginx configuration parts ( with comments/questions starting by ; )
; in case of freeze due to too high Filesystem activity, maybe adding a limit? #worker_rlimit_nofile 50000; user www-data; ; 6 cores, so i read 6 fits. maybe already wrong? worker_processes 6; pid /var/run/nginx.pid; events { worker_connections 1024; } http { passenger_root /var/lib/gems/1.8/gems/passenger-3.0.11; passenger_ruby /usr/bin/ruby1.8; ; the forum once featured a chat, which was working w/o websockets. ; so it was a hell of pull requests (deactivated now, freeze still happening) keepalive_timeout 65; keepalive_requests 50; gzip on; server { listen 80; server_name www.domain.tld; root /var/www/domain/rails/public; passenger_enabled on; } server { listen 80; server_name forum.domain.tld; location / { root /var/www/domain/forum; index index.php; } ; satic stuff to be handled by nginx location ~* ^/style/.+.(jpg|jpeg|gif|css|png|js|ico|xml)$ { access_log off; expires 30d; root /var/www/domain/forum/; } ; now the php magic, note the "backend"-fcgi_pass location ~ .php$ { fastcgi_split_path_info ^(.+\.php)(.*)$; fastcgi_pass backend; fastcgi_index index.php; fastcgi_param SCRIPT_FILENAME /var/www/domain/forum$fastcgi_script_name; include fastcgi_params; fastcgi_param QUERY_STRING $query_string; fastcgi_param REQUEST_METHOD $request_method; fastcgi_param CONTENT_TYPE $content_type; fastcgi_param CONTENT_LENGTH $content_length; fastcgi_intercept_errors on; fastcgi_ignore_client_abort off; fastcgi_connect_timeout 60; fastcgi_send_timeout 180; fastcgi_read_timeout 180; fastcgi_buffer_size 128k; fastcgi_buffers 256 16k; fastcgi_busy_buffers_size 256k; fastcgi_temp_file_write_size 256k; fastcgi_max_temp_file_size 0; } location ~ /\.ht { deny all; } } ;the php5-fpm socket. i read that /dev/shm/ whould be the fastes place for this. bad idea in general? upstream backend { server unix:/dev/shm/phpfpm; } ... }
php5-fpm settings (i changed this values due to php5-fpm error log messages higher and higher.. (freeze-problem was there before as well)*
listen = /dev/shm/phpfpm user = www-data group = www-data pm = dynamic ; holy, 4000! well, shinking this value to earth-level gave me ; 100s of 502 bad gateway commands. this values were quite stable. ; since there are only max 520 users online i dont get it, why i would need ; as many children as configured here. due to keep-alive maybe? ; asking questions is easier for me since restarting server will make ; my community-members angry ;) pm.max_children = 4000 pm.start_servers = 100 pm.min_spare_servers = 50 pm.max_spare_servers = 150 pm.max_requests = 10 pm.status_path = /status ping.path = /ping ping.response = pong slowlog = log/$pool.log.slow ;should i use rlimit? ;rlimit_files = 1024 chdir = /
mysql/my.cnf
[client] port = 3306 socket = /var/run/mysqld/mysqld.sock [mysqld_safe] socket = /var/run/mysqld/mysqld.sock nice = 0 [mysqld] user = mysql socket = /var/run/mysqld/mysqld.sock port = 3306 basedir = /usr datadir = /var/lib/mysql tmpdir = /tmp skip-external-locking bind-address = 127.0.0.1 key_buffer = 16M max_allowed_packet = 16M thread_stack = 192K thread_cache_size = 8 myisam-recover = BACKUP ; high number, but less gives some phpBB errors. max_connections = 450 table_cache = 512 ; i read twice the cpu cores, bad? thread_concurrency = 12 join_buffer_size = 2084K concurrent_insert = 3 query_cache_limit = 64M query_cache_size = 512M query_cache_type = 1 log_error = /var/log/mysql/error.log log_slow_queries = /var/log/mysql/mysql-slow.log long_query_time = 2 expire_logs_days = 10 max_binlog_size = 100M low_priority_updates=1 [mysqldump] quick quote-names max_allowed_packet = 16M [isamchk] key_buffer = 16M !includedir /etc/mysql/conf.d/
I used smartctl already, hdds seem to be fine. /proc/mdstatus quotes:
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md3 : active raid1 sda3[1] 1459264192 blocks [2/1] [_U] md1 : active raid1 sda1[0] 3911680 blocks [2/1] [U_] unused devices:
ulimit -a
core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 127727 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 127727 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
I quote some questions in my configuration files, these are not (intentional) directly problem-related, but would be nice for me to know wether they are indeed questionable or done right.
One additional Fact: my MYSQL-database is at 12GB size.
i dont know if that does matter, but mytop sometimes shows me 4-5 seconds long insert queries, some are 20-30 seconds long. Its just a feeling that i am unable to prove (because i dont know how), but when i disable the database, the freeze seems not to happen.
Example:
i created a dummy rails application to see the development log. the app made some sql-queries, reads and inserts.
the log quite often was like:
DbTest Load (0.3ms) SELECT * FROM `db_test` WHERE (`db_test`.`id` = 31722) LIMIT 1 SQL (0.1ms) BEGIN DbTest Update (0.3ms) UPDATE `db_test` SET `updated_at` = '2012-07-04 23:32:34' WHERE `id` = 31722 - now the log stands still for 5-60 seconds. SQL (49.1ms) COMMIT - SQL-Update time in the log does not include freeze time Rendering test/index Completed in 96ms (View: 16, DB: 59) | 200 OK [http://localhost:9000/test]
Bad part is: this mini-freeze here only happens from time to time as well. note: meanwhile i cannot even upload files via scp.
I currently feel like running form bad to worse and back by googling for my server-problem due to immense lack of knowledge regarding server configurations. It still makes me wonder, why those problems even appear, since 250 users a time is not such a high amount, right?
So my questions:
whats wrong and how to fix? ;) or:
what information can i provide to make the situation more clear?
- can you point at some critical bad configuration-line which i should consider to catch up in the documentation?
- are there any tools i can run to see some possible bottlenecks?
- any further advice? (next to: "pay someone who knows what he does" - its a private project, server costs enough already. :))
Thanks for your time and help.
Best Regards, Daniel
P.S.: i renamed the configfiles to domain.tld since i dont want to have any % more load to the server until its fixed. might be a exaggeratedly thought..
P.P.S: if i asked a complete duplicate question, sorry. my search results seemed to be quite specific in their own way.
Edit:
just got some iotop 99.99% values while system seems to be frozen. can this fact be considered?
Edit2:
now i just noticed that this even occours with a load of 3-5.. iotop results are from 0-99% raid/mysql.. mhmm
You've looked at all sorts of metrics, but seem to have missed the ones I'd start with: what happens to your request times during the slowdown - while you'd expect everything to be slower, are there URLs with higher levels of access leading up to the events? Do the events follow any sort of pattern with relation to time?
You seem to have high levels of concurrency - but parts of your MySQL configuration seems to be setup for MyISAM - innodb might be better for this setup, however a slow mysqld will only indirectly affect load metrics (unless the 120 waiting processes are all mysqld?). Are you running a mix of engines? If you're sticking with MyISAM, the reduce the number of threads and increase the key_buffer_size. Regardless which engine your tables use, change your long query time to zero (at least temporarily) and start parsing those log files with mysqldumpslow.
I wouldn't put much faith in hdparm's benchmarks - it's a very poor substitute for things like bonnie++ and fio - but even the latter is difficult yo use to model real application traffic.
Last time I had random freezing like that it was a dodgy hard drive cable causing the drive to time out and need restarting occasionally; I would have expected that errors of that level would be reported quite loudly in dmesg though :S