How to troubleshoot performance issues of PHP, MySQL and generic I/O
I have a WordPress based website running on a shared hosting. Its response time is very decent (around 2s to retrieve the HTML page and 5s to load all the resources).
I was planning to move it to a dedicated virtual server (Ubuntu 12.04 LTS), which should theoretically improve things and make them more consistent given its not shared. However I observed severe performance degredation, with the page taking 10seconds to be generated.
I ruled out network issues by editing /etc/hosts
on the server and mapping the domain to 127.0.0.1
. I used the Apache load tester ab
to get the HTML, so JS, CSS and images are all excluded. It still took 10 seconds.
I have Zpanel installed on the server which also uses MySQL, and its pages come up quite fast (1.5s) and also phpMyAdmin. Performing some queries on the wordpress database directly through phpMyAdmin returns them quite fast too, with query times in the 10 to 30 millisecond region.
Memory is also sufficient, with only 800Mb being used of the 1Gb physical memory available, so it doesn't seem to be a swap issue either. I have also installed APC to try to improve the PHP performance, but it didn't have any effect.
What else should I look for? What could be causing this degradation in performance? Could it be some kind of I/O issue since I am running on a cloud based virtual server?
I wish to be able to raise the issue with my provider but without showing actual data from some diagnosis I am afraid he will just blame my application.
UPDATE with sar
output (every second) when I did an HTTP request:
02:31:29 CPU %user %nice %system %iowait %steal %idle
02:31:30 all 0.00 0.00 0.00 0.00 0.00 100.00
02:31:31 all 2.22 0.00 2.22 0.00 0.00 95.56
02:31:32 all 41.67 0.00 6.25 0.00 2.08 50.00
02:31:33 all 86.36 0.00 13.64 0.00 0.00 0.00
02:31:34 all 75.00 0.00 25.00 0.00 0.00 0.00
02:31:35 all 93.18 0.00 6.82 0.00 0.00 0.00
02:31:36 all 90.70 0.00 9.30 0.00 0.00 0.00
02:31:37 all 71.05 0.00 0.00 0.00 0.00 28.95
02:31:38 all 14.89 0.00 10.64 0.00 2.13 72.34
02:31:39 all 2.56 0.00 0.00 0.00 0.00 97.44
02:31:40 all 0.00 0.00 0.00 0.00 0.00 100.00
02:31:41 all 0.00 0.00 0.00 0.00 0.00 100.00
UPDATE 2 After josten's suggestions.
I/O:
iotop
fails with OSError: Netlink error: No such file or directory (2)
and sar -d
also fails with Requested activities not available in file /var/log/sysstat/sa14
. I think this is because this is a virtual machine, just like iostat
also fails. Could it be the reason why %iowait
reported by sar 1 10
is always 0%?
CPU Load:
The process that is topping the CPU% in htop
is actually apache2
. I was expecting this to maybe be the database, but its not. It goes up to 94% for a few seconds when I do a fresh HTTP request. Seems this is the culprit.
I have done an strace -f -t
and one summary strace -c -f
. There seems to be an awful lot of lstat
calls (57786), with 2455 resulting in errors. No idea if this is normal.
Other than that the topmost call was wait4
which I presume is normal (its just waiting), and munmap
. Top 5 below.
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
51.06 0.124742 897 139 6 wait4
14.90 0.036388 1 57786 2455 lstat
9.67 0.023622 13 1857 munmap
7.69 0.018790 37 514 brk
6.70 0.016361 481 34 clone
2.87 0.006999 74 94 12 select
strace
itself slowed down apache by a factor of 2. I am trying to understand the long trace now to see if there is anything indicative of what was causing the CPU to spike for a few seconds.
What is the typical time for lstat
for a good performing server? I wish to gather some information so that I can complain in a constructive manner to the provider if it is the storage access fault.
UPDATE Output of fio
random read test:
random-read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.59
Starting 1 process
random-read: Laying out IO file(s) (1 file(s) / 128MB)
Jobs: 1 (f=1): [r] [100.0% done] [12185K/0K /s] [2975 /0 iops] [eta 00m:00s]
random-read: (groupid=0, jobs=1): err= 0: pid=24264
read : io=131072KB, bw=10298KB/s, iops=2574 , runt= 12728msec
clat (usec): min=119 , max=162219 , avg=380.34, stdev=957.37
lat (usec): min=119 , max=162219 , avg=380.89, stdev=957.40
bw (KB/s) : min= 7200, max=13424, per=99.89%, avg=10285.72, stdev=1608.68
cpu : usr=2.80%, sys=18.65%, ctx=33511, majf=0, minf=23
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=32768/0/0, short=0/0/0
lat (usec): 250=45.57%, 500=37.17%, 750=3.41%, 1000=7.83%
lat (msec): 2=5.67%, 4=0.27%, 10=0.08%, 20=0.01%, 250=0.01%
Run status group 0 (all jobs):
READ: io=131072KB, aggrb=10297KB/s, minb=10545KB/s, maxb=10545KB/s, mint=12728msec, maxt=12728msec
The only hint I have now is that the CPU line of the fio
output seems to show quite a bit of activity when compared to other systems. I ran it on my local Ubuntu machine and the output was:
cpu : usr=0.19%, sys=0.59%, ctx=32923, majf=0, minf=23
The usr
percentage seems to be a small fraction of what is being reported on my server.
UPDATE Re PHP APC. Yes it is installed. Output from phpinfo:
APC Support enabled
Version 3.1.7
APC Debugging Disabled
MMAP Support Enabled
MMAP File Mask no value
Locking type pthread mutex Locks
Serialization Support php
Revision $Revision: 307215 $
Build Date May 2 2011 19:00:42
Is there any specific setting I should check for? These are the settings I have (local value, Master value):
apc.cache_by_default On On
apc.canonicalize On On
apc.coredump_unmap Off Off
apc.enable_cli Off Off
apc.enabled On On
apc.file_md5 Off Off
apc.file_update_protection 2 2
apc.filters no value no value
apc.gc_ttl 3600 3600
apc.include_once_override Off Off
apc.lazy_classes Off Off
apc.lazy_functions Off Off
apc.max_file_size 1M 1M
apc.mmap_file_mask no value no value
apc.num_files_hint 1000 1000
apc.preload_path no value no value
apc.report_autofilter Off Off
apc.rfc1867 Off Off
apc.rfc1867_freq 0 0
apc.rfc1867_name APC_UPLOAD_PROGRESS APC_UPLOAD_PROGRESS
apc.rfc1867_prefix upload_ upload_
apc.rfc1867_ttl 3600 3600
apc.serializer default default
apc.shm_segments 1 1
apc.shm_size 32M 32M
apc.slam_defense On On
apc.stat On On
apc.stat_ctime Off Off
apc.ttl 0 0
apc.use_request_time On On
apc.user_entries_hint 4096 4096
apc.user_ttl 0 0
apc.write_lock On On
UPDATE Increased apc.shm_size
to 96M. Cache full count is now 0, and there are 96.5% hits to the cache after a few refreshes of the website here and there. APC memory usage is 25.4MB free.
It seems to have reduced the loading time by 3 seconds or so, now down to around 4 to 5 seconds if I do a pure wget
from the server itself without getting any images etc. Still more than twice slower than the other hosting but definitely was an improvement.
I am still finding it strange why it was taking so long to render those pages when the server is totally idle (I don't have APC installed on my development PC and it doesn't have that kind of behaviour). And its still strange where those extra remaining seconds are being wasted.
This looks like other cases I've seen where Apache is spending a lot of its time compiling PHP. Have you made sure an opcode cache (e.g. APC) is installed? It'll show as a loaded module in the output of phpinfo()
, if that helps. Otherwise, to track what Apache's doing within mod_php, your best bet is going to be XHProf.
To anyone other than jbx arriving here via Google: the other answers are excellent, by the way. Go read them. But those answers, and jbx's responses to them, have helped me arrive at this conclusion.
You have to first determine what the issue is first; if it's PHP, MySQL, I/O, load, memory, CPU, kernel, etc. sar
logs the systems metrics; you'll have to catch it in the act. You can configure atop
to do process accounting which definitely helps.
To determine if it's I/O
Use tools such as iotop
and atop
to see what the disk usage is; these tools will also tell you what is causing the IO. Generally, if iowait is sustained over 10% this could be the issue.
sar
logs disk IO; so you can run sar -d
to see it (look at %util
column).
To determine if it's load
Use tools such as htop
, top
, uptime
; again tie this to the process running and find out more details about what the process is doing. Note that this reports the load on the scheduler; it doesn't reflect the CPU usage.
To determine if it's a CPU
sar
once again comes in to save the day; you can see this information with sar -P ALL
. You can also use mpstat -P ALL
for real-time data. Generally, the CPU is only an issue if all the CPU's are at 100%; 80%+ means they're being utilized (but not necessarily saturated).
To determine if it's the Memory (VM)
You'll want to use vmstat
; vmstat -S M 1
and observe the swap
, io
, and system
columns. Obviously a high amount of swapping can impact performance. There is also the system
section; a high amount of interrupts can also do the same.
To determine if it's interrupts
You can use vmstat -S M 1
. Unfortunately, it's hard to tell if interrupts are the issue if your system doesn't have a baseline on what is normal. A high amount of interrupts (which are caused from hardware requiring action from the kernel) will bring the system to a crawl. Failing NIC's are notorious for doing this.
To determine if it's the kernel
This is trickier but generally requires strace
, perf
, or sysdig
tools. One such tool is perf top
. strace
with a summary (-c
) is nice but it doesn't break it down relative to the system resources (so the data that is provided is only speculation); it's ideal to use perf top
to come to the conclusion that it's the kernel. You can also use stap
(SystemTap) if your machine supports it. I should also note that strace
will impact performance; you should use sysdig
if the system is at all important.
To determine if it's MySQL/PHP
You basically have to follow what I posted above (perf
for example can provide information on what command is causing high kernel time, iotop
, atop
, htop
can provide information relative to system resources on what is using them); basically, you're using the above tools to determine what is causing the load.
Once you've determined it to be MySQL
It could be a query that you're running (so you'll want to use EXPLAIN
on that query in MySQL). You'll also need to make sure that your database is optimized and that the queries you're executing are optimized. You'll also have to make sure that the table engine you're using is ideal for what you're doing (I've seen many large tables that MyISAM when they should be InnoDB). If you've determined that none of the above are the issue and still suspect MySQL you may want to archive data in the affected tables to reduce access (table scans) to that table. You may also want to verify constraint consistency, enable cache buffering, and ensure indexes are optimal.
A good tool to help in this process is mytop
; but all the information that mytop
provides is easily accessible in the mysql
client. Some useful statements to run:
-
SHOW FULL PROCESSLIST\G
to get a complete list of currently executing SQL statements as well as their status to the server. -
SHOW ENGINE INNODB STATUS\G
(InnoDB only) -
EXPLAIN EXTENDED <QUERY>
to explain a query that you see MySQL executing. -
SHOW GLOBAL STATUS\G
for a server-wide status
Once you've determined it to be PHP
You can use tools to profile your PHP code (such as xdebug
) and then open the generated profile in KCacheGrind
to see a performance analysis of the profiled PHP code.
If you find it's none of these you may just need to upgrade your server.
Look over the answer I gave to another question similar to this one for clues.
The thing is, if other pages outside of WordPress area are loading fine, but WordPress itself is choked, three things come to mind outside of the generic things I recommend.
- When you migrated your WordPress code to the new setup, did you make sure to correctly set all paths on the file system in
wp-config.php
? The reason being that sometimes WordPress can work despite incorrect paths if those are set in the MySQL DB for WordPress options. Making sure they are inwp-config.php
forces WordPress to use the correct directories & ensures temp & cache folders work as expected. - Database slowdown? That is the only other thing I can thing of that would be idiosyncratic to WordPress yet allow other pages to load. Are you sure your MySQL
my.cnf
is working as it should for the DB needs of your site? - Do you have a plugin or setting in your WordPress code that enables Gzip compression? In general Gzip compression should happen on the server side via Apache or Ngnix since they can hand Gzip compression more efficiently than PHP code. So if you have caching enabled in WordPress, disable it since PHP (which is what WordPress uses) is not great at Gzip compression.
In general I have setup tons of CMS sites—more recently WordPress—on cloud servers without issue. a 10 second page load is not a symptom of the cloud host being inadequate. I would recommend looking over the stuff I recommend here & in my other answer. And I would also recommend debugging by doing a clean WordPress install on the setup that is an issue & see how that reacts. If that works well in comparison to your full site, then it is clear there is some configuration issue in your sites specific code.
EDIT: Here is another idea. Do you have Apache authorization (htaccess) anywhere in your setup? Do you have it set to allow from localhost
? See below. Sometimes this setup works, but if Allow from localhost
is the first in the list of Allow
’s or the only item in the list of Allow
’s it can choke from reverse DNS weirdness. I would recommend trying to disable that—if you can—and see how quickly the site loads in comparison to it being enabled.
Order Deny,Allow
Deny from all
Allow from 127.0.0.1 ::1
Allow from localhost