How to troubleshoot performance issues of PHP, MySQL and generic I/O

I have a WordPress based website running on a shared hosting. Its response time is very decent (around 2s to retrieve the HTML page and 5s to load all the resources).

I was planning to move it to a dedicated virtual server (Ubuntu 12.04 LTS), which should theoretically improve things and make them more consistent given its not shared. However I observed severe performance degredation, with the page taking 10seconds to be generated.

I ruled out network issues by editing /etc/hosts on the server and mapping the domain to 127.0.0.1. I used the Apache load tester ab to get the HTML, so JS, CSS and images are all excluded. It still took 10 seconds.

I have Zpanel installed on the server which also uses MySQL, and its pages come up quite fast (1.5s) and also phpMyAdmin. Performing some queries on the wordpress database directly through phpMyAdmin returns them quite fast too, with query times in the 10 to 30 millisecond region.

Memory is also sufficient, with only 800Mb being used of the 1Gb physical memory available, so it doesn't seem to be a swap issue either. I have also installed APC to try to improve the PHP performance, but it didn't have any effect.

What else should I look for? What could be causing this degradation in performance? Could it be some kind of I/O issue since I am running on a cloud based virtual server?

I wish to be able to raise the issue with my provider but without showing actual data from some diagnosis I am afraid he will just blame my application.

UPDATE with sar output (every second) when I did an HTTP request:

02:31:29        CPU     %user     %nice   %system   %iowait    %steal     %idle
02:31:30        all      0.00      0.00      0.00      0.00      0.00    100.00
02:31:31        all      2.22      0.00      2.22      0.00      0.00     95.56
02:31:32        all     41.67      0.00      6.25      0.00      2.08     50.00
02:31:33        all     86.36      0.00     13.64      0.00      0.00      0.00
02:31:34        all     75.00      0.00     25.00      0.00      0.00      0.00
02:31:35        all     93.18      0.00      6.82      0.00      0.00      0.00
02:31:36        all     90.70      0.00      9.30      0.00      0.00      0.00
02:31:37        all     71.05      0.00      0.00      0.00      0.00     28.95
02:31:38        all     14.89      0.00     10.64      0.00      2.13     72.34
02:31:39        all      2.56      0.00      0.00      0.00      0.00     97.44
02:31:40        all      0.00      0.00      0.00      0.00      0.00    100.00
02:31:41        all      0.00      0.00      0.00      0.00      0.00    100.00

UPDATE 2 After josten's suggestions.

I/O:

iotop fails with OSError: Netlink error: No such file or directory (2) and sar -d also fails with Requested activities not available in file /var/log/sysstat/sa14. I think this is because this is a virtual machine, just like iostat also fails. Could it be the reason why %iowait reported by sar 1 10 is always 0%?

CPU Load:

The process that is topping the CPU% in htop is actually apache2. I was expecting this to maybe be the database, but its not. It goes up to 94% for a few seconds when I do a fresh HTTP request. Seems this is the culprit.

I have done an strace -f -t and one summary strace -c -f. There seems to be an awful lot of lstat calls (57786), with 2455 resulting in errors. No idea if this is normal. Other than that the topmost call was wait4 which I presume is normal (its just waiting), and munmap. Top 5 below.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 51.06    0.124742         897       139         6 wait4
 14.90    0.036388           1     57786      2455 lstat
  9.67    0.023622          13      1857           munmap
  7.69    0.018790          37       514           brk
  6.70    0.016361         481        34           clone
  2.87    0.006999          74        94        12 select

strace itself slowed down apache by a factor of 2. I am trying to understand the long trace now to see if there is anything indicative of what was causing the CPU to spike for a few seconds.

What is the typical time for lstat for a good performing server? I wish to gather some information so that I can complain in a constructive manner to the provider if it is the storage access fault.

UPDATE Output of fio random read test:

random-read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.59
Starting 1 process
random-read: Laying out IO file(s) (1 file(s) / 128MB)
Jobs: 1 (f=1): [r] [100.0% done] [12185K/0K /s] [2975 /0  iops] [eta 00m:00s]
random-read: (groupid=0, jobs=1): err= 0: pid=24264
  read : io=131072KB, bw=10298KB/s, iops=2574 , runt= 12728msec
    clat (usec): min=119 , max=162219 , avg=380.34, stdev=957.37
     lat (usec): min=119 , max=162219 , avg=380.89, stdev=957.40
    bw (KB/s) : min= 7200, max=13424, per=99.89%, avg=10285.72, stdev=1608.68
  cpu          : usr=2.80%, sys=18.65%, ctx=33511, majf=0, minf=23
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=32768/0/0, short=0/0/0
     lat (usec): 250=45.57%, 500=37.17%, 750=3.41%, 1000=7.83%
     lat (msec): 2=5.67%, 4=0.27%, 10=0.08%, 20=0.01%, 250=0.01%

Run status group 0 (all jobs):
   READ: io=131072KB, aggrb=10297KB/s, minb=10545KB/s, maxb=10545KB/s, mint=12728msec, maxt=12728msec

The only hint I have now is that the CPU line of the fio output seems to show quite a bit of activity when compared to other systems. I ran it on my local Ubuntu machine and the output was:

cpu          : usr=0.19%, sys=0.59%, ctx=32923, majf=0, minf=23

The usr percentage seems to be a small fraction of what is being reported on my server.

UPDATE Re PHP APC. Yes it is installed. Output from phpinfo:

APC Support enabled
Version 3.1.7
APC Debugging   Disabled
MMAP Support    Enabled
MMAP File Mask  no value
Locking type    pthread mutex Locks
Serialization Support   php
Revision    $Revision: 307215 $
Build Date  May 2 2011 19:00:42

Is there any specific setting I should check for? These are the settings I have (local value, Master value):

apc.cache_by_default    On  On
apc.canonicalize    On  On
apc.coredump_unmap  Off Off
apc.enable_cli  Off Off
apc.enabled On  On
apc.file_md5    Off Off
apc.file_update_protection  2   2
apc.filters no value    no value
apc.gc_ttl  3600    3600
apc.include_once_override   Off Off
apc.lazy_classes    Off Off
apc.lazy_functions  Off Off
apc.max_file_size   1M  1M
apc.mmap_file_mask  no value    no value
apc.num_files_hint  1000    1000
apc.preload_path    no value    no value
apc.report_autofilter   Off Off
apc.rfc1867 Off Off
apc.rfc1867_freq    0   0
apc.rfc1867_name    APC_UPLOAD_PROGRESS APC_UPLOAD_PROGRESS
apc.rfc1867_prefix  upload_ upload_
apc.rfc1867_ttl 3600    3600
apc.serializer  default default
apc.shm_segments    1   1
apc.shm_size    32M 32M
apc.slam_defense    On  On
apc.stat    On  On
apc.stat_ctime  Off Off
apc.ttl 0   0
apc.use_request_time    On  On
apc.user_entries_hint   4096    4096
apc.user_ttl    0   0
apc.write_lock  On  On

UPDATE Increased apc.shm_size to 96M. Cache full count is now 0, and there are 96.5% hits to the cache after a few refreshes of the website here and there. APC memory usage is 25.4MB free.

It seems to have reduced the loading time by 3 seconds or so, now down to around 4 to 5 seconds if I do a pure wget from the server itself without getting any images etc. Still more than twice slower than the other hosting but definitely was an improvement.

I am still finding it strange why it was taking so long to render those pages when the server is totally idle (I don't have APC installed on my development PC and it doesn't have that kind of behaviour). And its still strange where those extra remaining seconds are being wasted.

This looks like other cases I've seen where Apache is spending a lot of its time compiling PHP. Have you made sure an opcode cache (e.g. APC) is installed? It'll show as a loaded module in the output of phpinfo(), if that helps. Otherwise, to track what Apache's doing within mod_php, your best bet is going to be XHProf.

To anyone other than jbx arriving here via Google: the other answers are excellent, by the way. Go read them. But those answers, and jbx's responses to them, have helped me arrive at this conclusion.

You have to first determine what the issue is first; if it's PHP, MySQL, I/O, load, memory, CPU, kernel, etc. sar logs the systems metrics; you'll have to catch it in the act. You can configure atop to do process accounting which definitely helps.

To determine if it's I/O

Use tools such as iotop and atop to see what the disk usage is; these tools will also tell you what is causing the IO. Generally, if iowait is sustained over 10% this could be the issue.

sar logs disk IO; so you can run sar -d to see it (look at %util column).

To determine if it's load

Use tools such as htop, top, uptime; again tie this to the process running and find out more details about what the process is doing. Note that this reports the load on the scheduler; it doesn't reflect the CPU usage.

To determine if it's a CPU

sar once again comes in to save the day; you can see this information with sar -P ALL. You can also use mpstat -P ALL for real-time data. Generally, the CPU is only an issue if all the CPU's are at 100%; 80%+ means they're being utilized (but not necessarily saturated).

To determine if it's the Memory (VM)

You'll want to use vmstat; vmstat -S M 1 and observe the swap, io, and system columns. Obviously a high amount of swapping can impact performance. There is also the system section; a high amount of interrupts can also do the same.

To determine if it's interrupts

You can use vmstat -S M 1. Unfortunately, it's hard to tell if interrupts are the issue if your system doesn't have a baseline on what is normal. A high amount of interrupts (which are caused from hardware requiring action from the kernel) will bring the system to a crawl. Failing NIC's are notorious for doing this.

To determine if it's the kernel

This is trickier but generally requires strace, perf, or sysdig tools. One such tool is perf top. strace with a summary (-c) is nice but it doesn't break it down relative to the system resources (so the data that is provided is only speculation); it's ideal to use perf top to come to the conclusion that it's the kernel. You can also use stap (SystemTap) if your machine supports it. I should also note that strace will impact performance; you should use sysdig if the system is at all important.

To determine if it's MySQL/PHP

You basically have to follow what I posted above (perf for example can provide information on what command is causing high kernel time, iotop, atop, htop can provide information relative to system resources on what is using them); basically, you're using the above tools to determine what is causing the load.

Once you've determined it to be MySQL

It could be a query that you're running (so you'll want to use EXPLAIN on that query in MySQL). You'll also need to make sure that your database is optimized and that the queries you're executing are optimized. You'll also have to make sure that the table engine you're using is ideal for what you're doing (I've seen many large tables that MyISAM when they should be InnoDB). If you've determined that none of the above are the issue and still suspect MySQL you may want to archive data in the affected tables to reduce access (table scans) to that table. You may also want to verify constraint consistency, enable cache buffering, and ensure indexes are optimal.

A good tool to help in this process is mytop; but all the information that mytop provides is easily accessible in the mysql client. Some useful statements to run:

SHOW FULL PROCESSLIST\G to get a complete list of currently executing SQL statements as well as their status to the server.
SHOW ENGINE INNODB STATUS\G (InnoDB only)
EXPLAIN EXTENDED <QUERY> to explain a query that you see MySQL executing.
SHOW GLOBAL STATUS\G for a server-wide status

Once you've determined it to be PHP

You can use tools to profile your PHP code (such as xdebug) and then open the generated profile in KCacheGrind to see a performance analysis of the profiled PHP code.

If you find it's none of these you may just need to upgrade your server.

Look over the answer I gave to another question similar to this one for clues.

The thing is, if other pages outside of WordPress area are loading fine, but WordPress itself is choked, three things come to mind outside of the generic things I recommend.

When you migrated your WordPress code to the new setup, did you make sure to correctly set all paths on the file system in wp-config.php? The reason being that sometimes WordPress can work despite incorrect paths if those are set in the MySQL DB for WordPress options. Making sure they are in wp-config.php forces WordPress to use the correct directories & ensures temp & cache folders work as expected.
Database slowdown? That is the only other thing I can thing of that would be idiosyncratic to WordPress yet allow other pages to load. Are you sure your MySQL my.cnf is working as it should for the DB needs of your site?
Do you have a plugin or setting in your WordPress code that enables Gzip compression? In general Gzip compression should happen on the server side via Apache or Ngnix since they can hand Gzip compression more efficiently than PHP code. So if you have caching enabled in WordPress, disable it since PHP (which is what WordPress uses) is not great at Gzip compression.

In general I have setup tons of CMS sites—more recently WordPress—on cloud servers without issue. a 10 second page load is not a symptom of the cloud host being inadequate. I would recommend looking over the stuff I recommend here & in my other answer. And I would also recommend debugging by doing a clean WordPress install on the setup that is an issue & see how that reacts. If that works well in comparison to your full site, then it is clear there is some configuration issue in your sites specific code.

EDIT: Here is another idea. Do you have Apache authorization (htaccess) anywhere in your setup? Do you have it set to allow from localhost? See below. Sometimes this setup works, but if Allow from localhost is the first in the list of Allow’s or the only item in the list of Allow’s it can choke from reverse DNS weirdness. I would recommend trying to disable that—if you can—and see how quickly the site loads in comparison to it being enabled.

Order Deny,Allow
Deny from all
Allow from 127.0.0.1 ::1
Allow from localhost

How to troubleshoot performance issues of PHP, MySQL and generic I/O

Related

Recent Posts