IO issues with InnoDB, review of assessment required

TL;DR - Having some IO based issues with InnoDB. Before I sign the form to request more hardware can you have a look and see if you agree with my assessment

Working for a client, I have inherited a large InnoDB MySQL database server. The data size of all tables (across about 180 schema) according to Information_Schema is about 34.9GB with a total size of about 38GB.

It is sitting on a Xen DomU (Debian) that is also running heartbeat and DRBD (in active/passive) to sync the data_dir partition across a gigibit crossover and to provide failover. NOTE: the web servers, caching servers and database servers are all Xen virtual servers sitting on two servers for redundancy and failover.

The total data size consists of about 25GB of BLOB data (Images, Videos, File Uploads etc) that is stored in the database to allow for multiple front end web servers that cache these files on their local filesystem. The caches on these servers is probably flushed on average about 10 times a week.

The rest of the data (about 9GB) is actual content that is updated often. There multiple web servers at the front end and also static file caches sitting in front of these. Still the web servers are doing about 500k hits a day. The web servers only ever get loaded when the databases are waiting on IO and seem to be able to handle a large amount of queries and connections before this happens.

The application is a PHP based CMS called eZ Publish. We are hosting about 90 sites off of this setup.

InnoTop is showing about 40 Million queries per day. On average about 500 qps. The database server has 2 CPUs, one is hardly ever used, running a 32Bit kernel and has 2GB of ram. 512MBs have been allocated to the InnoDB buffer pool. I am aware that this isn't much. The server doesn't have any more free.

Based on the below outputs and from reading online blogs and this Book I have concluded that the server is grossly under resourced and we would benefit a lot if we had enough memory to store the 9GB of data that is access frequently (if not all of the 35GB of data) and/or faster disks replicated over at least 10GB fiber channel.

Does that assessment seem accurate? Is there any settings I can change here that will give any performance benefit?

~$ vmstat 1 10

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 4  3     32 312280  26852 710860    0    0     2     1    8    2  3  1 86 10
 1 16     32  60964  27144 955584    0    0  1402 19095 2102 1921 31 13  0 55
 0 16     32 245340  27320 771756    0    0  1050 15640 1389 1990 11  8 31 49
 1 19     32 294744  27464 722744    0    0  1634 12470 2356 3082  0  1 49 49
 2  7     32 465268  27756 556484    0    0  2393  2982 3127 3339  8  6 29 57
 2  7     32 364820  28108 654308    0    0  1542 10695 2773 2614 12  7 21 60
 1 20     32 144328  28428 871120    0    0  1766  9515 3110 2882 29 10  4 57
 1 14     32 231284  28060 761364    0    0  1496 16435 2913 3058 25 13  3 59
 0  5     32 335140  28144 659352    0    0   986 21689 2197 2483  1  1 43 55
 0  8     32 367516  28200 627312    0    0   666  7098 1519 1602  0  0 50 49

~$ iostat -x

sda8 is the DRBD replicated filesystem

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.68    0.00    1.21    9.94    0.10   86.08

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda1              0.03     3.59  0.46  3.85    18.55    59.53    18.13     0.70  163.56   2.70   1.16
sda2              0.04     0.01  0.02  0.03     0.50     0.33    16.35     0.01  102.47   5.11   0.03
sda7              0.00     0.00  0.00  2.43     0.00     2.43     1.00     0.04   16.02  16.02   3.89
sda8            126.18    38.46 29.41 33.89  1244.76   578.80    28.81     0.24    3.75   3.33  21.05

mysql> SHOW ENGINE INNODB STATUS\G

Status: 
=====================================
101214 11:35:28 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 4 seconds
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 296294, signal count 169637
Mutex spin waits 0, rounds 3304249, OS waits 12485
RW-shared spins 397195, OS waits 245310; RW-excl spins 44610, OS waits 37492
------------
TRANSACTIONS
------------
Trx id counter 0 1849382073
Purge done for trx's n:o < 0 1849375557 undo n:o < 0 0
History list length 302
Total number of lock structs in row lock hash table 0
LIST OF TRANSACTIONS FOR EACH SESSION:

     -- TRIMMED OP --

--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: waiting for i/o request (write thread)
Pending normal aio reads: 0, aio writes: 0,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 1
1588467 OS file reads, 491628 OS file writes, 166422 OS fsyncs
1 pending preads, 0 pending pwrites
191.95 reads/s, 28245 avg bytes/read, 3.25 writes/s, 2.75 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 3, free list len 6, seg size 10,
18975 inserts, 18793 merged recs, 16850 merges
Hash table size 2212699, used cells 1083042, node heap has 1459 buffer(s)
8141.96 hash searches/s, 1822.54 non-hash searches/s
---
LOG
---
Log sequence number 33 3880839250
Log flushed up to   33 3880839250
Last checkpoint at  33 3875354146
0 pending log writes, 0 pending chkp writes
123375 log i/o's done, 2.75 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 681812036; in additional pool allocated 1048576
Buffer pool size   32768
Free buffers       0
Database pages     31309
Modified db pages  3167
Pending reads 1
Pending writes: LRU 129, flush list 0, single page 0
Pages read 3360890, created 9011, written 408990
331.17 reads/s, 0.25 creates/s, 0.00 writes/s
Buffer pool hit rate 990 / 1000
--------------
ROW OPERATIONS
--------------
3 queries inside InnoDB, 0 queries in queue
4 read views open inside InnoDB
Main thread process no. 7916, id 2395159472, state: sleeping
Number of rows inserted 73371, updated 55517, deleted 47643, read 154375744
1.50 inserts/s, 1.25 updates/s, 0.00 deletes/s, 9428.64 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================

mysql> SHOW STATUS;

+-----------------------------------+------------+
| Variable_name                     | Value      |
+-----------------------------------+------------+
| Aborted_clients                   | 700        | 
| Aborted_connects                  | 1          | 
| Binlog_cache_disk_use             | 0          | 
| Binlog_cache_use                  | 0          | 
| Bytes_received                    | 116        | 
| Bytes_sent                        | 177        | 
| Com_select                        | 1          | 
| Com_show_status                   | 1          | 
| Com_xxx (OP rest were 0)          | 0          | 
| Compression                       | OFF        | 
| Connections                       | 166487     | 
| Created_tmp_disk_tables           | 0          | 
| Created_tmp_files                 | 94         | 
| Created_tmp_tables                | 1          | 
| Delayed_errors                    | 0          | 
| Delayed_insert_threads            | 0          | 
| Delayed_writes                    | 0          | 
| Flush_commands                    | 1          | 
| Handler_write                     | 131        | 
| Handler_xxx (OP rest were 0)      | 0          | 
| Innodb_buffer_pool_pages_data     | 31224      | 
| Innodb_buffer_pool_pages_dirty    | 2942       | 
| Innodb_buffer_pool_pages_flushed  | 408710     | 
| Innodb_buffer_pool_pages_free     | 0          | 
| Innodb_buffer_pool_pages_latched  | 12         | 
| Innodb_buffer_pool_pages_misc     | 1544       | 
| Innodb_buffer_pool_pages_total    | 32768      | 
| Innodb_buffer_pool_read_ahead_rnd | 39211      | 
| Innodb_buffer_pool_read_ahead_seq | 2735       | 
| Innodb_buffer_pool_read_requests  | 279758942  | 
| Innodb_buffer_pool_reads          | 1341622    | 
| Innodb_buffer_pool_wait_free      | 0          | 
| Innodb_buffer_pool_write_requests | 1744350    | 
| Innodb_data_fsyncs                | 166166     | 
| Innodb_data_pending_fsyncs        | 2          | 
| Innodb_data_pending_reads         | 3          | 
| Innodb_data_pending_writes        | 0          | 
| Innodb_data_read                  | 3356102656 | 
| Innodb_data_reads                 | 1581795    | 
| Innodb_data_writes                | 491260     | 
| Innodb_data_written               | 877349888  | 
| Innodb_dblwr_pages_written        | 408828     | 
| Innodb_dblwr_writes               | 10622      | 
| Innodb_log_waits                  | 2          | 
| Innodb_log_write_requests         | 783217     | 
| Innodb_log_writes                 | 116708     | 
| Innodb_os_log_fsyncs              | 123325     | 
| Innodb_os_log_pending_fsyncs      | 1          | 
| Innodb_os_log_pending_writes      | 0          | 
| Innodb_os_log_written             | 362475008  | 
| Innodb_page_size                  | 16384      | 
| Innodb_pages_created              | 8991       | 
| Innodb_pages_read                 | 3350432    | 
| Innodb_pages_written              | 408828     | 
| Innodb_row_lock_current_waits     | 0          | 
| Innodb_row_lock_time              | 10732      | 
| Innodb_row_lock_time_avg          | 195        | 
| Innodb_row_lock_time_max          | 2426       | 
| Innodb_row_lock_waits             | 55         | 
| Innodb_rows_deleted               | 47499      | 
| Innodb_rows_inserted              | 73260      | 
| Innodb_rows_read                  | 153981836  | 
| Innodb_rows_updated               | 55359      | 
| Key_blocks_not_flushed            | 60         | 
| Key_blocks_unused                 | 85607      | 
| Key_blocks_used                   | 30452      | 
| Key_read_requests                 | 43862323   | 
| Key_reads                         | 102541     | 
| Key_write_requests                | 5199340    | 
| Key_writes                        | 95356      | 
| Last_query_cost                   | 0.000000   | 
| Max_used_connections              | 132        | 
| Ndb_cluster_node_id               | 0          | 
| Ndb_config_from_host              |            | 
| Ndb_config_from_port              | 0          | 
| Ndb_number_of_data_nodes          | 0          | 
| Not_flushed_delayed_rows          | 0          | 
| Open_files                        | 7266       | 
| Open_streams                      | 0          | 
| Open_tables                       | 12674      | 
| Opened_tables                     | 0          | 
| Prepared_stmt_count               | 0          | 
| Qcache_free_blocks                | 4709       | 
| Qcache_free_memory                | 16054648   | 
| Qcache_hits                       | 21241525   | 
| Qcache_inserts                    | 1070573    | 
| Qcache_lowmem_prunes              | 560462     | 
| Qcache_not_cached                 | 2537364    | 
| Qcache_queries_in_cache           | 12205      | 
| Qcache_total_blocks               | 31665      | 
| Questions                         | 27104225   | 
| Rpl_status                        | NULL       | 
| Select_full_join                  | 0          | 
| Select_full_range_join            | 0          | 
| Select_range                      | 0          | 
| Select_range_check                | 0          | 
| Select_scan                       | 1          | 
| Slave_open_temp_tables            | 0          | 
| Slave_retried_transactions        | 0          | 
| Slave_running                     | OFF        | 
| Slow_launch_threads               | 0          | 
| Slow_queries                      | 0          | 
| Sort_merge_passes                 | 0          | 
| Sort_range                        | 0          | 
| Sort_rows                         | 0          | 
| Sort_scan                         | 0          | 
| Ssl_xxx (OP removed as not used)  |            | 
| Table_locks_immediate             | 9749827    | 
| Table_locks_waited                | 359        | 
| Tc_log_max_pages_used             | 0          | 
| Tc_log_page_size                  | 0          | 
| Tc_log_page_waits                 | 0          | 
| Threads_cached                    | 0          | 
| Threads_connected                 | 20         | 
| Threads_created                   | 4862       | 
| Threads_running                   | 16         | 
| Uptime                            | 67476      | 
+-----------------------------------+------------+

mysql> SHOW VARIABLES

+---------------------------------+-----------------------------+
| Variable_name                   | Value                       |
+---------------------------------+-----------------------------+
| auto_increment_increment        | 1                           | 
| auto_increment_offset           | 1                           | 
| automatic_sp_privileges         | ON                          | 
| back_log                        | 50                          | 
| binlog_cache_size               | 32768                       | 
| bulk_insert_buffer_size         | 8388608                     | 
| completion_type                 | 0                           | 
| concurrent_insert               | 1                           | 
| connect_timeout                 | 10                          | 
| default_week_format             | 0                           | 
| delay_key_write                 | ON                          | 
| delayed_insert_limit            | 100                         | 
| delayed_insert_timeout          | 300                         | 
| delayed_queue_size              | 1000                        | 
| div_precision_increment         | 4                           | 
| engine_condition_pushdown       | OFF                         | 
| expire_logs_days                | 0                           | 
| flush                           | OFF                         | 
| flush_time                      | 0                           | 
| ft_boolean_syntax               | + -><()~*:""&|              | 
| ft_max_word_len                 | 84                          | 
| ft_min_word_len                 | 4                           | 
| ft_query_expansion_limit        | 20                          | 
| ft_stopword_file                | (built-in)                  | 
| group_concat_max_len            | 1024                        | 
| have_archive                    | YES                         | 
| have_bdb                        | NO                          | 
| have_blackhole_engine           | NO                          | 
| have_compress                   | YES                         | 
| have_crypt                      | YES                         | 
| have_csv                        | YES                         | 
| have_dynamic_loading            | YES                         | 
| have_example_engine             | NO                          | 
| have_federated_engine           | YES                         | 
| have_geometry                   | YES                         | 
| have_innodb                     | YES                         | 
| have_isam                       | NO                          | 
| have_merge_engine               | YES                         | 
| have_ndbcluster                 | DISABLED                    | 
| have_openssl                    | DISABLED                    | 
| have_query_cache                | YES                         | 
| have_raid                       | NO                          | 
| have_rtree_keys                 | YES                         | 
| have_symlink                    | YES                         | 
| init_connect                    |                             | 
| init_file                       |                             | 
| init_slave                      |                             | 
| innodb_additional_mem_pool_size | 1048576                     | 
| innodb_autoextend_increment     | 8                           | 
| innodb_buffer_pool_awe_mem_mb   | 0                           | 
| innodb_buffer_pool_size         | 536870912                   | 
| innodb_checksums                | ON                          | 
| innodb_commit_concurrency       | 0                           | 
| innodb_concurrency_tickets      | 500                         | 
| innodb_data_file_path           | ibdata1:10M:autoextend      | 
| innodb_data_home_dir            |                             | 
| innodb_doublewrite              | ON                          | 
| innodb_fast_shutdown            | 1                           | 
| innodb_file_io_threads          | 4                           | 
| innodb_file_per_table           | ON                          | 
| innodb_flush_log_at_trx_commit  | 1                           | 
| innodb_flush_method             |                             | 
| innodb_force_recovery           | 0                           | 
| innodb_lock_wait_timeout        | 50                          | 
| innodb_locks_unsafe_for_binlog  | OFF                         | 
| innodb_log_arch_dir             |                             | 
| innodb_log_archive              | OFF                         | 
| innodb_log_buffer_size          | 1048576                     | 
| innodb_log_file_size            | 5242880                     | 
| innodb_log_files_in_group       | 2                           | 
| innodb_log_group_home_dir       | ./                          | 
| innodb_max_dirty_pages_pct      | 90                          | 
| innodb_max_purge_lag            | 0                           | 
| innodb_mirrored_log_groups      | 1                           | 
| innodb_open_files               | 300                         | 
| innodb_rollback_on_timeout      | OFF                         | 
| innodb_support_xa               | ON                          | 
| innodb_sync_spin_loops          | 20                          | 
| innodb_table_locks              | ON                          | 
| innodb_thread_concurrency       | 8                           | 
| innodb_thread_sleep_delay       | 10000                       | 
| interactive_timeout             | 50                          | 
| join_buffer_size                | 536866816                   | 
| key_buffer_size                 | 132120576                   | 
| key_cache_age_threshold         | 300                         | 
| key_cache_block_size            | 1024                        | 
| key_cache_division_limit        | 100                         | 
| language                        | /usr/share/mysql/english/   | 
| large_files_support             | ON                          | 
| large_page_size                 | 0                           | 
| large_pages                     | OFF                         | 
| lc_time_names                   | en_US                       | 
| license                         | GPL                         | 
| local_infile                    | ON                          | 
| locked_in_memory                | OFF                         | 
| log                             | OFF                         | 
| log_bin                         | OFF                         | 
| log_bin_trust_function_creators | OFF                         | 
| log_error                       |                             | 
| log_queries_not_using_indexes   | OFF                         | 
| log_slave_updates               | OFF                         | 
| log_slow_queries                | OFF                         | 
| log_warnings                    | 1                           | 
| long_query_time                 | 10                          | 
| low_priority_updates            | OFF                         | 
| lower_case_file_system          | OFF                         | 
| lower_case_table_names          | 0                           | 
| max_allowed_packet              | 16776192                    | 
| max_binlog_cache_size           | 4294967295                  | 
| max_binlog_size                 | 1073741824                  | 
| max_connect_errors              | 50                          | 
| max_connections                 | 800                         | 
| max_delayed_threads             | 20                          | 
| max_error_count                 | 64                          | 
| max_heap_table_size             | 268435456                   | 
| max_insert_delayed_threads      | 20                          | 
| max_join_size                   | 18446744073709551615        | 
| max_length_for_sort_data        | 1024                        | 
| max_prepared_stmt_count         | 16382                       | 
| max_relay_log_size              | 0                           | 
| max_seeks_for_key               | 4294967295                  | 
| max_sort_length                 | 1024                        | 
| max_sp_recursion_depth          | 0                           | 
| max_tmp_tables                  | 32                          | 
| max_user_connections            | 0                           | 
| max_write_lock_count            | 4294967295                  | 
| multi_range_count               | 256                         | 
| myisam_data_pointer_size        | 6                           | 
| myisam_max_sort_file_size       | 2147483647                  | 
| myisam_recover_options          | OFF                         | 
| myisam_repair_threads           | 1                           | 
| myisam_sort_buffer_size         | 8388608                     | 
| myisam_stats_method             | nulls_unequal               | 
| ndb_autoincrement_prefetch_sz   | 32                          | 
| ndb_force_send                  | ON                          | 
| ndb_use_exact_count             | ON                          | 
| ndb_use_transactions            | ON                          | 
| ndb_cache_check_time            | 0                           | 
| net_buffer_length               | 16384                       | 
| net_read_timeout                | 30                          | 
| net_retry_count                 | 10                          | 
| net_write_timeout               | 60                          | 
| new                             | OFF                         | 
| old_passwords                   | OFF                         | 
| open_files_limit                | 65535                       | 
| optimizer_prune_level           | 1                           | 
| optimizer_search_depth          | 62                          | 
| pid_file                        | /var/run/mysqld/mysqld.pid  | 
| port                            | 3306                        | 
| preload_buffer_size             | 32768                       | 
| protocol_version                | 10                          | 
| query_alloc_block_size          | 8192                        | 
| query_cache_limit               | 2097152                     | 
| query_cache_min_res_unit        | 4096                        | 
| query_cache_size                | 67108864                    | 
| query_cache_type                | ON                          | 
| query_cache_wlock_invalidate    | OFF                         | 
| query_prealloc_size             | 8192                        | 
| range_alloc_block_size          | 2048                        | 
| read_buffer_size                | 131072                      | 
| read_only                       | OFF                         | 
| read_rnd_buffer_size            | 262144                      | 
| relay_log_purge                 | ON                          | 
| relay_log_space_limit           | 0                           | 
| rpl_recovery_rank               | 0                           | 
| secure_auth                     | OFF                         | 
| server_id                       | 0                           | 
| skip_external_locking           | ON                          | 
| skip_networking                 | OFF                         | 
| skip_show_database              | OFF                         | 
| slave_compressed_protocol       | OFF                         | 
| slave_load_tmpdir               | /tmp/                       | 
| slave_net_timeout               | 3600                        | 
| slave_skip_errors               | OFF                         | 
| slave_transaction_retries       | 10                          | 
| slow_launch_time                | 2                           | 
| socket                          | /var/run/mysqld/mysqld.sock | 
| sort_buffer_size                | 2097144                     | 
| sql_big_selects                 | ON                          | 
| sql_mode                        |                             | 
| sql_notes                       | ON                          | 
| sql_warnings                    | OFF                         | 
| ssl_ca                          |                             | 
| ssl_capath                      |                             | 
| ssl_cert                        |                             | 
| ssl_cipher                      |                             | 
| ssl_key                         |                             | 
| storage_engine                  | MyISAM                      | 
| sync_binlog                     | 0                           | 
| sync_frm                        | ON                          | 
| system_time_zone                | NZDT                        | 
| table_cache                     | 32362                       | 
| table_lock_wait_timeout         | 50                          | 
| table_type                      | MyISAM                      | 
| thread_cache_size               | 8                           | 
| thread_stack                    | 131072                      | 
| time_format                     | %H:%i:%s                    | 
| time_zone                       | SYSTEM                      | 
| timed_mutexes                   | OFF                         | 
| tmp_table_size                  | 134217728                   | 
| tmpdir                          | /tmp                        | 
| transaction_alloc_block_size    | 8192                        | 
| transaction_prealloc_size       | 4096                        | 
| tx_isolation                    | REPEATABLE-READ             | 
| updatable_views_with_limit      | YES                         | 
| version                         | 5.0.32-Debian_7etch10       | 
| version_comment                 | Debian etch distribution    | 
| version_compile_machine         | i486                        | 
| version_compile_os              | pc-linux-gnu                | 
| wait_timeout                    | 50                          | 
+---------------------------------+-----------------------------+

Some path settings were removed due to character limits.


On average about 500 qps

This is the first challenge. A single (fast) HDD goes only 200-250 IOPS.

in your iostat dump, we see sda8 is doing 165 IOPS, not bad at all for a replicated disk. Still, the average queue size is very small, and the average wait and average service time are little more than 3msec. In other words, the disc isn't a bottleneck, at least not at the moment you ran iostat.

Obviously, the RAM buffers and other InnoDB optimizations are already at work reducing drastically the amount of IO requests (otherwise, you'd be trying to do 500 requests, and sda8 wouldn't be able to service so fast).

So, what's your problem?

Edit:

ok, the numbers at heavy load paint a totally different picture, where the number of IOs pile up and response time suffers a lot. Definitely a case for some hardware redo.

first, i'd reevaluate DRBD. Three options:

  1. keep DRBD, but get better latency between hosts. either 10GbE, or some fancier connection. I'd say InfiniBand could be an option.
  2. replace DRBD with some external shared storage (SAN storage). One that could give a high number of IOPS. iSCSI on ethernet won't cut it. Either high-end FC, or shared SAS.
  3. consider DB replication instead of sharing storage. That way, storage would be local to the DB, and the latency would be local too. The replication happens 'after the fact', so it doesn't affect IOPS.

Then, no matter which solution you choose, you still need a lot more IOPS than what your current disks can provide.

There are two main ways to get high IOPS:

  1. Add more spindles. Usually in a RAID10, the IOPS you can get is roughly the sum of the IOPS of all active disks. That's why high-RPM disks are available in so low capacities, you usually want lots and lots of disks to get higer speed. 20-30 drives just to add up the spindles are not unheard of; that's the number you would need to get into the ~6000 IPS ballpark. (a 15kRPM disk gives 180-250 IOPS)
  2. Use SSD. A high-end flash drive can get 20,000 IOPS. Be sure to use 'server' SSDs, the ones used for workstations would die in a few months. See the difference between SLC and MLC, for example, Intel's X-25E vs. X-25M.

165 IOPS? Where do you read that from? I think you miscalculated by adding rrqm (read request merges) and wrqm (write request merges) instead of using r/s + w/s (real IOPS by disk). Additionally the disk is only at 21% utilization - that usually not a sign to add more spindles. Everything below 70% for a given LUN can be considered OK.

Apparently only a small portion of the queries are hitting the spindles.