MySql Data Loss - post mortem analysis - RackSpace Cloud Server

After a recent 'emergency migration' of a RS cloud server, the mysql databases on our server snapshot image proved to be days out of date from the backup date. And yet files that were uploaded through the impacted webapp had been written to the file system. Related metadata that was written to the database was lost, but the files themselves were backed-up.

Once I was able to manually access the mysql data files before the mysql server started (server was configured to start mysql on boot), I was able to see that the update time for ib_logfile1, ib_logfile0 and ibdata1 was days old.

As with this poster, mysql data loss after server crash, it's as if some caching controller had told the OS / mysql server that it had committed data that was still in cache, and it was lost instead of flushed.

I can't quite wrap my head around how the uploaded files got written but the database data did not. I would have thought that any cache would have flushed system wide, rather than process by process.

Any suggestions as to how this might have happened?

UPDATE TWO:

See my answer below that explains what happened.

UPDATE:

Config details, as requested.

RackSpace Cloud Server Details:
OS: Ubuntu 10.04 LTS (Lucid)
RAM: 1024 MB
Disk Space: 40 GB
Datacenter: ORD1
Service Level: unmanaged
root@restore-testing:~# dpkg -s mysql-server
...
Architecture: all
Source: mysql-dfsg-5.1
Version: 5.1.61-0ubuntu0.10.04.1
...
root@restore-testing:~# cat /etc/fstab
proc            /proc       proc    defaults    0 0
/dev/xvda1       /           ext3    defaults,errors=remount-ro,noatime    0 1
/dev/xvdc1       none        swap    sw          0 0

I can see this happening depending on the method Innodb flushes data.

Please look into innodb_flush_method used by your MySQL installation. Depending on the value set (O_DSYNC or O_DIRECT), InnoDB could either double buffer to the OS and the InnoDB buffer pool or just the InnoDB buffer pool. If the variable is set to cache to the buffer pool only, I can quickly see data vanishing if the OS restoration butchered the buffer pool in the process. I wrote a post in the DBA StackExchange about this.

Here is another link concerning using MySQL in the Cloud vs bare-metal ( Click Here ) . It names three potential problems/challenges that comes with moving MySQL into a Cloud Environment:

  • Virtual IPs
  • Memory Configuration
  • Slow Disks

Even if these limitations have been overcome since that article, it is prudent to rethink where mission-critical data will reside. This is especially true given what just happened to your data.

BTW StackOverflow has a nice post about pros and cons of MySQL in the Cloud.

To further this point more from another aspect, Cloud Environments provide geographic replication of a mysql instance from East Coast to West Coast. When I personally did a 30-day evaluation of the XEROUND Database Service (I was provided with two Public IPs), I saw very bad intermittency (about 5-6 minutes) between the IPs. Could you imagine losing data during this window because of a crash on either end? Your data loss was due to an emergency manual intervention.

RECOMMENDATION

IMHO I would switch your MySQL databases to bare-metal and use either DRBD or MySQL Replication for data redundancy. You can maintain all Cloud Services for Web and App Servers.


While some settings of innodb_flush_method combined with certain hardware can result in data loss with a hardware failure, no combination of innodb_flush_method and innodb_flush_log_at_trx_commit explain how the ib_logfile1 & ib_logfile2 could be days stale.

I migrated servers at about the timestamp of the database files. I brought mysql down slowly on both servers and rsync'd /var/lib/mysql from one to the other. The webapps came up and checked out on the new server.

But, what if I forgot to monit unmonitor mysql on the target server and it re-started mysql? Maybe I had replaced the data and log files under a running mysql server? Would mysql continue blithely flushing data out to stale inodes?

A quick test later, and the answer is yes. MySql doesn't notice that it is writing to invalid file handles when its data and log files have been replaced but the in memory buffer pool is able to satisfy all queries. Given the size of our database (small) and query volume (low), the buffer pool would likely have continued to handle our requests for some time.