Here at TrustIV, we use Microsoft Hyper-V as our virtualisation platform, and host various Linux systems on it. Hyper-V and Linux may seems like something of an odd couple, but they match up surprisingly well. Hyper-V is reasonably priced and very capable, and using Linux as the guest OS saves all kinds of licensing headaches. Microsoft also contributes various Hyper-V integration drivers to the Linux kernel, which smooth things further.
However, a couple of weeks ago, we had a power outage in our data centre, which caused some data corruption in a PostgreSQL database we use internally. This was slightly surprising, as PostgreSQL is designed to withstand this sort of abuse. This is the story of how we figured out the cause. As with my other technical posts, there's a short non-technical conclusion at the end.
When a database like PostgreSQL finishes a transaction with a COMMIT, it makes sure the data has been written to disk, and then tells the user that the transaction is complete. For this to work, you need an unbroken "chain of custody" from the user to the metal, where each component promises that the data has been committed. In our case, some "committed" transactions had not been written to disk, leading to inconsistent on-disk data. The PostgreSQL documentation warns you about this sort of thing. Somewhere between the user and the disk, something was lying, but what?
My first port of call was Hyper-V. I found a knowledge base article about similar problems with Active Directory guests on some versions of Hyper-V. Active Directory would ask Hyper-V to turn disk caching off. Hyper-V would try to turn disk caching off, fail, then lie and say it was successful. This seemed promising, but dig deeper and it doesn't match up. The bug is fixed in the version of Hyper-V we're using. Hyper-V now admits that it couldn't switch disk caching off, which forces AD to work around the problem.
So, if AD can work around caching, then why can't we? AD works around it by sending all requests with a special SCSI flag called FUA, which forces data to disk. Hyper-V recognises this flag (so long as you're using VHDX disks), and makes sure data is written to disk before reporting success. So, either we weren't using this flag, or our disks were lying about data being written to disk.
Since our disks are high-quality server-spec drives, I figured the problem was probably higher up in the food chain. When you commit a transaction in PostgreSQL on Linux, it tells the OS to do an fdatasync (or, at least that's the default - there are other options). From Linux's point of view, fdatasync is still a relatively high-level command - different file systems implement it in different ways. However, looking at the Linux kernel mailing list and the kernel source, ext4 has a "barrier" option, which is on by default, and makes sure that fdatasync requests use the FUA flag.
So, it looks like we can trace a COMMIT from the DB to the disk. Where's the missing link? The missing link was ext4. The PostgreSQL data drive was formatted as ext2, rather than ext4. It's an easy mistake to make, but ext2 doesn't support hardware write barriers, so our transactions were getting lost in various layers of cache.
The next step was to prove it. The PostgreSQL devs recommend that you should do a disk pull plug test on any hardware you'll be using to host a database. We skipped this step initially, as it seemed destructive, but in hindsight it would have been the lesser of two evils. We've got two Hyper-V servers, so we chose one as a "victim", migrated all-but-one of our VM images off it (live migrations are great), and used the remaining VM image to test what happens when we pull the plug. Sure enough, when we stored our data on an ext2 partition, we saw data loss. When we switched to ext4, our data stayed intact. It's important that when you do these tests, you physically pull the plug on the physical box - it's tempting to do something less destructive, but a real power outage won't be so kind.
So what did we learn? Well, for starters, we learned not to use ext2 to store PostgreSQL data. From a distance, it looks like a good fit (the DB does its own journaling, so why use anything fancier?), but lack of support for hardware write barriers is a killer. In practice, the journaling overhead of ext4 wasn't that severe, and you can always fall back to one of the weaker options like writeback, or no journaling.
But more importantly, we learned that you should never neglect destructive testing. It's easy to put off, as you don't want to destroy your stuff. But if you don't seek out the Chaos Monkey, the Chaos Monkey will seek you. The resilience of your infrastructure will be tested, whether you choose to test it or not.