Wednesday, May 11, 2011

Virtualization and Data Loss

Well, it had to happen to me eventually.  A physical server running VMware ESXi crashed and I lost a set of virtual servers that I had moved to it.

It seemed to result from a power hiccup.  Nearly everything important in the server room is on a UPS, except for this system.

This failure mode was new to me: VMware ESXi would not finish its boot, but complained about an invalid file (sorry, exact filename escapes me) and stopped.  (It looked an awful lot like a Windows boot failure I've seen in the past where a corrupted registry hive file prevented Windows from booting!)  I had to perform a VMware ESXi recovery installation, and that resulted in the ominous warning that one of my filesystems had an invalid partition table.

This particular VMware server has two VMFS filesystems on it (two separate hard drives to improve I/O performance for the VMs), and the second of the two filesystems was toast.

I hadn't considered the virtual machines on this VMware server to be irreplaceable, but they were valuable.  It took a couple of days of work to rebuild one of the lost VMs.  Another of the lost VMs caused a troublesome cascaded failure: it provided an infrequently-used web proxy whose loss caused unexpected software update failures elsewhere, and that took some time to diagnose as well.

In summary: I wish I had enough disk space everywhere to have backups of all the virtual machines, and I wish I had a good way to use apcupsd (or equivalent) to shutdown ESXi servers nicely on power failures.

No comments:

Post a Comment