I posted to the mailing list last week that Storm was going to be down for a while due to hardware problems. I’ve fixed the issues and Storm has been running the last few days without any other problems. From what I can tell, the hardware failure was caused by a thunderstorm and an ethernet cable that wasn’t run through a surge protector – resulting in a cooked motherboard. After replacing the motherboard and tweaking a few issues within the VMs related to the move, everything is back online and running fine.
This isn’t the first time Storm’s crashed and left me wondering (or worrying) about the integrity of the VMs, especially ones that run constantly. I do my best to remember to take snapshots along the way, but I’m guilty of not always remembering. As far as I can tell, ESXi 4.0 doesn’t provide a clean way for automated snapshots of a VM, much less conditional snapshots, since I don’t want my snapshot trees to be unnecessarily large.
To solve the issue of automated snapshots, I wrote autoshot.sh and stuck it in a cronjob to be run regularly. Autoshot.sh uses ash provided by Busybox on the ESXi 4.0 host. There were some limitations in what I could do, as Busybox is fairly restrictive, and finding that python’s time.strptime() was broken (while the rest of the time module seems to work fine) made this script a bit longer than necessary (and pretty ugly as well.) Autoshot.sh uses vim-cmd to get information on and take snapshots and uses each VM’s logfile to determine it’s last power-off time. If the VM is powered on or it’s last snapshot was before it was powered off, autoshot.sh will take a new snapshot to save the new changes. If the VM was powered on, the snapshot includes the memory.
If you find any bugs in autoshot.sh or notice that something on Storm isn’t running properly, let me know either through the comments or the mailing lists.