The cloud is still fallible: Joyent’s entire US-East-1 data centre went down yesterday, with the finger of blame pointed at a particularly fat finger which resulted in a monumental operator error.
The problem has since been resolved, with all virtual machines back up and running, however according to a post-mortem blog post from the Joyent team, the minimum downtime for customers was 20 minutes and the maximum 149 minutes, with over 90% of instances recovered within an hour.
The cause of the outage? A typo, put simply. Joyent picks up the story: “The command to reboot the select set of new systems that needed to be updated was mistyped, and instead specified all servers in the data centre.
“Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-eat 1 availability zone without delay.”
Whoops. It’s the classic admin’s worst nightmare. But what happens from here?
As Joyent CTO Brian Cantrill wrote in an update on Y Combinator: “While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a data centre.”
The promised post-mortem has provided a step-by-step guide as to the plan going forward; firstly “dramatically improving” tooling so input validation is far more strict to avoid total wipeout across all servers; secondly improving the control plane recovery so all nodes can be rebooted safely without operator intervention; and finally assessing a more aggressive migration of customer instances on legacy hardware.
David Coursey, of Silicon Angle, had a different view. “To me, ‘fat fingers’ is a bit of a hero,” he wrote. “He helped who-ever-heard-of-Joyent join the big leagues of cloud vendors – Amazon, Rackspace, Microsoft, Google – who have suffered (usually worse) failures.
“Such a shutdown just shouldn’t be possible without a lot more authority that a single sysadmin should ever be allowed.”
Coursey added that Joyent shouldn’t paper over the cracks by dismissing the unfortunate sysadmin – and the IaaS provider seems to have taken this approach.
“We will be working as diligently as we can, and as expediently as we can, to prevent an issue like this from happening again,” Joyent concluded.
For now, put this one alongside Mimecast in terms of shock outages.