Google’s infrastructure as a service (IaaS) offering Compute Engine lost connectivity across all regions for 18 minutes on April 11 after problems experienced with a bug in its network configuration management software.
In a status update discussing the outage and how it occurred posted yesterday, the search giant confirmed the event affected Compute Engine only, and bemoaned how its ‘canary step’ process – a configuration deployed to a single site to ensure there are no issues with upstream failures – had a software bug. The result was that the push system wrongly believed there were no issues with the new configuration, and therefore happily began its rollout resulting in dropped traffic.
“The Google engineers who had been investigating a localised failure of the asia-east1 VPN now knew that they had a widespread and serious problem,” a post from Benjamin Treynor Sloss, the interestingly titled ‘VP 24×7’ at Google, reads. “They did precisely what we train for, and decided to revert the most recent configuration changes made to the network even before knowing for sure what the problem was.
“This was the correct action, and the time from detection to decision to revert to the end of the outage was thus just 18 minutes,” Sloss adds.
Naturally Google has apologised to its customers, and has also thrown in discounts of service credits up to 25% of impacted Compute Engine and VPN applications for those affected ‘to underscore how seriously we are taking this event.’ Previously, Google’s cloud has fallen over due to a connectivity fault in February last year, and an issue with manual link activation back in November.
Google says that the latest issue is under control meaning there is no risk of a reoccurrence, while its engineering teams will be working on prevention, detection, and mitigation systems over the next ‘several weeks’ to aim to ensure this sort of thing won’t happen again.