Amazon Web Services move to start rebooting instances starting Friday, ending September 30, over a security fix highlights how the cloud is maturing and also how best practices remain a work in progress.
Playing with the building blocks of the cloud: Getting IaaS right
The cloud services giant told EC2 customers by email that it was going to reboot instances in all zones in the days to come and users can't stop the process. Cloud providers all have some downtime for maintenance issues, but the scale of AWS' reboot — presumably over a flaw in the open source hypervisor Xen — has a lot of customers worried about downtime.
Amazon did say its update has nothing to do with the Bash bug reported on Thursday.
It's clear that the communication and best practices in a situation where every instance has to be rebooted is still a work in progress. AWS is the largest cloud infrastructure player and is most likely the first to have to take such reboot measures at scale. Rest assured, AWS won't be the last.
As enterprises increasingly rely on cloud providers such as AWS, Microsoft Azure, Google and IBM's SoftLayer to name just a few (HP, Oracle, Rackspace etc) for compute there will have to be some standard way to handle these issues.
Update at 4:40pm ET: AWS said in a blog post that a small number of customers will be affected, but acknowledged the inconvenience. AWS said:
As we explained in emails to the small percentage of our customers who are affected and on our forums, the instances that need the update require a system restart of the underlying hardware and will be unavailable for a few minutes while the patches are being applied and the host is being rebooted.
While most software updates are applied without a reboot, certain limited types of updates require a restart. Instances requiring a reboot will be staggered so that no two regions or availability zones are impacted at the same time and they will restart with all saved data and all automated configuration intact. Most customers should experience no significant issues with the reboots. We understand that for a small subset of customers the reboot will be more inconvenient; we wouldn’t inconvenience our customers if it wasn’t important and time-critical to apply this update.
Customers who aren’t sure if they are impacted should go to the “Events“ page on the EC2 console, which will list any pending instance reboots for their AWS account.
RightScale's take on the situation is probably a good place to start for managing the reboot and starting to ponder best practices. Ideally, Amazon would have had a system where in a case customers couldn't postpone a reboot they'd be automatically be moved to another availability zone (at no charge) while the maintenance was underway.
Just like Microsoft eventually came to the Patch Tuesday timing of fixing security issues AWS will work through its own cadence.
The bind here is that AWS can't be too forthcoming about what exactly it is fixing until the work is done. Telegraphing a flaw and then having security compromised is a much larger problem than downtime. Some customers have praised AWS for the massive reboot operation while others have screamed.
The reboots are staggered so there is an opportunity for AWS customers to build redundancy. However, AWS may not be giving customers enough of a heads up on the reboot cadence.
This customer summarizes the problem well in the AWS forums:
This is most definitely a problem for us as well. We currently have more than 100 instances scheduled for reboot, and we too cannot scramble the staff on short notice to monitor the many of services that this will impact. In fact, entire service clusters of machines are scheduled to be rebooted at once, although they are in different AZs, amounting to an event equivalent to the loss of an entire Region. On two days' notice we can't possibly prepare for this.
AWS' response is that there is no way to reschedule the reboots because they are "very timely security and operational patches."
It's a tight spot. As the cloud becomes the de facto computing system these issues will need to be worked out. In the future, we're going to have a Patch Tuesday approach for the cloud.