Wordpress outtage: avoiding future failure?

The Wordpress out age might teach SaaS/cloud players a valuable lesson in rollout management. Bob Warfield has the germ of an idea. Worth thinking about? You decide.

The Wordpress outage is getting lots of attention. Understandably. Bob Warfield weighs into the mix with his Dark Side post, arguing that a feathered approach to rollout might be a good idea:

When you roll out a code change, no matter how well-tested it is, don’t deploy to all the hotels.  A feathered release cycle delivers the code change to one hotel at a time, and waits an appropriate length of time to see that nothing catastrophic has occurred.  It’s amazing what a difference a day makes in understanding the potential pitfalls of a new release.  Given the operational flexibility of being able to manage multiple hotels, you can adopt all sorts of release feathering strategies.  Start with smaller customers, start with brand new customers, start with your freemium customers, and start out by beta testing customers are all possibilities that can result in considerable risk mitigation for the majority of your customer base.

In comments I cheekily suggested that self hosting is the alternative solution but then I've had my own troubles in that regard. Bob was having none of that, instead suggesting:

Multi-tenant architectures need to know how to migrate users to new releases. Not all releases are so over-arching that everyone needs to instantaneously go there. Perhaps some could even be voluntary so that people could do the equivalent of what you suggest but without the hosting.

That flies in the face of the conventional SaaS wisdom of one code line, and there are serious problems with it if customers don’t move forward fairly rapidly. For that reason, I would suggest there be a window during which the customer has leeway to choose their migration point. If you release quarterly, perhaps the window is the first 2-4 weeks after the release is available.

There is a bigger picture here. The news that cloud systems goes pear shaped is about as newsworthy as 'dog bites man.' The fact it is Wordpress makes it marginally more newsworthy. As Bob says, mistakes happen. Let's get over it.

Google, Salesforce.com, NetSuite, Freshbooks and on it goes. All have had that dreaded 'unscheduled downtime' experience. So far it has been a case of 'it is how you communicate and handle it' that matters. I've tended to agree. As far as I am aware, there isn't a single recorded instance of business critical data getting lost as a result. (Please correct me if I'm wrong.)

The elephant in the room is complexity. As applications become more complex, the potential for there to be an outage or serious bug issue increases. Ask SAP, Oracle, Infor, Lawson - any of them - about patch releases. Add in the fact that people are looking forward to a connected world where systems around the globe seamlessly spin up and present relevant data to us and it is not hard to see how one rogue or defective service could impact others.

There are no simple solutions to these levels of complexity which of course lends itself well as a defense for the monolithic systems that some ERP players sell. But I agree with Bob. If we're going to be faced with a world where you have to factor failure, then his idea has genuine merit.

If the SaaS/cloud players thought in those terms, then what would it mean for rollout management? What would it mean for the perception that rollouts occur and everyone has the same functionality - instantly? I'm thinking that like all emerging markets - things change.

An outage of the scale that took out Wordpress may be a good thing after all.