Flickr, which is Yahoo's photo-sharing site, went down for two hours of maintenance today, and didn't come back until five hours later. Although the outage was serious, and the site stayed down for hours beyond the original estimates, the problem was handled responsibly and well.
Here's why I respect the way Flickr handled this situation:
- The problem was defined honestly and clearly. There was no beating around the bush, no hiding, no pretending, just a straightforward presentation of facts.
- Status reports were honest. The screwed up their estimates, but informed customers as to the real state of affairs. I don't know about you, but I like to know what's going on. Honest information makes the downtime a bit more palatable.
- They acknowledged the problem was more serious than anticipated. When a system goes down, and particularly when it stays down, users already know there's a problem. Open discussion of the issue builds trust with users, who are the people that matter most. Users may not be happy with the information, but you'll retain that all-important credibility.
- There was only one false start. After the 4:13pm announcement, where they first acknowledged their incorrect time estimates, there were no further communication errors. Yes, everyone knows "time estimates are tricky," to use their words, but many users will overlook the occasional estimating error for that reason. No one likes repeated wrong estimates, and Flickr got it right by ensuring there were no follow-on time estimate mistakes.
- Finally, repeating the first point, they were honest: Flickr acknowledged the problem, communicated clearly, and then fixed it. Done deal.
When an IT project is failing, it's often tempting to sweep the tough facts under the rug. This strategy almost always backfires, as the problem continues to grow and worsen. While I don't advocate baring one's IT soul indiscriminately, honesty is virtually always the very best policy.
Here's a complete copy of the Flickr blog announcement of the issue (emphasis added below):
UPDATE 4 7:34pm PST: And we’re back. Flickr is open again and ready to receive your photos. Get uploading!
UPDATE 3 7:08pm PST: Do you remember when we said we were almost back online? Well, that time we were joking, but this time is for real!
The latest estimate from our beautiful Ops team is 7:30pm PST.
UPDATE 2 6:07pm PST: We’re almost back folks. Just crossing the t’s and dotting the i’s before we throw the big switch. In the meantime, why not get outside and take some photos?
UPDATE 1 4:13pm PST: Anyone who’s ever worked in software probably knows that time estimates are tricky. Given that we’d prefer that Flickr be as close to 100% stable as we can make it before we go back online, we’re going to take more time to make sure that’s what happens.
It’s better to be safe than sorry when it comes to your precious photos, plus, there’s the added benefit of giving us all a chance to reflect on our serious Flickr addictions. Thank you again for your patience.
2:30pm PST: We started on a database upgrade and a few alters to the database structure last night. Given our scale, work like this takes a long time, and makes a definite impact on site performance.
You may have noticed today that the site is having lots of hiccups and that behaviour is generally pretty erratic. So, we’ve decided to take the site offline help things settle down. We’re anticipating a couple of hours is all we need at this point, so, we’re hoping to be back online around 4:30 PST.
Sorry about this! It will be one of those massages that ‘hurts so good’ and we’ll post updates here as we have them.