Flickr down...outage handled responsibly

Flickr, which is Yahoo's photo-sharing site, went down for two hours of maintenance today, and didn't come back until five hours later. Although the outage was serious, and the site stayed down for hours beyond the original estimates, the problem was handled responsibly and well.

Here's why I respect the way Flickr handled this situation:

  • The problem was defined honestly and clearly. There was no beating around the bush, no hiding, no pretending, just a straightforward presentation of facts.
  • Status reports were honest. The screwed up their estimates, but informed customers as to the real state of affairs. I don't know about you, but I like to know what's going on. Honest information makes the downtime a bit more palatable.
  • They acknowledged the problem was more serious than anticipated. When a system goes down, and particularly when it stays down, users already know there's a problem. Open discussion of the issue builds trust with users, who are the people that matter most. Users may not be happy with the information, but you'll retain that all-important credibility.
  • There was only one false start. After the 4:13pm announcement, where they first acknowledged their incorrect time estimates, there were no further communication errors. Yes, everyone knows "time estimates are tricky," to use their words, but many users will overlook the occasional estimating error for that reason. No one likes repeated wrong estimates, and Flickr got it right by ensuring there were no follow-on time estimate mistakes.
  • Finally, repeating the first point, they were honest: Flickr acknowledged the problem, communicated clearly, and then fixed it. Done deal.

When an IT project is failing, it's often tempting to sweep the tough facts under the rug. This strategy almost always backfires, as the problem continues to grow and worsen. While I don't advocate baring one's IT soul indiscriminately, honesty is virtually always the very best policy.

Here's a complete copy of the Flickr blog announcement of the issue (emphasis added below):

UPDATE 4 7:34pm PST: And we’re back. Flickr is open again and ready to receive your photos. Get uploading!

UPDATE 3 7:08pm PST: Do you remember when we said we were almost back online? Well, that time we were joking, but this time is for real!

The latest estimate from our beautiful Ops team is 7:30pm PST.

UPDATE 2 6:07pm PST: We’re almost back folks. Just crossing the t’s and dotting the i’s before we throw the big switch. In the meantime, why not get outside and take some photos?

UPDATE 1 4:13pm PST: Anyone who’s ever worked in software probably knows that time estimates are tricky. Given that we’d prefer that Flickr be as close to 100% stable as we can make it before we go back online, we’re going to take more time to make sure that’s what happens.

It’s better to be safe than sorry when it comes to your precious photos, plus, there’s the added benefit of giving us all a chance to reflect on our serious Flickr addictions. Thank you again for your patience.

2:30pm PST: We started on a database upgrade and a few alters to the database structure last night. Given our scale, work like this takes a long time, and makes a definite impact on site performance.

You may have noticed today that the site is having lots of hiccups and that behaviour is generally pretty erratic. So, we’ve decided to take the site offline help things settle down. We’re anticipating a couple of hours is all we need at this point, so, we’re hoping to be back online around 4:30 PST.

Sorry about this! It will be one of those massages that ‘hurts so good’ and we’ll post updates here as we have them.

Topic: Outage

  • Agreed

    Communication is the key to all phases of IT.
    • They did do it well

      Their blog was funny, but it conveyed the essential information. Amazing how unusual that it is.
      • True...

        I love the line, " the meantime, why not get outside and take some photos?" Good line. ;-)
  • RE: Flickr down...outage handled responsibly

    unfortunately it looks like the problem isn't over. i've been on the site this morning and the "hiccups" are back. this is probably another "lesson learned" for us in IT - don't take a prolonged outage then not fix the problem.
  • What free thing will the public demand for the outage

    it seems the thing to do.
  • RE: Flickr down...outage handled responsibly

    Totally wasnt completely down, just slow and not all functions available and yes they communicated and handled it well. Well done!
  • Frankly I do not understand what's the big deal..

    This services are non essential and will not cause a big looses of anything if they are down. I understand that if I pay for something I should be getting it as per the agreement , but I would not expect anything except for a service credit equal to the time that the service was not available over the estimated outage.

    People just expect allot over this services without any consideration of how it is done
    and what it takes to have a reliability index of 9999 for the network.
    a specially the network that needs to handle a vast amount of traffic and data transfers. Just like with any service there will be slip-ups and as long the issues are fixed and no user data is lost I would let it be.

    If some one wants a completely reliable and secure service, maybe they need to try and host their own.
  • That's funny...

    I'm a paying Flickr customer and this is the first I heard about an outage. So much for the communication kudos. As others have stated, though, it isn't essential (at least how I'm using it), and a few hours over an entire year is not a big deal.
  • Sandbox testing first?

    It doesn't sound like they tested this "enhancement" in a offline sandbox first for if they did this very likely never would've happened. If you're wondering what happens when site administrators push beta/alpha enhancements out to the public without internal QA first take a look at the eBay forums, trust me this is not a desirable outcome. Next time Flickr, try testing the enhancement offline in a internal sandbox before applying it to the actual public system.

    - John Musbach
    John Musbach
  • so when

    did the site come back up??? i checked it before going to bed sunday eve and it was "still having hiccups"
    however on sing it Saturday i never had issues, just sunday...all day when i needed to upload pix from the holidays for family and friends