ie8 fix

Microsoft: Here's what caused our cloud outage this week

By | August 19, 2011, 4:48am PDT

Summary: Microsoft is informing customers hit by its Office 365 cloud outage this week that a “networking interruption” caused the problem, and that the company is planning to offer them a 25 percent credit for their trouble.

Microsoft officials are starting to share some details with customers and partners about what led to several cloud-service outages this week.

On August 17, many North American users of Microsoft Office 365 and SkyDrive were unable to access their email and calendars due to a three-plus-hour outage.

Some Dynamics CRM Online users also experienced service problems that day, but Microsoft execs are not saying the two sets of issues were due to the same root cause. The Dynamics CRM team has declined to provide information on what led to Wednesday’s outage or on how many users were affected. (Microsoft officials have said that Microsoft is planning to add CRM Online to the company’s hosted Office 365 suite — which currently includes Microsoft-hosted Exchange, SharePoint and Lync — before year-end.)

Update: The Dynamics team sent this update via a company spokesperson:

“The root cause of the Microsoft Dynamics CRM Online service has been identified as a site configuration issue. A configuration change was made in all data centers that should prevent this from happening again. This was not a complete outage and separate from any other service issue experienced by customers. ”

While not sharing exact details, Microsoft officials are attributing the Office 365 problems to “a networking interruption” in one of its North American datacenters. One of my contacts said he believed faulty Cisco networking gear was the culprit — something Microsoft a Microsoft spokesperson didn’t confirm (or deny) when I asked.

Microsoft sent out notes to Office 365 customers using the affected Microsoft-hosted services on August 18 informing them of their initial findings and plans to credit affected users with 25 percent of their monthly invoices. Here is a copy of the note Microsoft e-mailed to customers:

Dear Customer:

The Office 365 team strives to provide exceptional service to all of our customers. On August 17, customers served from one of our North America data center lost access to email services included in the Office 365 suite. We apologize for the inconvenience this may have caused you and your employees.

We are committed to communicating with our customers in an open and honest manner about service issues and the steps we’re taking to prevent recurrences.

•What happened?

º Preliminary investigation indicates that a networking interruption in one of our North America data centers caused Office 365 Exchange Online to be inaccessible by some customers.
º This incident lasted from approximately 11:30 AM PDT to 2:40 PM PDT, during which time customers were not able to access the Outlook Web App or send and receive email through Exchange Online.
º The Service Health Dashboard was updated regularly during the event to notify customers of the problem, though there was a brief period of intermittent access issues to that dashboard.

• What actions have been taken to prevent a recurrence?

º The data center’s networking facilities have been remediated and we are investigating the root cause.
º We continue to monitor the overall network very closely to maintain high levels of service to customers.

We understand that any disruption in service may result in a disruption to your business. As a gesture of our commitment to ensuring the highest quality service experience Microsoft is proactively providing your organization a credit equal to 25% of your monthly invoice. The credit will appear on a future invoice, and you do not need to contact Microsoft to receive this credit. Please note, processing of the credit may take as long as 90 days.

If you have additional questions, please do not hesitate to contact us at the Office 365 community site.

Thank you for choosing Office 365 to host your business productivity applications. We appreciate your business.

Sincerely,

The Office 365 Team

Microsoft launched Office 365 at the end of June and have on-boarded number of customers and partners since then. Microsoft also has moved some of its existing BPOS (Business Productivity Online Suite) users onto Office 365, but has advised the majority of BPOS users interested in Office 365 to wait until September before the migration process will begin in earnest.

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

Mary Jo has covered the tech industry for more than 25 years for a variety of publications and Web sites, and is a frequent guest on radio, TV and podcasts, speaking about all things Microsoft-related. She is the author of Microsoft 2.0: How Microsoft plans to stay relevant in the post-Gates era (John Wiley & Sons, 2008).

Disclosure

Mary-Jo Foley

Freelance journalist/blogger Mary Jo Foley has nothing to disclose. WYSIWYG (what you see is what you get). I do not own Microsoft stock or stock in any of its partners or competitors. I have no business ventures that are sponsored by/funded by Microsoft or any of its partners or competitors.

Biography

Mary-Jo Foley

Mary Jo Foley has covered the tech industry for 25 years for a variety of publications, including ZDNet, eWeek and Baseline. She has kept close tabs on Microsoft strategy, products and technologies for the past 10 years. In the late 1990s, she penned the award-winning "At The Evil Empire" column for ZDNet, and more recently the Microsoft Watch blog for Ziff Davis.

Got a tip? Send her an email with your rants, rumors, tips and tattles. Confidentiality guaranteed.

52
Comments

Join the conversation!

Just In

RE: Microsoft: Here's what caused our cloud outage this week
dfwekrwe44-24353611083890172929229494159280 Updated - 10th Nov
Spa uk mulberry bags would like to thank you for enjoying your data in addition to your online mulberry bags world web page. mulberry bag Truly best known the actual issue right away. Appreciate it!
the 25% credit is standard for Office365 - paying it without customers having to ask for it isn't...
@mary.branscombe That is not true, automatic renumeration is part of the SLA. Other cloud providers require you to ask for a refund and set a time limit.
0 Votes
+ -
Two hour, limited interruption...
GoodThings2Life 19th Aug
...really isn't that catastrophic, considering that most businesses hosting things on-site have that kind of downtime a lot more frequently. Plus, hardware failure is inevitable once in a while.

What's bigger news here is how they responded to it. They not only identified and resolved the issue, but they seem to have communicated the information quickly. The offer to discount the service for the inconvenience is a nice touch too.
0 Votes
+ -
I'd also point out...
GoodThings2Life 19th Aug
... that since it only affected access to Outlook Web App, and send/receive new messages, that anyone using ActiveSync connections on their phone as well as Outlook client still had access to most of their information... an option certain other services wouldn't have had.
@GoodThings2Life - The outage did not just affect the Outlook Web App; the regular Outlook connection, which uses Outlook Anywhere (RPC over HTTP), and ActiveSync connections were down as well. Granted any old data was available to those users, but new mail could not be sent or received in any way as all connections to the Exchange portion of Office 365 were down. As for your comment about "other services," I assume this is a thinly veiled reference to Google Apps. While I agree that Office 365 is a better overall service, customers of the paid version of Google Apps get ActiveSync access and Outlook syncing using the Google Apps Outlook Sync application, so they would have the same access to old data as Office 365 users do in an outage situation.
0 Votes
+ -
actually
@JoeTierney 19th Aug
@GoodThings2Life anyone running Outlook and ActiveSync with Google Apps would have had that option as well.
@GoodThings2Life - The outage was actually a little over 3 hours, not 2. Also, the frustrating thing was that the "Service Health" console showed that everything was fine for the first hour of the outage, before finally being updated to show a problem. However, what really was bad for me personally was that I had just migrated a client from in-house Exchange to Office 365 over last weekend and then, just three days later, I was left answering questions about the viability of the service and my recommendation of it. When I forwarded my client the Microsoft apology letter this morning, the CEO responded that maybe Microsoft should change the name to Office 364.99.
@GoodThings2Life

No outage here to our Data Center's Exchange boxes over the past 3 years. I know alot of companies that have been able to keep their fail-over data centers up continuously over the past few years. Sure individual sites might go down but most can keep the services out of a datacenter up pretty continuously.
@GoodThings2Life
I have email clients that haven't been down in years. Not sure where a few hours of downtime each month became the standard. But hey, cloud.
0 Votes
+ -
I must say...
wolf_z 19th Aug
...a credit for 25% of the monthly fee for a *three hour* outage is very generous.

How many hours in a month? 28*24= 672 / 4 (25%) = 168 hours / 3 (hours out) = 56 *times* the number of hours out! happy

Can you imagine anyone who pays you 56 times what you lost as a thank you? (chuckle)
@wolf_z - Office 365 has a 99.99% uptime guarantee*, so after 446.4 minutes of downtime in a 31 day month, they are required to give a 25% discount. So while this outage was only 190 minutes and Microsoft is giving the discount proactively instead of waiting for customers to request it, I wouldn't say it was "very generous," especially considering they had already had a partial Exchange outage just 5 days earlier for a short amount of time. They made a uptime guarantee, so they have to live up to it.

The math goes:
60 min * 24 hrs * 31 days = 44,640 total minutes in August
44,640 min * .01% maximum downtime = 446.4 total minutes max downtime for a 31 day month

* - http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=8094

EDIT - I had a complete brain fart and did the math wrong. Yaksplat below is correct and the SLA is a 99.9% uptime guarantee. The math should be:

60 min * 24 hrs * 31 days = 44,640 total minutes in August
44,640 min * .001 maximum downtime = 44.64 total minutes max downtime for a 31 day month

@yaksplat - Thanks for the correction.
@reidjim76
check your math. .1% downtime = .001 or 44.64 minutes.
0 Votes
+ -
Welcome to the cloud, folks.
Userama 19th Aug
I'm sure that now the problem has been fixed, this will never happen again. (chuckle)
0 Votes
+ -
Agreed!
William Farrell 19th Aug
@Userama
wink
A little sarcasm here? Since this is "the cloud", how could a network interruption at one data center bring down a service for any length of time? Shouldn?t users have just be redirected to a different data center?
@thensley@...

"Let me explain. No, there is too much. Let me sum up."

It seems like a they suffered form a "single point of failure" (kinda like what happend with Frontier.com webmail where that service was unavaillable for over 14 hours a couple of days ago and affected Frontier customers nation wide) and *that* is not a forgivable error in this day and age where redundancy should be the default, not the exception.
0 Votes
+ -
RE: Microsoft: Here's what caused our cloud outage this week
john_gillespie@... Updated - 19th Aug
@thensley@...

So a MicroSoft product works some times and not at others due to what they describe as something they had no control over. Is there news here? I think I would feel safer trusting my business with a service that always worked and had several layers of redundancy. This s not an Xbox, this is serious business.
There goes their promise of 99.98% uptime ^^
@Ambiorix2 Not really, it is infact the missing 0.02% time is for events like these!
availability of the system is much more important than an advertised SLA or a credit. lets face it, Microsoft wants to be a services company and the reality is they have priced office365 where they have because they simply dont have the proven track record of delivering software services for enterprises. The trade off of the lower price is spotty performance.

Office 365 is appropriate for the SMB not larger companies.
@smtp22 Sorry, but you're painting a very rosy picture of in-house email services. The VAST majority of the clients I've worked with over the years could only dream of detecting a serious networking problem, working out a solution, applying said solution, testing that the solution worked and communicate with their users what's happening several times during the process ... all within 3 hours of the problem first arising. Most companies I know take MANY hours to fix such problems ... if not days.

No (sane) cloud operator claims to be 100% effective - hardware fails, lightning strikes, software crashes and humans make errors. Any expectation to the contrary is misguided.

As for your claim that Microsoft don't "have the proven track record of delivering software services for enterprises", I think that Windows Server, SQL Server, Exchange Server, SharePoint, etc. are evidence enough that Microsoft does build mission-critical software. They've been operating one of the world's biggest and most complex IT infrastructures for many years and are now offering their prowess for others to enjoy.
@bitcrazed

I do not think that the London Stock Exchange would agree with you. And, don't take my word for it, either:

http://blogs.computerworld.com/london_stock_exchange_to_abandon_failed_windows_platform

IIRC, this was one of Microsoft's "prized catches" at the time of implementation.
@fatman65535 Wow one entity. I'm sure their are more who have decided against the Windows Platform in their enterprise but to claim they don't have a proven track record for delivering enterprise services is BS. Everyone has their bad experiences but Microsoft powers a significant portion of the enterprise world, whether it be on- or off-premise solutions
@fatman65535: The LSE problem was not a Microsoft issue. The LSE replaced their Windows infrastructure with a Linux based infrastructure (codenamed Turquoise) which they also screwed up right royally:

http://www.zdnet.co.uk/news/infrastructure/2010/11/03/lse-trading-pool-crash-due-to-human-error-40090735/

There is no reason why one could not create a very high performance system based on Windows. Nasdaq, for example, has been running on NT+ since 2000:

http://technet.microsoft.com/en-us/library/cc723450.aspx

http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=49271
@smtp22 - The problem is that companies like Google have convinced the upper management business masses that the cloud is, now and forever, the right path to follow going forward. No company that relies on the internet and "rented space" on data centers should expect 100% uptime because it's an impossible level of operation given how volitile and vulnerable the internet is to even the most simple failures in hardware.
0 Votes
+ -
I thought the point of cloud computing was that there isn't a single point of failure. So why are Microsoft and Amazon having outages once a month?
@olePigeon

You forget, with Windows you HAVE to reboot once a month or you will have server failures caused by counting overflows.
@jessepollard - utter cr@p. You may be describing your own experiences here, which speaks more to your abilities than reality, but, no, Windows does NOT have to be shut down once a month. Evidence: I have a machine in our office that us under heavy load and has been running for 9 months now without interruption.
@bitcrazed Windows require monthly maintenance which require REBOOTING. Even if you have a machine that can ignore updates (ie: not connected to the network), the same becomes sluggish after about 8 weeks of no reboot.

9 weeks I can believe, even 3 months. But there is no windows based system that can run for 9 months without a reboot. This is why most IT depts schedule a 10+ min outage per system for maintenance every month.
0 Votes
+ -
That is a lie
Mister Spock 19th Aug
@jessepollard

Wht did you hope to accomplish by saying that?

plain
@wackoae - Windows PC's do NOT REQUIRE rebooting if you turn-off automatic updated.

Note: THIS IS NOT AN IDEAL/RECOMMENDED SCENARIO. Normally, we recommend staying up to date and running Windows Update once a month (at least).

The machine in question is manually updated only when absolutely necessary because it's performing a vital function and cannot go down right now. This machine is in the process of being replaced with a virtualized solution that will allows us to keep the service running even when individual VM's are restarted.

The point is, however, that Windows can quite happily run for many months, if not years, without requiring a reboot.

While there are indeed many (primarily in-house) apps that can cause a machine to become sluggish, that's not a fault of the OS - it's the fault of the app developers not designing their code to run on servers for extended periods of time (e.g. ensuring their code doesn't leak) and/or admin's running untested apps on their servers.

Most IT departments schedule reboots shortly after patch-tuesday every month to make sure that their servers can be updated and rebooted as necessary. This is good IT hygeine.
@olePigeon They ALL have had outages - even the sacred cow Google.
So...how much more reliable again is that cloud farm than automated backups onto an external drive every night?

Acronis True Home Image 2011: $4.99 w/MIR

HITACHI XL Desk 2TB USB 2.0 External Hard Drive: $79.99

People really believing Microsoft or Amazon can keep data available over the internet 24/7 like a local device is: Priceless
@jck - Anyone that believes they can keep an in-house Exchange infrastructure running 100% of the time is either spending an enormous sum for the privelige, or is smoking something illegal.

BTW - I don't know why you're promoting backup solutions when the problem being discussed was a networking infrastructure issue. You do have a clue, right?
Unfortunately, the whole idea of the cloud is flawed. When you already have an engineering solution (onsite hardware, onsite software = {the potential for} productive work) why change the formula to (onsite hardware, onsite software, asymmetric digital subscriber line, offsite hardware, offsite software, offsite data = {less - it doesn't matter how much less- potential for} productive work). Now, actually exploiting either potential is the problem at the moment, as it is Friday afternoon and I can't seem to find that catalyst....motivation.
0 Votes
+ -
My company has been using BPOS for about 6 months. We have had several outages during this time. Needles to say, the BPOS performance is unacceptable. Office 365 was our "light at the end of the tunnel". Microsoft's promise with increased reliability with the new architecture found in 365 was comforting (multi-tenant). Needless to say, my learning of this outage with 365 is extremely disheartening. It hasn't helped to see that the same types of hardware failures are afflicting 365. If a piece of hardware such as a router or server goes down - there HAS to be backups to take on the load. In our last BPOS outage, the "backups" failed to come online. Looks like MS is still trying to figure this solution out. In addition the health dashboard is apparently still not 100% functional with 365, despite the promise from MS to make it more responsive. Looks like Microsoft is not up to being a world class datacenter as, say Google.
"faulty Cisco networking gear was the culprit" - wtf? You suppose to design your network to avoid that. MS - hire yourself a proper network architect.
0 Votes
+ -
@vgrig
may have had Cisco themselves installing and configuring it, though I have no facts on the matter.

plain
stick the cloud where the sun don't shine it has been hacked so many times it is a joke stop the cloud now it does not let you save your files on your computer at all
I know the root cause of the network problem! They were giving Steve B. a tour of the data center, he tripped over a network cable, pulling it out of its switch, but since it was Steve, everyone who noticed was too scared to say anything, for fear of being fired on the spot for pointing out Steve's clumsiness. The remedy is to keep Steve B. away from any and all computers.
@anothercanuck "The remedy is to keep Steve B. away from any and all computers." Great idea...Permanently!

A new press release will be forthcoming, "We are all in the cloud except when we are not" Signed: Bozo Ballmer
Whether you go to the cloud or remain on premise, outages are a part of any infrastructure as much as humans falling sick is a fact of life. The difference in response is how much detail you get and/or conpensation. MS is delivering on their financially-backed SLA by refunding 25% of your next month's bill while other companies do not even do such.
You ought to be able to withstand complete vaporization of at least 2 geo diverse data centers simultaneously without loss of more than the last couple of seconds worth of data if any. MS really needs to get these pseudo cloud apps rearchitected and on to Azure. What's the point of having the worlds best cloud platform if you're not going to use it?
If it is really due to Cisco Router problem ... MS might need to have more influence in network equipment company. especially since cloud service is about networking and economy is poor.

Google bought Moto mobile which also consist of CMTS, cable modem termination system. Although it is not core routing equipment like Cisco or Juniper, nevertheless they might have stumble on to something.
Hello friends google

okey
okey oyna
seo article writing
okey oyunu
okeyoyna
to you a little talk
okey oyunu oyna
okey indir
about the following
okey y?kle
okey oyunu indir
okey indirme
headings in the url extensions written in
okeyindir
tavla
tavla oyunu
tavla oyna
the s?t?nlerimin malicious software by one,
tavla oyunu oyna
tavla oyunu
tavlaoyunu
batak oyna
google seo. This example shows the way
batak oyunu oyna
batak
mynet okey
to know the name of sizlerinde anl?ya.
0 Votes
+ -
100% Refund
wizardb@... 21st Aug
Open Office/Libre Office and when the cloud crashes or your internet goes down no big deal keep on working.
0 Votes
+ -
Horrific.
mrgoose 22nd Aug
As a non US citizen, I find the thought of trusting my data to any foreign corporation, based in 8000km away in a foreign jurisdiction, is a thoroughly horrific concept.

Best wishes, G.
This is a large Corp company and they can just as easily go down like the rest. When your getting services hosted, what happens when your internet is disconnected, or in their case their internet. Are you able to access your files, does it affect your every day business?

This is why I can not rely on Cloud Computing, Not only is it not secure enough for me, but to loose any type of service over being able to store it directly on my network and computer makes it worthless. On top of that I didn't know 365 was a monthly pay service.

Yes the outage of 3 hours does not seem that bad, but if it's a critical period of 3 hours it can affect a lot. IT is left trying to explain something that Microsoft will not release info about so when someone calls complaining about service, they are wasting time troubleshooting an issue that there was nothing you could do to correct it anyways.

If MS wants to become a reliable comapny, they need to make sure their communication is better with IT. When an issue occurs post the issue and don't hide it until it becomes big then release that there was an issue. I guess this is how they make their money when people call in for support calls when it was MS issue from the start. Thanks for the $250 for the support call we will have this back up in a couple of hours, but until then lets have the Teir 1 walk you through 3 hours of troubleshooting that is irrelevant to the problem.
0 Votes
+ -
It's basic math
scH4MMER 22nd Aug
@sharpear - I can't understand why people argue one way or the other about the Cloud. It is what it is... if you can avoid downtime issues and cover all your data integrity/disaster recovery bases by hiring your own staff and purchasing your own equipment, and if you can do it cheaper than the Cloud, then don't use the cloud.

I've seen enough catastrophes in businesses with their own IT -person- to know that current complaints about Cloud outages amount to a hill of beans for most companies.

That's not to let Microsoft off the hook... they've got to keep improving if they want to win over the bigger companies that can afford to cover their own bases.
How have we come to accept a 2-3 hour outage as not really that bad, as one commenter offers? A 3.36 hour outage a week translates into 98% annual uptime. That?s 175 offline hours a year. While a 25% credit is admirable, it?s a paltry portion of the total cost a business incurs from such an outage. Not only is this bad, it?s preventable. Industry standard technology and best practices are readily available for guaranteed five nines availability and better. This should be the standard of uptime assurance delivered by Microsoft, or by Netflix, United Airlines, Amazon ? name your outage. However, as long as consumers believe in the inevitability of downtime and do not challenge it, cloud service providers will have no reason to improve. Customers deserve better performance and better service from their cloud service providers.

Dave Laurello
CEO and Chairman
Stratus Technologies, Inc.
0 Votes
+ -
RE: Microsoft: Here's what caused our cloud outage this week
dfwekrwe44-24353611083890172929229494159280 Updated - 10th Nov
Spa uk mulberry bags would like to thank you for enjoying your data in addition to your online mulberry bags world web page. mulberry bag Truly best known the actual issue right away. Appreciate it!

Join the conversation!

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]
ie8 fix

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources
ie8 fix