X
Home & Office

Google blames Docs downtime on exposed memory bug

Google has explained the downtime that struck its Docs productivity apps last week, blaming it on a failed attempt to improve the products' functionality.The hour-long outage struck on Wednesday, with Google offering no explanation for the incident at the time.
Written by David Meyer, Contributor

Google has explained the downtime that struck its Docs productivity apps last week, blaming it on a failed attempt to improve the products' functionality.

The hour-long outage struck on Wednesday, with Google offering no explanation for the incident at the time. On Friday, the company admitted it had brought the systems down during an upgrade.

"The outage was caused by a change designed to improve real time collaboration within the document list," engineering director Alan Warren wrote in a blog post. "Unfortunately this change exposed a memory management bug which was only evident under heavy usage."

Warren explained that the memory management bug had caused the machines that look up Google's servers when modifying Google Docs documents to not recycle their memory properly after each lookup. He said this caused them to "eventually run out of memory and restart", during which time other machines had to pick up the load.

The chain reaction "meant that eventually the servers couldn't properly process a large fraction of the requests to access document lists, documents, drawings, and scripts which led to the outage you saw on Wednesday," Warren said.

Warren pointed out that Google's systems had alerted the company just 60 seconds after the failure rate increased sharply, and the rollback of the upgrade in question began 23 minutes after the first alert. The rollback took 24 minutes, and capacity was restored around five minutes after that.

"Since resolution, we have been assembling and scrutinising the timeline of this event, and have assembled a list of steps which will both reduce the chance of a future event, decrease the time required to notice and resolve a problem, and limit the scope which any single problem can affect," Warren wrote, noting that a full incident report would be released once the investigation is over.

Comments on Warren's post were mostly appreciative, but one 'bjbraams' asked whether any explanation would be forthcoming regarding a "blackout" that apparently hit the Documents List feature of Google Docs on 26 August.

Editorial standards