Dropbox not hacked, just stupid

Dropbox not hacked, just stupid

Summary: Far from suffering a hacking incident, the file hosting service fell victim to its update scripts and MySQL infrastructure.

SHARE:

Despite reports over the weekend claiming that Dropbox was hacked, the company has today released a post-mortem of its downtime, with the blame falling onto its update and database infrastructure.

The company said that the weekend's outage was due to a bug within an update script reinstalling a number of machines containing production traffic for photo sharing, camera uploads, and some APIs.

"On Friday at 5.30pm PT, we had a planned maintenance scheduled to upgrade the OS on some of our machines. During this process, the upgrade script checks to make sure there is no active data on the machine before installing the new OS," wrote Dropbox head of infrastructure Akhil Gupta.

"A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-slave pairs were impacted, which resulted in the site going down.

"Your files were never at risk during the outage."

Gupta said that Dropbox was able to recover from backups, and able to restore "most functionality" within three hours.

However, the company was not able to fully restore complete functionality until Sunday afternoon, Pacific Time, due to the large size of the MySQL databases that the company uses.

Seemingly shocked by the time taken using the standard tooling to restore from MySQL backups, Dropbox said it has developed a tool that will allow for faster restorations by parallelising the replay of binary logs. The company plans to open source this tool in future.

In order to prevent the updating script from reinstalling active machines in Dropbox's master and dual-slave database infrastructure, Gupta said that active machines would now be able to ignore such commands.

"Over the past few years, our infrastructure has grown rapidly to support hundreds of millions of users. We routinely upgrade and repurpose our machines. When doing so, we run scripts that remotely verify the production state of each machine," Gupta said.

"We've since added an additional layer of checks that require machines to locally verify their state before executing incoming commands. This enables machines that self-identify as running critical processes to refuse potentially destructive operations."

1775 Sec, the group claiming to be behind the supposed hacking of Dropbox, later said that it was all a hoax.

Topics: Cloud, Data Centers, Data Management, Security

About

Chris started his journalistic adventure in 2006 as the Editor of Builder AU after originally joining CBS as a programmer. After a Canadian sojourn, he returned in 2011 as the Editor of TechRepublic Australia, and is now the Australian Editor of ZDNet.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

3 comments
Log in or register to join the discussion
  • ooops!

    Hi :)
    Dohhh!

    Good to hear a new tool is going to be open-sourced so that any 'deliberate mistakes' can be spotted by anyone keen to see what all the fuss was about! :)
    Regards from
    Tom :)
    Tom6
  • A familiar story

    It's certainly not the first time I've heard (or experienced) a story about an organization's computers having problems that in retrospect turn out to have been directly related to some sort of system update or modification.

    Log: (programmer x) has modified current and active system parameters.



    Programmer x: What happened?!? Were we hacked? Do we need to call the hardware vendor?
    dqkennard
  • LVM snapshots for MySQL backups

    They should use LVM snapshots to store their MySQL backups. As Peter Zaitsev from Percona mentioned years ago (2006), it's almost as good as a hot backup. It's blazing fast to recover from, the only requirement of flushing tables before making the snapshot is a small price to pay to get faster recovery time.

    Although, I'm not surprised to see this kind of incident happening, people still aren't getting that this kind of move / change needs to be tested first. Basically perform the upgrade / change on a test environment that is an exact replica of production (very easy these days with virtualisation) then learn about everything that's going to go wrong before it happens.

    XD
    Mr Critical