X
Home & Office

Dedupe processing chews your CPU

There's just too much stuff out there, and it's filling storage devices as fast as they can be upgraded -- 50 percent data growth a year, I'm told, is par for the course. We 're talking about standard end-user data, what the storage industry refers to as unstructured data, as opposed to databases.
Written by Manek Dubash, Contributor

There's just too much stuff out there, and it's filling storage devices as fast as they can be upgraded -- 50 percent data growth a year, I'm told, is par for the course. We 're talking about standard end-user data, what the storage industry refers to as unstructured data, as opposed to databases. It consists of everything from your last Word document to your machine's operating system: it's all got to be backed up and stored somewhere.

As a result, the storage vendors have come up a number of cunning plans to reduce that volume of data, the most effective of which seems to be deduplication. This technique works particularly well for virtual machines, as a growing proportion computers are. That's because VMs tend to contain copies of the same stuff so storing only one copy of, for example, Windows XP, instead of hundreds saves you terabytes of expensive enterprise storage gear. There are other techniques too, of course, including compression.

So what's the problem? The first is that you have to be reasonably sure that, when you're storing data for a long-ish period of time -- say five years or more -- that the technology will need to be around to rehydrate that deduped, compressed data. The longer you store it for, the higher the level of reassurance you'll seek. I've heard some users argue at this point that the best strategy is not to dedupe archived data at all but to rehydrate it, push it out to tape, and leave it there.

If it's encrypted, then you'll need a key management system too -- but let's not get into that right now.

The problem is that, whichever way you process your backed up data, there's an underlying, unspoken assumption going on. Deduplication and compression can chew through lots of CPU, especially if throughput is high. And the assumption is that there'll always be enough processing power to do the job. After a recent conversation with one IT manager, I've started wondering how true that really is.

Not every setup has vast reserves of CPU power. Not every datacentre is organised in such a way that there's always CPU to be reserved at the click of a portal button. And just because the latest servers have 12 cores, it doesn't follow that every datacentre is suddenly ordering them, let alone being full of them, or that the power they offer will be earmarked for as prosaic an application as backup.

So it would seem to be worth suggesting that, before going for a full-on deduping and compression regime in order to cut storage volumes, it's worth doing a full audit to check that any added CPU demand can be met, and that the environmental systems can cope with the added load.

It may well turn out that it's cheaper to outfit with a bunch of new 12-core behemoths than buy more storage -- which admittedly isn't a viable long-term strategy -- but I'd be interested to hear how you're approached it.

Editorial standards