I found myself thinking about storage deduplication recently, while wondering why only large enterprises can afford this technology. Yet signs are it won't be too long before it becomes commonplace.
As I mentioned in an earlier blog, deduplication is the process of storing only one copy of a lump of data - be it a file or a block - and when a duplicate of that data needs to be stored, creating a tiny pointer to it instead. Advocates reckon that compression levels of around 10:1 to 20:1 can be routinely achieved.
The problem is that corporate data is growing at a rate that outstrips the ability and willingness of enterprises to pay for it. So enterprises use deduplication mostly in backup devices, rather than live, tier one storage in production environments. It was developed as the idea that all storage problems can be sorted with the application of yet more storage started to become economically unviable.
Arguments rage over whether deduplication should take place at the source or the target. Deduping at source saves the network from carrying lots of duplicate traffic but loads up the clients. This, if you move the process of deduplication back towards the source, becomes increasingly important, as the process is highly processor-intensive. It means you can't do it on every client: not all of them have the horsepower, and a 10-20 percent CPU hike during backups can adversely affect the end-user experience.
As a result, many users deduplicate at the target, using dedicated CPU-heavy boxes, often acting as virtual tape libraries - storage that looks to the backup software like a tape but in fact consists of disks. You pay for source-level deduping with network traffic but you keep the CPU firepower under control in the datacentre. If your network can stand it, then this is the route you're more likely to take.
What about the rest of us? Smaller businesses will have to wait. There's a lot of voices calling for a Linux-based deduping file system - there's even one, lessfs, under development but it's still in beta, and I wouldn't trust my backups to a beta-level file system - would you?
But the signs are that Linux will sprout at least one such add-on and may eventually include such functionality in the kernel. The problem is that enterprise-level data is hugely expensive because it meeds to be surrounded by management software, by redundant components and by other subsystems that ensure that not a single bit is lost. That makes deduping, rather than buying more storage, economically viable.
For the rest of us, a 1TB disk costs well under a hundred quid and looks like the way to go, for now. But deduping could cut your home or business storage requirements by a factor of 10; I suspect that smaller businesses with fewer users won't achieve ratios up to 20:1 as larger ones do.
Even so, as more people and businesses of all sizes increasingly use, store, and need to back up space-hungry data such as video and audio, that will be worth having - and it's on its way...