The pendulum has shifted. We are in an era in which Storage Managers are in the ascendancy while vendors must shape up to meet customer demands in order to survive the current economic plight. Long gone are the days of disk happy vendors who could easily shift expensive boxes of FC disks or Account Managers who boasted their huge margins at the selling of skyscraper storage systems to clients who faced an uphill struggle to meet their constantly growing storage demands. With responses such as thin/dynamic/virtual provisioning arrays and automated storage tiering, vendors have taken a step towards giving customers solutions that will enable them to use more of what they already have as well as utilise cheaper disks. Another such feature now starting to really prick the conscience of vendors as customers become more savvy is that of primary deduplication or the more aptly termed ‘data reduction’. So as this cost saving surge continues some vendors have cheekily tried to counteract it with sales pitches for exorbitantly priced Flash SSDs (which promise 10 times performance yet shamelessly sit on the back end of Storage systems dependent on the latency of their BEDs and RAID controllers) as a means to keep margins up. But not the WAFL kings NetApp….
Mention deduplication and you most likely think of backup environments where redundant data is eliminated, leaving only one copy of the data and an index of the duplicated data should it ever be required for restoration. With only the unique data stored the immediate benefits of deduplication are obvious from a reduction in backup storage capacity, power, space and cooling requirements to reduction in the amount of data sent across the WAN for remote backups, replication and disaster recovery. Not only that, deduplication savings has also shifted the backup paradigm from tape to disk allowing quicker restores and reduced media handling errors (and yes I have made it no secret of giving kudos to Data Domain in this respect). Shift this concept now to primary storage though and you have a different proposition with different challenges and advantages.
Primary storage is accessed or written to constantly therefore necessitating that any deduplication process must be fast enough to eliminate any potential overhead or delay to data access. Add to the equation that unlike backup data the amounts of duplicate data are not in the same proportion as that found in Primary storage and you also have a lesser yield in deduplication ratios. Despite this though, NetApp have taken Primary deduplication by the horns and are offering genuine data reduction that extends beyond the false marketing of archiving and tiering being data reduction techniques when in fact all they are is the shoving of data onto different platforms.
Most vendors on the ‘data reduction’ bandwagon have gone with file level deduplication which looks at the file system itself replacing identical files with one copy and links for the duplicate files. Hence there is no requirement for the file to be decompressed or reassembled upon end user request due to the same data merely having numerous links. Therefore the main advantage is that data access should be without any added latency. In real terms though this minimalist approach doesn’t produce data reduction ratios that yield anything significant for the user to be particularly excited about.
On the flip side what is referred to as sub file level deduplication has an approach familiar to those who already use deduplication for their backups. Using the hash based technology; files are first broken into chunks. Each chunk of data is then assigned a unique identification, whereupon duplicated identifications of chunks are replaced with a pointer to the original chunk. Such an approach brings the added advantage of discovering duplicate patterns in random places irregardless of how the data is saved. With the addition of compression end users can also significantly reduce the size of chunks. Of course this also adds the catch 22 situation of deduplication achieving better efficiency with smaller chunks, while compression is more effective with larger chunks. Hence why NetApp have yet to incorporate compression alongside their sub level deduplication. Despite this though NetApp are showing results that when put in a virtual context are more than impressive.
One of the first major vendors to incorporate primary data deduplication, NetApp is comfortably verifying their ‘storage efficiency’ selling tag when put in the context of server and desktop virtualisation. One of the many benefits of VMware (or other server virtualisation platforms) is their ability to rapidly deploy new virtual machines from stored templates. Each of these VM templates includes a configuration file and several virtual disk files. It is these virtual disk files that include the operating system, common applications and patch systems or updates and it is these that are constantly duplicated each time a cloned VM is deployed. Imagine now a deployment of 200 like for like VMs and then put NetApp’s primary deduplication process wherein multiple machines end up sharing the same physical blocks in a FAS system and you’ve got some serious reduction numbers and storage efficiency. With reduction results of 75% to 90%, NetApp’s advantage comes from their long established snapshot-magic producing WAFL (write anywhere file level) technology. With its in built CRC checksum for each block of data store, the WAFL already has block-based pointers. By running the deduplication at scheduled times all checksums are examined, with the filer doing a block-level comparison of blocks if any of the checksums match. If a match is identified, then one of the WAFL block-based pointers simply replaces the duplicated block. Due to the scheduled nature of the operation occurring during quiet periods, the performance impact is also not that intrusive giving the NetApp solution significant storage savings especially when similar operating systems and applications are grouped into the same datastores. Add to the mix that NetApp’s PAM (Performance Accelerator Module) is also dedupe-aware, common block reads are quickly satisfied from cache bringing even faster responses by not having to search through every virtual disk file (VMDK). NetApp also ‘go further, faster’ so to speak with the addition of their flex clone technology which rapidly deploys VM clones which are also prededuplicated.
So while arguments may be raised that NetApp’s sub level deduplication suffers from the physical layer constraints of WAFL’s 4KB block size or their lack of compression, the truth is that they have deliberately avoided such alternatives. If they’d have opted for using sliding block chunking where a window is passed along the file stream to seek out a more naturally occurring internal file or added compression algorithms, the overhead that would come with such additions would render most of the advantages of primary dedupe worthless. Yes, Ocarina and Storwize have appliances that compress and uncompress data as it’s alternatively stored and read but what performance overhead do such technologies have when hundreds of end users concurrently access the same email attachment? As for Oracle’s Solaris ZFS file system sub level deduplication which is yet to see the light of day one wonders how much hot water it will get Oracle into should it turn out to be a direct rip off of the NetApp model.
Bottom line is as long as the primary deduplication model you employ gives you the reduction numbers worth the inevitable overhead then it's more than a beneficial cost saving feature. Furthermore I’m the first to admit that NetApp certainly have their flaws but when it comes to primary deduplication and consequent data reduction they really are making your storage more efficient.