Metadata holds key to future of storage

Storage needs are continuing to evolve, and metadata has an increasing role to play

There's a significant shift afoot in storage fundamentals, and it's not storage area networks (SANs) or network attached storage (NAS) -- although both will have critical roles in these new fundamentals. The shift involves the facilitating role that metadata will play in abstracting the specifics about data and where it's stored from the applications, end users, and operating systems requiring access to it. 
When vendors discuss metadata-driven storage, the phrase "storage virtualisation" invariably comes up. Vendors will tell you that, depending on what the goals of a particular metadata application are, the benefits of storage virtualisation can range from improvements in retrieval performance to searchability to ease of management to better allowance for heterogeneity at the hardware level (but usually not all of the above).

Return on investment theoretically comes in the form of increased productivity to both end users and those tasked with planning and managing enterprise storage. Storage virtualisation can result in capacity optimisations that bring hardware savings.

Generally speaking, the application of metadata to a technology implies that a richer set of descriptive attributes is replacing something more rudimentary. For example, we hear a lot about metadata in discussions about identity management. In that context, metadata often refers to other attributes that are associated with someone's identity besides their username and password. One such attribute could be a code that describes their purchasing authority.

In the context of storing information, metadata layers promise to turn static, monolithic repositories that house data on the operating system's or application's terms into malleable storage clouds that are more accommodating of the way end users prefer to organise information for retrieval and that have the flexibility IT managers need as their physical hardware needs evolve.

Where have we been hearing about metadata and storage recently? Perhaps the trend most worth watching out for is the consolidation wave that EMC may have kicked off upon announcing it would be acquiring content management solution provider Documentum. If there ever was a category of enterprise applications that delivers the benefits of virtualised storage, it's the document and content management category in which companies like Documentum, Veritas, OpenText, and FileNet have traditionally played. Metadata is one of the key enablers to the way in which document and content managers virtualise storage. From a data management perspective, the benefits of application-derived storage virtualisation are closely aligned with the benefits of the other storage technologies offered by companies like EMC. 

Even prior to the Documentum acquisition, EMC was already in the metadata-driven storage management business. In an interview earlier this year, EMC chief technology officer Mark Lewis described to me the main idea behind the company's Centera family of network attached storage."[Another] way we store information is as objects -- objects you can think of as unstructured data. Databases consist of structured data, which means relational records that are usually fairly dynamic and that have highly relational characteristics. Unstructured data is a photograph. That's unstructured data, where you're storing a big object with a little bit of information around the object. It's usually what we call fixed content. It's a medical image or an email record or a document that's been scanned in. It's not relational, but you still want it to be a record. We have Content Addressable Storage (CAS), which fits into that market place for storing those records." That little bit of information stored with the big object is -- you guessed it -- metadata.

Storage companies like EMC and content management companies like Documentum are a natural fit, says Steve Weissman, president of the industry analyst and management-consulting firm Kinetic Information. "Broadly speaking, content management focuses on organising and facilitating access to information once it is collected; storage systems generally protect that information and ensure its availability when needed. Since both usually make extensive use of metadata to speed the sharing and retrieval process along, they therefore can play quite well together. But to be maximally effective, they need to be constructed and utilised in concert, not as individual pieces of infrastructure."

While the EMC move for Documentum may spark a wave of consolidation in the storage virtualisation area, there's one likely-to-be-metadata-driven blip on the radar -- WinFS, a new Microsoft file system that's expected to be a part of the next version of Windows (code-named Longhorn). Microsoft has been leaking bits of news about Win FS, the latest of which cleared up some confusion as to whether Longhorn would include support for NTFS, the file system supported in the various derivatives of Windows NT including NT itself, 2000, and XP.

Microsoft may not be using the word "metadata", but it's evident from the information Microsoft is sharing that metadata will play a role in WinFS. The dead giveaway is one of WinFS' lynchpins: the next version of Microsoft's SQL Server relational database (code-named Yukon). Microsoft has already said that the querying capabilities of SQL Server are a key building block of WinFS. Microsoft senior vice president Bob Muglia recently told CNET's that WinFS also would incorporate the data labelling capabilities of Extensible Markup Language (XML). Said Muglia in that interview: "Think of WinFS as pulling together relational database technology, XML database technology, and file streaming that a file system has. It's a [storage] format that is agnostic, that is independent of the application."

Perhaps demonstrating the sort of versatility that a metadata layer can introduce into the storage food chain, IBM's recently introduced Total Storage SAN File System (also known as Storage Tank) goes off on a completely different metadata vector. The idea behind Storage Tank is to virtualise storage, but not in the way you might think. Applications, operating systems and end users are indeed divorced from storage specifics. But instead of the perspective being one of content management, the perspective is storage management (although it's still capable of traditional content management).

IBM claims that the Storage Tank project will reach its shining moment when enterprises can take most or all of their current storage area networks (regardless of vendor or location) and merge them into a cloud (or "tank") of storage with a uniform interface that services all users, applications, and operating systems in utility-like fashion. What makes Storage Tank tick? Metadata.

According to a recent IBM press release, "IBM Research-designed software keeps track of descriptive information-'metadata' such as physical locations, file sizes or access permissions -- that accompanies the actual content within the files. Where most storage systems include this metadata in the storage system itself, Storage Tank spreads the information across servers on the network -- with the IBM software precisely monitoring the location of the metadata." In other words, to enable its distributed nature and its eventual ability to assemble a cloud from heterogeneous parts (not delivered yet), Storage Tank depends on system-level (as opposed to content-level) metadata information. The resulting infrastructure, says IBM, bears the on-demand characteristics of utility computing: no entity will run out of capacity, while the total cost of ownership is kept to a minimum since IT managers have a single point of management and don't have to overbuild silos of storage to accommodate the individual growth needs of each of those entities.

As you can imagine, with a Storage Tank-like architecture, reliability is essential. An organisation could become paralysed if the metadata somehow becomes inaccessible. To guarantee its availability, Storage Tank relies on a cluster of Intel/Linux metadata servers (the minimum configuration consists of two servers and costs $90,000). As with most clustering technologies, redundancy of the metadata database is a part of the message behind Storage Tank's clustering technology, which can grow to as many as eight systems.

In fact, databases play a key role in metadata-driven systems. In an effort to expeditiously find what it's looking for, any storage infrastructure that depends on more than some basic information will require a robust, super fast, secure and fault-tolerant database technology. If not done correctly, a layer of metadata (and the database that goes with it) can do more harm than good. Recall that, for the most part, metadata-driven virtualisation of storage takes the place of scenarios where users, applications and operating systems were more hardwired to storage -- scenarios where few compromises in performance and availability are made. But once layers of abstraction -- essentially translation layers -- are inserted into those scenarios, the potential for things to go awry increases. This embedding of database technology into the storage infrastructure is where the rocket scientists get involved, and the degree to which the rocket sciences succeed at dealing with the compromises introduced by additional layers of technology is what will separate the true winners from the rest of the pack.

For example, Microsoft has already made it clear that its forthcoming Yukon relational database technology is a key building block to WinFS. Although we most often hear about WinFS during discussions of Longhorn, Microsoft's ambitions for WinFS won't stop at the desktop. When you consider any of these storage virtualisation technologies and the voluminous amounts of metadata they will have to keep track of, it's not surprising to see that, based on the advertised advancements (high availability, additional backup and restore capabilities, replication enhancements, and secure by default) of its next generation database technology, why Microsoft's WinFS is waiting for Yukon. Likewise, neither EMC's content addressable storage nor IBM's Storage Tank technologies would be able to step on the field if their metadata databases weren't deeply embedded into the infrastructure and didn't bear some of the same design goals that Microsoft has for Yukon.

Embedding databases into the storage food chain isn't new. Much of what's being done today in the area of storage virtualisation resembles the architecture of IBM's 1970's class System 38. The System 38 technology eventually evolved into the AS/400, which in turn evolved into IBM's current iSeries midrange systems. According to IBM iSeries senior technical staff member Amit Dave, "Storage virtualisation was a critical design criteria of the System 38, and integrating a database directly into the operating system played a pivotal role in achieving that objective." The benefits, according to Dave, were clear: "The idea was to eliminate all notions of a disk drive and to relieve the users of any concerns about data placement or storage management. Instead, users only had to concern themselves with creating a data template and the system would take care of the rest. The System 38 automatically pooled disk drives so that they appeared to the application as one virtual memory store. Applications, and therefore users, had no knowledge of what or how many disk drives a system had, so they didn't have to know how to store the data, how to allocate space, or how spread that data across volumes."

Compared to the more horizontally focused applications of today's storage virtualisation technologies, the System 38 focused primarily on one type of application (databases). But the design goals, and the ultimate benefits to the end user, are nearly identical. Without the metadata layer, according to Dave, much of it wouldn't have been possible. Says Dave, "Over the long haul, the concept of metadata can lead storage in any number of directions. The opportunities are enormous and endless."