Understanding, managing data lifecycles

In this issue of Industry Insider, Timothy Smith, our guest columnist from Hitachi Data Systems Australia/New Zealand, discusses the key ingredients of a good data lifecycle management solution. Smith explains why adopting a standards-based archival platform could prove crucial to your business.
Written by Timothy Smith, Contributor
In this issue of Industry Insider, Timothy Smith, our guest columnist from Hitachi Data Systems Australia/New Zealand, discusses the key ingredients of a good data lifecycle management solution. Smith explains why adopting a standards-based archival platform could prove crucial to your business.
Man has long felt the need to preserve the results of human activity in various physical forms including documents, books, photographs and paintings. The concept of an archive, starting possibly as far back as cave paintings, has evolved in modern times to identify, preserve and share those historical records.
What we have learned about preserving our output is being applied to organisational data in digital form and it has become a difficult problem with data growing at an exponential rate. But solving this problem is critical because this structured and unstructured data has become an integral part of the way we do business.
A recent study by Gartner Group identified that the three key business issues concerning CIO's are: cost/budget pressures, data security concerns, and the need for faster innovation. These issues have been further complicated with emerging regulations relating to the management of corporate information. You can no longer -just delete" old information, compounding cost/budget pressures and data security concerns within already stretched IT environments.

The changing nature of our information is also exacerbating these pressures. For organisations to confront this situation, they need to first understand the nature of their data and its economic value so a strategy can be developed that is in line with the company's business model. Data can be defined into three categories -- structured, semi-structure and unstructured.

The traditional definition of structured data is that which is organised by the well-defined structure provided by databases. Database sizes are growing so fast that it is impeding application performance, stretching backup windows and artificially inflating the total cost of operations. Unstructured data is typically comprised of documents, spreadsheets, graphics, still and motion images (rich media) and a variety of other formats. The semi-structured classification applies to messages including e-mail and other electronic forms and often serves to provide an organisational framework for unstructured file attachments. It is estimated that up to 50 percent of data residing in data centres fall into these last two categories.

Enterprise Storage Group (ESG), a storage-centric research company based in the United States, found there is a sea change occurring in the structure of information, a trend that ESG expects to accelerate over the next several years. Reference Information, defined as a digital asset retained for active reference and value, will become the primary form of information currency. Reference Information will become, once organisations harness and leverage it, enormously valuable. The problem is, that unlike Transactional Information (databases information, textual or -flat" information), Reference Information is either unstructured or semi-structured in nature and dispersed across an organisation.

With the growth of e-mail usage and the digitisation of business information, this unstructured data comprising documents, spreadsheets, graphics and rich media is now overshadowing the growth of structured data and creating significant problems to companies grappling with storage capacities. More and more information generated by business activity is outside the structured bounds and retrieval mechanisms of databases, pushing the need to quickly catalogue, search, retrieve and replace this unstructured data out into the storage environment itself.

The importance of e-mail to an organisation and recent world events, such as the Enron scandal and the increased risk of terrorism, has translated to the need for not just more hardware but even new regulations such as the Sarbanes-Oxley Act and Basel II. This complicates the situation further as organisations are required to retain e-mail files for a proscribed period of time -- potentially indefinitely -- to comply with these types of regulations. Although there is no similar legislation in Australia presently for corporations, the future Corporations Act is likely to be revised to address these disclosure requirements. However, we are beginning to see state government legislation targeting information security and business continuity across departments.

Page II: What are the key ingredients of a good data lifecycle management solution? Is adopting a standards-based archival platform crucial to your business?

Although enterprises could define policies around e-mail retention to comply with regulatory requirements, this does not address the recurring problem of growing storage requirements. Instead of adding more disks to support storage growth, organisations need to adopt a long-term solution to manage their storage needs.

A key component to solving this problem is storage management archiving, or data lifecycle management, as it is known. This process matches availability and retrieval time of data with the data's value throughout its useful life. This maximises the overall cost of storage and at the same time prioritises access to information that is either the most important or most likely to be retrieved.

In adopting this notion of data lifecycle management, organisations need to elevate the efficiency and responsiveness of their storage environment. This need to manage data more efficiently is shared across a wide spectrum of customer segments, and emphasises the need for a general-purpose archival platform that can ensure the long-term preservation of corporate digital assets throughout their lifecycles.

The principles of data lifecycle management are similar to the real life practice of using storage specialists to store physical documents in warehouses away from high-rent CBD (central business district) locations. Storing documents in these warehouses frees up the office space for more "valuable" uses such as meeting rooms.

Data lifecycle management can alleviate the problems caused by the runaway growth of all data types. As an example, instead of arbitrarily imposing mailbox size limits, or restricting or prohibiting the use of attachments, e-mails are archived to a tamper-proof, disaster-proof secondary archival storage environment for a specified period and can be produced on demand in order to meet regulatory and business requirements.

The data storage paradigm within corporate environments is shifting in response to these issues. The archive as a metaphor for storage provides a context for defining the functions needed for an effective data life cycle management solution. While it is fundamental that IT departments continue to ensure capacity requirements are met for critical applications, there is a further demand for more effectively managing digital assets by moving them to a different class of media based on their current value.

Documents, images and data records are tagged with a unique identifier before they are stored and "filed" in different classes of storage based on their current and projected values. The idea is to take advantage of waning requirements for retrieval time and availability by moving less valuable, less likely to be accessed data to less expensive storage. This requires a system with greater intelligence which can automatically move data within the overall storage environment based on a company's information retention policy.

Securing and preserving data has become just the base level function required of storage. More and more of the information generated by business activity is outside the structured bounds and retrieval mechanisms of databases, pushing the need to quickly catalogue, search for and retrieve this unstructured data out into the storage environment itself. At the same time, solutions must encompass varying classes of storage devices and media arranged in tiers in order to balance the cost of storing any particular data asset with its current value from the time of creation to end-of-life.

The fundamental model for an archive solution needs to span industries and content types. This is true whether the intent is to preserve records in an immutable environment for regulatory purposes or to make your storage more efficient by prioritising retrieval of data that is most valuable and thus most likely to be needed and accessed.

Understanding how a data lifecycle management solution operates is a basic requirement for evaluating the many solutions offered in the market. Solutions should be based on a standards-based archival platform that can satisfy the requirements of e-mail, and a variety of other data types -- structured, semi-structured and unstructured, in both traditional corporate environments and specific vertical industry regulatory environments.

The International Organization for Standardization (ISO) reference model for an open archival information system (OAIS) provides a basic evaluation reference on how the data lifecycle is managed. OAIS is a proven foundation for archival systems. It is specifically applicable to organisations with the responsibility of making information available for the long term.

This reference model addresses the:

  • migration of digital information to new media and forms
  • data models used to represent the information
  • role of software in information preservation
  • exchange of digital information among archives

    OAIS identifies both internal and external interfaces to the archive functions and a number of high-level services at these interfaces . The eight functions of the OAIS model are:

    Preservation Planning: This is the foundation of the model of understanding business-specific issues related to data and how the data's value varies over its useful lifetime. A combination of consulting and technology ensures the user community's data is preserved as required on the storage infrastructure to provide the availability and retrieval time appropriate to its value at any point throughout its useful life.

    Page III: What are the key ingredients of a good data lifecycle management solution? Is adopting a standards-based archival platform crucial to your business?

    Produce: The produce function is designed to handle the data assets produced by any manner of industry or activity. The archival needs of industries differ greatly -- traditional corporate environments require archival solutions for office documents compared with healthcare organisations who may need to archive multiple records of individual patients.

    Ingest: The ingest function prepares the contents for storage and management within the archive. Some actions include the creation of a digital signature or metadata to uniquely identify the object and to ensure it has not been tampered with during the move. Each ingest process is tailored to a particular kind of data asset being archived.

    Data management: The metadata extracted during ingest is maintained in the data management mechanism, a searchable index that allows users to search and find archived data.

    Archival storage: This function stores, maintains and retrieves data, manages the storage hierarchy including movement based on changes in data value and provides disaster recovery capabilities.

    Administration: Much of the routine administration functions are provided by automated tools for user/group management. These administrative functions include configuration management of system hardware and software, system engineering functions to monitor and improve archive operations, updating of archival and Hierarchical Storage Requirements (HSM) policies and customer support. HSM technology can be used to archive e-mail messages and attachments that are over a certain size, a certain age or have not been accessed in a certain period of time.

    Access control: The access control function helps consumers find information, limits access as required (such as enforcing read-only access for mandated retention periods) and delivers query responses to consumers. Using a tamper proof Write Once Read Many (WORM) functionality will ensure storage units comply with data retention regulations.

    Consume: Consumption is tailored to the intended use of the data assets. This involves integrating the archival system to the application ordinarily used to access the data. While it is important to provide a general interface for archived data retrieval, the real value is added by enabling the application and application user to continue working as always. The standard application interface and access approach should not change whether data is in primary, secondary or tertiary storage.

    Implementing a business aligned, standards-based archival platform that embraces not only e-mail, but a variety of other documents and data types, including structure and unstructured data is the first step. It is essential that you then establish a practice of regular reviews to adjust for the changing nature of your business, ensuring that your archival platform has not become outdated. The regularity of your review process should be based on the rate of change you business undergoes, a formal review every six to 12 months will traditional cover most organisations.

    The review process should at least include:

  • A complete policy review -- ensuring that the archival policies are still in line with your business.
  • Data value is still matched to the most cost-effective storage media.
  • Data performance and accessibility is not impacting on business efficiency.
  • A review of new or amended compliance requirements introduced through Government legislation and corporate watchdogs.

    Understanding the nature and value of your data and managing it effectively throughout its lifecycle will assist significantly in addressing cost/budget pressures, data security concerns, and the need for faster innovation.

    Timothy Smith is marketing manager for Hitachi Data Systems Australia/New Zealand. If you would like to become a ZDNet Australia guest columnist, write in to Fran Foo, Editor of Insight, at fran.foo@zdnet.com.au.

  • Editorial standards