If you Google the term data dictionary, you will get 100 or more results of definitions that generally mean the same thing. A "data dictionary" is data about data. It is what defines the items in your database(s) and can be looked to for definitions, structures, use, allowable content, and sometimes business rules associated with the data. It is the roadmap to data in your database.
I was recently asked the question: Does the ROI on the data dictionary that my organization religiously maintains justify its cost? I answered with a resounding "Yes!" Then I thought, why would anyone ask that question? After all, aren't the inherent benefits of a well-thought out and maintained data dictionary obvious to all? Perhaps not - since the question is being raised.
Having spent a number of years of my career on the database/applications development side of the house, a data dictionary has significant meaning to me. Not only do I know what it is, but I know how important it is to an organization and how difficult life can be without one when you need to combine data from multiple sources.
Based on the above, you might conclude that a data dictionary:
- Is common sense
- Is important
- Is worth the effort to create and maintain
- Is part of every organization
You may be surprised at how many organizations do not have data dictionaries, do not staff a dedicated data administration/metadata management team, do not update existing ones (they were created at the time an application was built), or depend on the knowledge of one or more individuals to maintain this information about the data in their heads as institutional knowledge.
So what's the big deal with data dictionaries if so many organizations forgo them?
The big deal is the value they bring when you have to share data, whether internally or across organizations. Having one makes life so much easier, while not having one can result in chaos and misinformation.
Imagine this situation: The role of Organization O is to collect information about air quality so that it can make rulings, set standards, and influence legislation concerning the quality of air in the environment. Organization O has three departments. Department A is responsible for monitoring outdoor air quality. Department B is responsible for monitoring air quality in indoor environments, while Department C is responsible for monitoring air quality underground, such as in mines and sewers.
Each department is made up of brilliant scientists who are extremely familiar with the chemistry of air. All the scientists are not necessarily from the same disciplines, but all know air. Now these scientists begin to construct spreadsheets and databases to capture and manipulate data about air quality. What do you think the odds are that in each and every spreadsheet and database that is created, the terminology used to name the fields is the same? Probably moderately high. Now what are the odds that these fields with the same names across databases are defined the same and capture data in the same way? That's right - pretty low.
Now, try to combine the data from the disparate databases in the three departments into a data warehouse or central repository so that you can do statistics using a larger data set, and boy, do you have trouble (and a lot of work on your hands).
Unfortunately, many of us are not fortunate enough to be in an organization at the time that these databases are first developed, and so we can't step in with good data administration practices in time to prevent such messes from happening. In fact, many of us get brought in to deal with the train wreck of disparate data sources under the guise of creating a data warehouse or having to define the data to be used in a service as part of a SOA effort.
The good news is that it is never too late to create and maintain data dictionaries - with standards, administrative policies, and procedures governing data. The bad news is that it takes a considerable amount of time and effort to create them "after the fact," particularly when it is long after the fact and the people who actually know what the data was intended to convey have left the organization.
However, the effort always pays for itself over time. Creating data dictionaries will not boost profits overnight, nor will they suddenly allow you to do more with less or make your company 100 percent more efficient. But by working steadily on the project, you will aid in decision-making in your organization by improving the quality of the data and exposing the data to its users (because they now know where it is and what it means), thus putting your organization in the position to build top quality data warehouses or SOA services.
All of these goals are worthwhile and can add to the bottom line by helping you work smarter and faster. But how do you measure this as ROI? That's difficult, because many of the benefits are hard to measure, are intangible, or have to be measured over a very long period of time. Yet despite all this, it just makes good sense strategically for an organization, and strategic ROI is measured over a period of many years.
So going back to the question that started this article - yes, by golly, it is worth it! Can I measure it in terms of ROI well enough to convince someone stuck on pure numbers that they should give the project a go? Maybe not (lots of intangibles there), but I feel confident that I can argue the point that every penny spent on building and maintaining an accurate data dictionary is well spent. Don't believe me? Ask the poor guy working on the "new" data warehouse for a company that doesn't have one. He will tell you how much not having one is costing you.