In honor of the Presidential Center dedication, ZDNet Government is proud to present Part 2 of our exclusive, 4-part in-depth special report on the George W. Bush Presidential Center and the 200 million email archive project.
I was recently asked in a radio interview about whether or not the 200 million message email trove being archived is really that large. That number can be interpreted in different ways. To archivists, 200 million messages is a tremendous number of documents. To most IT professionals, that's a drop in the bucket for a medium-sized enterprise.
There are about 200 million messages that the archivists are dealing with, which is roughly 80 terabytes. That's not a small amount of data. But when you consider that most IT operations dealing with anything resembling Big Data are looking in the multi-petabyte quantities, it's far from unmanageable.
It's just not really that many bits and bytes. You could actually load all of these messages into RAM and process them in real-time using something like SAP's HANA product. So, from a technical point of view, the Bush message archive isn't exactly a large data structure.
But, from an archivist's point of view, it's huge because the archivists want to go through every single message and redact anything that is still considered a national security issue or a thorny political issue.
Think about 200 million messages. If you don't explore solving the problem using machine-based analysis, but instead expect individual humans in the National Archives and Records Agency to look at every single email message, it could be the end of time before they finish their work.
From a technical point of view, managing White House email is really a pretty simple thing. But, from a policy point of view, it's a very difficult thing. In my book and the various speeches I've given on this topic in D.C., I've always made it clear that archiving is a technical process, where retrieving what's been archived is a policy process.
In other words, it's up to us techies to make sure the data can be saved. But whether or not anyone gets to see that saved data has to be determined by laws, judges, and — courtesy of the Presidential Records Act — current and former presidents and vice presidents.
Quite obviously, not all email data is constrained by national security. Much of the data stored is also political in nature. That information may be suitable for safe public viewing from a national security perspective, but politically charged all the same.
That's where the push and pull has come from with White House email — because of that difference. Of course, the weird thing is that most recent White House generations have claimed that solving the archiving challenge is a technical problem. Clearly that's not the case.
From an IT geek perspective, email archiving is an activity that we do across enterprises every day. But from a "What do we want to show? How do we want to show it? How do we want to control our messaging?" perspective, it's a much bigger problem.
Even though the collection of 200 million email messages being archived is a boon for historians, it's far from the whole story.
Because I did so much research into the Bush administration email operation, I'm very well aware that those 200 million messages only represent a portion of the email traffic that went on during the Bush White House. The messages being discussed are only the official emails that went through the EOP (Executive Office of the President) email channels.
President Bush's team operated another email operation, based around the GWB43.com domain name. This operation wasn't run by the White House. Instead, it was run by an ISP located down in Chattanooga, Tennessee. While some conspiracy theorists might think that using GWB43 was a way for the Bushies to get around email requirements, the opposite was actually the truth.
There's a 1939 law, called the Hatch Act, that governs how White House email works. Yep, a law enacted way before anyone even knew of email controls email in the most important office of the land.
In any case, the Hatch Act restricts government officials from using government resources to conduct political activities. This means any sort of communication about politics, campaigns, political strategy, and so on could not be conducted through official White House channels and were required — by law — to run through outside services, like our friends in Chattanooga.
Because of this, using what then Deputy Press Secretary Dana Perino called "an abundance of caution," any email message, official or not, that might have had a political tinge, was not routed through the EOP email servers, but instead was routed through GWB43.
None of these official emails, the ones that also contained political information, are available for archiving. In Where Have All The Emails Gone, I estimated that 103.6 million messages ran over the open Internet, through GWB43.com. None of these will be turned over to the archivists.
That means that the historical record being turned over to the archivists is missing a full third of the story.
I've always wanted to ensure that this very large (and completely undocumented collection of political messages) are also made available to the public, but they may well be lost to time.
Adding to the problem is the fact that many White House staffers had multiple email accounts. For example, then Deputy Chief of Staff Karl Rove had a GWB43.com account, which was the domain used for the political arm of the White House operations. He also had an AOL account.
He would use each of those for different things. As you might imagine, most individuals had their own personal accounts, accounts for their work as political operators, and accounts for their work as public servants.
But let's just forget those hundred million or so political messages. Everyone else certainly has. Let's instead focus on what's involved in processing the 200 million messages that the Bush Presidential Center is willing to make available.
Next week in Part 3 of our Special Report: Hand-processing 200 million emails and how modern analytics techniques could provide innovative new applications for presidential email.