Special Report: G.W. Bush's 103.6 million missing email messages and the IT archiving challenge
Summary: In Part 2 of our 4-part Special Report, our resident presidential scholar David Gewirtz (who wrote the book on White House email) explores why a large part of the story will always be missing from the record books.
In honor of the Presidential Center dedication, ZDNet Government is proud to present Part 2 of our exclusive, 4-part in-depth special report on the George W. Bush Presidential Center and the 200 million email archive project.
The conflict between IT challenge and archiving challenge
I was recently asked in a radio interview about whether or not the 200 million message email trove being archived is really that large. That number can be interpreted in different ways. To archivists, 200 million messages is a tremendous number of documents. To most IT professionals, that's a drop in the bucket for a medium-sized enterprise.
There are about 200 million messages that the archivists are dealing with, which is roughly 80 terabytes. That's not a small amount of data. But when you consider that most IT operations dealing with anything resembling Big Data are looking in the multi-petabyte quantities, it's far from unmanageable.
It's just not really that many bits and bytes. You could actually load all of these messages into RAM and process them in real-time using something like SAP's HANA product. So, from a technical point of view, the Bush message archive isn't exactly a large data structure.
But, from an archivist's point of view, it's huge because the archivists want to go through every single message and redact anything that is still considered a national security issue or a thorny political issue.
Think about 200 million messages. If you don't explore solving the problem using machine-based analysis, but instead expect individual humans in the National Archives and Records Agency to look at every single email message, it could be the end of time before they finish their work.
From a technical point of view, managing White House email is really a pretty simple thing. But, from a policy point of view, it's a very difficult thing. In my book and the various speeches I've given on this topic in D.C., I've always made it clear that archiving is a technical process, where retrieving what's been archived is a policy process.
In other words, it's up to us techies to make sure the data can be saved. But whether or not anyone gets to see that saved data has to be determined by laws, judges, and — courtesy of the Presidential Records Act — current and former presidents and vice presidents.
Quite obviously, not all email data is constrained by national security. Much of the data stored is also political in nature. That information may be suitable for safe public viewing from a national security perspective, but politically charged all the same.
That's where the push and pull has come from with White House email — because of that difference. Of course, the weird thing is that most recent White House generations have claimed that solving the archiving challenge is a technical problem. Clearly that's not the case.
From an IT geek perspective, email archiving is an activity that we do across enterprises every day. But from a "What do we want to show? How do we want to show it? How do we want to control our messaging?" perspective, it's a much bigger problem.
But wait, there's more
Even though the collection of 200 million email messages being archived is a boon for historians, it's far from the whole story.
Because I did so much research into the Bush administration email operation, I'm very well aware that those 200 million messages only represent a portion of the email traffic that went on during the Bush White House. The messages being discussed are only the official emails that went through the EOP (Executive Office of the President) email channels.
President Bush's team operated another email operation, based around the GWB43.com domain name. This operation wasn't run by the White House. Instead, it was run by an ISP located down in Chattanooga, Tennessee. While some conspiracy theorists might think that using GWB43 was a way for the Bushies to get around email requirements, the opposite was actually the truth.
There's a 1939 law, called the Hatch Act, that governs how White House email works. Yep, a law enacted way before anyone even knew of email controls email in the most important office of the land.
In any case, the Hatch Act restricts government officials from using government resources to conduct political activities. This means any sort of communication about politics, campaigns, political strategy, and so on could not be conducted through official White House channels and were required — by law — to run through outside services, like our friends in Chattanooga.
Because of this, using what then Deputy Press Secretary Dana Perino called "an abundance of caution," any email message, official or not, that might have had a political tinge, was not routed through the EOP email servers, but instead was routed through GWB43.
None of these official emails, the ones that also contained political information, are available for archiving. In Where Have All The Emails Gone, I estimated that 103.6 million messages ran over the open Internet, through GWB43.com. None of these will be turned over to the archivists.
That means that the historical record being turned over to the archivists is missing a full third of the story.
I've always wanted to ensure that this very large (and completely undocumented collection of political messages) are also made available to the public, but they may well be lost to time.
Adding to the problem is the fact that many White House staffers had multiple email accounts. For example, then Deputy Chief of Staff Karl Rove had a GWB43.com account, which was the domain used for the political arm of the White House operations. He also had an AOL account.
He would use each of those for different things. As you might imagine, most individuals had their own personal accounts, accounts for their work as political operators, and accounts for their work as public servants.
But let's just forget those hundred million or so political messages. Everyone else certainly has. Let's instead focus on what's involved in processing the 200 million messages that the Bush Presidential Center is willing to make available.
Next week in Part 3 of our Special Report: Hand-processing 200 million emails and how modern analytics techniques could provide innovative new applications for presidential email.
Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.
Talkback
Don't forget to address the current administrations...
Hatch Act still exists
Is the Hatch Act to blame?
"the Hatch Act restricts government officials from using government resources to conduct political activities"
Does the Hatch Act (or any other statute) restrict government officials from using resources *outside of the government* to conduct official government business? I believe that Watergate, which brought down the Nixon administration, caused subsequent administrations to be cautious with regard to communications that could be archived.
Please - all his emails would take up about 10 gigs of space.
You must be a Democrat
Does that make you a Republican...
Not being able to compose a 3 sentence e-mail? That's not nearly as bad as
Obama was left with a completely blank face, and a bunch of "uhs, ahhs, ums, huhs".
what about Clinton emails
Oh C'mon
I'm pretty sure you win the argument because you used the word "t-bagger",
We owe so much to Gore, who invented the internet, and re-invented government. He also invented global warming.
Yet, Clinton and Gore are two of the people most to blame for the conditions that this country finds itself in, including the causes for 9/11, and the housing crash which led to so much devastation in the economy.
Besides being two of the most corrupt people in government ever, they are also a couple of the sleaziest people ever.
Would you like a "cigar" to go with your clueless adoration of the Clinton/Gore Sleazydency?
arpanet to internet
Spin or facts? You prefer to call it spin, yet, nothint I stated
The internet didn't take off until the early 1990s, and after the "Gore Bill", and, while there are some that want to give some kind of credit to Gore for ARPANET, or AII, the fact is that, he was just leading from behind, since whatever did come from the early "research" was already happening anyway, and the "information superhighway" was already years into development by the time Gore decided he wanted in on the action.
Gore is nothing by a shyster, and he is willing to lie, steal, and borrow, in order to put his face into the news and to try to profit from whatever comes out of it. Hence, he's also instrumental in getting "global warming" into the headlines, but his main intention, as in other endeavors he got involved in, was to try to get some gain out of it, and in the case of "global warming", he is one of the primary profiteers from sales of carbon credits. He became a multi-millionaire from that scheme, and he also profited from becoming an Apple board member, and purchasing 59,000 shares of Apple stock for a mere $440,000. Gore was and is nothing more than a shyster and a sleazebag, and a liar in the first degree.
National Center for Supercomputing Applications
Bunch of B.S....
Also, HTML, the language of the internet, was already in the works before Gore got involved, and the rest of the internet history would have occurred without Gore or any other high-placed government official.
Read (from Wikipedia):
"In 1980, physicist Tim Berners-Lee, who was a contractor at CERN, proposed and prototyped ENQUIRE, a system for CERN researchers to use and share documents. In 1989, Berners-Lee wrote a memo proposing an Internet-based hypertext system.[2] Berners-Lee specified HTML and wrote the browser and server software in the last part of 1990. In that year, Berners-Lee and CERN data systems engineer Robert Cailliau collaborated on a joint request for funding, but the project was not formally adopted by CERN. In his personal notes[3] from 1990 he listed[4] "some of the many areas in which hypertext is used" and put an encyclopedia first."
http://en.wikipedia.org/wiki/HTML
rules of the road
Far from happy with G.W. Bush..
Why does this need to be 4 parts?
No doubt, the whole 4 parts could be done in 1 or 2 paragraphs,