Special Report: G.W. Bush's 103.6 million missing email messages and the IT archiving challenge

Special Report: G.W. Bush's 103.6 million missing email messages and the IT archiving challenge

Summary: In Part 2 of our 4-part Special Report, our resident presidential scholar David Gewirtz (who wrote the book on White House email) explores why a large part of the story will always be missing from the record books.

TOPICS: Government, Storage

In honor of the Presidential Center dedication, ZDNet Government is proud to present Part 2 of our exclusive, 4-part in-depth special report on the George W. Bush Presidential Center and the 200 million email archive project.

The conflict between IT challenge and archiving challenge

I was recently asked in a radio interview about whether or not the 200 million message email trove being archived is really that large. That number can be interpreted in different ways. To archivists, 200 million messages is a tremendous number of documents. To most IT professionals, that's a drop in the bucket for a medium-sized enterprise.

There are about 200 million messages that the archivists are dealing with, which is roughly 80 terabytes. That's not a small amount of data. But when you consider that most IT operations dealing with anything resembling Big Data are looking in the multi-petabyte quantities, it's far from unmanageable.

It's just not really that many bits and bytes. You could actually load all of these messages into RAM and process them in real-time using something like SAP's HANA product. So, from a technical point of view, the Bush message archive isn't exactly a large data structure.

But, from an archivist's point of view, it's huge because the archivists want to go through every single message and redact anything that is still considered a national security issue or a thorny political issue.

Think about 200 million messages. If you don't explore solving the problem using machine-based analysis, but instead expect individual humans in the National Archives and Records Agency to look at every single email message, it could be the end of time before they finish their work.

From a technical point of view, managing White House email is really a pretty simple thing. But, from a policy point of view, it's a very difficult thing. In my book and the various speeches I've given on this topic in D.C., I've always made it clear that archiving is a technical process, where retrieving what's been archived is a policy process.

In other words, it's up to us techies to make sure the data can be saved. But whether or not anyone gets to see that saved data has to be determined by laws, judges, and — courtesy of the Presidential Records Act — current and former presidents and vice presidents.

Quite obviously, not all email data is constrained by national security. Much of the data stored is also political in nature. That information may be suitable for safe public viewing from a national security perspective, but politically charged all the same.

That's where the push and pull has come from with White House email — because of that difference. Of course, the weird thing is that most recent White House generations have claimed that solving the archiving challenge is a technical problem. Clearly that's not the case.

From an IT geek perspective, email archiving is an activity that we do across enterprises every day. But from a "What do we want to show? How do we want to show it? How do we want to control our messaging?" perspective, it's a much bigger problem.

But wait, there's more

Even though the collection of 200 million email messages being archived is a boon for historians, it's far from the whole story.

Because I did so much research into the Bush administration email operation, I'm very well aware that those 200 million messages only represent a portion of the email traffic that went on during the Bush White House. The messages being discussed are only the official emails that went through the EOP (Executive Office of the President) email channels.

President Bush's team operated another email operation, based around the GWB43.com domain name. This operation wasn't run by the White House. Instead, it was run by an ISP located down in Chattanooga, Tennessee. While some conspiracy theorists might think that using GWB43 was a way for the Bushies to get around email requirements, the opposite was actually the truth.

There's a 1939 law, called the Hatch Act, that governs how White House email works. Yep, a law enacted way before anyone even knew of email controls email in the most important office of the land.

In any case, the Hatch Act restricts government officials from using government resources to conduct political activities. This means any sort of communication about politics, campaigns, political strategy, and so on could not be conducted through official White House channels and were required — by law — to run through outside services, like our friends in Chattanooga.

Because of this, using what then Deputy Press Secretary Dana Perino called "an abundance of caution," any email message, official or not, that might have had a political tinge, was not routed through the EOP email servers, but instead was routed through GWB43.

None of these official emails, the ones that also contained political information, are available for archiving. In Where Have All The Emails Gone, I estimated that 103.6 million messages ran over the open Internet, through GWB43.com. None of these will be turned over to the archivists.

That means that the historical record being turned over to the archivists is missing a full third of the story.

I've always wanted to ensure that this very large (and completely undocumented collection of political messages) are also made available to the public, but they may well be lost to time.

Adding to the problem is the fact that many White House staffers had multiple email accounts. For example, then Deputy Chief of Staff Karl Rove had a GWB43.com account, which was the domain used for the political arm of the White House operations. He also had an AOL account.

He would use each of those for different things. As you might imagine, most individuals had their own personal accounts, accounts for their work as political operators, and accounts for their work as public servants.

But let's just forget those hundred million or so political messages. Everyone else certainly has. Let's instead focus on what's involved in processing the 200 million messages that the Bush Presidential Center is willing to make available.

Next week in Part 3 of our Special Report: Hand-processing 200 million emails and how modern analytics techniques could provide innovative new applications for presidential email.

Topics: Government, Storage


David Gewirtz, Distinguished Lecturer at CBS Interactive, is an author, U.S. policy advisor, and computer scientist. He is featured in the History Channel special The President's Book of Secrets and is a member of the National Press Club.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Don't forget to address the current administrations...

    use of Private email to do public work and circumvent FOIA requests.
    • Hatch Act still exists

      I very strongly advocated for a Hatch Act change, but it's still the same old mess it's been. So, future administrations (and our current one) are subject to the same 1939 rules. Crazy, no?
      David Gewirtz
      • Is the Hatch Act to blame?

        From the article:
        "the Hatch Act restricts government officials from using government resources to conduct political activities"

        Does the Hatch Act (or any other statute) restrict government officials from using resources *outside of the government* to conduct official government business? I believe that Watergate, which brought down the Nixon administration, caused subsequent administrations to be cautious with regard to communications that could be archived.
        Rabid Howler Monkey
  • Please - all his emails would take up about 10 gigs of space.

    He could never manage to compose more than a 3 sentence email. On top of that, they would never be published, lest the public see all of his spelling and grammar mistakes. LOL
    • You must be a Democrat

      Since you could barely muster two sentences for your post. ;)
      William Farrel
      • Does that make you a Republican...

        since you only managed one?
    • Not being able to compose a 3 sentence e-mail? That's not nearly as bad as

      your current president, who's brain was stolen last week when someone his teleprompter.

      Obama was left with a completely blank face, and a bunch of "uhs, ahhs, ums, huhs".
  • what about Clinton emails

    have you done any review of the clinton emails more especially the Gore fundraising or the destruction of EPA emails?
    • Oh C'mon

      Do you really want to go there? Anything any would say, I am sure you'll have your t-bagger ways to twist it. Email was very scarce in the federal govt until Al Gore crusaded and made it mandatory for all agencies to embrace! Yes, Al Gore spearheaded the IT of Internet, Office and Email throughout the agencies during Clinton era! Before Al Gore done this, every agency operated differently and using different software and standards. Heck, the IRS (which I worked for then) was using WordPerfect for their word processing... Al Gore ensured all agencies embraced the Internet and standardized everything to MS Office. Best thing that ever happened to us!
      • I'm pretty sure you win the argument because you used the word "t-bagger",

        and nothing else matters.

        We owe so much to Gore, who invented the internet, and re-invented government. He also invented global warming.

        Yet, Clinton and Gore are two of the people most to blame for the conditions that this country finds itself in, including the causes for 9/11, and the housing crash which led to so much devastation in the economy.

        Besides being two of the most corrupt people in government ever, they are also a couple of the sleaziest people ever.

        Would you like a "cigar" to go with your clueless adoration of the Clinton/Gore Sleazydency?
        • arpanet to internet

          Senator Albert Gore, Jr. began to craft the High Performance Computing and Communication Act of 1991 (commonly referred to as "The Gore Bill") after hearing the 1988 report toward a National Research Network submitted to Congress by a group chaired by Leonard Kleinrock, professor of computer science at UCLA. The bill was passed on 9 December 1991 and led to the National Information Infrastructure (NII) which Al Gore called the "information superhighway". ARPANET was the subject of two IEEE Milestones, both dedicated in 2009. adorne iterates spin. iterating is easier than thought that may induce doubt.
          • Spin or facts? You prefer to call it spin, yet, nothint I stated

            is false.

            The internet didn't take off until the early 1990s, and after the "Gore Bill", and, while there are some that want to give some kind of credit to Gore for ARPANET, or AII, the fact is that, he was just leading from behind, since whatever did come from the early "research" was already happening anyway, and the "information superhighway" was already years into development by the time Gore decided he wanted in on the action.

            Gore is nothing by a shyster, and he is willing to lie, steal, and borrow, in order to put his face into the news and to try to profit from whatever comes out of it. Hence, he's also instrumental in getting "global warming" into the headlines, but his main intention, as in other endeavors he got involved in, was to try to get some gain out of it, and in the case of "global warming", he is one of the primary profiteers from sales of carbon credits. He became a multi-millionaire from that scheme, and he also profited from becoming an Apple board member, and purchasing 59,000 shares of Apple stock for a mere $440,000. Gore was and is nothing more than a shyster and a sleazebag, and a liar in the first degree.
          • National Center for Supercomputing Applications

            Gore's legislation also helped fund the National Center for Supercomputing Applications at the University of Illinois, where a team of programmers, including Netscape founder Marc Andreessen, created the Mosaic Web browser, the commercial Internet's technological springboard. 'If it had been left to private industry, it wouldn't have happened,' Andreessen says of Gore's bill, 'at least, not until years later.' The University of Pennsylvania's Dave Ferber says that without Gore the Internet "would not be where it is today." Joseph E. Traub, a computer science professor at Columbia University, claims that Gore "was perhaps the first political leader to grasp the importance of networking the country. 2005, Al Gore won the Webby Lifetime Achievement Award “for three decades of contributions to the Internet”.
          • Bunch of B.S....

            Before Gore, and even before the internet as we know it, there were companies out there already making banking and news and information available to everyday people. That was all occurring in the early 1980s, and I myself was involved with an on-line banking system, which was developed at Chemical Bank in NY (Now Chase). The online world was already a reality before Gore even thought of getting involved.

            Also, HTML, the language of the internet, was already in the works before Gore got involved, and the rest of the internet history would have occurred without Gore or any other high-placed government official.

            Read (from Wikipedia):

            "In 1980, physicist Tim Berners-Lee, who was a contractor at CERN, proposed and prototyped ENQUIRE, a system for CERN researchers to use and share documents. In 1989, Berners-Lee wrote a memo proposing an Internet-based hypertext system.[2] Berners-Lee specified HTML and wrote the browser and server software in the last part of 1990. In that year, Berners-Lee and CERN data systems engineer Robert Cailliau collaborated on a joint request for funding, but the project was not formally adopted by CERN. In his personal notes[3] from 1990 he listed[4] "some of the many areas in which hypertext is used" and put an encyclopedia first."

          • rules of the road

            Without the goverment funding of arpanet the standards to ensure interoperability the internet would not exist in its current form. When arpanet was shutdown the defense data network continued. The High-Performance Computing and Communications Initiative and spurred many significant technological developments and the creation of a high-speed fiber optic computer network that became the commercal internet. Before the funding of National Information Infrastructure connecting a modem to modem was available for everyday people. They could even use hypertext!
  • Far from happy with G.W. Bush..

    but I'm sure his significance to the liberal agenda will be the best thing that they will never let anyone forget, no matter how much wrong they can do and blame him for.
  • Why does this need to be 4 parts?

    Part one and this part could easily been one part. I don't see why it is split up. Anyway, I remember the hype going on that the Bush admin was using a GOP controlled server for email and that they supposedly LOST all the emails. And we never heard another peep until now.
    • No doubt, the whole 4 parts could be done in 1 or 2 paragraphs,

      but, some people feel so self-important with their supposed inside knowledge of things that occur in government, that they need to drag it out in order to attract more attention to their self-image.