Data-driven analysis debunks claims that NSA is out of control (Special Report)

Data-driven analysis debunks claims that NSA is out of control (Special Report)

Summary: If these numbers were reported in a corporate situation, they would be considered an absolute triumph of big data management and implementation. UPDATE: Response/corrections/clarifications from Washington Post reporter.

SHARE:

IMPORTANT UPDATE: Please see the end of this article for a detailed response by Barton Gellman of the Washington Post. He clarifies some of my statements, calls me out on some, and gives us a much better understanding of others. He posted this response to the comments, but I don't want it to get lost.

Just how heinous is the National Security Agency? If press reports and blog postings are to be believed, the NSA and the entire government surveillance apparatus of the United States are completely out of control and we're headed for a Gestapo-style state.

But is that really true? What does the data have to say about it?

Let's start with a basic problem. Big numbers are hard for people to visualize. Really, really, really big numbers are impossible to visualize.

The gotcha that comes out of this cognitive limitation is that it's possible to distort public perception by tossing out big-sounding numbers. Even if an attempt is made to put those numbers in perspective, most readers grab the most savory bit of information, usually from the headline, and that's what becomes their internal representation of the facts.

So let me summarize the results of my data-driven investigation, and then take you through the details. Here is a summary of the results of my analysis:

  • Facebook captures 20 times more data per day (for just its server logs, not counting everyone's posts) than the NSA captures in total.
  • The NSA's selection systems are actually insanely accurate. If you compared all the data they capture to a year's worth of time, the amount of errors they make amounts to about a quarter of a millisecond.
  • The actual byte quantity of erroneous data the NSA records amounts to less than one MP3 track per week.
  • If these numbers were reported in a corporate situation, they would be considered an absolute triumph of big data management and implementation.

So, there you go. Headlines hyper-inflate the facts. Now let me take you through all the details. Let's start with what happened on Thursday.

Reports of broken privacy rules

On Thursday, Bart Gellman, a Pulitzer-prize winning correspondent at the Washington Post, reported "NSA broke privacy rules thousands of times per year, audit finds." Disclosure: Bart used to write for one of my publications a decade or so ago.

According to the Post, an NSA audit described "2,776 incidents in the preceding 12 months of unauthorized collection, storage, access to or distribution of legally protected communications." This describes the period of about May 2011 to May 2012.

From this report sprang a hue and cry across the land, notably from the Electronic Frontier Foundation, who declared, "NSA Spying: The Three Pillars of Government Trust Have Fallen."

It's important to note, before I go further, that I have an incredible degree of respect for both Bart and the EFF. But, to paraphrase President Clinton, it's time to employ some arithmetic.

Volume of NSA data

Here's where the really big numbers come from. According to the NSA itself, in a document released to the public (PDF), the Internet as a whole carries 1,826 petabytes of information per day. Hang with me here. The numbers are not going to make much sense for a little while, but I'll knit them together so you can grasp the big picture.

Of that 1,826 petabytes, the NSA "touches" 1.6%, or just under 30 petabytes. While the NSA doesn't define "touches" in detail, we can assume from context that they mean the data briefly passes through their networks and/or data collection centers. I know you can't picture either 1,826 petabytes or 30 petabytes, but don't worry about that for now. Stick with me. This will make sense soon.

The NSA disclosed that of that 30 petabytes it "touches," only 0.025% is "selected for review". That number is about 7.3 terabytes. By "selected for review," we can fairly assume that about 7.3 terabytes is added to the NSA's global databases and may be examined by federal agents.

I'll come back to the Washington Post's 2,776 "incidents" in a minute. First, let's get some picture of the difference between petabytes and terabytes.

Picturing the scale of data

The best way I've found to picture these data sizes is by comparing them to money. A single byte, roughly one character (like "B") could be compared to a penny. If one byte is one penny, then the 140 characters in a tweet is worth about $1.40 (140 pennies).

Okay, let's raise the stakes a bit. A kilobyte is roughly a thousand (I know, 1024, but work with me), about a thousand characters of text. So far, in this article, you've read about three times that many characters. In terms of pennies, a kilobyte would be about ten bucks, or just about the cost of two Subway sandwiches.

Following along, then, a megabyte is worth about a million pennies, or about $10,000 dollars, which is roughly the cost of a used 1998 Toyota Camry. A gigabyte (which in video form will hold just about one episode of a TV show) would be a billion pennies, or about $10 million dollars — the price of a very fancy mansion.

Do you see how these numbers just get insanely bigger? When we go from a kilobyte (a thousand or so) to a gigabyte (a billion or so), we go from a few sandwiches to a Hollywood celebrity's mansion.

Hang with me. I'll bring this back to the NSA in a minute, but you still need to get the full picture. Let's punch it up. Let's go from a gigabyte to a terabyte. Let's say a terabyte is worth a trillion pennies. In dollars, that puts you in billionaire territory, roughly the net worth of Microsoft's Steve Ballmer, and about half the net worth of Jeff Bezos, who just bought the Washington Post for what, for him, is pocket change.

So a terabyte in money terms puts you in Mark Zuckerberg, Bruce Wayne, Lex Luthor territory. So what about a petabyte? We've been flinging the term petabyte around the news all last week, but how much is that? How can we picture it?

Let's use money again. If we're talking a penny a byte, a petabyte is one quadrillion pennies, or about $10 trillion dollars. If it's hard picturing billionaire-level wealth, try this one out for size: $10 trillion is the entire Gross Domestic product of China and Japan...combined.

Okay, so let's go back to trying to picture what the NSA is doing, and doing wrong. Now that we have a frame of reference (ranging from the cost of a submarine sandwich to the total income of China and Japan combined), we can get a feel for the relationship of the terms the press is flinging around.

Parsing the NSA data flow using what we now understand

Let's start with the biggest number first. While the NSA "touches" about 30 petabytes (in the dollar analogy, about twice America's GDP), it only selects for review about 7.3 terabytes (about the net worth of Bill Gates and Jeff Bezos combined).

By the way, as a reality check, according to Robert Johnson (Facebook Director of Engineering), back in 2011 Facebook collected 130 terabytes of log data each day. Facebook, just in terms of log data (not counting all the cat pictures and recipes everyone posts), gathers almost 20 times the amount of log data each day than NSA grabs of all data.

Now, let's look at the number 2,776, which is what has everyone all upset.

Before we start playing with this number, let's add one more fact. This number is over the course of a year, while the other data we're looking at is over the course of a day.

2,776 is the number of erroneous data accesses by the NSA that the Washington Post reported. First of all, how much data is that? Since we're talking about metadata, we're not talking full messages. A typical email header has about 4,500 bytes (or about 4K). Let's give the naysayers the benefit of the doubt and let each NSA error be 32K.

Putting it all into perspective

So now, we can start putting the heinousness in perspective. 32K times 2,776 errors is a little under 90 megabytes — or about the size of one Justin Bieber album downloaded as MP3s — per year.

To fit this into the daily numbers we've been working with, let's divide that 90 megabytes by 365. That gives us about 252K. In penny-per-byte terms, that's about $2,500 (or about the cost of one nicely equipped iMac).

In terms of dollars, which is the analogy we've been using throughout this article, the NSA mistakenly grabs the penny-per-byte data equivalent of an iMac as compared to the penny-per-byte equivalent of the overall net worth of Bill Gates plus Jeff Bezos.

The bottom line is this: the NSA runs about 30 quadrillion bytes through its systems each day. It records about 7 trillion of those bytes. It mistakenly records less than a megabyte a day — less than one MP3 worth of data per day.

Let's put it another way. When we talk about our goals for measuring excellent data center high-availability performance, we look for "five nines" of service availability, meaning that uptime is 99.999 percent. In terms of operating time, five nines means the network will be down all of 5 minutes and 26 seconds for the entire year.

If we picture the NSA's accuracy by comparing it to the commonly accepted IT goal of five-nines of high availability (or about five and a half minutes per year), the NSA's error rate (described in terms of time) would be 0.2649 milliseconds per year. That's not the Holy Grail of five nines of accuracy. That's more like twelve nines.

These numbers don't look to me like a heinous disregard for privacy on the part of the NSA's coders and systems engineers. Instead, it looks  to me more like a triumph of IT and database engineering.

Of course, information like that doesn't cause outrage, it doesn't sell newspapers, and it doesn't generate page views. It's just accurate. Looking at actual data rather than breathless hyperbole paints a far clearer picture of the activities of America's most advanced technical intelligence gathering operation.

They're not the enemy. If anything, they appear to be doing a darned good job protecting us without getting all up in your privacy junk.

The following was posted to the comments for this article by Barton Gellman. I'm thrilled he's participating in our conversation. Thanks, Bart, for joining us and sharing clarifications.

From the author of the Washington Post story (Barton Gellman)

I'm the author of The Washington Post story. There's a newsroom expression. "Danger: reporter doing math." I'm not going to audit David, but in any case the math won't be the problem here. The problem is that he misunderstands what he's counting. I don't blame him for that: This is a very complex set of legal, technical and operational questions. I have been following them closely since 2005, and devoted two chapters of my last book to them, and I still don't find them easy. No time for a treatise but a few quick points:

* The "compliance incidents" do not all involve collection. As the story and the documents note, they can take place anywhere along the spectrum of electronic surveillance: collection, retention, processing or distribution. Any of them can range from the minor, with little privacy impact, to the very serious.

* David assumes the surveillance is all about metadata. It is not. Much of it -- an unknown quantity, because the report does not break this down -- is content. As the story notes, the NSA does not "target" Americans for content collection but it does collect a great deal of American content "inadvertently," "incidentally" or deliberately when one party is known to be a foreign target overseas. Most of it stays in databases, and a single search can pull up gigabytes.

* A crucial point to understand: the last two categories of collection on Americans -- "incidental" and deliberate, when one party is overseas -- account for the highest volume of American data in NSA hands. They DO NOT COUNT as incidents. NONE of them are among the 2,776 incidents. As the NSA interprets the law, it is not a violation to collect, keep and process it. Until my story that had never been clear, and the White House still works hard to obscure the difference between forbidden and routine collection (including collection of content) from Americans. "Minimization" rules strip out identities by default, but there are many exceptions and requests from "customers" to unmask identities are readily granted.

* It is not possible to calculate or even estimate within several orders of magnitude the quantity of data involved in 2,776 incidents, nor the number of people affected, even if you know whether you're dealing with metadata or content. A small but unknown number of incidents -- those involving unlawful search terms but obtaining no results -- do not collect, process or disseminate any data at all and thus have zero privacy impact. Other incidents may involve only a few surveillance subjects but includ large volumes of data, either because collection takes place over a span of time or because the previously collected data set is very large. One "incident" in the May 2012 report involved over 3,000 database files, and each file contained an unknown (but typically very large) number of records. Another episode -- not counted as an "incident" at all -- collected data on all calls from Washington, DC for an unknown period of time. There is no way to tell from the report alone, but based on the routine procedures and scale of NSA operations it is likely that some of these individual incidents (1 of 2,776) affected hundreds of thousands of people.

* By the way, as again the story notes, the 2,776 cover only Ft. Meade and nearby offices. There would be substantially more incidents in an audit that included the SIGINT Directorate's huge regional operations centers in Texas, Georgia, Colorado and Hawaii -- and the activities of other directorates such as Technology, and such as Information Assurance, that also touch enormous volumes of data.

* It's fair game to take a full data set and challenge a reporter's (or researcher's) analysis of the data. But this was not a full data set and it's a mistake for David to think he can suss out the whole story from the limited number of documents we posted alone. I drew upon other documents and filled the gaps with many hours of old-fashioned interviews. I took some primary material, combined it with other leads, and applied journalism in order to understand what the material says, what it doesn't say, and what inferences can and can't be drawn from it. That's among the reasons we don't just dump documents into the public domain. There are not many stories in the Snowden archive that can be told by documents alone.

* Despite all this, David is surely right to say the error rate is very low in percentage terms. That is important in assessing individual performance, and maybe that's the end of the story for you. That's your choice. For some people, public policy question considers the absolute number as well. We might not accept the more mundane harm of 1 million lost airline bags a year, even if 99.9 percent of 1 billion bags checked annually made it to their destinations. Some systems have to be designed with less fault tolerance than others. That's a political and social decision, but we have been unable to debate it until the Snowden disclosures.

* Part of the importance of this story is that the government worked so hard to obscure it. In public releases of semi-annual reports to Congress, the administration blacked out ALL statistical data. (By the way, note that the tables in the 14-page document I posted are unclassified. In the DOJ/DNI report to Congress, they were marked Top Secret // Special Intelligence, which made public release impossible and restricted the readership in Congress.) Alongside the refusal to release any data, the government left the very strong impression that mistakes were vanishingly rare and abuse non-existent. That may depend on the definition of "abuse." Marcy Wheeler quotes a tv interview in which I discussed that and makes some additional points here.

Topics: Privacy, Big Data, Government, Government US, Storage

About

David Gewirtz, Distinguished Lecturer at CBS Interactive, is an author, U.S. policy advisor, and computer scientist. He is featured in the History Channel special The President's Book of Secrets and is a member of the National Press Club.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

98 comments
Log in or register to join the discussion
  • David, you think The Goverment is telling the truth ?

    Did nobody ever tell you that Governments lie to cover their ass ?

    I believe those number as much as I believe a Government Fiscal forecast.
    Alan Smithie
    • It was a leaked internal report

      It wasn't supposed to be released. Or maybe you think Snowden is an NSA patsy?
      larry@...
      • Yeah right and IBM didn't sell Hollerith card machines

        To a well known evil regime to help them process personal data faster.
        Alan Smithie
        • RE: Yeah right and IBM didn't sell Hollerith card machines

          @alan,

          You mean the machines IBM sold to the Nazis so they could track those tattooed serial numbers they put on people's wrists? Which btw, were hollerith codes.. Go figure. Watson was such a great humanitarian.
          Sethgr
          • I read his biography years ago.

            Watson did not sell computing machinery to the Nazis because he WAS a Nazi. He did it because, despite being a brilliant businessman, he was a naive one-world peacenik who would have sold the machinery to ANYONE who had not started a war (and he sold it BEFORE Hitler started the war). Once the Nazis had the machines, and the war had started, neither Watson nor his company IBM had any more control over the applications of the machines.

            So yes, Watson was a humanitarian; a NAIVE and gullible humanitarian. And the Nazis would have found another way to keep track of those "undesirables" if they didn't have IBM machines; their spies could have stolen enough information to almost-duplicate them on their own (and who cares about patent rights of an enemy during a war?) in any event. Watson thought they were going to use the data for innocent purposes, as the United States government (Social Security and IRS) was doing.

            It is, reportedly, true that the unity of the COMPANY was so strong that RAF pilots who had worked for IBM often "missed" their IBM targets in Germany, but they found other militarily useful places to drop their bombs.

            For a REAL anti-Semitic American, look at Henry Ford, who financed the publication of the "Protocols of the Elders of Zion," a notorious forgery of what the "Jewish Conspiracy" was alleged to be planning, in a US English edition, and was awarded a medal by Hitler (before the war). And a bank owned by a US Senator was STILL trading with Hitler DURING the war, and had to be seized and shut down by the government. The Senator's name? Prescott Bush, father of one President and grandfather of another. Truthfully, many US industrialists saw the Fascists and Nazis as simply fiscal and economic conservatives, until the Pearl Harbor hit the fan, and a Republican administration in the 1930's might well have allied with them and adopted their policies toward workers (no unions wanted, just like today). So we could say that FDR not only saved us FROM the Fascists, but in his first two terms, saved us from BECOMING fascists. At least for the next 80 years.
            jallan32
          • Merchants don't care about good or evil, just about opportunities

            I think that every successful business person or entrepreneur will agree that the opportunity is the only thing that matters and that means to think different and play dumb in some things to get the deal.

            Capitalism is not humanitarian, it's about opportunities and selfishness and there's nothing wrong about it, who cares about people exporting hazardous garbage in Hindi, giving massive tumors to women and children, destroying oceans and killing people for petroleum or selling weapons to violent countries, ethics have no place in economics.

            Governments are like companies and run by merchants and merchants have no ethics.
            delimitaciones
          • Yeah right and IBM didn't sell Hollerith card machines

            The 3rd Reich didn't lose the war, they just changed venues.

            Facebook's data is also made available to the NSA & this entire article is just an attempt to cover it all up but whats new. First it's deigned, then we'll talk about it, then we'll plant the seeds that its no big deal, i.e. they can't understand the data they've collected jada jada jada. Google is the NSA for all intent and purpose, they were funded by them and their "do no harm" motto sounds like something from Orwell, and the 3rd folks then anything else because they do harm daily. Youtube sensors on political grounds daily as well. These are just little old FELONIES but hey don't worry about it, is that what our author is attempting to convey here, steer the boat away from the felony rocks one of his readers might see. Thing is, most of his readers see right thru the it these days, thank God. Spin machines in high gear today but whats better is yesterdays NEWS said that most major media would start this denial cycle today. Good to know the author is reading his marching orders.
            netquestz
    • How about another anaology?

      Lets say that the 1,826 petabytes represents all of the windows in the world and the NSA “touching” represents an ability to peek in 1.6% of all of the windows in the world once per day. The 0.025% of data collected that is elected for review represents 7.3 trillion in data per day recorded and reviewed by government agents all of which belong to some private individual or corporation. The 2,776 incidents per minute represent each of the individual violations of privacy that were recorded and reviewed by mistake. Of course these violations of privacy are only violations when regarding US citizens. Foreign communications are not given the same consideration of right to privacy which is why foreign companies are pulling out of the US market for cloud and data services.

      So... the NSA has the capability of peeking in 1.6% of all the windows in the world and process all 1.6% of the total available data to create profiles that target just 0.025% of the total data to flag for recording and review. Now, as if the mere capability that anyone with access to the system could use whatever criteria they choose to set specific profiles for data collection isn't disturbing enough, the system is operated in total secrecy so that we don't know what criteria they may use or when they may change their criteria.

      Finally, even if we extend blind trust to the operators of such a system that they would only use this system in good and beneficial ways and would diligently guard it from misuse, they actually mistakenly, even with the best of intentions, violate an American citizen's rights 2,776 per minute. So each day, US citizens rights are violated approximately 4 million times per day (1.3% of the population); each month, 119 million violations occur (roughly on third of the population) and every citizen in the US has their rights violated 4.5 times per year.

      The NSA is a spy agency isn't it? I would say they are doing a fine job.
      techadmin.cc@...
      • correction...

        I hate that ZDnet did away with the ability to edit your own posts...

        My last paragraph was completely in error. I scrolled up to get the number of incidents and read "2,776 "incidents" in a minute" but did not read the beginning of the sentence. Apparently the NSA only violated 2,776 US citizens rights in a year.

        I suppose that is better, unless:

        A. You were one of the 2,776.
        B. You had your data peeked at or collected but not reviewed.
        C. Your rights not to have your data collected and reviewed were never recognized.
        techadmin.cc@...
      • How about another analogy?

        Excellent analogy.

        Couple of points.

        They have basically unlimited funding (the money of the people they commit the crimes against).

        Each and every single violation is a FELONY.

        But the author says to go back to sleep now, nothing to see here. They can't digest the data they've collected (yet), nothing to worry about, move along now.
        netquestz
    • The government has already got to him

      He say anything they tell him to now.
      John2219
      • The government has already got to him

        Yep. It's crazy. Been reading ZDNet since it was in print form but the last several years they've morphed from what used to be a reasonable tech rag to a political mouthpiece for whomever is in power at the moment.

        The good part is that despite his best efforts the readers are generally starting to get it in mass now. Even with all the paid bloggers reinforcing the stories they can't get anything to stick to the walls anymore.

        There actually may be hope of steering the ship back on track despite their best efforts.
        netquestz
  • error != "civil rights violation"

    When you collect data under defined circumstances it's inevitable that you'll overcollect and adjust. When police do a wiretap under a legitimate warrant they are supposed to be listening only for certain matters. They hear all of them, but discount all the irrelevant ones. It's the only way to get a wiretap.

    Same here. These systems are inevitably so complex that they will make errors. The only thing you can do is to set procedures so that the overcollected data is not used, and clearly that's what happened with the NSA. The very fact that they had the leaked report and the study that went into it shows that they're trying to follow the rules.
    larry@...
    • Nice story but...

      ...the NSA has already admitted to feeding data to other LE agencies for follow-up of non-intellegence-related issues where they think there may be evidence of criminal activity. So they don't just ignore overcollections.
      huntm856
      • that's not out of policy

        They do that on purpose; it's not an error. It's just more evidence that they're following a policy. They could be lying about their policies, but if you take them at their word then this is no evidence that they're going rogue with the data
        larry@...
    • but some errors are

      What you say is true, but thats not exclusively what were talking about. What we are talking about is more along the lines of placing a wiretap and not telling a court about it for months. Which is a civil rights violation.

      Not every error is a civil rights violation. But some of them are.
      stimp
    • No, it's a violation

      'Intent' is no excuse for breaking the law. The next time you get pulled over for speeding, tell the officer you didn't mean to, and watch him laugh at you.
      akaltman@...
  • Wow

    I would expect this from most media, guess I thought ZDNet was above being mouthpiece.
    Michael Alan Goff
    • Great sociological study

      Sounds like the NSA is going all out with their cover up. Wouold you expect anything less?

      The questions is how many suckers are out there that will believe it.

      Verizon got a huge kickback in government cloud contracts (billions) for cooperating. Maybe ZDnet wants a piece of the pie. Though I personally wouldn't jump to such harsh judgment without a little more evidence.
      Astringent
      • It isn't that

        I don't think they're being bought off or anything, I just think this is a case where the Government (whoever it is) gets shielded.
        Michael Alan Goff