Needles and haystacks: Balancing precision and recall

Attivio's Sid Probstein says for traditional search, the conventional model is to return many results to get high recall; somewhere in there are the right answers. But what if you want a more precise answer?
Written by Sid Probstein, Attivio, Contributor
Commentary - These terms are bandied about in most every discussion about search. What do they really mean and how do we evaluate them? To those of us immersed in the thick of search lore, we have lots of answers but we certainly don’t make them easy to understand to the uninitiated. There is quite a large wall around our little garden. Let’s try to demystify the discussion a bit by focusing on the tradeoff that is inherent between the two concepts in context of how it applies to real world examples.

Let us begin by considering the problem of evaluating the accuracy of a spam detector for your email system. You test it on 1,000 known spam emails and it tags all but one as spam. Would you report that it is 99.9 percent accurate? No, because you still need to evaluate the performance of the detector on legitimate (non-spam) email. You perform the test on 1,000 legitimate emails and one is (incorrectly) detected as spam. The 99.9 percent figure then is accurate. But if 500 of them were falsely tagged, you certainly would not be happy with the detector’s performance. What should its overall accuracy be?

The challenge lies with recognizing there are two different classes of errors in a detection problem. In our example, the 500 emails incorrectly detected as spam are called false positives. The one spam that was not detected is called a false negative. The overall accuracy of the system really depends on how you view the importance of false positives and negatives. If you were equally OK with getting a spam email as you are with not receiving a legitimate email, then a single measure of accuracy would fine for both tests. But as you might strongly contend, this is not true for spam. The cost of receiving a single spam email is much less than the cost of not receiving a legitimate email, and so the measure of acceptability for the two kinds of errors must be different.

There are other factors to include in your equation. One is expected outcome. How much email on average is really spam? If the answer is one percent (and we know of course it is not) then you would probably forgo the detection altogether if it gained you comfort in knowing your email would never be falsely denied. One in a hundred spam emails is a minor inconvenience. If the answer, however, is 99 percent (which is more likely the case) then your accuracy assessment “balance” will be different.

Another factor is the consequence of the action taken. If spam email is redirected to a special directory in your inbox, then the incorrectly tagged email can still be viewed; you just have to look in the special directory. But if your company’s email server denies access before it ever reaches you, the consequence of not receiving a legitimate email is more severe. Your accuracy assessment will be different depending on which of these two actions are performed.

The subject matter itself is also important. To illustrate, let us consider a completely different problem than email spam detection: testing for pregnancy. A product states that it has a 99.8 percent accuracy rate (positive and negative). What does this mean? If it is used by 1,000,000 women and it turns out that 5,000 of them are actually pregnant (.5 percent), then we could summarize the test results in a table.

What lessons can we learn from this table? Are we happy with the fact that 10 women are in for a big surprise? What about the 1,990? When should you start remodeling the spare bedroom?

Now consider identifying inappropriate use of PII (personal identifiable information such as social security numbers) within your organization. According to a Ponemon Institute study, the average cost per incident where PII has been put at risk is $6.3 million to investigate the breach, notify customers, restore security infrastructures, and recover lost business. They calculate that this amounts to about $197 per individual record containing PII1. Not detecting the inappropriate use of this piece of PII is a false negative, so determining the cost is fairly straightforward. The cost of a false positive amounts to very little by comparison. In a good monitoring or auditing system, false positives can be tagged as acceptable PII and subsequently ignored. Of course, if the number of false positives becomes fantastically large, then the cost of manageability is an issue.

Finally, there is the threshold tolerance factor. In certain cases, it is necessary to have a low amount of both types of errors, even zero. Reexamining our pregnancy example, a pregnancy test involves the measurement of the hCG hormone in a woman’s system. The more inexpensive tests (these include the home tests) check simply for its presence. The more accurate tests determine the specific amount of hCG and they do so by checking it in the blood where the distribution of hCG is more stable. Of course, these tests require a doctor’s visit and are more expensive. But they are also more accurate.

In reporting income for a public company, the costs of reporting too much income or reporting not enough income these days have become equally catastrophic. This is an extreme example where there is really only one acceptably correct answer, but it can be done because both the question and data are precise. There may be many, many rules and regulations that determine what constitutes revenue and how it is reported, but there is little that is “fuzzy” or “interpretive” about either (or shouldn’t be according to the tax department).

This fuzziness factor has a direct relationship with the threshold tolerance. Clearly a request for information about World War II is inherently less precise than asking for 2008 revenue. When is a piece of information about the war no longer a relevant answer? The query can be imprecise, the answer can be imprecise, or both. In the case of World War II, both are imprecise. This is also true for spam, although less so (what is spam and what is legitimate advertising?) With our pregnancy example, the question is precise (you are either pregnant or not) but the test is not, having some margin of error.

So, you may be wondering what this has to do with precision and recall. Precision is the ratio of detected correct answers to the full set of detections. In other words, it is a measure of false positives (actually, its inverse). Recall is the ratio of detected correct answers to the true set of correct answers, or a measure of false negatives. Looking back at Table A in our pregnancy example, we detected 4,990 correct pregnancies, but we said there were 6,980 pregnancies, so the precision is around 71.5 percent (4,990 divided by 6,980). The number of true pregnancies was 5,000, so the recall is 99.8 percent (4,990 divided by 5,000). In our spam detector example, we detected 999 correct spam emails, but we said there were 1,499 of them (999 + 500), so the precision is around 66.6 percent (999 divided by 1,499). The number of true spam emails was 1,000, so the recall is 99.9 percent (999 divided by 1,000).

In the realm of information access and search, precision and recall are the logic for determining the order in which results are returned for a query and how many of them should appear (the threshold). The logic amounts to a series of detection problems, for example, finding the correct spelling of a term in a query, deciding whether audio is noise or speech, determining if a noun phrase is worthy of being a navigator for exploration, or calculating a sentiment score on a document.

For traditional search (for the Web or the enterprise), the conventional model is to return many results to get high recall; somewhere in there are the right answers. When the queries are pretty fuzzy (recall our World War II example), this may be the right approach because what is considered “right” is a personal choice. But for more exact questions, the balance between precision and recall should be adjusted accordingly, even to the extreme of providing the one correct answer (recall our reporting example), not determined by whether or not the user decides to click on the next page of results.

1 2007 Annual Study: U.S. Cost of a Data Breach, the Ponemon Institute, LLC, November 2007

Sid Probstein, currently CTO at Attivio, has more than 15 years experience leading successful engineering organizations and building complex, high-performance systems.

Editorial standards