Part 1 -- Introduction
Part 2 -- A Tale Of Two Weathermen
Part 3 -- A Model For Evaluating Intelligence
The fundamental problem with evaluating intelligence products is that intelligence, for the most part, is probabilistic. Even when an intelligence analyst thinks he or she knows a fact, it is still subject to interpretation or may have been the result of a deliberate campaign of deception.
- The problem is exacerbated when making an intelligence estimate, where good analysts never express conclusions in terms of certainty. Instead, analysts typically use words of estimative probability (or, what linguists call verbal probability expressions) such as "likely" or "virtually certain" to express a probabilistic judgement. While there are significant problems with using words (instead of numbers or number ranges) to express probabilities, using a limited number of such words in a preset order of ascending likelihood currently seems to be considered the best practice by the National Intelligence Council (see page 5).
Intelligence products, then, suffer from two broad categories of error: Problems of calibration and problems of discrimination. Anyone who has ever stepped on a scale only to find that they weigh significantly more or significantly less than expected understands the idea of calibration. Calibration is the act of adjusting a value to meet a standard.
In simple probabilistic examples, the concept works well. Consider a fair, ten-sided die. Each number, one through ten, has the same probability of coming up when the die is rolled (10%). If I asked you to tell me the probability of rolling a seven, and you said 10%, we could say that your estimate was perfectly calibrated. If you said the probability was only 5%, then we would say your estimate was poorly calibrated and we could "adjust" it to 10% in order to bring it into line with the standard.
Translating this concept into the world of intelligence analysis is incredibly complex. To have perfectly calibrated intelligence products, we would have to be able to say that, if a thing is 60% likely to happen, then it happens 60% of the time. Most intelligence questions (beyond the trivial ones), however, are unique, one of a kind. The exact set of circumstances that led to the question being asked in the first place and much of the information relevant to its likely outcome are impossible to replicate making it difficult to keep score in a meaningful way.
The second problem facing intelligence products is one of discrimination. Discrimination is associated with the idea that the intel is either right or wrong. An analyst with a perfect ability to discriminate always gets the answer right, whatever the circumstance. While the ability to perfectly discriminate between right and wrong analytic conclusions might be a theoretical ideal, the ability to actually achieve such an feat exists only in the movies. Most complex systems are subject to a certain sensitive dependence on initial conditions which precludes any such ability to discriminate beyond anything but trivially short time frames.
If it appears that calibration and discrimination are in conflict, they are. The better calibrated an analyst is, the less likely they are to be willing to definitively discriminate between possible estimative conclusions. Likewise, the more willing an analyst is to discriminate between possible estimative conclusions, the less likely he or she is to be properly calibrating the possibilities inherent in the intelligence problem.
For example, an analyst who says X is 60% likely to happen is still 40% "wrong" when X does happen should an evaluator choose to focus on the analyst's ability to discriminate. Likewise, the analyst who said X will happen is also 40% wrong if the objective probability of X happening was 60% (even though X does happen), if the evaluator chooses to focus on the analyst's ability to calibrate.
Failure to understand the tension bewteen these two evaluative principles leaves the unwitting analyst open to a "damned if you do, damned if you don't" attack by critics of the analyst's estimative work. The problem only grows worse if you consider words of estimative probability instead of numbers.
All this, in turn, typically leads analysts to ask for what Phlip Tetlock, in his excellent book Expert Political Judgment, called "adjustments" when being evaluated regarding the accuracy of their estimative products. Specifically, Tetlock outlines four key adjustments:
- Value adjustments -- mistakes made were the "right mistakes" given the cost of the alternatives
- Controversy adjustments -- mistakes were made by the evaluator and not the evaluated
- Difficulty adjustments -- mistakes were made because the problem was so difficult or, at least, more difficult than problems a comparable body of analysts typically faced
- Fuzzy set adjustments -- mistakes were made but the estimate was a "near miss" so it should get partial credit
This parade of horribles should not be construed as a defense of the school of thought that says that intelligence cannot be evaluated, that it is too hard to do. It is merely to show that evaluating intelligence products is truly difficult and fraught with traps to catch the unwary. Any system established to evaluate intelligence products needs to acknowledge these issues and, to the greatest degree possible, deal with them.
Many of the "adjustments", however, can also be interpreted as excuses. Just because something is difficult to do doesn't mean you shouldn't do it. An effective and appropriate system for evaluating intelligence is an essential step in figuring out what works and what doesn't, in improving the intelligence process. As Tetlock notes (p. 9), "The list (of adjustments) certainly stretches our tolerance for uncertainty: It requires conceding that the line between rationality and rationalization will often be blurry. But, again, we should not concede too much. Failing to learn everything is not tantamount to learning nothing."
Tomorrow: The Problems With Evaluating Process