Sources And Methods: Part 8 -- Batting Averages (Evaluating Intelligence)

Part 1 -- Introduction
Part 2 -- A Tale Of Two Weathermen
Part 3 -- A Model For Evaluating Intelligence
Part 4 -- The Problem With Evaluating Intelligence Products
Part 5 -- The Problem With Evaluating The Intelligence Process
Part 6 -- The Decisionmaker's Perspective
Part 7 -- The Iraq WMD Estimate And Other Iraq Pre-War Assessments

Despite good reasons to believe that the findings of the Iraq WMD National Intelligence Estimate NIE) and the two pre-war Intelligence Community Assessments (ICAs) regarding Iraq can be evaluated as a group for insights into the quality of the analytic processes used to produce these products, several problems remain before we can determine the "batting average".

Assumptions vs. Descriptive Intelligence: The NIE drew its estimative conclusions from what the authors believed were the facts based on an analysis of the information collected about Saddam Hussein's WMD programs. Much of this descriptive intelligence (i.e. that information which was not proven but clearly taken as factual for purposes of the estimative parts of the NIE) turned out to be false. The ICAs, however, are largely based on a series of assumptions either explicitly or implicitly articulated in the scope notes to those two documents. This analysis, therefore, will only focus on the estimative conclusions of the three documents and not on the underlying facts.
Descriptive Intelligence vs. Estimative Intelligence: Good analytic tradecraft has always required analysts to clearly distinguish estimative conclusions from the direct and indirect information that supports those estimative conclusions. The inconsistencies in the estimative language along with the grammatical structure of some of the findings makes this particularly difficult. For example, the Iraq NIE found: "An array of clandestine reporting reveals that Baghdad has procured covertly the types and quantities of chemicals and equipment sufficient to allow limited CW agent production hidden in Iraq's legitimate chemical industry." Clearly the information gathered suggested that the Iraqi's had gathered the chemicals. What is not as clear is if they were they likely using them for limited CW production or if they merely could use these chemicals for such purposes. A strict constructionist would argue for the latter interpretation whereas the overall context of the Key Judgments would suggest the former. I have elected to focus on the context to determine which statements are estimative in nature. This inserts an element of subjectivity into my analysis and may skew the results.
Discriminative vs. Calibrative Estimates: The language of the documents uses both discriminative ("Baghdad is reconstituting its nuclear weapons program") and calibrative language ("Saddam probably has stocked at least 100 metric tons ... of CW agents"). Given the seriousness of the situation in the US at at that time, the purposes for which these documents were to be used, and the discussion of the decisonmaker's perspective in part 6 of this series, I have elected to treat calibrative estimates as discriminative for purposes of evaluation.
Overly Broad Estimative Conclusions: Overly broad estimates are easy to spot. Typically these statements use highly speculative verbs such as "might" or "could". A good example of such a statement is the claim: "Baghdad's UAVs could threaten Iraq's neighbors, US forces in the Persian Gulf, and if brought close to, or into, the United States, the US homeland." Such alarmism seems silly today but it should have been seen as silly at the time as well. From a theoretical perspective, these type of statements tell the decisionmaker nothing useful (anything "could" happen; everything is "possible"). One option, then, is to mark these statements as meaningless and eliminate them from consideration. This, in my mind, encourages this bad practice and I intend to count these kinds of statements as false if they turned out to have no basis in fact (I would under this same logic have to count them as true if they turned out to be true, of course).
Weight of the Estimative Conclusion: Some estimates are clearly more fundamental to a report than others. Conclusions regarding direct threats to US soldiers, for example, should trump any minor and indirect consequences regarding regional instability identified in the reports. Engaging in such an exercise might be something appropriate for individuals directly involved in this process and in a better position to evaluate these weights. I, on the other hand, am looking for only the broadest possible patterns (if any) from the data. I have, therefore decided to weigh all estimative conclusions equally.
Dealing with Dissent: There were several dissents in the Iraq NIE. While the majority opinion is, in some sense, the final word on the matter, an analytic process that tolerates formal dissent deserves some credit as well. Going simply with the majority opinion does not accomplish this. Likewise, eliminating the dissented opinion from consideration gives too much credit to the process. I have chosen to count those estimative conclusions with dissents as both true and false (for scoring purposes only).

Clearly, given the caveats and conditions under which I am attempting this analysis, I am looking only for broad patterns of analytic activity. My intent is not to spend hours quibbling about all of the various ways a particular judgment could be interpreted as true or false after the fact. My intent is to merely make the case that evaluating intelligence is difficult but, even with those difficulties firmly in mind, it is possible to go back, after the fact, and, if we look at a broad enough swath of analysis, come to some interesting conclusions about the process.

Within these limits, then, by my count, the Iraq NIE contained 28 (85%) false estimative conclusions and 5 (15%) true ones. This conclusion tracks quite well with the WMD Commission's own evaluation that the NIE was incorrect in "almost all of its pre-war judgments about Iraq's weapons of mass destruction." By my count, the Regional Consequences of Regime Change in Iraq ICA fares much better with a count of 23 (96%) correct estimative conclusions and only one (4%) incorrect one. Finally, the report on the Principal Challenges in Post-Saddam Iraq nets 15 (74%) correct analytic estimates to 4 (26%) incorrect ones. My conclusions are certainly consistent with the tone of the Senate Subcommittee Report.

It is noteworthy that the Senate Subcommittee did not go to the same pains to compliment analysts on their fairly accurate reporting in the ICAs as the WMD Commission did to pillory the NIE. Likewise, there was no call from Congress to ensure that the process involved in creating the NIE was reconciled with the process used to create the ICAs, no laws proposed to take advantage of this largely accurate work, no restructuring of the US national intelligence community to ensure that the good analytic processes demonstrated in these ICAs would dominate the future of intelligence analysis.

The most interesting number, however, is the combined score for the three documents. Out of the 76 estimative conclusions made in the three reports, 43 (57%) were correct and 33 (43%) incorrect. Is this a good score or a bad score? Such a result is likely much better than mere chance, for example. For each judgment made there were likely many reasonable hypotheses considered. If there were only three reasonable hypotheses to consider in each case, the base rate would be 33%. On average, the analysts involved were able to nearly double that "batting average".

Likewise it is consistent with both hard and anecdotal data of historical trends in analytic forecasting. Mike Lyden, in his thesis on Accelerated Analysis, calculated that, historically, US national security intelligence community estimates were correct approximately 2/3 of the time.

Former Director of the CIA, GEN Michael Hayden, made his own estimate of analytic accuracy in May of last year, ""Some months ago, I met with a small group of investment bankers and one of them asked me, 'On a scale of 1 to 10, how good is our intelligence today?' I said the first thing to understand is that anything above 7 isn't on our scale. If we're at 8, 9, or 10, we're not in the realm of intelligence—no one is asking us the questions that can yield such confidence. We only get the hard sliders on the corner of the plate."

Given these standards, 57%, while a bit low by historical measures, certainly seems to be within normal limits and, even more importantly, consistent with what the US has routinely expected from its intelligence community.

Tomorrow: Final Thoughts

Sources And Methods

Saturday, February 7, 2009

Part 8 -- Batting Averages (Evaluating Intelligence)

No comments:

Post a Comment