Wednesday, August 7, 2013

Is Forensic Speaker Recognition The Next "Fingerprint?"

Take a fingerprint... for that matter, go ahead and take a palm print. Now, take a voiceprint. In this day and age, forensic biometric analysis is extraordinarily complex. In a world where we analyze everything from irises to earlobes, what can science tell us about voice?

One increasingly popular form of analysis is forensic speaker recognition (aka voice biometrics or biometric acoustics). Forensic speaker recognition (FSR) has unequivocal potential as a supplementary analytic methodology, with applications in both the fields of law enforcement and counterterrorism (for details, see the last section of the 2012 book on FSR Applications to Law Enforcement and Counter-terrorism).

The utility of the FSR process is either one of identification (1:N or N:1) or verification (1:1).
  • 1:N Identification -- Imagine you have a recording of a voice making threats over the phone. The speaker identification process allows you to query a database of acoustic recordings of known suspects for comparison against your target voice to identify more threats he/she might have made.
  • N:1 Identification -- Imagine you have a bunch of voice recordings and you want to know in which of them, if any, a certain speaker participates. 
  • 1:1 Verification -- Imagine you wish to grant someone access to a building or secure location by assessing whether or not they are who they say they are (this aspect of speaker recognition is less applicable to analysis and more applicable to security). 
That said, the CIA, the NSA and the Swiss IDIAP all turned to automatic speaker verification systems in 2003 to analyze the so-called Osama tapes (for details of the approach, see Graphing the Voice of Terror). This case provides an excellent opportunity to note the distinction between automatic speaker recognition performed by an algorithmic machine and aural speaker recognition performed by acoustic experts. 

The cornerstone methodology supporting forensic speaker recognition is voiceprint analysis,or spectrographic analysis, a process that visually displays the acoustic signal of a voice as a function of time (seconds or milliseconds) and frequency (hertz) such that all components are visible (formants, harmonics, fundamental frequency, etc.).
(Note:  For those who are more acoustically inclined and would enjoy a well-written read on all things acoustic from military strategy to frog communication, Seth Horowitz's new book The Universal Sense: How Hearing Shapes the Mind comes with my highest recommendation.)
Spectrographic analysis differs from human speaker recognition in that it provides a more quantifiable comparison between two speech signals. Under favorable conditions, both approaches yield favorable results: 85 percent identification accuracy (McGehee 1937), 96 percent accuracy (Epsy-Wilson 2006), 98 percent accuracy (Clifford 1980), 100 percent accuracy (Bricker and Pruzansky 1966). These approaches, however, do not come without caveats.

Forensic speaker recognition has many limitations and is currently inadmissible in federal court as expert testimony. Bonastre et al (2003) summarize these limitations quite well:  
"The term voiceprint gives the false impression that voice has characteristics that are as unique and reliable as fingerprints... this is absolutely not the case."
The thing about voices is that they are susceptible to a myriad of external factors such as psychological/emotional state, age, health, weather... the list goes on. From an application standpoint, the most prominent of these factors is intentional vocal disguise. There are a number of things people can intentionally do to their voices to drastically reduce the ability of machine or human expert to identify their voice correctly (you would be amazed at how difficult it is - nearly impossible - to identify a whispered voice). Under these conditions, identification accuracy falls to 40 - 52 percent (Thompson 1987), 36 percent (Andruski 2007), 26 percent (Clifford 1980). 
Top: Osama bin Laden's "dirty" 2003 telephonic spectrogram
Bottom: Osama bin Laden's "clean" spectrogram
Source: Owl Investigations


More problematic still is communication by telephone. Much of the input law enforcement and national security analysts have to work with comes from telephone wiretaps or calls made from jail cells. Telephones, cellphones in particular, create a filtering phenomenon of an acoustic signal, whereby all acoustic information under a certain frequency simply does not get transmitted (within this frequency range lie some of the key characteristics for voice identification). 

While the forensic speaker recognition capability has come a long way since 2003, the consensus among the analytic community remains that it is not a stand-alone methodology, rather a promising supplementary tool. Biometric analysis was also a topic brought to the Intelligence Technology panel of the 2013 Global Intelligence Forum conference this year. Of note was the expanding applicability and increasing capabilities of all biometric technologies. 

Thus far, the Spanish Guardia Civil is the only law enforcement agency worldwide to have a fully-operational acoustic biometric system (called SAIVOX, the Automatic System for the Identification of Voices). In the Spanish booking process, just like we take fingerprints, they take voice samples that they then contribute to a corpus of over 3,500 samples linked with well-known criminals and certain types of crime. 

In 2011, the FBI commissioned NIST to launch a program on "investigatory voice biometrics." The goal of the committee is to develop best practices and collection standards to launch an operational voice biometric system with robust enough corpora so as to serve as a useful tool in ongoing investigations, modeled off the Spanish system. (This is an ongoing project and you can read the full report here).

FSR is not a perfect methodology, but one that can add substantial value on a case-by-case basis. It is of high interest to the US national security and law enforcement analytic communities.

Additional reading:
Andruski, J., Brugnone, N., & Meyers, A. (2007). Identifying disguised voices through speakers' vocal pitches and formants. 153rd ASA meeting.
Bonastre, J. F., Bimbot, F., Boe, L. J., Campbell, J. P., Reynolds, D. A., & Magrin-Chagnolleau, I. (2003). Person authentication by voice: A need for caution. Eurospeech 2003.
Bricker, P.D., & Pruzansky, S. (1966). Effects of stimulus content and duration ontalk identification. The acoustical society of the Americas, 40, 1441-1449.
Clifford, B. R. (1980). Voice identification by human listeners: On earwitnessreliability. Law and human behavior, 4(4), 373-394.
Epsy-Wilson, C. Y., Manocha, S., & Vishnubhotla, S. (2006). A new set of features fortext-independent speaker identification.
McGehee, F. (1937). The reliability of the identification of the human voice. Journal of general psychology, 31, 53-65.
Parmar, P. (2012). Voice fingerprinting: Avery important tool against crime. J Indian academy forensic med.,34(1), 70-73. doi: 0971-0973