Take a fingerprint... for that matter, go ahead and take a palm print. Now, take a voiceprint. In this day and age, forensic biometric analysis is extraordinarily complex. In a world where we analyze everything from irises to earlobes, what can science tell us about voice?
One increasingly popular form of analysis is forensic speaker recognition (aka voice biometrics or biometric acoustics). Forensic speaker recognition (FSR) has unequivocal potential as a supplementary analytic methodology, with applications in both the fields of law enforcement and counterterrorism (for details, see the last section of the 2012 book on FSR Applications to Law Enforcement and Counter-terrorism).
The utility of the FSR process is either one of identification (1:N or N:1) or verification (1:1).
- 1:N Identification -- Imagine you have a recording of a voice making threats over the phone. The speaker identification process allows you to query a database of acoustic recordings of known suspects for comparison against your target voice to identify more threats he/she might have made.
- N:1 Identification -- Imagine you have a bunch of voice recordings and you want to know in which of them, if any, a certain speaker participates.
- 1:1 Verification -- Imagine you wish to grant someone access to a building or secure location by assessing whether or not they are who they say they are (this aspect of speaker recognition is less applicable to analysis and more applicable to security).
(Note: For those who are more acoustically inclined and would enjoy a well-written read on all things acoustic from military strategy to frog communication, Seth Horowitz's new book The Universal Sense: How Hearing Shapes the Mind comes with my highest recommendation.)
"The term voiceprint gives the false impression that voice has characteristics that are as unique and reliable as fingerprints... this is absolutely not the case."The thing about voices is that they are susceptible to a myriad of external factors such as psychological/emotional state, age, health, weather... the list goes on. From an application standpoint, the most prominent of these factors is intentional vocal disguise. There are a number of things people can intentionally do to their voices to drastically reduce the ability of machine or human expert to identify their voice correctly (you would be amazed at how difficult it is - nearly impossible - to identify a whispered voice). Under these conditions, identification accuracy falls to 40 - 52 percent (Thompson 1987), 36 percent (Andruski 2007), 26 percent (Clifford 1980).
Top: Osama bin Laden's "dirty" 2003 telephonic spectrogram Bottom: Osama bin Laden's "clean" spectrogram Source: Owl Investigations |
More problematic still is communication by telephone. Much of the input law enforcement and national security analysts have to work with comes from telephone wiretaps or calls made from jail cells. Telephones, cellphones in particular, create a filtering phenomenon of an acoustic signal, whereby all acoustic information under a certain frequency simply does not get transmitted (within this frequency range lie some of the key characteristics for voice identification).