“… pointed a machine-learning system at the waveform of the voice data – its pattern of spikes and troughs – rather than the audio recording directly. It worked brilliantly.”
from newscientist.com, by
Every call into or out of US prisons is recorded. It can be important to know what’s being said, because some inmates use phones to conduct illegal business on the outside. But the recordings generate huge quantities of audio that are prohibitively expensive to monitor with human ears.
To help, one jail in the Midwest recently used a machine-learning system developed by London firm Intelligent Voice to listen in on the thousands of hours of recordings generated every month.
“No one at the prison spotted the code word until software started churning through calls“
The software saw the phrase “three-way” cropping up again and again in the calls – it was one of the most common non-trivial words or phrases used. At first, prison officials were surprised by the overwhelming popularity of what they thought was a sexual reference.
Then they worked out it was code. Prisoners are allowed to call only a few previously agreed numbers. So if an inmate wanted to speak to someone on a number not on the list, they would call their friends or parents and ask for a “three-way” with the person they really wanted to talk to – code for dialling a third party into the call. No one running the phone surveillance at the prison spotted the code until the software started churning through the recordings.
This story illustrates the speed and scale of analysis that machine-learning algorithms are bringing to the world. Intelligent Voice originally developed the software for use by UK banks, which must record their calls to comply with industry regulations. As with prisons, this generates a vast amount of audio data that is hard to search through.
The company’s CEO Nigel Cannings says the breakthrough came when he decided to see what would happen if he pointed a machine-learning system at the waveform of the voice data – its pattern of spikes and troughs – rather than the audio recording directly. It worked brilliantly.
Training his system on this visual representation let him harness powerful existing techniques designed for image classification. “I built this dialect classification system based on pictures of the human voice,” he says.
The trick let his system create its own models for recognising speech patterns and accents that were as good as the best hand-coded ones around, models built by dialect and computer science experts. “In our first run we were getting something like 88 per cent accuracy,” says Intelligent Voice developer Neil Glackin.
The software then taught itself to transcribe speech by using recordings of US congressional hearings, matching up the audio with the transcripts.
Cheap as chips
The power of machines that can listen and watch is not that they can do better than human ears or eyes. In fact, they perform much worse – especially when confronted with data from the real world. Their power, like all applications of computation, lies in speed, scale and the relative cheapness of processing.