For millions who can’t hear, lip reading offers a window into conversations that would be lost without it. But the practice is hard—and the results are often inaccurate.
Now, researchers are reporting a new () program that outperformed professional lip readers and the best to date, with just half the error rate of the previous best algorithm. If perfected and integrated into smart devices, the approach could put lip reading in the palm of everyone’s hands.
“It’s a fantastic piece of work,” says Helen Bear, a computer scientist at Queen Mary University of London who was not involved with the project.
Writing computer code that can read lips is maddeningly difficult. So in the new study scientists turned to a form of called , in which computers learn from data. They fed their system thousands of hours of videos along with transcripts, and had the computer solve the task for itself.
The researchers started with 140,000 hours of YouTube videos of people talking in diverse situations. Then, they designed a program that created clips a few seconds long with the mouth movement for each phoneme, or word sound, annotated. The program filtered out non-English , nonspeaking faces, low-quality video, and video that wasn’t shot straight ahead. Then, they cropped the videos around the mouth. That yielded nearly 4000 hours of footage, including more than 127,000 English words.
The process and the resulting data set—seven times larger than anything of its kind—are “important and valuable” for anyone else who wants to train similar systems to read lips, says Hassan Akbari, a computer scientist at Columbia University who was not involved in the research.
The process relies in part on neural networks, algorithms containing many simple computing elements connected together that learn and process information in a way similar to the human brain. When the team fed the program unlabeled video, these networks produced cropped clips of mouth movements. The next program in the system, which also used neural networks, took those clips and came up with a list of possible phonemes and their probabilities for each video frame. A final set of algorithms took those sequences of possible phonemes and produced sequences of English words.
After training, the researchers tested their system on 37 minutes of video it had not seen before. The misidentified only 41% of the words, they report in a paper posted this month to the website arXiv. That might not sound like a lot, but the best previous computer method, which focuses on individual letters rather than phonemes, had a word error rate of 77%. In the same study, professional lip readers erred at a rate of 93% (though in real life they have context and body language to go on, which helps). The work was done by DeepMind, an company based in London, which declined to comment on the record. […]