Machine learning model extracts image features (e.g., color and motion) from the frames of fast-moving action clips to determine an appropriate sound effect.
copyright by spectrum.ieee.org
Imagine you are watching a scary movie: the heroine creeps through a dark basement, on high alert. Suspenseful music plays in the background, while some unseen, sinister creature creeps in the shadows… and then–BANG! It knocks over an object.
Such scenes would hardly be as captivating and scary without the intense, but perfectly timed sound effects, like the loud bang that sent our main character wheeling around in fear. Usually these sound effects are recorded by Foley artists in the studio, who produce the sounds using oodles of objects at their disposal. Recording the sound of glass breaking may involve actually breaking glass repeatedly, for example, until the sound closely matches the video clip.
In a more recent plot twist, researchers have created an automated program that analyzes the movement in video frames and creates its own artificial sound effects to match the scene. In a survey, the majority of people polled indicated that they believed the fake sound effects were real. The model, AutoFoley, is described in a study published June 25 in IEEE Transactions on Multimedia .
“Adding sound effects in post-production using the art of Foley has been an intricate part of movie and television soundtracks since the 1930s,” explains Jeff Prevost, a professor at the University of Texas at San Antonio who co-created AutoFoley. “Movies would seem hollow and distant without the controlled layer of a realistic Foley soundtrack. However, the process of Foley sound synthesis therefore adds significant time and cost to the creation of a motion picture.”
Intrigued by the thought of an automated Foley system, Prevost and his PhD student, Sanchita Ghose, set about creating a multi-layered machine learning program. They created two different models that could be used in the first step, which involves identifying the actions in a video and determining the appropriate sound.
The first machine learning model extracts image features (e.g., color and motion) from the frames of fast-moving action clips to determine an appropriate sound effect.
The second model analyzes the temporal relationship of an object in separate frames. By using relational reasoning to compare different frames across time, the second model can anticipate what action is taking place in the video.
In a final step, sound is synthesized to match the activity or motion predicted by one of the models. Prevost and Ghose used AutoFoley to create sound for 1,000 short movie clips capturing a number of common actions, like falling raining, a galloping horse, and a ticking clock. […]
Read more: spectrum.ieee.org