IBM Corp. has teamed up with researchers from the Massachusetts Institute of Technology to create a new method for training “video recognition” deep learning models more efficiently.
Copyright by siliconangle.com
Deep learning is a branch of machine learning that aims to replicate how the human brain solves problems. It has led to major breakthroughs in areas such as language translation and image and voice recognition.
Video recognition is similar to image classification, in that the deep learning model basically tries to identify what’s going on in a video, including the objects and people it sees, what they’re doing and so on. The main difference between the two is that videos have a lot more moving parts than a simple, static image, and so training deep learning models to understand them takes much more time and effort.
“By one estimate, training a video recognition model can take up to 50 times more data and eight times more processing power than training an image classification model,” MIT explained in a blog post today.
Of course, no one likes devoting huge amounts of compute resources to such a task because it can often be prohibitively expensive. Moreover, the resources needed makes it next to impossible to run video recognition models on low-powered mobile devices, where many AI applications are going.
Those problems are what inspired a research team led by Song Han, an assistant professor at MIT’s Department of Electrical Engineering and Computer Science, to come up with a more efficient model for video recognition training. The new technique dramatically reduces the size of video recognition models in order to speed up training times and improve performance on mobile devices.
“Our goal is to make AI accessible to anyone with a low-power device,” Han said. “To do that we need to design efficient AI models that use less energy and can run smoothly on edge devices where so much of AI is moving.”
Image classification models work by looking for patterns in the pixels of an image in order to build up a representation of what they see. With enough examples, the models can learn to recognize people, objects and the ways they relate to one another.
Video recognition works in a similar way, but the deep learning models go further by using “three-dimensional convolutions” to encode the passage of time in a sequence of images (video frames), which leads to bigger and more computationally-intensive models. To reduce the calculations involved, Han and his colleagues designed an operation they call a “temporal shift module” which shifts the feature maps of a selected video frame to its neighboring frames. By mingling spatial representations of the past, present and future, the model gets a sense of time passing without explicitly representing it. […]