Consulting Energy Research Telco

IBM and MIT break new ground in video recognition model training

IBM and MIT break new ground in video recognition model training

IBM Corp. has teamed up with researchers from the Massachusetts Institute of Technology to create a new method for training “ models more efficiently.

Copyright by

SwissCognitiveDeep learning is a branch of that aims to replicate how the human brain solves problems. It has led to major breakthroughs in areas such as language translation and image and voice recognition.

Video recognition is similar to image classification, in that the model basically tries to identify what’s going on in a video, including the objects and people it sees, what they’re doing and so on. The main difference between the two is that videos have a lot more moving parts than a simple, static image, and so training models to understand them takes much more time and effort.

“By one estimate, training a model can take up to 50 times more data and eight times more processing power than training an image classification model,” MIT explained in a blog post today.

Of course, no one likes devoting huge amounts of compute resources to such a task because it can often be prohibitively expensive. Moreover, the resources needed makes it next to impossible to run models on low-powered mobile devices, where many applications are going.

Those problems are what inspired a research team led by Song Han, an assistant professor at MIT’s Department of Electrical Engineering and Computer Science, to come up with a more efficient model for training. The new technique dramatically reduces the size of models in order to speed up training times and improve performance on mobile devices.

“Our goal is to make accessible to anyone with a low-power device,” Han said. “To do that we need to design efficient models that use less energy and can run smoothly on edge devices where so much of is moving.”

Image classification models work by looking for patterns in the pixels of an image in order to build up a representation of what they see. With enough examples, the models can learn to recognize people, objects and the ways they relate to one another.

Video recognition works in a similar way, but the models go further by using “three-dimensional convolutions” to encode the passage of time in a sequence of images (video frames), which leads to bigger and more computationally-intensive models. To reduce the calculations involved, Han and his colleagues designed an operation they call a “temporal shift module” which shifts the feature maps of a selected video frame to its neighboring frames. By mingling spatial representations of the past, present and future, the model gets a sense of time passing without explicitly representing it. […]

Read more –


Leave a Reply