Researchers create a mathematical framework to evaluate explanations of machine-learning models and quantify how well people understand them.


Copyright: – “Unpacking black-box models”


Modern machine-learning models, such as neural networks, are often referred to as “black boxes” because they are so complex that even the researchers who design them can’t fully understand how they make predictions.

To provide some insights, researchers use explanation methods that seek to describe individual model decisions. For example, they may highlight words in a movie review that influenced the model’s decision that the review was positive.

But these explanation methods don’t do any good if humans can’t easily understand them, or even misunderstand them. So, MIT researchers created a mathematical framework to formally quantify and evaluate the understandability of explanations for machine-learning models. This can help pinpoint insights about model behavior that might be missed if the researcher is only evaluating a handful of individual explanations to try to understand the entire model.

“With this framework, we can have a very clear picture of not only what we know about the model from these local explanations, but more importantly what we don’t know about it,” says Yilun Zhou, an electrical engineering and computer science graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of a paper presenting this framework.

Image: Courtesy of the researchers

Thank you for reading this post, don't forget to subscribe to our AI NAVIGATOR!


Zhou’s co-authors include Marco Tulio Ribeiro, a senior researcher at Microsoft Research, and senior author Julie Shah, a professor of aeronautics and astronautics and the director of the Interactive Robotics Group in CSAIL. The research will be presented at the Conference of the North American Chapter of the Association for Computational Linguistics.

Understanding local explanations

One way to understand a machine-learning model is to find another model that mimics its predictions but uses transparent reasoning patterns. However, recent neural network models are so complex that this technique usually fails. Instead, researchers resort to using local explanations that focus on individual inputs. Often, these explanations highlight words in the text to signify their importance to one prediction made by the model.[…]

Read more: