Jürgen Schmidhuber in an interview with Dalith Steiger about how DanNet triggered the deep convolutional network revolution.
Q: Jürgen, now it’s 2021. A decade ago, something very important happened in the fields of deep learning and AI.
A: That’s true! In 2011, our DanNet triggered the deep convolutional network (CNN) revolution.
Q: Can you tell us more about the DanNet?
A: Sure. DanNet was named after my outstanding Romanian postdoc Dan Claudiu Cireșan (aka Dan Ciresan). It was developed at the Swiss AI Lab, IDSIA, in Lugano. In 2011, DanNet was the first pure deep convolutional neural network (CNN) to win computer vision contests. For a while, it enjoyed a monopoly. From 2011 to 2012, it won every contest it entered, winning four of them in a row (15 May 2011, 6 August 2011, 1 March 2012, 10 September 2012), driven by a very fast implementation based on graphics processing units (GPUs).
Q: How mind-blowing was the DanNet?
A: It seemed out of this world. Already in 2011, DanNet achieved the first superhuman performance in a vision challenge, although computing was still 100 times more expensive than today.
Q: What happened then?
A: In July 2012, our CVPR paper on DanNet hit the computer vision community. Many other researchers started working in this field, too. The similar AlexNet (citing DanNet) joined the party in Dec 2012. Our even much deeper Highway Net (May 2015) and its special case ResNet (Dec 2015) further improved performance (a ResNet is a Highway Net whose gates are always open). Today, a decade after DanNet, everybody is using fast deep CNNs for computer vision.
Q: But you guys did not invent CNNs; you just made them very deep and fast, right? Where did those CNNs come from?
A: CNNs originated over 4 decades ago. The basic CNN architecture with convolutional layers and downsampling layers is due to Kunihiko Fukushima (1979). In 1987, NNs with convolutions were combined in Japan by Alex Waibel with weight sharing and backpropagation, a technique from 1970. Yann LeCun’s team later contributed important improvements of CNNs, especially for images, e.g., (Sec. XVIII). The popular downsampling variant called “max-pooling” was introduced by Juyang Weng et al. (1993). In 2010, my team at the Swiss AI Lab IDSIA showed that unsupervised pre-training is not necessary to train deep NNs (a reviewer called this a “wake-up call to the machine learning community”— compare the survey blog post).
Q: And what happened then?
A: One year later, our team with my postdocs Dan Cireșan & Ueli Meier and my PhD student Jonathan Masci (a fellow co-founder of NNAISENSE) greatly sped up the training of deep CNNs. Our fast GPU-based CNN of 1 February 2011, now known as the “DanNet,” was a practical breakthrough. Published later that year at IJCAI, it was much deeper and faster than earlier GPU-accelerated CNNs of 2006. DanNet showed that deep CNNs worked far better than the existing state-of-the-art for recognizing objects in images.
Q: Tell us more about the superhuman result that DanNet achieved in the same year!
A: On a sunny day in Silicon Valley, at IJCNN 2011, DanNet blew away the competition and achieved the first superhuman visual pattern recognition in an international contest. Even the New York Times mentioned this. DanNet performed twice as good as human test subjects and three times better than the already impressive second-place entry by LeCun’s team.
Q: Awesome. How did the industry react?
A: In 2011, DanNet immediately attracted tremendous interest from industry, which further exploded when its temporary monopoly on winning computer vision competitions made DanNet the first deep CNN to win: a Chinese handwriting contest (ICDAR, May 2011), a traffic sign recognition contest (IJCNN, Aug 2011), an image segmentation contest (ISBI, May 2012), and a contest on object detection in large images (ICPR, Sept 2012). The latter was actually a medical imaging contest on cancer detection. Our CNN image scanners were 1000 times faster than previous methods. The significance of these kinds of improvements in the health care industry is obvious. Today IBM, Siemens, Google, and many startups are pursuing this approach.
In 2011, we also introduced our deep neural nets to Arcelor Mittal, the world’s largest steel producer, and were able to greatly improve steel defect detection. To the best of my knowledge, this was the first deep learning breakthrough in heavy industry.
Q: In Feb 2012, you had a technical report on DanNet which summarized some recent breakthroughs. What happened then?
A: In July 2012, DanNet was also presented at CVPR, the leading computer vision conference. This helped to spread the word in the computer vision community. As of 2020, the CVPR article was the most cited DanNet paper, albeit not the first.
Q: At some point, CNNs developed by other research groups also started winning something, correct?
A: Yes, after DanNet had won 4 image recognition competitions, the similar GPU-accelerated “AlexNet” won the ImageNet 2012 contest. Unlike DanNet, AlexNet used Christoph v. d. Malsburg’s rectified linear neurons (ReLUs) (1973) and a variant of Stephen J. Hanson’s stochastic delta rule (1990) called “dropout”. While both of these techniques helped, they are not really required to win vision contests. Back then, the only really important CNN-related task was to accelerate known techniques for training CNNs through GPUs greatly.
Q: But you did not stop back then; you kept pushing the frontiers of what’s possible with deep neural networks. What was the next step?
A: Indeed, we continued to make CNNs and other neural nets even deeper and better. It should be mentioned that until 2015, deep networks had at most a few tens of layers, e.g., 20-30 layers. But in May 2015, our novel Highway Net was the first working extremely deep feedforward neural net with hundreds of layers. The Highway Net is based on the LSTM principle, which enables much deeper learning. Its special case called “ResNet” (the ImageNet 2015 winner of Dec 2015) is a Highway Net whose gates are always open (compare & Sec. 4 of). Highway Nets perform roughly as well as ResNets on ImageNet. Highway layers are also often used for natural language processing.
Q: What an incredible development. What else has changed in the past decade?
A: The original successes of DanNet required a precise understanding of GPUs’ inner workings. Today, convenient software packages shield the user from such details, and compute is roughly 100 times cheaper than 10 years ago when our results set the stage for the recent decade of deep learning. Many current commercial neural net applications are based on what started in 2011.
Q: Jürgen, thanks a lot for this interview on DanNet!
About the interviewee
The media have called Jürgen Schmidhuber the father of modern Artificial Intelligence. Since age 15, his main goal has been to build a self-improving AI smarter than himself, then retire. His lab’s deep learning neural networks such as LSTM have revolutionized machine learning, are now on 3 billion smartphones, and used billions of times per day, for Facebook’s automatic translation (2017), Google’s speech recognition (since 2015), Apple’s Siri & QuickType, Amazon’s Alexa, etc. He also pioneered artificial curiosity and meta-learning machines that learn to learn. He is the recipient of numerous awards, and chief scientist of the company NNAISENSE, which aims at building the first practical general purpose AI. He is also advising various governments on AI strategies.
Sources of Images:
- Credits: FAZ/Bieber
- Markus Bertschi – Jakob und Bertschi, www.jakobundbertschi.ch