A previous post [MIR] (2019) focused on our Annus Mirabilis 1990-1991 at TU Munich. Back then we published many of the basic ideas that powered the Artificial Intelligence Revolution of the 2010s through Artificial Neural Networks (NNs) and Deep Learning. The present post is partially redundant but much shorter (a 7 min read), focusing on the recent decade’s most important developments and applications based on our work, also mentioning related work, and concluding with an outlook on the 2020s, also addressing privacy and data markets.
Copyright by Jürgen Schmidhuber
1. The Decade of Long Short-Term Memory (LSTM)
Much of AI in the 2010s was about the NN called Long Short-Term Memory (LSTM) [LSTM1-13] [DL4]. The world is sequential by nature, and LSTM has revolutionized sequential data processing, e.g., speech recognition, machine translation, video recognition, connected handwriting recognition, robotics, video game playing, time series prediction, chat bots, healthcare applications, you name it. By 2019, [LSTM1] got more citations per year than any other computer science paper of the past millennium. Below I’ll list some of the most visible and historically most significant applications.
2009: Connected Handwriting Recognition. Enormous interest from industry was triggered right before the 2010s when out of the blue my PhD student Alex Graves won three connected handwriting competitions (French, Farsi, Arabic) at ICDAR 2009, the famous conference on document analysis and recognition. He used a combination of two methods developed in my research groups at TU Munich and the Swiss AI Lab IDSIA: LSTM (1990s-2005) [LSTM1-6] (which overcomes the famous vanishing gradient problem analyzed by my PhD student Sepp Hochreiter [VAN1] in 1991) and Connectionist Temporal Classification [CTC] (2006). CTC-trained LSTM was the first recurrent NN or RNN [MC43] [K56] to win any international contests.
CTC-Trained LSTM also was the First Superior End-to-End Neural Speech Recognizer. Already in 2007, our team successfully applied CTC-LSTM to speech [LSTM4], also with hierarchical LSTM stacks [LSTM14]. This was very different from previous hybrid methods since the late 1980s which combined NNs and traditional approaches such as Hidden Markov Models (HMMs), e.g., [BW] [BRI] [BOU] [HYB12]. Alex kept using CTC-LSTM as a postdoc in Toronto [LSTM8].
CTC-LSTM has had massive industrial impact. By 2015, it dramatically improved Google’s speech recognition [GSR15] [DL4]. This is now on almost every smartphone. By 2016, more than a quarter of the power of all those Tensor Processing Units in Google’s datacenters was used for LSTM (and 5% for convolutional NNs) [JOU17]. Google’s on-device speech recognition of 2019 (not any longer on the server) is still based on LSTM. See [MIR], Sec. 4. Microsoft, Baidu, Amazon, Samsung, Apple, and many other famous companies are using LSTM, too [DL4] [DL1].
2016: The First Superior End-to-End Neural Machine Translation was also Based on LSTM. Already in 2001, my PhD student Felix Gers showed that LSTM can learn languages unlearnable by traditional models such as HMMs [LSTM13]. That is, a neural “subsymbolic” model suddenly excelled at learning “symbolic” tasks! Compute still had to get 1000 times cheaper, but by 2016-17, both Google Translate [GT16] [WU] (which mentions LSTM over 50 times) and Facebook Translate [FB17] were based on two connected LSTMs [S2S], one for incoming texts, one for outgoing translations – much better than what existed before [DL4]. By 2017, Facebook’s users made 30 billion LSTM-based translations per week [FB17] [DL4]. Compare: the most popular youtube video (the song “Despacito”) got only 6 billion clicks in 2 years. See [MIR], Sec. 4.
LSTM-Based Robotics. By 2003, our team used LSTM for Reinforcement Learning (RL) and robotics, e.g., [LSTM-RL]. In the 2010s, combinations of RL and LSTM have become standard. For example, in 2018, an RL-trained LSTM was the core of OpenAI’s Dactyl which learned to control a dextrous robot hand without a teacher [OAI1].
2018-2019: LSTM for Video Games. In 2019, DeepMind beat a pro player in the game of Starcraft, which is harder than Chess or Go [DM2] in many ways, using Alphastar whose brain has a deep LSTM core trained by RL [DM3]. An RL-trained LSTM (with 84% of the model’s total parameter count) also was the core of OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018) [OAI2] [OAI2a]. See [MIR], Sec. 4.
The 2010s saw many additional LSTM applications, e.g., [DL1]. LSTM was used for healthcare, chemistry, molecule design, lip reading [LIP1], stock market prediction, self-driving cars, mapping brain signals to speech (Nature, vol 568, 2019), predicting what’s going on in nuclear fusion reactors (same volume, p. 526), etc. There is not enough space to mention everything here.
2. The Decade of Feedforward Neural Networks
LSTM is an RNN that can in principle implement any program that runs on your laptop. The more limited feedforward NNs (FNNs) cannot (although they are good enough for board games such as Backgammon [T94] and Go [DM2] and Chess). That is, if we want to build an NN-based Artificial General Intelligence (AGI), then its underlying computational substrate must be something like an RNN. FNNs are fundamentally insufficient. RNNs relate to FNNs like general computers relate to mere calculators. Nevertheless, our Decade of Deep Learning was also about FNNs, as described next.
2010: Deep FNNs Don’t Need Unsupervised Pre-Training! In 2009, many thought that deep FNNs cannot learn much without unsupervised pre-training [MIR] [UN0-UN5]. But in 2010, our team with my postdoc Dan Ciresan [MLP1] showed that deep FNNs can be trained by plain backpropagation [BP1] (compare [BPA] [BPB] [BP2] [R7]) and do not at all require unsupervised pre-training for important applications. Our system set a new performance record [MLP1] on the back then famous and widely used image recognition benchmark called MNIST. This was achieved by greatly accelerating traditional FNNs on highly parallel graphics processing units called GPUs. A reviewer called this a “wake-up call to the machine learning community.” Today, very few commercial NN applications are still based on unsupervised pre-training (used in my first deep learner of 1991). See [MIR], Sec. 19.
2011: CNN-Based Computer Vision Revolution. Our team in Switzerland (Dan Ciresan et al.) greatly sped up the convolutional NNs (CNNs) invented and developed by others since the 1970s [CNN1-4]. The first superior award-winning CNN, often called “DanNet,” was created in 2011 [GPUCNN1,3,5]. It was a practical breakthrough. It was much deeper and faster than earlier GPU-accelerated CNNs [GPUCNN]. Already in 2011, it showed that deep learning worked far better than the existing state-of-the-art for recognizing objects in images. In fact, it won 4 important computer vision competitions in a row between May 15, 2011, and September 10, 2012 [GPUCNN5] before a similar GPU-accelerated CNN of Univ. Toronto won the ImageNet 2012 contest [GPUCNN4-5] [R6].
At IJCNN 2011 in Silicon Valley, DanNet blew away the competition through the first superhuman visual pattern recognition in a contest. Even the New York Times mentioned this. It was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), a contest on object detection in large images (ICPR, 10 Sept 2012), at the same time a medical imaging contest on cancer detection. (All before ImageNet 2012 [GPUCNN4-5] [R6].) Our CNN image scanners were 1000 times faster than previous methods [SCAN], with tremendous importance for health care etc. Today IBM, Siemens, Google and many startups are pursuing this approach. Much of modern computer vision is extending the work of 2011, e.g., [MIR], Sec. 19.
Already in 2010, we introduced our deep and fast GPU-based NNs to Arcelor Mittal, the world’s largest steel maker, and were able to greatly improve steel defect detection through CNNs [ST] (before ImageNet 2012). This may have been the first Deep Learning breakthrough in heavy industry, and helped to jump-start our company NNAISENSE. The early 2010s saw several other applications of our Deep Learning methods.
Through my students Rupesh Kumar Srivastava and Klaus Greff, the LSTM principle also led to our Highway Networks [HW1] of May 2015, the first working very deep FNNs with hundreds of layers. Microsoft’s popular ResNets [HW2] (which won the ImageNet 2015 contest) are a special case thereof. The earlier Highway Nets perform roughly as well as ResNets on ImageNet [HW3]. Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well [HW3].
3. LSTMs & FNNs, especially CNNs. LSTMs v FNNs
In the recent Decade of Deep Learning, the recognition of static patterns (such as images) was mostly driven by CNNs (which are FNNs; see Sec. 2), while sequence processing (such as speech, text, etc.) was mostly driven by LSTMs (which are RNNs [MC43] [K56]; see Sec. 1). Often CNNs and LSTMs were combined, e.g., for video recognition. FNNs and LSTMs also invaded each other’s territories on occasion. Two examples:
1. Multi-dimensional LSTM [LSTM15] does not suffer from the limited fixed patch size of CNNs and excels at certain computer vision problems [LSTM16]. Nevertheless, most of computer vision is still based on CNNs.
2. Towards the end of the decade, despite their limited time windows, FNN-based Transformers [TR1] [TR2] started to excel at Natural Language Processing, a traditional LSTM domain (see Sec. 1). Nevertheless, there are still many language tasks that LSTM can rapidly learn to solve quickly [LSTM13] (in time proportional to sentence length) while plain Transformers can’t.
Business Week called LSTM “arguably the most commercial AI achievement” [AV1]. As mentioned above, by 2019, [LSTM1] got more citations per year than all other computer science papers of the past millennium [R5]. The record holder of the new millennium [HW2] is an FNN related to LSTM: ResNet [HW2] (Dec 2015) is a special case of our Highway Net (May 2015) [HW1], the FNN version of vanilla LSTM [LSTM2].
4. GANs: the Decade’s Most Famous Application of our Curiosity Principle (1990)
Another concept that has become very popular in the 2010s are Generative Adversarial Networks (GANs), e.g., [GAN0] (2010) [GAN1] (2014). GANs are an instance of my popular adversarial curiosity principle from 1990 [AC90, AC90b] (see also survey [AC09]). This principle works as follows. One NN probabilistically generates outputs, another NN sees those outputs and predicts environmental reactions to them. Using gradient descent, the predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error. One net’s loss is the other net’s gain. GANs are a special case of this where the environment simply returns 1 or 0 depending on whether the generator’s output is in a given set [AC19]. (Other early adversarial machine learning settings [S59] [H90] neither involved unsupervised NNs nor were about modeling data nor used gradient descent [AC19].) Compare [SLG] [R2] [AC18] and [MIR], Sec. 5.
5. Other Hot Topics of the 2010s: Deep Reinforcement Learning, Meta-Learning, World Models, Distilling NNs, Neural Architecture Search, Attention Learning, Fast Weights, Self-Invented Problems …
In July 2013, our Compressed Network Search [CO2] was the first deep learning model to successfully learn control policies directly from high-dimensional sensory input (video) using deep reinforcement learning (RL) (see survey in Sec. 6 of [DL1]), without any unsupervised pre-training (extending earlier work on large NNs with compact codes, e.g., [KO0] [KO2]; compare more recent work [WAV1] [OAI3]). This also helped to jump-start our company NNAISENSE.
A few months later, neuroevolution-based RL (see survey [K96]) also successfully learned to play Atari games [H13]. Soon afterwards, the company DeepMind also had a Deep RL system for high-dimensional sensory input [DM1] [DM2]. See [MIR], Sec. 8.
By 2016, DeepMind had a famous superhuman Go player [DM4]. The company was founded in 2010, by some counts the decade’s first year. The first DeepMinders with AI publications and PhDs in computer science came from my lab: a co-founder and employee nr. 1.
Our work since 1990 on RL and planning based on a combination of two RNNs called the controller and the world model [PLAN2-6] also has become popular in the 2010s. See [MIR], Sec. 11. (The decade’s end also saw a very simple yet novel approach to the old problem of RL [UDRL].)
For decades, few have cared for our work on meta-learning or learning to learn since 1987, e.g., [META1] [FASTMETA1-3] [R3]. In the 2010s, meta-learning has finally become a hot topic [META10] [META17]. Similar for our work since 1990 on Artificial Curiosity & Creativity [MIR] (Sec. 5, Sec. 6) [AC90-AC10] and Self-Invented Problems [MIR] (Sec. 12) in POWERPLAY style (2011) [PP] [PP1] [PP2]. See, e.g., [AC18].
Similar for our work since 1990 on Hierarchical RL, e.g. [HRL0] [HRL1] [HRL2] [HRL4] (see [MIR], Sec. 10), Deterministic Policy Gradients [AC90], e.g., [DPG] [DDPG] (see [MIR], Sec. 14), and Synthetic Gradients [NAN1-NAN4], e.g., [NAN5] (see [MIR], Sec. 15).
Similar for our work since 1991 on encoding data by factorial disentangled representations through adversarial NNs [PM2] [PM1] and other methods [LOC] (compare [IG] and [MIR], Sec. 7), and on end-to-end-differentiable systems that learn by gradient descent to quickly manipulate NNs with Fast Weights [FAST0-FAST3a] [R4], separating storage and control like in traditional computers, but in a fully neural way (rather than in a hybrid fashion [PDA1] [PDA2] [DNC] [DNC2]). See [MIR], Sec. 8.
Similar for our work since 2009 on Neural Architecture Search for LSTM-like architectures that outperform vanilla LSTM in certain applications [LSTM7], e.g., [NAS], and our work since 1991 on compressing or distilling NNs into other NNs [UN0] [UN1], e.g., [DIST2] [R4]. See [MIR], Sec. 2.
Already in the early 1990s, we had both of the now common types of neural sequential attention: end-to-end-differentiable “soft” attention (in latent space) [FAST2] through multiplicative units within networks [DEEP1-2] (1965), and “hard” attention (in observation space) in the context of RL [ATT0] [ATT1]. This led to lots of follow-up work. In the 2010s, many have used sequential attention-learning NNs. See [MIR], Sec. 9.
As mentioned in Sec. 21 of ref [MIR], surveys from the Anglosphere do not always make clear [DLC] that Deep Learning was invented where English is not an official language. It started in 1965 in the Ukraine (back then the USSR) with the first nets of arbitrary depth that really learned [DEEP1-2] [R8]. Five years later, modern backpropagation was published “next door” in Finland (1970) [BP1]. The basic deep convolutional NN architecture (now widely used) was invented in the 1970s in Japan [CNN1], where NNs with convolutions were later (1987) also combined with “weight sharing” and backpropagation [CNN1a]. We are standing on the shoulders of these authors and many others – see 888 references in ref [DL1].
Of course, Deep Learning is just a small part of AI, in most applications limited to passive pattern recognition. We view it as a by-product of our research on more general artificial intelligence, which includes optimal universal learning machines such as the Gödel machine (2003-), asymptotically optimal search for programs running on general purpose computers such as RNNs, etc.
6. The Future of Data Markets and Privacy
AIs are trained by data. If it is true that data is the new oil, then it should have a price, just like oil. In the 2010s, the major surveillance platforms (e.g., Sec. 1) [SV1] did not offer you any money for your data and the resulting loss of privacy. The 2020s, however, will see attempts at creating efficient data markets to figure out your data’s true financial value through the interplay between supply and demand. Even some of the sensitive medical data will not be priced by governmental regulators but by patients (and healthy persons) who own it and who may sell parts thereof as micro-entrepreneurs in a healthcare data market [SR18] [CNNTV2].
Are surveillance and loss of privacy inevitable consequences of increasingly complex societies? Super-organisms such as cities and states and companies consist of numerous people, just like people consist of numerous cells. These cells enjoy little privacy. They are constantly monitored by specialized “police cells” and “border guard cells”: Are you a cancer cell? Are you an external intruder, a pathogen? Individual cells sacrifice their freedom for the benefits of being part of a multicellular organism.
Similar for super-organisms such as nations [FATV]. Over 5000 years ago, writing enabled recorded history and thus became its inaugural and most important invention. Its initial purpose, however, was to facilitate surveillance, to track citizens and their tax payments. The more complex a super-organism, the more comprehensive its collection of information about its components.
200 years ago, at least the parish priest in each village knew everything about all the village people, even about those who did not confess, because they appeared in the confessions of others. Also, everyone soon knew about the stranger who had entered the village, because some occasionally peered out of the window, and what they saw got around. Such control mechanisms were temporarily lost through anonymization in rapidly growing cities, but are now returning with the help of new surveillance devices such as smartphones as part of digital nervous systems that tell companies and governments a lot about billions of users [SV1] [SV2]. Cameras and drones [DR16] etc. are becoming tinier all the time and ubiquitous; excellent recognition of faces and gaits etc. is becoming cheaper and cheaper, and soon many will use it to identify others anywhere on earth – the big wide world will not offer any more privacy than the local village. Is this good or bad? Anyway, some nations may find it easier than others to become more complex kinds of super-organisms at the expense of the privacy rights of their constituents [FATV].
7. Outlook: 2010s v 2020s – Virtual AI v Real AI?
In the 2010s, AI excelled in virtual worlds, e.g., in video games, board games, and especially on the major WWW platforms (Sec. 1). Most AI profits were in marketing. Passive pattern recognition through NNs helped some of the most valuable companies such as Amazon & Alibaba & Google & Facebook & Tencent to keep you longer on their platforms, to predict which items you might be interested in, to make you click at tailored ads etc. However, marketing is just a tiny part of the world economy. What will the next decade bring?
In the 2020s, Active AI will more and more invade the real world, driving industrial processes and machines and robots, a bit like in the movies. (Better self-driving cars [CAR1] will be part of this, especially fleets of simple electric cars with small & cheap batteries [CAR2].) Although the real world is much more complex than virtual worlds, and less forgiving, the coming wave of “Real World AI” or simply “Real AI” will be much bigger than the previous AI wave, because it will affect all of production, and thus a much bigger part of the economy. That’s why NNAISENSE is all about Real AI.
Some claim that big platform companies with lots of data from many users will dominate AI. That’s absurd. How does a baby learn to become intelligent? Not “by downloading lots of data from Facebook” [NAT2]. No, it learns by actively creating its own data through its own self-invented experiments with toys etc, learning to predict the consequences of its actions, and using this predictive model of physics and the world to become a better and better planner and problem solver [AC90] [PLAN2-6].
We already know how to build AIs that also learn a bit like babies, using what I have called artificial curiosity since 1990 [AC90-AC10] [PP-PP2], and incorporating mechanisms that aid in reasoning [FAST3a] [DNC] [DNC2] and in the extraction of abstract objects from raw data [UN1] [OBJ1-3]. In the not too distant future, this will help to create what I have called in interviews see-and-do robotics: quickly teach an NN to control a complex robot with many degrees of freedom to execute complex tasks, such as assembling a smartphone, solely by visual demonstration, and by talking to it, without touching or otherwise directly guiding the robot – a bit like we’d teach a kid [FA18]. This will revolutionize many aspects of our civilization.
Sure, such AIs have military applications, too. But although an AI arms race seems inevitable [SPE17], almost all of AI research in the 2020s will be about making human lives longer & healthier & easier & happier [SR18]. Our motto is: AI For All! AI won’t be controlled by a few big companies or governments. Since 1941, every 5 years, compute has been getting 10 times cheaper [ACM16]. This trend won’t break anytime soon. Everybody will own cheap but powerful AIs improving her/his life in many ways.
So much for now on the 2020s. In the more distant future, most self-driven & self-replicating & curious & creative & conscious AIs [INV16] will go where most of the physical resources are, eventually colonizing and transforming the entire visible universe [ACM16] [SA17] [FA15] [SP16], which may be just one of countably many computable universes [ALL1-3].