Within the field of AI research, machine learning has enjoyed remarkable success in recent years -- allowing computers to surpass or come close to matching human performance in areas ranging from facial recognition to speech and language recognition.
Machine learning is the process of teaching a computer to carry out a task, rather than programming it how to carry that task out step by step.
At the end of training, a machine-learning system will be able to make accurate predictions when given data.
That may sound dry, but those predictions could be answering whether a piece of fruit in a photo is a banana or an apple, if a person is crossing in front of a self-driving car, whether the use of the word book in a sentence relates to a paperback or a hotel reservation, whether an email is spam, or recognizing speech accurately enough to generate captions for a YouTube video.
Machine learning is typically split into supervised learning, where the computer learns by example from labeled data, and unsupervised learning, where the computer groups similar data and pinpoints anomalies.
Deep learning is a subset of machine learning, whose capabilities differ in several key respects from traditional shallow machine learning, allowing computers to solve a host of complex problems that couldn't otherwise be tackled.
An example of a simple, shallow machine-learning task might be predicting how ice-cream sales will vary based on outdoor temperature. Making predictions using only a couple of data features in this way is relatively straightforward, and can be carried out using a shallow machine-learning technique called linear regression with gradient descent.
The issue is that swathes of problems in the real world aren't a good fit for such simple models. An example of one of these complex real-world problems is recognizing handwritten numbers.
To solve this problem, the computer needs to be able to cope with huge variety in how the data can be presented. Every digit between 0 and 9 can be written in myriad ways: the size and exact shape of each handwritten digit can be very different depending on who's writing and in what circumstance.
Coping with the variability of these features, and the even bigger mess of interactions between them, is where deep learning and deep neural networks become useful.
Neural networks are mathematical models whose structure is loosely inspired by that of the brain.
Each neuron within a neural network is a mathematical function that takes in data via an input, transforms that data into a more amenable form, and then spits it out via an output. You can think of neurons in a neural network as being arranged in layers, as shown below.
All neural networks have an input layer, where the initial data is fed in, and an output layer, that generates the final prediction. But in a deep neural network, there will be multiple "hidden layers" of neurons between these input and output layers, each feeding data into each other. Hence the term "deep" in "deep learning" and "deep neural networks", it is a reference to the large number of hidden layers -- typically greater than three -- at the heart of these neural networks.
This simplified diagram above hopefully helps to provide an idea of how a simple neural network is structured. In this example, the network has been trained to recognize handwritten figures, such as the number 2 shown here, with the input layer being fed values representing the pixels that make up an image of a handwritten digit, and the output layer predicting which handwritten number was shown in the image.
In the diagram above, each circle represents a neuron in the network, with the neurons organized into vertical layers.
As you can see, each neuron is linked to every neuron in the following layer, representing the fact that each neuron outputs a value into every neuron in the subsequent layer. The color of the links in the diagram also vary. The different colors, black and red, represent the significance of the links between neurons. The red links are those of greater significance, meaning they will amplify the value as it passes between the layers. In turn, this amplification of the value can help activate the neuron that the value is being fed into.
A neuron can be said to have been activated when the sum of the values being input into this neuron passes a set threshold. In the diagram, the activated neurons are shaded red. What this activation means differs according to the layer. In "Hidden layer 1" shown in the diagram, an activated neuron might mean the image of the handwritten figure contains a certain combination of pixels that resemble the horizontal line at the top of a handwritten number 7. In this way, "Hidden layer 1" could detect many of the tell-tale lines and curves that will eventually combine together into the full handwritten figure.
An actual neural network would likely have both more hidden layers and more neurons in each layer. For instance, a "Hidden layer 2" could be fed the small lines and curves identified by "Hidden layer 1", and detect how these combine to form recognizable shapes that make up digits, such as the entire bottom loop of a six. By feeding data forward between layers in this way, each subsequent hidden layer handles increasingly higher-level features.
As mentioned the activated neuron in the diagram's output layer has a different meaning. In this instance, the activated neuron corresponds to which number the neural network estimates it was shown in the image of a handwritten digit it was fed as an input.
As you can see, the output of one layer is the input of the next layer in the network, with data flowing through the network from the input to the output.
But how do these multiple hidden layers allow a computer to determine the nature of a handwritten digit? These multiple layers of neurons basically provide a way for the neural network to build a rough hierarchy of different features that make up the handwritten digit in question. For instance, if the input is an array of values representing the individual pixels in the image of the handwritten figure, the next layer might combine these pixels into lines and shapes, the next layer combines those shapes into distinct features like the loops in an 8 or upper triangle in a 4, and so on. By building a picture of which of these features, modern neural networks can determine -- with a very high level of accuracy -- the number that corresponds to a handwritten digit. Similarly, different types of deep neural networks can be trained to recognize faces in an image or to transcribe written speech from audio.
The process of building this increasingly complex hierarchy of features of the handwritten number out of nothing but pixels is learned by the network. The learning process is made possible by how the network is able to alter the importance of the links between the neurons in each layer. Each link has an attached value called a weight, which will modify the value spat out by a neuron as it passes from one layer to the next. By altering the value of these weights, and an associated value called bias, it is possible to emphasize or diminish the importance of links between neurons in the network.
For instance, in the case of the model for recognizing handwritten digits, these weights could be modified to stress the importance of a particular group of pixels that form a line, or a pair of intersecting lines that form a 7.
The model learns which links between neurons are important in making successful predictions during training. At each step during training, the network will use a mathematical function to determine how accurate its latest prediction was compared to what was expected. This function generates a series of error values, which in turn can be used by the system to calculate how the model should update the value of the weights attached to each link, with the ultimate aim of improving the accuracy of the network's predictions. The extent to which these values should be changed is calculated by an optimization function, such as gradient descent, and those changes are pushed back throughout the network at the end of each training cycle in a step called back propagation.
Over the course of many, many training cycles, and with the help of occasional manual parameter tuning, the network will continue to generate better and better predictions until it hits close to peak accuracy. At this point, for example, when handwritten digits could be recognized with more than 95 percent accuracy, the deep-learning model can be said to have been trained.
Essentially deep learning allows machine learning to tackle a whole host of new complex problems -- such as image, language and speech recognition -- by allowing machines to learn how features in the data combine into increasingly higher level, abstract forms. For example in facial recognition, how pixels in an image create lines and shapes, how those lines and shapes create facial features and how these facial features are arranged into a face.
If you are interested in learning more about neural networks, the video series below provides an excellent explanation.
Why is it called deep learning?
As mentioned, the depth refers to the number of hidden layers, typically more than three, used within deep-neural networks.
How is deep learning being used?
For many tasks, for recognizing and generating images, speech and language, and in combination with reinforcement learning to match human-level performance in games ranging from the ancient, such as Go, to the modern, such as Dota 2 and Quake III.
Deep-learning systems are a foundation of modern online services. Such systems are used by Amazon to understand what you say -- both your speech and the language you use -- to the Alexa virtual assistant or by Google to translate text when you visit a foreign-language website.
Every Google search uses multiple machine-learning systems, to understand the language in your query through to personalizing your results, so fishing enthusiasts searching for "bass" aren't inundated with results about guitars.
But beyond these very visible manifestations of machine and deep learning, such systems are starting to find a use in just about every industry. These uses include: computer vision for driverless cars, drones and delivery robots; speech and language recognition and synthesis for chatbots and service robots; facial recognition for surveillance in countries like China; helping radiologists to pick out tumors in x-rays, aiding researchers in spotting genetic sequences related to diseases and identifying molecules that could lead to more effective drugs in healthcare; allowing for predictive maintenance on infrastructure by analyzing IoT sensor data; underpinning the computer vision that makes the cashierless Amazon Go supermarket possible, offering reasonably accurate transcription and translation of speech for business meetings -- the list goes on and on.
When should you use deep learning?
When your data is largely unstructured and you have a lot of it.
Deep learning algorithms can take messy and broadly unlabeled data -- such as video, images, audio recordings, and text -- and impose enough order upon that data to make useful predictions, building a hierarchy of features that make up a dog or cat in an image or of sounds that form a word in speech.
As mentioned, deep neural networks excel at making predictions based on largely unstructured data. That means they deliver best in class performance in areas such as speech and image recognition, where they work with messy data such as recorded speech and photographs.
Should you use always deep learning instead of shallow machine learning?
No, because deep learning can be very expensive from a computational point of view.
For non-trivial tasks, training a deep-neural network will often require processing large amounts of data using clusters of high-end GPUs for many, many hours.
Given top-of-the-range GPUs can cost thousands of dollars to buy, or up to $5 per hour to rent in the cloud, it's unwise to jump straight to deep learning.
If the problem can be solved using a simpler machine-learning algorithm such as Bayesian inference or linear regression, one that doesn't require the system to grapple with a complex combination of hierarchical features in the data, then these far less computational demanding options will be the better choice.
Deep learning may also not be the best choice for making a prediction based on data. For example, if the dataset is small then sometimes simple linear machine-learning models may yield more accurate results -- although some machine-learning specialists argue a properly trained deep-learning neural network can still perform well with small amounts of data.
One of the big drawbacks is the amount of data they require to train, with Facebook recently announcing it had used one billion images to achieve record-breaking performance by an image-recognition system. When the datasets are this large, training systems also require access to vast amounts of distributed computing power. This is another issue of deep learning, the cost of training. Due to the size of datasets and number of training cycles that have to be run, training often requires access to high-powered and expensive computer hardware, typically high-end GPUs or GPU arrays. Whether you're building your own system or renting hardware from a cloud platform, neither option is likely to be cheap.
Deep-neural networks are also difficult to train, due to what is called the vanishing gradient problem, which can worsen the more layers there are in a neural network. As more layers are added the vanishing gradient problem can result in it taking an unfeasibly long time to train a neural network to a good level of accuracy, as the improvement between each training cycle is so minute. The problem doesn't afflict every multi-layer neural network, rather those that use gradient-based learning methods. That said this problem can be addressed in various ways, by choosing an appropriate activation function or by training a system using a heavy-duty GPU.
As mentioned deep neural networks are hard to train because of the number of layers in the neural network. The number of layers and links between neurons in the network is such that it can become difficult to calculate the adjustments that need to be made at each step in the training process -- a problem referred to as the vanishing gradient problem.
Another big issue is the vast quantities of data that are necessary to train deep learning neural networks, with training corpuses often measuring petabytes in size.
What deep learning techniques exist?
There are various types of deep neural network, with structures suited to different types of tasks. For example, Convolutional Neural Networks (CNNs) are typically used for computer vision tasks, while Recurrent Neural Networks (RNNs) are commonly used for processing language. Each has its own specializations, in CNNs the initial layers are specialized for extracting distinct features from the image, which are then fed into a more conventional neural network to allow the image to be classified. Meanwhile, RNNs differ from a traditional feed-forward neural network in that they don't just feed data from one neural layer to the next but also have built-in feedback loops, where data output from one layer is passed back to the layer preceding it -- lending the network a form of memory. There is a more specialized form of RNN that includes what is called a memory cell and that is tailored to processing data with lags between inputs.
The most basic type of neural network is a multi-layer perceptron network, the type discussed above in the handwritten figures example, where data is fed forward between layers of neurons. Each neuron will typically transform the values they are fed using an activation function, which changes those values into a form that, at the end of the training cycle, will allow the network to calculate how far off it is from making an accurate prediction.
More recently, generative adversarial networks (GANS) are extending what is possible using neural networks. In this architecture two neural networks do battle, the generator network tries to create convincing "fake" data and the discriminator attempts to tell the difference between fake and real data. With each training cycle, the generator gets better at producing fake data and the discriminator gains a sharper eye for spotting those fakes. By pitting the two networks against each other during training, both can achieve better performance. GANs have been used to carry out some remarkable tasks, such as turning these dashcam videos from day to night or from winter to summer, as shown in the video below, and have applications ranging from turning low-resolution photos into high-resolution alternatives and generating images from written text. GANs have their own limitations, however, that can make them challenging to work with, although these are being tackled by developing more robust GAN variants.
Where can you learn more about deep learning?
There's no shortage of courses out there that cover deep learning.
If you're just after a more detailed overview of deep learning, then Neural Networks and Deep Learning is an excellent free online book. While if you are comfortable with high-school maths and the Python programming language, then Google's Colab project offers an interactive introduction to machine learning.
It depends on your approach, but it will typically cost you hundreds of dollars upwards, depending on the complexity of the machine-learning task and your chosen method.
What hardware do you need for machine learning?
The first choice is whether you want to rent hardware in the cloud or build your own deep-learning rig. Answering this question comes down to how long you anticipate you will be training your deep-learning model. You will pay more over time if you stick with cloud services, so if you anticipate the training process will take more than a couple of months of intensive use then buying/building your own machine for training will likely be prudent.
If the cloud sounds suitable, then you can rent computing infrastructure tailored to deep learning from the major cloud providers, including AWS, Google Cloud, and Microsoft Azure. Each also offers automated systems that streamline the process of training a machine-learning model with offerings such as drag-and-drop tools, including Microsoft's Machine Learning Studio, Google's Cloud AutoML and AWS SageMaker.
That said, building your own machine won't be cheap. You'll need to invest in a decent GPU to train anything more than very simple neural networks, as GPUs can carry out a very large number of matrix multiplications in parallel, helping accelerate a crucial step during training.
If you're not planning on training a neural network with a large number of layers, you can opt for consumer-grade cards, such as the Nvidia GeForce GTX 1060, which typically sells for about £270, while still offering 1,280 CUDA cores.
More heavy-duty training, however, will require specialist equipment. One of the most powerful GPUs for machine learning is the Nvidia Tesla V100, which packs 640 AI-tailored Tensor cores and 5,120 general HPC CUDA cores. These cost considerably more than consumer cards, with prices for the PCI Express version starting at £7,500.
Building AI-specific workstations and servers can cost even more, for example, the deep-learning focused DGX1 sells for $149,000.
As well as a PCIe adapter, the Tesla V100 is available as an SXM module to plug into Nvidia's high-speed NVLink bus.
How long does it take to train a deep learning model?
The time taken to train a deep-learning model varies hugely, from hours to weeks or more, and is dependent on factors such as the available hardware, optimization, the number of layers in the neural network, the network architecture, the size of the dataset and more.
Which deep-learning software frameworks are available?
There are a wide range of deep-learning software frameworks, which allow users to design, train and validate deep neural networks, using a range of different programming languages.
A popular choice is Google's TensorFlow software library, which allows users to write in Python, Java, C++, and Swift, and that can be used for a wide range of deep learning tasks such as image and speech recognition, and which executes on a wide range of CPUs, GPUs, and other processors. It is well-documented, and has many tutorials and implemented models that are available.
Another popular choice, especially for beginners, is PyTorch, a framework that offers the imperative programming model familiar to developers and allows developers to use standard Python statements. It works with deep neural networks ranging from CNNs to RNNs and runs efficiently on GPUs.
Will neural networks and deep learning lead to general artificial intelligence?
At present deep learning is used to build narrow AI, artificial intelligence that performs a particular task, be that captioning photos or transcribing speech.
There's no system so far that can be thought of as a general artificial intelligence, able to tackle the same breadth of tasks and with the same broad understanding as a human being. When such systems will be developed is unknown, with predictions ranging from decades upwards.