At CELUM we are always interested in modern technologies and useful applications thereof. Because of that, we have a dedicated team of researchers working on one of the most important topics nowadays: machine learning. In this blog series we want to give you an introduction to deep-learning with neural networks for computer vision tasks, and what we at CELUM have in store for you. In the first post of this series, we will walk through the basics of neural networks together. For that, we need some theoretical knowledge, so let’s get started!
Artificial neural networks are a machine learning paradigm trying to emulate the architecture and learning functionality of the human brain. They loosely follow their biological counterpart in that they are interconnected grids of small computational nodes, called neurons. The most important feature of neural networks is their ability to learn from input data, rather than being programmed explicitly. To understand how this is achieved, we will first have a look at the smallest but most important part of a neural network, the neuron.
Neurons are small nodes that compute an output based on incoming signals. The simplest kind of neurons are called perceptrons, which we will use to illustrate the basic functionality. Perceptrons take weighted inputs and calculate an output, based on the sum of inputs x multiplied by their weights w. They also have a threshold, if the sum is more than the threshold, the perceptron outputs a one, otherwise a zero. If we call the negative threshold the bias b, we can write a vectorized form of the neuron as:
To allow a finer output than zero and one, different activation functions can be used to calculate the results. A general neuron can therefore be visualized as follows:
We can think of a single neuron as a decision maker. As an example, let’s make a perceptron decide, if we want to go to a concert this Friday. We have some information for the neuron, which we will weigh to let the perceptron now, how important it is for us. The following questions are asked and will be the inputs to the neuron. In the parentheses we will list how important each criterion is to us, on a scale from zero to one. We set the threshold to 0.5 for this example. Possible answers are yes (1) and no (0).
Using the equation above, we can calculate if we are going to the concert on Friday. Let’s say for example, the weather on Friday will be nice (-> x1 = 1), but the concert is expensive (->x2 = 0) and our friends don’t want to go (-> x3 = 0). If we weigh our inputs and sum them up, we receive x1(=1) * w1(=0.3) + w2(=0) * w2(=0.3) + x3(=0) * w3(=0.6) = 0.3. This result is smaller than our threshold, so we won’t go to the concert. The decision would change, if our friends would go to the concert, resulting in 0.3 + 0.6 = 0.9, which is greater than the threshold, meaning that the neuron decided we should go to the concert.
Neurons are connected to each other and are arranged in layers. The inputs of neurons in layer 2 for example, are the outputs of neurons in layer 1. This means that if we change the weights of neurons in the first layer, all following layers will be affected as well. There are different layer types for different applications, the easiest and most basic is the fully connected layer. Here, every neuron of a layer receives the outputs of all neurons of the layer before it as inputs. This results in a huge number of parameters, which is why this kind of layer is seldom used nowadays. The following figure shows a simple network using fully connected layers for image classification.
This is a very simple network with 4 layers to differentiate between 2 classes. Modern neural networks have millions of neurons in up to 1000+ layers and can work with thousands of image classes. Now that we have a basic understanding of neurons and the weighted connections between them, I want to show the basic workflow when using neural networks and how to get them to learn from data. As an example, we will use the task of image classification.
Before we can talk about the workflow, we need to describe what image classification is. The goal of image classification is to automatically put a query image in one of several predefined classes. If we make a network to separate kinds of animals, we first have to define the classes. Let’s say we want to differentiate between cats, dogs, horses and others. This means that we have 4 classes and 4 output neurons. The goal is to run an image through the neural network and return probabilities for each available class. The sum of probabilities should be 100%. If we have an accurate network, a possible result for an image of a cat could be: cat 97%, dog 1.5%, horse 1% and others 0.5%.
Now that we have a basic understanding of the building blocks of neural networks and the task at hand, I will talk about the steps needed when using deep-learning.
The first step is to define a network architecture. To make this simpler, there are many different deep-learning frameworks available (Tensorflow, Caffe…). Nowadays every framework has higher level APIs that make it possible to define networks in layers, by providing parameters like the number of neurons in a layer in function calls. When implementing neural networks from scratch, the dimensions of all weight, input and bias matrices have to be calculated manually, which is very time consuming and error-prone. Modern frameworks calculate those matrices by themselves in the background, thus reducing errors.
Important to note is that it is not recommended to try to invent a new architecture for everything, as there are a lot of papers available describing very accurate solutions for different tasks. When a suiting network architecture has been selected, it’s time for the most important step, the training. The following illustration shows where we are in the workflow.
Updating the weights of neurons to give more accurate results is called training the network. A very important step here is to create a training dataset. This set has to consist of hundreds of thousands of images for every image class we want to predict. This means that every image of the training set has to have a label of the correct class associated with it. If we want our network to be able to recognize cats for example, we have to provide training images with the correct label “cat”. The trick to training is that we can calculate the current error of the network by defining a cost function and then trying to minimize this error.
The training itself is done by stochastic gradient descent. This means that we send a batch of images through the network and make predictions. Because we know the result which the network should output (because of the correct labels), we can calculate the error of the actual result via the cost function. We then update the weights of the network via the gradient so that the following predictions are more accurate. This update operation is done for hundreds of iterations in many epochs, sometimes resulting in training times of more than a week per network. The final result of the training is a network model trained to recognize the image classes we want, with accuracies depending on the quality of the training data and the chosen network architecture. Some of the most important steps are done, there is only one step remaining in the workflow:
Now that we have a trained model, we can make predictions for images which were not in the training dataset. This operation is called inference. The parameters of the network are frozen (weights, biases...), the network is only giving predictions and is not learning anymore. It is now possible to optimize the model structure, to make predictions faster. A few libraries exist which make optimizations by merging operations and layers and dropping unused weights without lowering the accuracy of the network. The models are now much smaller (often only a few MBs in size) and can therefore be deployed in apps or services. This concludes the workflow for deep-learning!
In this post we talked about the basic functionality of neural networks and gave an overview of the workflow when using deep-learning in production. Of course, this post could only scratch the surface of deep-learning, but I hope you could still learn something new! Now that the foundation is laid, look forward to the second part of this blog series, where we will show practical implementations of this technology here at CELUM. See you soon!
created by Fabian Eder, Student & AI Software Engineer
Deep-learning workflow images edited by Fabian Eder, © NVIDIA source: https://blogs.nvidia.com/wp-content/uploads/2016/08/ai_difference_between_deep_learning_training_inference.jpg