Decoding Deep Learning: Neural Networks for Regression Part II
In Part I, we discussed in detail the architecture of a neural network and then built three different models to test how the number of hidden layers affects the accuracy of the model. In general, increasing the number of hidden layers improves the accuracy of the model. However, there are also a lot of hyperparameters that have to be fine-tuned. Neural networks can be a complex phenomenon since they have a lot of hyperparameters that need to be configured.
Before we use the grid search method in the sci-kit learn library to tune our neural network, we will first go over some concepts that will help you better understand the intricacies involved in neural networks.
Optimizers in Neural Networks
All machine learning algorithms have a corresponding loss function which they aim to minimize to improve the model accuracy. Neural networks are no exception. The process of minimizing a mathematical expression is called optimization.
How does a neural network minimize the loss?
By updating the weights. It should be noted that the neurons closer to the output layer have a much greater impact on the loss function as opposed to the neurons that are farther from the output layer (or at the beginning of the network).
Let’s try to understand, without going deep into mathematics, how optimizers work by illustrating one of the most common optimization algorithms used in data science.
Stochastic Gradient Descent (SGD)
SGD finds the local minimum of a differentiable function. It does this by finding the values of a function’s parameters (weights in our case) that minimize a cost function.
The weights are initialized at the beginning of the learning process. SGD finds the partial derivatives of every weight in the network with respect to the loss.
If the partial derivative is positive, SGD decreases the weight of that particular neuron in order to decrease the loss. On the contrary, if the partial derivative is negative, the weight of that particular neuron would be increased to reduce the loss.
The alpha is the learning rate and determines how fast or slow the algorithm finds the minima. If the alpha value is large, then the algorithm will converge faster but there is a chance it might skip the best minima. On the other hand, a small value of alpha would not encounter such a problem but can take a long time to converge.
Neural networks learn by updating the weights through an optimization algorithm. It is a complicated process and hence the hidden layers in a neural network are known as a black box. But are there any specific guidelines as to what should be the initial values of the weights?
A weight determines the impact of a particular feature/ input on the output. In general, the initial weights should be neither too small nor too large, otherwise the net might face the vanishing gradient problem (very slow learning) or the exploding gradient problem (divergence) respectively. Smaller weights are preferred since they result in a more robust model and make it less prone to overfitting.
Secondly, the weights should not be all set to zero, otherwise, all the neurons would end up learning the same features during the training process. Therefore, any strategy in which weights are initialized with the same constant values would result in a poor predictive model.
Thirdly, the weights should exhibit high variance.
So how did we initialize our weights when we built our models? By default, Keras uses the glorot_uniform weight initialization technique if we do not use it ourselves.
In deep learning, the weight values are almost always initialized by drawing random numbers either from the uniform distribution or the normal distribution. Numerous other techniques have been developed over the years and the weight initialization technique depends on the activation function used in each layer.
Here now you can find all the different weight initialization techniques available in Keras: https://keras.io/api/layers/initializers/
Looking for an ML platform to speed up all your tedious data science tasks, so that you can get to experimenting faster? The AI & Analytics Engine is your answer! Trial free for 2 weeks here.
As discussed earlier, activation functions introduce non-linearity into a neural net, making them powerful enough to learn anything. Here, we will go over three of the most widely used activation functions in deep learning:
1. Sigmoid /Logistic
Mathematically, a sigmoid function can be represented as:
It outputs a value between 0 and 1. It is used for binary classification problems and if you know logistic regression, then you will be familiar with this activation function already.
When we have a binary classification problem, we use a sigmoid function. When we have a multi-class classification problem, we use a softmax activation function in the final layer of the neural network. Hence, a softmax function is a generalization of the sigmoid function to a multi-class setting.
Mathematically, a tanh function can be represented as:
It outputs a value between -1 and +1. It is similar to the sigmoid function except that it has the advantage of being zero centered, making it easier to model inputs that have strongly negative, strongly positive, or neutral values.
However, both sigmoid and tanh activation functions suffer from the problem of vanishing gradient, making them unsuitable in networks with many hidden layers.
Mathematically, a ReLU function can be represented as:
It outputs a zero if the input is negative, otherwise, it outputs the same value as the input. This has become one of the most popular activation functions in the field of deep learning since ReLU is computationally very efficient and allows the model to converge faster. It also overcomes the vanishing gradient problem that is a drawback of sigmoid and tanh activation functions.
Which Activation Function should be used in hidden layers and output layer?
Due to the wide variety of activation functions available, it seems like a daunting task to select the appropriate activation function. As a general rule, you should use ReLU for the hidden layers, and for the final output layer, use sigmoid for binary classification, softmax for multi-class classification, and linear function for a regression problem.
Tuning Our Neural Network
Import the Required Libraries
(You can view and download the complete Jupyter Notebook here. And the dataset ‘housing.csv’ can be downloaded from here)
Define a model for KerasRegressor
List of Hyperparameter Values
Prepare the Grid
Perform Grid Search
As you can see, it took 152 mins to train the neural network with grid search. Hence, you might want to do a few chores while the model gets trained.
Now let’s view the best score and the best parameters learned by the network.
Best Score & Best Parameters
Don’t be surprised to see a negative value as the best score. Simply ignore the negative sign. GridSearchCV, by convention, always tries to maximize its score so loss functions like MSE have to be negated.
Optimizing a neural network by fine-tuning its hyperparameters is a time-consuming process due to the large number of hyperparameters involved and hence grid search takes a lot of time to test each and every possible configuration. This blog was meant to give you an understanding of how neural networks minimize loss function by updating the weights that are initialized at the start of the learning process. There are a number of optimization algorithms and weight initialization techniques that can be used. You were also introduced to the four most common activation functions that have become popular in the field of deep learning. The end goal was to show how to tune the hyperparameters of a neural network in Python using Keras and scikit-learn.