Decoding Deep Learning: Neural Networks for Regression Part I
Neural networks and deep learning are all the rage these days. They are known to solve some of the most complex problems such as text translation, audio transcription, and object detection and have all the data scientists and machine learning engineers on the edge of their seats.
In this blog, we would build a simple Artificial Neural Network and predict the median housing price. Since the median housing price is a continuous variable, this is a regression problem.
About the Dataset
The aim is to predict the median price of a house given a bunch of features such as its longitude, latitude, proximity to the ocean, number of bedrooms, household income, house age, etc. This is a very common use case in real estate to predict housing prices.
The target variable is median_house_value. All the variables in the dataset are numeric except one: ocean_proximity.
Neural networks require the input features to be numeric. For simplicity and convenience, we will drop the ocean_proximity and total_bedrooms feature and from our dataset.
Not sure how to compute missing values? Pay a visit to our blog Is there a faster way to clean data?
Neural Networks: Background Knowledge
A neural network consists of an input layer, one or more hidden layers, and an output layer. Information flows from the input to the output layer through the hidden layers.
The number of neurons in the input layer equals the number of features/ input variables. You can have as many hidden layers as you wish and the number of neurons in each hidden layer is also your choice. Any neural network with more than one hidden layer is considered a deep neural network. The output layer will have as many layers as the number of classes present in the target variable if it is a classification problem (for binary classification, one neuron will suffice) and if it’s a regression problem, it will have a single neuron since it needs to output a single value.
Some of the most common terms that you need to acquaint yourself with, are as follows:
Activation Function: An activation function decides whether a neuron will ‘fire’ or not i-e whether it will get activated or not. It does this by adding a non-linear component to the input function. Some of the most commonly used activation functions are Sigmoid, Softmax, and ReLU.
Batches: In a classical machine learning algorithm, the training data is fed into the model all at once. In deep learning, however, the training data is split into chunks or batches of equal size and then fed into the neural network. This helps the model to generalize better.
Bias: Every neuron receives an input and multiplies it with a weight, and then adds a bias term to alter its range. This final output once the bias term gets added, makes the linear component of the input transformation.
Hidden Layer: These are the intermediate layers that do all the processing on the data.
Input Layer: This is the first layer of a neural network and as the name suggests, it receives the input.
Loss Function: A loss function, also known as an objective function or cost function, measures the error between the predicted and actual value. For any model, we would want the difference between output and input to be as low as possible, ideally zero. Hence, a loss function measures the accuracy of the model and tries to improve it by penalizing the network when it makes errors.
Neuron: A neuron is the building block of a neural network, represented by a circle. It receives input, processes it, and generates an output. The input layer, hidden layers, and output layer are all composed of neurons.
Output Layer: This is the final layer of a neural network and as the name suggests, it generates the output.
Weight: Every neuron receives an input and multiplies it with a weight. The weights are initialized randomly and are subsequently updated using backpropagation. The neurons which hold greater importance are assigned a higher weight and the less significant ones see their weights reduced.
The internal workings of a neural network (Image Source)
The input X gets multiplied with a weight W that is randomly initialized at the beginning and a bias term b is added (W*X + b). All of these inputs are summed and passed through an activation function. The output of the previous layer serves as input for the next layer. The output is compared with the ground truth/ original input. This is done by a loss/cost function and the objective is to minimize this function since it estimates the error between the predicted and actual values. The lower the error, the better would be our model performance.
The error is minimized by adjusting the weights which were initialized randomly at the start. The calculation of weights is the most important and complex part of the learning process for a neural network. It is done through an algorithm called backpropagation.
What kind of activation function should you use?
Well, that depends on the type of problem you are trying to solve. For hidden layers, it is common to use Rectified Linear Unit (ReLU) whereas for the output layer, a sigmoid function if you are trying to solve a binary classification problem and a softmax function if it’s a multi-class classification problem.
Since we are solving a regression problem, we won’t need any activation function in the output layer.
What kind of optimizer should you use?
There are numerous optimization techniques to minimize the loss function, with stochastic gradient descent being a very popular one among data scientists. However, in deep learning, one of the most common optimizers is ‘Adam’, as it has proven quite effective in solving various problems.
What kind of loss function should you use?
If it’s a binary classification problem, you should use the cross-entropy loss function. In this blog, since we are solving a regression problem, we will use the mean square error loss function.
Looking for an ML platform to speed up all your tedious data science tasks, so that you can get to experimenting faster? The AI & Analytics Engine is your answer! Trial free for 2 weeks here.
Importing the Required Libraries
Reading the Dataset
We drop the categorical column ocean_proximity and total_bedrooms since it has missing values and a very low correlation with the target variable.
Viewing dimensions of the Resultant Dataset
Plotting Histogram of the Target Variable
It’s always a good idea to perform some exploratory data analysis on the target variable, to get a better understanding of how the variable is distributed. Some quick visualizations can give you meaningful insights about the range of the observations, any significant outliers, are the observations heavily skewed in one direction, and the central tendency of the data points.
The most commonly used approach for a continuous variable is to plot a histogram. A histogram is a bar plot where the axis representing the data variable is divided into a set of discrete bins and the count of observations falling within each bin is shown using the height of the corresponding bar (source)
Separate Dataset into Features & Target
Scaling the Features
Create Train, Validation & Test Sets
Building the Neural Network
We use the Sequential model to build our neural network meaning the first layer is the input layer, the intermediate layers correspond to the hidden layers and the last layer is the output layer.
We will test three different models with varying levels of hidden layers and then compare the RMSE values to evaluate how the accuracy improves by increasing the number of hidden layers.
For all three models, the activation function used for the hidden layers is ReLU, the number of neurons per hidden layer is 32, the optimizer is Adam, the loss function is mean squared error, the batch size is 32, and the number of epochs is 500.
Model 1: 1 Input Layer, 2 Hidden Layers, 1 Output Layer
Model 2: 1 Input Layer, 3 Hidden Layers, 1 Output Layer
Model 3: 1 Input Layer, 5 Hidden Layers, 1 Output Layer
Comparing the three models, we see increasing the number of hidden layers decreases the RMSE score, hence increasing the accuracy of our model. As stated earlier, increasing the number of hidden layers increases the accuracy of the model but can also cause the model to overfit, hence caution needs to be exercised.
This blog aims to introduce you to the fascinating and exhilarating world of deep learning and lay a strong foundation for the main concepts involved in building deep neural networks. We started out by discussing the theory behind neural network architecture and then built three different models with different numbers of hidden layers in Python using Keras. In the next part, we would discuss in detail, the different activation functions used in neural networks and also do a hyperparameter optimization using grid search to identify the best performing model.