Regression [With Code]: A Lighthouse for Data Scientists
While people always get fascinated by the more advanced algorithms, many real-life problems can be quickly solved using linear regression.
In this article, we will implement one of the simplest and most widely used supervised machine learning algorithms: linear regression
Linear regression is a fairly simple and straightforward machine learning algorithm and is usually the first algorithm taught in data science. It lays the foundation for more advanced algorithms. Hence, enjoy the calmness of regression before chaos arrives.
We will first implement it in Python. We will use Pandas, Sci-kit learn, Seaborn, and Matplotlib libraries. Then, we will implement it on the AI & Analytics Engine which requires absolutely no programming experience.
Are you a Data Scientist looking for a tool to speed up your data wrangling and model deployment efforts? Look no further! Sign up for a free 2-week trial of the Engine, and see for yourself!
About the Dataset
The aim is to predict the median price of a house given a bunch of features such as its longitude, latitude, proximity to the ocean, number of bedrooms, household income, house age, etc. This is a very common use case in real estate to predict housing prices.
The target variable is median_house_value. All the variables in the dataset are numeric except one: ocean_proximity. Since it is a categorical variable, we will one-hot encode it to convert it into numeric.
Linear Regression: Background Knowledge
Linear Regression is a statistical model that studies the relationship between one or more input/independent variables and an output/ dependent variable. Both the input variables and the output variable are numeric.
The term linear in linear regression suggests that the relationship between the input and output variable must be linear. In the case of multiple linear regression where there is more than one input variable, the output variable must be a linear combination of input variables.
We are interested in calculating the weight of the coefficients of the input variables to determine their impact on the output variable. This is done through a technique referred to as Ordinary Least Squares (OLS).
OLS tries to find a best-fitting line through the data points that minimizes the residual sum of squares (RSS).
RSS calculates the distance between every data point (shown in blue) and the line (shown in red), squares it, and adds them up. This is the quantity that OLS aims to minimize. Numerous such lines are drawn and the process is repeated for every line.
Note: the distance between a data point (shown in blue) and the line (shown in red) is called a residual.
The best-fitting line is the one that results in the minimum value for the sum of the squared residuals.
Linear Regression: Evaluation Metrics
We will analyze the performance of our linear regression model by calculating the Root Mean Squared Error (RMSE).
RMSE measures how spread out the residuals are. The more spread out the residuals are i-e the greater the distance from the fitted line, the higher the RMSE, and the worse the performance of the model.
The performance of a machine learning classification model can be improved by tuning its hyperparameters. Similarly, for a regression model, we can improve performance by preventing it from overfitting. This is achieved by adding a penalty term to shrink the weight of the large coefficients to reduce the variance of the model. As a result, the model bias would increase.
There are two widely used regularization techniques:
1. Ridge Regression (L2)
Whenever there is multicollinearity in the data, least-squares are unbiased and variances can be quite large and hence the predicted values are not a true reflection of the actual values.
In such scenarios, ridge regression can be very effective in tuning the weights of the coefficients
Loss function = OLS + lambda* summation (squared coefficient values)
Lambda is the parameter whose value we need to select that will minimize the loss function. The higher the value of lambda, the higher the penalty, and the greater the reduction in the weight of the coefficients.
Ridge regression retains all of the features, unlike Lasso Regression.
2. Lasso Regression (L1)
Lasso shrinks some of the coefficients to zero so some of the features get eliminated completely. It is a popular technique to do feature selection. For example, if we have a lot of features that are highly correlated, it will retain one and set all the other correlated variables to zero.
Loss function = OLS + lambda * summation (absolute values of the magnitude of the coefficients)
The difference here is that Lasso penalizes the absolute value of the regression coefficients whereas ridge penalizes their squared value.
Importing the Required Libraries
Reading the Dataset
Our dataset consists of 20,640 observations and 10 columns. Hence, we have 9 features and one target variable.
What’s the Target/ Response/ Dependent Variable?
Let’s have a look at our target variable ‘median_house_value’
Plotting Histogram of the Target Variable
A histogram will give us a good idea of the spread of our target variable.
Getting a ‘feel’ for our Data
It’s always a good idea to explore your dataset in detail before building a model. Let’s see the data types of all our columns
9 of the columns are of the type float and one is categorical/object.
Looking at the Correlation of our Features with Target Variable
We can see the features that have a positive correlation with median_house_value and also the ones that have a negative correlation.
Median_income has the strongest correlation in terms of magnitude with the target variable. This suggests that median_income is the most important predictor in determining the median house value.
We can always remove features that have a very low correlation with our target variable such as total_bedrooms, population, longitude as it improves the predictive performance of our model.
Hence, checking for correlation is one of the ways in which we do the feature selection.
Plotting a Heatmap to Study Correlation Among all the Variables
A heatmap can be a very nice way to visualize the relation of the variables among themselves. This can help us remove features that have high multicollinearity. However, if the features have high multicollinearity but they are also strongly correlated with the target variable, it’s wise to keep them in the feature space.
We can also quickly identify features that are weakly correlated with our target variable as they do not add any useful information that our model can learn from. It also helps to reduce the dimensionality of our feature space.
Check for Missing Values
Always begin by identifying the missing values present in your dataset as they can hinder the performance of your model.
The only missing values we have are present in the total_bedrooms column.
Impute Missing Values
We begin by separating the features into numeric and categorical. The technique to impute missing values for numeric and categorical features is different.
For categorical features, the most frequent value occurring in the column is computed and the missing value is replaced with that.
For numeric features, there exists a range of different techniques such as calculating the mean value or building a model with the known values, and predicting the missing values. The Iterative Imputer used below does exactly that.
The missing values have been imputed. Now we have zero missing values in our dataset.
One-Hot encoding the Categorical Variable ‘Ocean_Proximity’
Most machine learning algorithms cannot understand text data. Hence, categorical variables are converted into vectors and then fed into the model. Linear regression requires all input variables to be numeric.
Here, we one-hot encode the only categorical variable in our data: ocean_proximity. It has five unique values: <1H Ocean, Inland, Island, Near Bay, Near Ocean. Hence, the ocean_proximity feature gets mapped to five different features, each representing one of the five distinct values.
For example, a vector [0,0,0,1,0] means TRUE for Near Bay and FALSE for all the other four columns.
Our One-Hot Encoded Dataset
Before one-hot encoding, the categorical feature ocean_proximity, our data had 10 columns. Now, after the one-hot encoding, our data has been transformed into 14 columns. One-hot encoding always increases the feature space since every unique value of a categorical column becomes a feature.
Separating into Features & Target Variable
X has all our features whereas y has our target variable.
Splitting into Training & Testing Data
Fitting the Linear Regression Model
Printing the y-intercept
Printing the Weight of the Coefficients
Plotting the Weight Coefficients for Easier Visualization
Evaluating the Linear Regression Model
Plotting the Histogram of Residuals
Linear Regression is one of the most basic machine learning algorithms but can also be one of the most powerful for regression problems. While people always get fascinated by the more advanced algorithms such as support vector regressors or neural networks, many real-life problems can be quickly and efficiently solved using linear regression. This article gave you an overview of linear regression, implemented it in Python, and then also introduced ridge and lasso regression which are used as regularization techniques to prevent a model from over-fitting.