This post will continue on from part one and break down the basics of linear regression and also explain how we can take the work that we did and expand upon that to apply a train-test split to our dataset.
These projects will also run Python notebooks on VSCode with the Jupyter Notebooks extension. If you do not use VSCode, it is expected that you know how to run notebooks (or alter the method for what works best for you).
Let's create the regression-with-scikit-learn-part-two by cloning the work we did yesterday. The packages required will be available in our conda environment.
If you are unsure on how to activate the conda virtual environment, please look to the prerequisites or resources section for links on conda fundamentals.
# Make the `regression-with-scikit-learn-part-two` directory
$ git clone https://github.com/okeeffed/regression-with-scikit-learn.git regression-with-scikit-learn-part-two
$ cd regression-with-scikit-learn-part-two
At this stage, the file docs/linear_regression.ipynb already exists and we can work off this material.
Before we start, let's go over the basics of linear regression.
Linear regression basics
The line equation to calculates the linear line is described as the following:
The statement can be broken down into the following:
Parameters of the model
To calculate the values of a and b, we need to define an error function (also known as the cost function or loss function) for any line and choose the line that minimizes the error function.
The aim is to minimize the vertical line distance between the fit line and the data point.
The distance itself is known as the residual. Because a positive and negative residuals (from data points above and below the line) will cancel each other out, we use the sum of the squares of the residuals.
OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the given dataset and those predicted by the linear function of the independent variable.
Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface — the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a simple linear regression, in which there is a single regressor on the right side of the regression equation.
To put that into some human speak, the axis of the dependent variable is our y axis, and so we sum the square between each data point in the set and the corresponding point on the X-axis to the regression line. The smaller the distance, the better the fit.
When we call the fit method from our LinearRegression object, we are actually calculating the parameters of the line by performing OLS under the hood.
Higher dimensions of linear regression
So far, the examples we have done are working on a dimension that is easily understood with y being calculated by one feature on the X-axis (from our example yesterday, this was the "Number Of Rooms (feature) vs Value Of House (target variable)"").
However, in the real world, we often have more than one feature.
To calculate multiple features (or dimensions), our linear regression equation becomes the following:
Linear equation for higher dimensions
In application, the Scikit-learn API can help us with this as we pass two arrays to the fit method:
Array with he features.
Array with the target variable.
Let's do just that and see how it works.
Applying the train/test split to our dataset
In our file docs/linear_regression.ipynb, we can add the following:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
reg_all = LinearRegression()
y_pred = reg_all.predict(X_test)
print(reg_all.score(X_test, y_test)) # outputs 0.711226005748496
The default score method for linear regression is R squared. For more details, see the documentation.
Note: You will never use Linear Regression out of the box like this. You will almost always want to use regularization. We will dive into this in the next part.
Today's post spoke to the math that describes our linear line generated by the linear regression fit.
We then spoke about how this calculation is worked out with more dimensions added into the mix.
Finally, we demonstrated this with a train_test_split and LinearRegression object.
As noted in the last section, this is not how you would use Linear Regression in practice. You will (almost) always want to use regularization.