Back to home

Regression With Scikit Learn (Part 2)

Published: Aug 17, 2021

Last updated: Aug 17, 2021

This is Day 30 of the #100DaysOfPython challenge.

This post will continue on from part one and break down the basics of linear regression and also explain how we can take the work that we did and expand upon that to apply a train-test split to our dataset.

Source code can be found on my GitHub repo okeeffed/regression-with-scikit-learn-part-two.


  1. Familiarity Conda package, dependency and virtual environment manager. A handy additional reference for Conda is the blog post "The Definitive Guide to Conda Environments" on "Towards Data Science".
  2. Familiarity with JupyterLab. See here for my post on JupyterLab.
  3. These projects will also run Python notebooks on VSCode with the Jupyter Notebooks extension. If you do not use VSCode, it is expected that you know how to run notebooks (or alter the method for what works best for you).
  4. Read "Regression With Scikit Learn (Part One)"

Getting started

Let's create the regression-with-scikit-learn-part-two by cloning the work we did yesterday. The packages required will be available in our conda environment.

If you are unsure on how to activate the conda virtual environment, please look to the prerequisites or resources section for links on conda fundamentals.

# Make the `regression-with-scikit-learn-part-two` directory $ git clone regression-with-scikit-learn-part-two $ cd regression-with-scikit-learn-part-two

At this stage, the file docs/linear_regression.ipynb already exists and we can work off this material.

Before we start, let's go over the basics of linear regression.

Linear regression basics

The line equation to calculates the linear line is described as the following:

Linear equation

y=ax+by = ax + b

The statement can be broken down into the following:

yTarget variable
xSingle feature
a,bParameters of the model

To calculate the values of a and b, we need to define an error function (also known as the cost function or loss function) for any line and choose the line that minimizes the error function.

The aim is to minimize the vertical line distance between the fit line and the data point.

The distance itself is known as the residual. Because a positive and negative residuals (from data points above and below the line) will cancel each other out, we use the sum of the squares of the residuals.

This will be our loss function and is called Ordinate Least Squares (OLS).

Wikipedia describes OLS as the following:

OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the given dataset and those predicted by the linear function of the independent variable.

Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface — the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a simple linear regression, in which there is a single regressor on the right side of the regression equation.

To put that into some human speak, the axis of the dependent variable is our y axis, and so we sum the square between each data point in the set and the corresponding point on the X-axis to the regression line. The smaller the distance, the better the fit.

When we call the fit method from our LinearRegression object, we are actually calculating the parameters of the line by performing OLS under the hood.

Higher dimensions of linear regression

So far, the examples we have done are working on a dimension that is easily understood with y being calculated by one feature on the X-axis (from our example yesterday, this was the "Number Of Rooms (feature) vs Value Of House (target variable)"").

However, in the real world, we often have more than one feature.

To calculate multiple features (or dimensions), our linear regression equation becomes the following:

Linear equation for higher dimensions

y=a1x1+a2x2+...+anxn+by = a_1x_1 + a_2x_2 + ... + a_nx_n + b

In application, the Scikit-learn API can help us with this as we pass two arrays to the fit method:

  1. Array with he features.
  2. Array with the target variable.

Let's do just that and see how it works.

Applying the train/test split to our dataset

In our file docs/linear_regression.ipynb, we can add the following:

from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) reg_all = LinearRegression(), y_train) y_pred = reg_all.predict(X_test) print(reg_all.score(X_test, y_test)) # outputs 0.711226005748496

The default score method for linear regression is R squared. For more details, see the documentation.

Note: You will never use Linear Regression out of the box like this. You will almost always want to use regularization. We will dive into this in the next part.


Today's post spoke to the math that describes our linear line generated by the linear regression fit.

We then spoke about how this calculation is worked out with more dimensions added into the mix.

Finally, we demonstrated this with a train_test_split and LinearRegression object.

As noted in the last section, this is not how you would use Linear Regression in practice. You will (almost) always want to use regularization.

This will be our topic in tomorrow's post.

Resources and further reading

Photo credit: deepakrautela

Personal image

Dennis O'Keeffe

  • Melbourne, Australia

Hi, I am a professional Software Engineer. Formerly of Culture Amp, UsabilityHub, Present Company and NightGuru.
I am currently working on Visibuild.


Get fresh posts + news direct to your inbox.

No spam. We only send you relevant content.

Regression With Scikit Learn (Part 2)


Share this post