Logistic Regression Basics
This is a basic look at
Logistic Regression and implementing an example from a
csv file. While the CSV file itself with the data is excluded, this basic look will show how to interpret the CSV in a particular way to give your dependent and independent variables.
The performance and reduction of these independent variables to improve the model are not included in this basic overview.
The original text below includes mathmetical formulas that do not translate into their mathematical expressions on the blog. Some familiarity with Latex will be required to interpret the expressions used.
Logistic Regression Intuition
This section can be quite difficult - there will be some math.
We know about
multiple linear regression etc. (DV on y, IV on x).
What happens if we classify things along a graph? Eg. 0 and 1 on the y axis and age on the x axis. This one is very black and white, but at the same time we can intuitive see some correlation.
In the example given above, we wouldn't use a linear model (as you could imagine). How about instead, you were able throw in probabilies between 0 and 1. The could be a probability between the x intercept and the y-intecept at x[hat]. You could interpret the above and below 100% and 0% respectively. This would be a VERY basic but sensicle attempt to describe the model.
The scientific approach
If we take the linear
y = b + b*x and take that into the sigmoid function
p = 1 / (1 + pow(e, -y)) and then we throw that into
ln(p/(1-p)) = b + b*x then we can get the y. Therefore the last equation is the one for logistical regression.
Based on the above formula and plugging in the example data, we will get the best fitting line.
If we now take any particular ages along the x axis of
20, 30, 40, 50 etc, we can then find y[hat] to get the predicted value that it will be a
0 - the higher the probability, the higher the chance of a
1. Any probability that is less than 0.5 is
projected down whereas anything else is
After applying to model, we can start drawing conclusions.
Implementation in Python
Using our standard setup, we want to predict whether or not we can get a correlation between the
purchase of something using their
For accurate predictions, we do use feature scaling and we will also create a classification test and training set.
Fitting the logistic regression model to the Training Set
In order to make a prediction on the X_test:
Checking the fit predictions using the Confusion Matrix
We do this by making a
Visualising the predictive power using a graph
There is a lot of code required to visualise this:
How do we interpret the graph?
The red points are the training set observations for when the IV purchased = 0, and 1 for green.
In our example, red did not buy the SUV, green are those who did.
Given the x,y axis, those with the lower salary who also didn't have red are also those who didn't but the SUV. We can see those with the higher salaries are more likely to have bought the SUV.
Another observation is that the older above the average even with the lower salary were more likely to buy the SUV.
What is the point of the classifiers?
The goal is to classify the right users into the right categories. We do this by plotting the
prediction regions - in the case of the graph, it's the red prediction and the green region is where the classifier does by the SUV.
The data point is the result, the region is the estimate.
When we have a linear classifier, the boundary will always be a straight line.
Checking the results when applied to the Test Set
The results that we can see from this actually come from the same confusion matrix that we saw before.
1,200+ PEOPLE ALREADY JOINED ❤️️
Get fresh posts + news direct to your inbox.
No spam. We only send you relevant content.