Familiarity Conda package, dependency and virtual environment manager. A handy additional reference for Conda is the blog post "The Definitive Guide to Conda Environments" on "Towards Data Science".
Familiarity with JupyterLab. See here for my post on JupyterLab.
These projects will also run Python notebooks on VSCode with the Jupyter Notebooks extension. If you do not use VSCode, it is expected that you know how to run notebooks (or alter the method for what works best for you).
Getting started
Let's create the supervised-learning-with-scikit-learn-template directory and install the required packages.
# Make the `supervised-learning-with-scikit-learn-template` directory
$ mkdir supervised-learning-with-scikit-learn-template
$ cd supervised-learning-with-scikit-learn-template
# Create a docs folder to place our notebook
$ mkdir docs
$ touch docs/supervised-learning-with-scikit-learn-template.ipynb
# Install our require dependencies
$ conda install scikit-learn pandas numpy matplotlib ipykernel
At this stage, we are ready to take a first look at some of the packages we will be using over the upcoming posts.
There will be more in-depth posts over the coming days with each package.
Today will include a short look at a iris dataset provided by Scikit Learn.
With this in mind, we can now begin adding code to our notebook.
Writing our first notebook
We will write seven cells in the notebook:
Importing our required packages and setting a graph style.
Exploring the iris dataset.
Assigning the iris dataset to their X and y variables.
Creating and exploring the data frame.
Visualizing the output and making sense of the data.
Creating a k-nearest neighbors classifier.
Applying the classifier to some unlabelled data and assigning predicted classes to that data.
Importing required packages
In our file docs/supervised-learning-with-scikit-learn-template.ipynb, we can add the following:
# Importing our required libraries
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
We are using four main libraries here:
sklearn which includes simple and efficient tools for predictive data analysis.
Finally, we are updating the pyplot style to use ggplot for aesthetics. More on that can be found in the docs here.
Exploring the dataset
As a first look, we will explore the dataset with some helpful functions to get a better idea of what is happening.
# Exploring the Iris dataset
iris = datasets.load_iris()
type(iris) # sklearn.datasets.base.Bunch - a dictionary-like object with key-value pairs
print(iris.keys()) # dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
print(iris.feature_names) # ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
type(iris.data) # numpy.ndarray
type(iris.target) # numpy.ndarray
iris.data.shape # (150, 4) - 150 rows and 4 columns
iris.target_names # array(['setosa', 'versicolor', 'virginica'], dtype='<U10') these will be encoded as 0, 1, 2
Some things to take away:
iris.data is our features for the data (also known as independent or predictor variables). There are 4 features (4 columns) in the data.
The features themselves can be explores with the feature_names property. In this data, the features are sepal length (cm), sepal width (cm), petal length (cm) and petal width (cm).
We notice that the target is a vector of integers. Our three possible classes of setosa, versicolor and virginica will be encoded as 0, 1, 2.
The iris.data.shape tells use that there are 150 rows of data to use as historical data to help us find features which might be useful in identifying future entries.
Assigning the iris dataset to a variable
The next step is a help to assign the data to more apt variables to be used.
Our features are assigned to X while the target variables are assigned to y.
# Setting our features to X and our target variables to y
X = iris.data
y = iris.target
This is a helpful preview to understand how our data will be used in the final matrix.
Visualizing the output
Finally, we can visualize the output by using a scatter matrix.
The matrix is a grid of scatter plots that shows the relationship between each pair of features. It allows us to explore many relationships in one chart.
# Help visualize the data.
# c stands for color so we display color by species.
# figsize will be the size of the figure.
# marker is the shape of the points.
_ = pd.plotting.scatter_matrix(df, c = y, figsize = [8,8], s = 150, marker = 'D')
# Diagonal line are histograms of the features corresponding the rows and columns.
# The rest of the lines are scatter plots of the column feature vs the row feature color by target variable.
# We can see that petalwidth and petallength are highly correlated.
plt.show()
In our notebook, this will output the following scatter matrix:
Scatter matrix in VSCode
It is up to us to interpret the data.
On the diagonal line, we can see histograms that bucket together the features corresponding to the rows and columns.
The colors on the scatter plot are assigned by our target variables. As we have three target variables, we will get three different colors plotted out.
The rest are scatter plots of the column feature vs the row feature color by target variable.
Something that you will notice on the second-from-the-bottom on the right scatter plot (petal length vs petal width) is that we get a linear grouping of elements. This tells us that there is a strong correlation between the two features.
You can read more about interpreting scatter plots here.
Constructing a classifier
There are different algorithms for classifying data. In our example, we will be going with k-nearest neighbors, an algorithm that creates predication boundaries to label data based on n closest data points.
We will do more of a deep dive on this classifier in another blog post. For now, we will see how to construct the classifier and train it against our labelled data.
from sklearn.neighbors import KNeighborsClassifier
# Set this to create boundaries based on 6 closest neight.
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(iris.data, iris.target)
The KNeighborsClassifier are helpful to understand more about the classifier and the available arguments.
In general, there are defaults for all possible arguments. Taken from the docs:
n_neighbors: int, default=5
Number of neighbors to use by default for kneighbors queries.
weights: {`uniform`, `distance`} or callable, default=`uniform`
weight function used in prediction. Possible values:
`uniform` : uniform weights. All points in each neighborhood are weighted equally.
`distance` : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
[callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.
algorithm: {`auto`, `ball_tree`, `kd_tree`, `brute`}, default=`auto`
Algorithm used to compute the nearest neighbors:
`ball_tree` will use BallTree
`kd_tree` will use KDTree
`brute` will use a brute-force search.
`auto` will attempt to decide the most appropriate algorithm based on the values passed to fit method.
Note: fitting on sparse input will override the setting of this parameter, using brute force.
leaf_size: int, default=30
Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
pint, default=2
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
metric: str or callable, default=`minkowski`
the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of DistanceMetric for a list of available metrics. If metric is “precomputed”, X is assumed to be a distance matrix and must be square during fit. X may be a sparse graph, in which case only “nonzero” elements may be considered neighbors.
metric_params: dict, default=None
Additional keyword arguments for the metric function.
n_jobs: int, default=None
The number of parallel jobs to run for neighbors search. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details. Doesn`t affect fit method.
Again, we will deep dive into this in another topic, but all you need to understand in our code is that we are overriding the default of n_neighbors to be 6 to make the prediction against the six closest neighbors.
The knn.fit(iris.data, iris.target) invocation will train the classifier on the data. As soon as we have called fit, the classifier is ready to make predictions.
Predicting unlabeled data
To make predictions, we need to call predict on the classifier and pass some unlabelled data.
We can use what we learned already about data frames to display that data as mapped to their features.
# A set of unlabeled data.
X_new = np.array([[5.6, 2.8, 3.9, 1.1], [5.7, 2.6, 3.8, 1.3], [4.7, 3.2, 1.3, 0.2]])
X_new.shape # (1, 4) - 1 data point and 4 features (assuming in the example about you just used the first example and not more)
# Showing the data frame
df_new = pd.DataFrame(X_new, columns=iris.feature_names)
print(df_new.head()) # print unlabeled data as a data frame
Finally, we can apply what we have done to predict the class of the unlabeled data.
prediction = knn.predict(X_new)
print(prediction) # array([0]) - 0 is the label for the first example which will map to one of the iris labels
# The prediction is [1 1 0] which maps to [versicolor versicolor setosa]
Our prediction printed out [1 1 0] which when decoded and mapped back to our labels results in the labels [versicolor versicolor setosa].
Therefore, our classified has predicated that the first and second datapoint is versicolor and that is a setosa is the class of the final data point.
Summary
Today's post set up a starting repository for all future posts on Machine Learning.
We then wrote a Python notebook that added cells to the notebook to show how to load the iris dataset, how to label the data, and how to create a classifier and apply that classifier.
Future posts will start to become more granular and dive deeper into particular topics around classifiers (and more machine learning applications).