Explore Free Machine Learning Tutorials – GameDev Academy https://gamedevacademy.org Tutorials on Game Development, Unity, Phaser and HTML5 Fri, 24 Feb 2023 21:14:37 +0000 en-US hourly 1 https://wordpress.org/?v=6.1.1 https://gamedevacademy.org/wp-content/uploads/2015/09/cropped-GDA_logofinal_2015-h70-32x32.png Explore Free Machine Learning Tutorials – GameDev Academy https://gamedevacademy.org 32 32 Radial Basis Function Networks – Regression for ML https://gamedevacademy.org/using-neural-networks-for-regression-radial-basis-function-networks/ Mon, 19 Dec 2022 02:34:28 +0000 https://pythonmachinelearning.pro/?p=1895 Read more]]> Machine learning is an expansive field – one often made better by techniques common to data science like regression.

In this tutorial, we’re going to explore the topic of neural networks and discover how we can make use of regression techniques when working with them. We’ll also dive into how to use Python to work with these concepts.

Let’s dive in and start learning!

Intro to Neural Networks and RBF Nets

Neural Networks are very powerful models for classification tasks. But what about regression? Suppose we had a set of data points and wanted to project that trend into the future to make predictions. Regression has many applications in finance, physics, biology, and many other fields.

Radial Basis Function Networks (RBF nets) are used for exactly this scenario: regression or function approximation. We have some data that represents an underlying trend or function and want to model it. RBF nets can learn to approximate the underlying trend using many Gaussians/bell curves.

Prerequisites

You can download the full code here.

Before we begin, please familiarize yourself with neural networks, backpropagation, and k-means clustering. Alternatively, if this is your first foray into the world of machine learning, we recommend first learning the fundamentals of Python. You can check out our tutorials, or try a robust curriculum like the Python Mini-Degree to do so.

For educators, we can also recommend Zenva Schools. Not only does the platform have tons of online courses for Python, but is suitable for K12 environments with classroom management tools, course plans, and more.

BUILD YOUR OWN GAMES

Get 250+ coding courses for

$1

AVAILABLE FOR A LIMITED TIME ONLY

RBF Nets

An RBF net is similar to a 2-layer network. We have an input that is fully connected to a hidden layer. Then, we take the output of the hidden layer perform a weighted sum to get our output.

But what is that inside the hidden layer neurons? That is a Gaussian RBF! This differentiates an RBF net from a regular neural network: we’re using an RBF as our “activation” function (more specifically, a Gaussian RBF).

Gaussian Distribution

The first question you may have is “what is a Gaussian?” It’s the most famous and important of all statistical distributions. A picture is worth a thousand words so here’s an example of a Gaussian centered at 0 with a standard deviation of 1.

This is the Gaussian or normal distribution! It is also called a bell curve sometimes. The function that describes the normal distribution is the following

\[ \mathcal{N}(x; \mu, \sigma^2) = \displaystyle\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\displaystyle\frac{(x-\mu)^2}{2\sigma^2}}  \]

That looks like a really messy equation! And it is, so we’ll use $\mathcal{N}(x; \mu, \sigma^2)$ to represent that equation. If we look at it, we notice there are one input and two parameters. First, let’s discuss the parameters and how they change the Gaussian. Then we can discuss what the input means.

The two parameters are called the mean $\mu$ and standard deviation $\sigma$. In some cases, the standard deviation is replaced with the variance $\sigma^2$, which is just the square of the standard deviation. The mean of the Gaussian simply shifts the center of the Gaussian, i.e. the “bump” or top of the bell. In the image above, $\mu=0$, so the largest value is at $x=0$.

The standard deviation is a measure of the spread of the Gaussian. It affects the “wideness” of the bell. Using a larger standard deviation means that the data are more spread out, rather than closer to the mean.

Technically, the above function is called the probability density function (pdf) and it tells us the probability of observing an input $x$, given that specific normal distribution. But we’re only interested in the bell-curve properties of the Gaussian, not the fact that it represents a probability distribution.

Gaussians in RBF nets

Why do we care about Gaussians? We can use a linear combination of Gaussians to approximate any function!

Source: https://terpconnect.umd.edu/~toh/spectrum/CurveFittingB.html

In the figure above, the Gaussians have different colors and are weighted differently. When we take the sum, we get a continuous function! To do this, we need to know where to place the Gaussian centers $c_j$ and their standard deviations $\sigma_j$.

We can use k-means clustering on our input data to figure out where to place the Gaussians. The reasoning behind this is that we want our Gaussians to “span” the largest clusters of data since they have that bell-curve shape.

The next step is figuring out what the standard deviations should be. There are two approaches we can take: set the standard deviation to be that of the points assigned to a particular cluster $c_j$ or we can use a single standard deviation for all clusters $\sigma_j = \sigma \forall j$ where $\sigma=\frac{d_\text{max}}{\sqrt{2k}}$ where $d_\text{max}$ is the maximum distance between any two cluster centers. and $k$ is the number of cluster centers.

But wait, how many Gaussians do we use? Well that’s a hyperparameter called the number of bases or kernels $k$.

Backpropagation for RBF nets

K-means clustering is used to determine the centers $c_j$ for each of the radial basis functions $\varphi_j$. Given an input $x$, an RBF network produces a weighted sum output.

\begin{equation*}
F(x)=\displaystyle\sum_{j=1}^k w_j\varphi_j(x, c_j) + b
\end{equation*}

where $w_j$ are the weights, $b$ is the bias, $k$ is the number of bases/clusters/centers, and $\varphi_j(\cdot)$ is the Gaussian RBF:

\begin{equation*}
\varphi_j(x, c_j) = \exp\left(\displaystyle\frac{-||x-c_j||^2}{2\sigma_j^2} \right)
\end{equation*}

There are other kinds of RBFs, but we’ll stick with our Gaussian RBF. (Notice that we don’t have the constant up front, so our Gaussian is not normalized, but that’s ok since we’re not using it as a probability distribution!)

Using these definitions, we can derive the update rules for $w_j$ and $b$ for gradient descent. We use the quadratic cost function to minimize.

\begin{equation*}
C = \displaystyle\sum_{i=1}^N (y^{(i)}-F(x^{(i)}))^2
\end{equation*}

We can derive the update rule for $w_j$ by computing the partial derivative of the cost function with respect to all of the $w_j$.

\begin{align*}
\displaystyle\frac{\partial C}{\partial w_j} &= \displaystyle\frac{\partial C}{\partial F} \displaystyle\frac{\partial F}{\partial w_j}\\
&=\displaystyle\frac{\partial }{\partial F}[\displaystyle\sum_{i=1}^N (y^{(i)}-F(x^{(i)}))^2]~\cdot~\displaystyle\frac{\partial }{\partial w_j}[\displaystyle\sum_{j=0}^K w_j\varphi_j(x,c_j) + b]\\
&=-(y^{(i)}-F(x^{(i)}))~\cdot~\varphi_j(x,c_j)\\
w_j &\gets w_j + \eta~(y^{(i)}-F(x^{(i)}))~\varphi_j(x,c_j)
\end{align*}

Similarly, we can derive the update rules for $b$ by computing the partial derivative of the cost function with respect to $b$.

\begin{align*}
\displaystyle\frac{\partial C}{\partial b} &= \displaystyle\frac{\partial C}{\partial F} \displaystyle\frac{\partial F}{\partial b}\\
&=\displaystyle\frac{\partial }{\partial F}[\displaystyle\sum_{i=1}^N (y^{(i)}-F(x^{(i)}))^2]~\cdot~\displaystyle\frac{\partial }{\partial b}[\displaystyle\sum_{j=0}^K w_j\varphi_j(x,c_j) + b]\\
&=-(y^{(i)}-F(x^{(i)}))\cdot 1\\
b &\gets b + \eta~(y^{(i)}-F(x^{(i)}))
\end{align*}

Now we have our backpropagation rules!

RBF Net Code

Now that we have a better understanding of how we can use neural networks for function approximation, let’s write some code!

First, we have to define our “training” data and RBF. We’re going to code up our Gaussian RBF.

def rbf(x, c, s):
    return np.exp(-1 / (2 * s**2) * (x-c)**2)

Now we’ll need to use the k-means clustering algorithm to determine the cluster centers. I’ve already coded up a function for you that gives us the cluster centers and the standard deviations of the clusters.

def kmeans(X, k):
    """Performs k-means clustering for 1D input
    
    Arguments:
        X {ndarray} -- A Mx1 array of inputs
        k {int} -- Number of clusters
    
    Returns:
        ndarray -- A kx1 array of final cluster centers
    """

    # randomly select initial clusters from input data
    clusters = np.random.choice(np.squeeze(X), size=k)
    prevClusters = clusters.copy()
    stds = np.zeros(k)
    converged = False

    while not converged:
        """
        compute distances for each cluster center to each point 
        where (distances[i, j] represents the distance between the ith point and jth cluster)
        """
        distances = np.squeeze(np.abs(X[:, np.newaxis] - clusters[np.newaxis, :]))

        # find the cluster that's closest to each point
        closestCluster = np.argmin(distances, axis=1)

        # update clusters by taking the mean of all of the points assigned to that cluster
        for i in range(k):
            pointsForCluster = X[closestCluster == i]
            if len(pointsForCluster) > 0:
                clusters[i] = np.mean(pointsForCluster, axis=0)

        # converge if clusters haven't moved
        converged = np.linalg.norm(clusters - prevClusters) < 1e-6
        prevClusters = clusters.copy()

    distances = np.squeeze(np.abs(X[:, np.newaxis] - clusters[np.newaxis, :]))
    closestCluster = np.argmin(distances, axis=1)

    clustersWithNoPoints = []
    for i in range(k):
        pointsForCluster = X[closestCluster == i]
        if len(pointsForCluster) < 2:
            # keep track of clusters with no points or 1 point
            clustersWithNoPoints.append(i)
            continue
        else:
            stds[i] = np.std(X[closestCluster == i])

    # if there are clusters with 0 or 1 points, take the mean std of the other clusters
    if len(clustersWithNoPoints) > 0:
        pointsToAverage = []
        for i in range(k):
            if i not in clustersWithNoPoints:
                pointsToAverage.append(X[closestCluster == i])
        pointsToAverage = np.concatenate(pointsToAverage).ravel()
        stds[clustersWithNoPoints] = np.mean(np.std(pointsToAverage))

    return clusters, stds

This code just implements the k-means clustering algorithm and computes the standard deviations. If there is a cluster with none or one assigned points to it, we simply average the standard deviation of the other clusters. (We can’t compute standard deviation with no data points, and the standard deviation of a single data point is 0).

We’re not going to spend too much time on k-means clustering. Visit the link at the top for more information.

Now we can get to the real heart of the RBF net by creating a class.

class RBFNet(object):
    """Implementation of a Radial Basis Function Network"""
    def __init__(self, k=2, lr=0.01, epochs=100, rbf=rbf, inferStds=True):
        self.k = k
        self.lr = lr
        self.epochs = epochs
        self.rbf = rbf
        self.inferStds = inferStds

        self.w = np.random.randn(k)
        self.b = np.random.randn(1)

We have options for the number of bases, learning rate, number of epochs, which RBF to use, and if we want to use the standard deviations from k-means. We also initialize the weights and bias. Remember that an RBF net is a modified 2-layer network, so there’s only only one weight vector and a single bias at the output node, since we’re approximating a 1D function (specifically, one output). If we had a function with multiple outputs (a function with a vector-valued output), we’d use multiple output neurons and our weights would be a matrix and our bias a vector.

Then, we have to write our fit function to compute our weights and biases. In the first few lines, we either use the standard deviations from the modified k-means algorithm, or we force all bases to use the same standard deviation computed from the formula. The rest is similar to backpropagation where we propagate our input going forward and update our weights going backward.

def fit(self, X, y):
    if self.inferStds:
        # compute stds from data
        self.centers, self.stds = kmeans(X, self.k)
    else:
        # use a fixed std 
        self.centers, _ = kmeans(X, self.k)
        dMax = max([np.abs(c1 - c2) for c1 in self.centers for c2 in self.centers])
        self.stds = np.repeat(dMax / np.sqrt(2*self.k), self.k)

    # training
    for epoch in range(self.epochs):
        for i in range(X.shape[0]):
            # forward pass
            a = np.array([self.rbf(X[i], c, s) for c, s, in zip(self.centers, self.stds)])
            F = a.T.dot(self.w) + self.b

            loss = (y[i] - F).flatten() ** 2
            print('Loss: {0:.2f}'.format(loss[0]))

            # backward pass
            error = -(y[i] - F).flatten()

            # online update
            self.w = self.w - self.lr * a * error
            self.b = self.b - self.lr * error

For verbosity, we’re printing the loss at each step. Notice we’re also performing an online update, meaning we update our weights and biases each input. Alternatively, we could have done a batch update, where we update our parameters after seeing all training data, or minibatch update, where we update our parameters after seeing a subset of the training data.

Making a prediction is as simple as propagating our input forward.

def predict(self, X):
    y_pred = []
    for i in range(X.shape[0]):
        a = np.array([self.rbf(X[i], c, s) for c, s, in zip(self.centers, self.stds)])
        F = a.T.dot(self.w) + self.b
        y_pred.append(F)
    return np.array(y_pred)

Notice that we’re allowing for a matrix inputs, where each row is an example.

Finally, we can write code to use our new class. For our training data, we’ll be generating 100 samples from the sine function. Then, we’ll add some uniform noise to our data.

# sample inputs and add noise
NUM_SAMPLES = 100
X = np.random.uniform(0., 1., NUM_SAMPLES)
X = np.sort(X, axis=0)
noise = np.random.uniform(-0.1, 0.1, NUM_SAMPLES)
y = np.sin(2 * np.pi * X)  + noise

rbfnet = RBFNet(lr=1e-2, k=2)
rbfnet.fit(X, y)

y_pred = rbfnet.predict(X)

plt.plot(X, y, '-o', label='true')
plt.plot(X, y_pred, '-o', label='RBF-Net')
plt.legend()

plt.tight_layout()
plt.show()

We can plot our approximated function against our real function to see how well our RBF net performed.

From our results, our RBF net performed pretty well! If we wanted to evaluate our RBF net more rigorously, we could sample more points from the same function, pass it through our RBF net and use the summed Euclidean distance as a metric.

We can try messing around with some key parameters, like the number of bases. What if we increase the number of bases to 4?

Our results aren’t too great! This is because our original function is shaped the way that it is, i.e., two bumps. If we had a more complicated function, then we could use a larger number of bases. If we used a large number of bases, then we’ll start overfitting!

Another parameter we can change is the standard deviation. How about we use a single standard deviation for all of our bases instead of each one getting its own?

Our plot is much smoother! This is because the Gaussians that make up our reconstruction all have the same standard deviation.

There are other parameters we can change like the learning rate; we could use a more advanced optimization algorithm; we could try layering Gaussians; etc.

To summarize, RBF nets are a special type of neural network used for regression. They are similar to 2-layer networks, but we replace the activation function with a radial basis function, specifically a Gaussian radial basis function. We take each input vector and feed it into each basis. Then, we do a simple weighted sum to get our approximated function value at the end. We train these using backpropagation like any neural network! Finally, we implemented RBF nets in a class and used it to approximate a simple function.

RBF nets are a great example of neural models being used for regression!

Want to learn more about how Python can help your career? Check out this article! You can also expand your Python skills with the Python Mini-Degree or, for teachers in need of classroom resources, the Zenva Schools platform.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

]]>
How to do Cluster Analysis with Python – Data Science https://gamedevacademy.org/cluster-analysis-python-tutorial/ Sun, 18 Dec 2022 14:00:19 +0000 https://pythonmachinelearning.pro/?p=2431 Read more]]>

You can access the full course here: Data Insights with Cluster Analysis

Want more Python topics for a school environment? The Zenva Schools platform which offers online courses, classroom management tools, reporting, pre-made course plans, and more for teachers.

Part 1

In this video we are going to discuss Cluster Analysis. We will discuss the following topics:

  1. Intro to Cluster Analysis – what is it, what are it’s different applications, the kinds of algorithms we can expect.
  2. K-means clustering
  3. Density-based Spatial Clustering of Applications with Noise (DBSCAN)
  4. Hierarchical Agglomerative Clustering (HAC)

k-means, DBSCAN and HAC are 3 very popular clustering algorithms which all take very different approaches to creating clusters.

Before diving in, you can also explore why you might want to learn these topics or just what career opportunities these skills will present you with!

Cluster Analysis

Imagine we have some data. In cluster analysis, we want to (in an unsupervised manner – no apriori information), separate different groups based on the data.

Flower petal length and width graph

Looking at a plot of the above data, we can say that it fits into 2 different groups – a cluster of points in the bottom left and a larger, elongated cluster on the top right. When we give this data to a clustering algorithm, it will create a split. Algorithms like k-means need to be told how many clusters we want. In some cases, we don’t need to specify the number of clusters. DBSCAN for instance is smart enough to figure out how many clusters there are in the data.

The data above is from the IRIS data set. This was created by a famous statistician R.A. Fischer, who collected this data set of 3 different species of flowers and plotted their measured properties such as petal width, petal length, sepal width, sepal length. Since we are doing clustering, we have removed the class labels from the data set as that is what the clustering algorithms is trying to give us in terms of what data points belong together.

Clustering

Grouping data into clusters so that the data in each cluster has similar attributes or properties. For example the data in the small cluster in the above plot have small petal length and small petal width.

There are several applications of clustering analysis used across a variety of fields:

  • Market analysis and segmentation
  • Medical imaging – Xrays, MRIs, fMRIs
  • Recommender systems – such as those used on Amazon.com
  • Geospatial data – longitudinal coordinates etc
  • Anomaly detection

People have used clustering algorithms to detect brain anomalies. We see below various brain images. C1 – C7 are various clusters. This is an example of clustering in the medical domain.

Brain images while brain is clustering info

Another example is a spatial analysis of user-generated ratings of venues on yelp.com

Spatial analysis with restaurant data clustered

Cluster Analysis

A set of points X1….Xn are taken in, analyzed and out comes a set of mappings from each point to a cluster (X1 -> C1, X2 -> C2 etc).

Visual representation of points going through cluster analysis

There are several such algorithms that will detect clusters. Some of these algorithms have additional parameters e.g. Number of clusters. These parameters vary for each algorithm.

The input however is a set of data points X1…Xn in any dimensionality i.e. 2D, 3D, 100D etc. For our purposes, we will stick with 2D points. It is hard to visualize data of higher dimensions though there are dimensionality reduction techniques that reduce say 100 dimensions to 2 so that they can be plotted.

The output is a cluster assignment where each point either belongs to a cluster or could be an outlier (noise).

Cluster analysis is a kind of unsupervised machine learning technique, as in general, we do not have any labels. There may be some techniques that use class labels to do clustering but this is generally not the case.

Summary

We discussed what clustering analysis is, various clustering algorithms, what are the inputs and outputs of these algorithms. We discussed various applications of clustering – not necessarily in the data science field.

Part 2

In this video, we will look at probably the most popular clustering algorithm i.e. k-means clustering.

Graph with blue and red points

This is a very popular, simple and easy-to-implement algorithm.

In this algorithm, we separate data into k disjoint clusters. These clusters are defines such that they minimize the within-cluster sum-of-squares. We’ll discuss this more when we look at k-means convergence. Disjoint here means that 1 point cannot belong to more than 1 cluster.

There is only 1 parameter for this algorithm i.e. k (the number of clusters). We need to have an idea of k before we run the algorithm. Sometimes it is obvious how many clusters we should have, but sometimes it is not that clear. We will discuss later how to make this choice.

This algorithm is a baseline algorithm in clustering.

The cluster center/centroid is a point that represents the cluster. The figure above has a red and a blue cluster. X is the centroid – the average of the x and y coordinates. In the blue cluster the average of the x and y coordinates is somewhere in the middle represented by the X in the middle of the square.

K-means Clustering Algorithm

1. Randomly initialize the cluster centers. For example, in the above diagram, we pick 2 random points to initialize the clusters.
2. Assign each point to it’s nearest cluster using distance formula like Euclidian distance.
3. Update the cluster centroids using the mean of the points assigned to it.
4. Go back to 2 until convergence (the cluster centroids stop moving or they move small imperceptible amounts).

Let’s go through an example (fake data) and discuss some interesting things about this algorithm.

Fake data graph

Visually, we can see that there are 2 clusters present.

Let’s randomly assign the cluster centers.

Fake data graph with random cluster centers

Let’s now assign each point to the closest cluster.

Fake data graph with points assigned to cluster

The points are now colored blue or red depending on which centroid they are closer to.

Next we need to update our cluster centroids. For the blue centroid, we take the average of all the x-coordinates – this will be the new x-coordinate for the centroid. Similarly we look at all the y-coordinates for the blue points, take their average and this becomes the new y-coordinate for the centroid. Likewise for the red points.

When I do this, the centroids shift over.

Fake data graph with centroids shifted

Once again I need to figure out which centroid each point is close to, which gives me the following.

Fake data graph with points assigned to new centroids

Once again, I update my cluster centroids as before.

Fake data graph with centroids shifted again

If we try to do another shift, the centroids won’t move again. This is evident from the last 2 figures where the same points are assigned to the cluster centroids.

At this point, we say that k-means has converged.

Convergence

Convergence mean that the cluster centroid don’t move at all or move a very very small amount. We use a threshold value to indicate that if the centroid does not move at least that much the k-means has converged.

Math equation for determining k-means

Mathematically, k-means is guaranteed to converge in a finite number of iterations (assigning point to a cluster and shifting). It may take a long time, but will eventually converge. It does not say anything about best or optimal clustering, just that it will converge.

K-means is sensitive to where you initialize the centroids. There are a few techniques to do this:

  • Assign each cluster center to a random data point.
  • Choose k points to be farthest away from each other within the bounds of the data.
  • Repeat k-means over and over again and pick the average of the clusters.
  • Another advanced approach called k-means ++ does things like ANOVA (Analysis Of Variance). We won’t be getting into it though.

Choosing k (How many clusters to use)

One way is to plot the data points and try different values to see what works the best. Another technique is called the elbow method.

Elbow method

Data graph with elbow point circled

Steps:

  • Choose some values of k and run the clustering algorithm
  • For each cluster, compute the within-cluster sum-of-squares between the centroid and each data point.
  • Sum up for all clusters, plot on a graph
  • Repeat for different values of k, keep plotting on the graph.
  • Then pick the elbow of the graph.

This is a popular method supported by several libraries.

Advantages Of k-means

  • This is widely known and used algorithm.
  • It’s also fairly simple to understand and easy to implement.
  • It is also guaranteed to converge.

Disadvantages of k-means

  • It is algorithmically slow i.e. can take a long time to converge.
  • It may also not converge to the local minima i.e. the optimal solution.
  • It’s also not very robust against varying cluster shapes e.g. It may not perform very well for elongated cluster shapes. This is because we use the same parameters for each cluster.

This was a quick overview of k-means clustering. Lets now look at how it performs on different kinds of data sets.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

Transcript 1

Hello, world, and thanks for joining me. My name is Mohit Deshpande. In this course, we’ll be learning about clustering analysis. In particular, we’re gonna use it in the context of data science, and we’re gonna analyze some data and see if we can segment out different kinds of customers so that we can provide them with all kinds of neat promotional discounts or special offers.

This is a very common thing that’s done for a lot of companies. We have a bunch of data. We wanna figure out unique groups of customers so that we can offer them special things.

In this course, we’ll be learning a little bit about three different kinds of clustering algorithms. First, I wanna introduce you to the concept of clustering, or what is clustering, and what are some other applications of it. Then we’ll get onto three different clustering algorithms, and what’s really neat is that they all approach clustering in a very different fashion.

So first, we’re gonna learn about a very popular kind of clustering called K-means clustering. Then we’ll look into a density-based clustering algorithm called DBSCAN. Then finally we’ll discuss a hierarchical clustering algorithm called hierarchical agglomerative clustering. And then we’ll see different kinds of data, where they work really well, and where they don’t work quite so well. So we’ll get a good idea of, at least some kind of notion, of which kind of clustering algorithms tend to work well on which kind of data.

Then, finally, we’re gonna tie everything together by looking at a real world data set and see if we can segment out different customers. We’ve been making courses since 2012, and we’re super-excited to have you on board. Online courses are a great way to learn new skills, and I take a lot of online courses myself.

Several courses consist mainly of video lessons that you can watch at your own pace and as many times as you want. So feel free to watch and re-watch and pause the video as many times as you want. We also have downloadable source code and project files and you’re getting everything that we build during the lesson.

It’s highly recommended that you code along with me. In my experience, that’s the best way to learn something is to get your feet wet. Lastly, we’ve seen the students who get the most out of these online courses are those who make a weekly plan and stick with it, depending, of course, on your own availability and learning style. Zenva, over the past six years, has taught all kinds of different topics on programming and game development, over 300,000 students, across 100 courses. The skills that they learned in these courses are completely transferrable. In fact, some of the students have used the skills that they’ve learned to advance their careers, to make a start-up, or to publish their own content from the skills that they’ve learned.

Thanks for joining, and I look forward to seeing the cool stuff you’ll be building. Now, without further ado, let’s get started.

Transcript 2

In this video, we are going to learn a little bit about cluster analysis. And this is a topic that we’re gonna be discussing over the duration of this course.

So just to give you an overview of the different things we’re gonna be covering, I’m gonna give you an introduction to cluster analysis, basically what is it and what are the different applications of it, as well as what kind of algorithms can we expect. And in fact, we’re gonna be covering three very popular algorithms, k-means clustering, DBSCAN, which stands for density-based spatial clustering of applications with noise, but usually we just call it DBSCAN, and then hierarchical agglomerative clustering, HAC.

These are three very popular clustering algorithms. And the interesting thing is they all take very different approaches to creating clusters. And we’re gonna get into all those in the subsequent videos. But first let’s talk a little bit about cluster analysis. And that’s what we’re gonna be focusing on primarily in this video, just to acquaint you with some of the terminology as well as some applications of cluster analysis for example. So clustering analysis, so imagine we have some data.

The whole point of clustering analysis is in an unsupervised way with no prior information, we want to be able to separate different groups based on the data that we have. And now sometimes these groups are predefined. You have the set of data like in this case, and you say, well, this seems, we plot this data, and you say, well, it seems to fit into two little groups. Here there’s a little clustering of points on the bottom left, and there’s a larger, kind of elongated cluster on the top right and so we might say, well, we can give a predefined number of clusters.

We want two clusters and we can give that to the clustering algorithms and then they’ll group these guys together. It’ll make a split and it actually, in some cases, we don’t need to specify the number of clusters. In fact, some algorithms, which is DBSCAN, are actually smart enough to be able to figure out how many clusters are based entirely on the data. But algorithms like k-means will actually need to be specified how many clusters that we have.

And so, for example, this data scan is actually taken, it’s a very famous data set called the Iris Dataset, collected by Ronald Fisher, which is, and here is a quick historical side note, he’s probably the most important statistician of the 20th century. A lot of statistical techniques that we have that are used in all kinds of companies were originally some of his work, but he collected this dataset of flowers. He has 50 different of three different kinds of species of flowers and he plots their measured properties like petal width, petal length, sepal width, and sepal length, and they’re all plotted out.

In this case, what I’ve actually done is removed the class labels, because usually when we’re doing clustering analysis, we don’t have the correct labels. In fact, that’s what the clustering is trying to give us. It’s trying to give us some notion that these things belong together and these other things belong together. So this is just a kind of data that you might expect with some clustering.

So clustering is taking our data and then putting it into groups such that the groups have some kind of similar properties or similar attributes here. So if we go back a slide here, so we have one cluster at the bottom left for example. That might be considered a cluster where the flowers in that cluster have a small petal length and a smaller petal width, for example. That’s an example of grouping, as I’m talking about.

And there’s so many different applications of clustering analysis, not just used for something like data science. But also things like medical imaging for things like x-rays or MRIs or FMRIs. They use clustering analysis. Recommender systems like those used on websites like Amazon.com and whatnot, they recommend you can use clustering analysis to help build recommendation systems. Geospatial data, of course, our data is in latitude and longitude coordinates and we can do some kind of clustering with that as well. And we can also use it for things like anomaly detection in our data. The top bullet point here is that we can use it for market analysis and segmentation, a very popular technique for that as well.

And so this kind of gives you a little bit of the applications of clustering analysis and certainly the algorithms that we’re gonna be learning are used in the field and can be used for your data as well. So for example, people have used clustering algorithms to actually do things like, to check brain anomalies, so here are just images I’ve taken of the brain and the different C-1 to C-Center are different clusters of the brain. They use a different clustering metric. And the result is, you can kind of segment out different kinds of brain anomalies. So it’s an application of clustering in the medical domain.

And another domain, what these guys have done is look at reviews on Yelp.com, and they’ve performed a spatial analysis of that. So there are a lot of different applications of clustering in many different fields.

So just a quick overview of clustering, but we can get a little bit more formal with this, with the algorithms. So you can think of clustering analysis on the bottom here, as just taking in a set of points, X1 to Xn. And then we churn it through a machine that is the cluster analysis, and then out comes an assignment that maps each point to a particular cluster. So on the bottom here, X1 is mapped to Cluster 1, X2 is mapped to Cluster 2, and there are a lot of different algorithms that exist to detect clusters.

In fact, like I’ve mentioned, we’re gonna be covering three of them, of the most popular ones. Now the parameters, some of these algorithms have additional parameters to them. So like I mentioned before, one of those parameters might be the number of clusters that you have that might have to be something that’s pre-defined that you would give to an algorithm. But there are sometimes other parameters as well, and they vary for each algorithm, so there’s no uniformness among the parameters. However, like you see in the chart here, the input to clustering algorithms is usually a set of data points, X1 to Xn, and they can be of any dimensionality, they can be in 2D, they can be 2D points, they can be 3D points, they can be 100 dimensional points.

But for our purposes, we’re just gonna stick with 2D points. It is certainly possible to do clustering on higher dimensionality data, but one of the issues is actually visualizing that higher dimensional data. Because we can’t really plot a hundred dimensional graph terribly easily, but there exist dimensionality production techniques that can press 100 dimensions down into two so we can plot it, but that’s a little beyond the scope of the course here. So we’re mostly gonna be sticking with 2D.

And then the output, like I said, is a cluster assignment. So each data point belongs to a cluster, or actually in some cases, some algorithms have some notion of outliers or noise built in, so just DBSCAN for example. So a point may not necessarily belong to an exact cluster, it might also belong to, it might be a noise point or some people like to think of it as being a noise cluster, although that’s not quite correct. But that is the output, and this is kind of a generalization of clustering algorithms. They take in this set of input points and then they produce a mapping that maps each point to a cluster or perhaps to a noise point.

And you can think of clustering analysis as being a kind of unsupervised machine learning, per se. And it’s unsupervised because generally, with clustering analysis, we don’t have any labels for our data. We just get a set of points and we’re set to cluster them. And so this unsupervised, and there are techniques that might use class labels to help do some kind of clustering or provide some other auxiliary information, but in many, many cases, clustering analysis is done unsupervised. We don’t have correct labels.

Alright, so that is just a quick overview of clustering analysis, so basically, what is clustering analysis, what kind of inputs we give to a clustering algorithm, what kinds of outputs do we expect from a clustering algorithm, and also discuss a little bit about different applications of clustering algorithms used in a wide variety of fields. It doesn’t even have to be something like data science or computer science, it can be used in the medical field, in business, finance, and so learning this information, we can take this and apply it to many other fields. So that is just a quick introduction to clustering analysis.

Transcript 3

In this video we are going to look at probably the most popular clustering algorithm called K-means clustering. So K-means clustering, the point of it is to take our cluster data and then separate it into K disjoint clusters, and these clusters are defined so that they minimize what’s called the within-cluster sum-of-squares, but we’ll get to that a bit later. You’ll see what that means when we discuss a little bit more about K-means convergence.

So the point of K-means clustering is to separate the data into K disjoint clusters. What I mean by disjoint clusters means that one point cannot belong to more than one cluster, and the only parameter that we have to set for this algorithm is the number of clusters. So you’ll see an example of an algorithm where you need to have some notion of how many clusters you want your data to have before you run the algorithm. Later, we’re gonna discuss how you can select this value of K, because sometimes it’s quite obvious how many clusters you have, but many times it’s not quite so clear how many clusters you should have, and so we’ll discuss a couple techniques a bit later on how to choose this parameter K.

So K-means clustering is very popular, very well-known. It actually, the algorithm itself is quite simple. It’s just a couple lines of algorithm code and the code itself is writing, if you had to write it from scratch also, it’s not something that would take you a long time. It’s very popular. I mean, it’s taught in computer science curriculums very commonly. So knowing this algorithm is kind of the first step to getting more acquainted with clustering. It’s kind of the baseline algorithm, and we’ll move on to more complicated algorithms a bit later, and then one point of terminology here.

You’ll often hear something called a cluster center or a centroid, and really that’s just a point that represents a cluster, so in the figure that I have here on the bottom right, we have two clusters. We have a red cluster and a blue cluster, and the X is the centroid. In other words, the centroid is really like the average of the X coordinates and the average of the Y coordinates and then that’s the point. So you can see in the blue, the average, if I average the X coordinates is somewhere gonna be in the middle, and if I average the Y coordinates, that’s gonna be somewhere in the middle and we end up with some centroid that’s in the middle of that square, and so that’s what I mean by centroid. So that’s all we really need to know about K-means.

Now we can get to the algorithm, we can actually get to the algorithm. I’m going to go through an example with you so you can kinda see it progress step-by-step with just some fake data, and then we’ll discuss some more interesting things about it such as convergence, properties, you know, talk a little bit about what convergence is, and then we’ll also discuss a bit later how do you select the value of K and what not. So but we’re gonna get to that.

So the clustering algorithm, the first step to the clustering is you randomly initialize the cluster centers. So suppose I have some data, that I have two clusters of four like in the previous chart. I have two clusters, so what I do is I just pick two random points to initialize the clusters. Now, the algorithm doesn’t, there are many ways to do this, and we’ll discuss a couple different ways a bit later, and so after you pick, we randomly initialize our cluster centers, then for each point we assign it to its nearest cluster using some kind of like metric, like the distance formula for example, Euclidean distance, and then we update the cluster centroids by taking the mean of the points assigned to it.

So after each point is assigned to a centroid, we look at the centroids and see how many points are assigned to this one centroid, and then we just average all the points, and that’s gonna shift the centroid. We basically keep doing this.

After the centroids have shifted, we then reset all of our points and figure out where the near, who, assign them to the nearest cluster again, and we just keep doing this until the cluster centroids stop moving or they move a very, very small imperceptible amount. Now, let’s actually do an example of this with some data so you have a better understanding of what I mean by each of these steps. So here is our data and we wanna cluster it and we want it to, we’re setting K equals two. There’s two clusters, okay?

If we look at this data we can kinda see that, well, yeah, there are two clusters here. So here is just like randomly initialized the two clusters. There’s red and blue, and just placed ’em somewhere on the plane. So now what I want to do is for each point, I’m going to assign it to the closest cluster. So after I do that I get something like this. So now I have colored all the points red or blue depending on if it’s closer to the red centroid or closer to the blue centroid.

So after I’ve done this, now I need to update my cluster centroid. So how I do that is I look at all the X, for example, for the blue centroid, I’ll look at all the X coordinates and take the average and that’s my new X coordinate for the centroid. I look at all the Y coordinates of all the blue points and I take the average of those and that becomes the new Y coordinate for my blue centroid. So after I do this, the centroids kinda shift over and then I reset the colors of all the points, and now I just do the same thing again.

So now for each point I have to figure out which centroid is it close to. So after I do that I get this assignment, and now I, again, take the average of the X coordinates and the average of the Y coordinates for each cluster and that becomes the location of my new centroid, and I shift it over again, and so now it’s gonna be here, and now let’s do another, now I do this one more time, and you’ll see that now, where if I try to do another shift, that nothing will, the cluster centroids won’t actually move again because in the previous step we see that the same points are assigned to the same cluster centroids.

So in other words, my centroids aren’t gonna move, and at this point we say that K-means has converged. So I hope that this gives you a bit of an overview of a K-means here as in how do you actually run the algorithm in step. So like I mentioned convergence, just ignore that big equation. It looks kinda scary but this is just to illustrate a point here. So convergence means that the cluster centroids either don’t move entirely or they move a very, very small very small amount, and you usually just set a threshold value. So if they don’t move at least this much, then we say they’ve converged.

The reason that I put this equation up here is just to illustrate the point that mathematically K-means is guaranteed to converge in a finite number of iterations, and by iterations I mean assigning the points to a cluster and then shifting and then assigning the points to a cluster and shifting.

In a finite amount of iterations, it is mathematically guaranteed to converge. Now, finite number, that still means that it might take a long time, but it will eventually converge. Now, this convergence, that theorem of K-means, doesn’t say anything about converging to the best cluster or converging to the optimal clustering. It just says that it will converge. This is still a useful property to know about K-means because this might not be true of other algorithms.

So one other point that I want to discuss is that K-means is actually quite sensitive to where you initialize the centroids, and there are a couple techniques to do this. So one technique that people like to use is you can just assign a cluster to be a random data point. So you pick a data point and say that is my cluster center. Another interesting thing is to choose K points that are the farthest away from each other. So if I have two points, I’ll choose two points that are the farthest away from each other within the bounds of my data, between the bound of the smallest X coordinate and the largest X coordinate, smallest Y coordinate, largest Y coordinate, and another approach is to repeat my K-means many times over and over again.

You just run the algorithm over and over again using one of these initializations and just take the average of the clusters. There’s another way more advanced approach called K-means plus plus that does all kinds of analysis of variance, but we’re not gonna get into that. In fact, we don’t have to implement it either ’cause it’s usually embedded in many libraries. Okay, so the only thing that remains to talk about is how do you choose a value of K. Well, one other thing to do is just plot your data, try a couple K values, and see which clustering works the best.

Another technique that’s actually quite popular is called the elbow method, and what you do is you choose some values of K, you run the clustering algorithm, you compute this thing called a within-cluster-sum-of-squares. In other words, you take the centroid and each data point assigned to that centroid and compute the sum of the square differences between those guys and you sum up all that. So for each cluster so you get one of those, and then you sum it up for all the clusters and you plot that on a graph, and you can keep plotting this on a graph for different values of K and then you pick the elbow of this graph here and that’s, it’s not a very clear-cut way as to what value of K you should use, it just gives you kind of an idea of what the range of different K you should choose there.

So this is also very popular and, again, there are lots of libraries that can actually generate this for you. You can do this yourself as well. So this is just a technique that you might wanna use to figure out how many values of K you should use. Okay, so that pretty much covers K-means clustering.

So I wanna talk a little bit about some of the advantages and disadvantages of this, and then later we’re gonna actually see it run on different kinds of data, and we’ll see how well it performs there. So it’s widely known and used, very popular like I said. It’s a fairly simple algorithm actually. It’s really only like four steps, and in lines of code wise it’s not that long either. It’s in the algorithm, easy to implement. One big perk is that it’s guaranteed to converge. It’s pretty nice.

Disadvantages of K-means, it’s algorithmically pretty slow if you know anything about big O notation. It’s take a long time for it to converge. Another issue is that it might not converge to the best solution. It’s guaranteed to converge, but it might not converge to the best solution, and it’s not very robust against clusters that have various shapes, like elongated clusters for example. It might not perform so well.

We’re actually gonna see this when we look at some fake data. We’ll see how well it performs against elongated data for example. And the reason for this is because we’re using the same parameters for each cluster when we do this. So anyway, that is just a quick overview of K-means clustering, and so now that we have this example we can go ahead and try to look at, analyze how it performs on different kinds of data sets.

Interested in continuing? Check out the full Data Insights with Cluster Analysis course, which is part of our Data Science Mini-Degree.

For teachers, you can also try Zenva Schools which offers online learning suitable for K12 school environments and offers a wide array of topics including Python, web development, Unity, and more.

]]>
Clustering with Gaussian Mixture Models – Data Science & ML https://gamedevacademy.org/clustering-with-gaussian-mixture-models/ Tue, 29 Nov 2022 12:26:18 +0000 https://pythonmachinelearning.pro/?p=1851 Read more]]> Gaussian Mixture Models are an essential part of data analysis – but do you know how they work?

In this article, we’ll seek to demystify how to analyze “clusters” of data and discuss Gaussian Mixture Models which can help us more efficiently and accurately sort out clusters of data points.

Let’s dive in!

What is Clustering?

Clustering is an essential part of any data analysis. Using an algorithm such as K-Means leads to hard assignments, meaning that each point is definitively assigned a cluster center. This leads to some interesting problems: what if the true clusters actually overlap? What about data that is more spread out; how do we assign clusters then?

Gaussian Mixture Models save the day! We will review the Gaussian or normal distribution method and the problem of clustering. Then we will discuss the overall approach of Gaussian Mixture Models. Training them requires using a very famous algorithm called the Expectation-Maximization Algorithm that we will discuss.

Download the full code here.

If you are not familiar with the K-Means algorithm or clustering, read about it here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

Gaussian Distribution

The first question you may have is “what is a Gaussian?”. It’s the most famous and important of all statistical distributions. A picture is worth a thousand words so here’s an example of a Gaussian centered at 0 with a standard deviation of 1.

This is the Gaussian or normal distribution! It is also called a bell curve sometimes. The function that describes the normal distribution is the following

\[ \mathcal{N}(x; \mu, \sigma^2) = \displaystyle\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\displaystyle\frac{(x-\mu)^2}{2\sigma^2}}  \]

That looks like a really messy equation! And it is, so we’ll use $\mathcal{N}(x; \mu, \sigma^2)$ to represent that equation. If we look at it, we notice there are one input and two parameters. First, let’s discuss the parameters and how they change the Gaussian. Then we can discuss what the input means.

The two parameters are called the mean $\mu$ and standard deviation $\sigma$. In some cases, the standard deviation is replaced with the variance $\sigma^2$, which is just the square of the standard deviation. The mean of the Gaussian simply shifts the center of the Gaussian, i.e., the “bump” or top of the bell. In the image above, $\mu=0$, so the largest value is at $x=0$.

The standard deviation is a measure of the spread of the Gaussian. It affects the “wideness” of the bell. Using a larger standard deviation means that the data are more spread out, rather than closer to the mean.

What about the input? More specifically, the above function is called the probability density function (pdf) and it tells us the probability of observing an input $x$, given that specific normal distribution. Given the graph above, we see that observing an input value of 0 gives us a probability of about 40%. As we move away in either direction from the center, we are decreasing the probability. This is one key property of the normal distribution: the highest probability is located at mean while the probabilities approach zero as we move away from the mean.

(Since this is a probability distribution, the sum of all of the values under the bell curve, i.e., the integral, is equal to 1; we also have no negative values.)

Why are we using the Gaussian distribution? The Expectation-Maximization algorithm is actually more broad than just the normal distribution, but what makes Gaussians so special? It turns out that many dataset distributions are actually Gaussian! We find these Gaussians in nature, mathematics, physics, biology, and just about every other field! They are ubiquitous! There is a famous theorem in statistics called the Central Limit Theorem that states that as we collect more and more samples from a dataset, they tend to resemble a Gaussian, even if the original dataset distribution is not Gaussian! This makes Gaussian very powerful and versatile!

Multivariate Gaussians

We’ve only discussed Gaussians in 1D, i.e., with a single input. But they can easily be extended to any number of dimensions. For Gaussian Mixture Models, in particular, we’ll use 2D Gaussians, meaning that our input is now a vector instead of a scalar. This also changes our parameters: the mean is now a vector as well! The mean represents the center of our data so it must have the same dimensionality as the input.

The variance changes less intuitively into a covariance matrix $\Sigma$. The covariance matrix, in addition to telling us the variance of each dimension, also tells us the relationship between the inputs, i.e., if we change x, how does y tend to change?

We won’t discuss the details of the multivariate Gaussian or the equation that generates it, but knowing what it looks like is essential to Gaussian Mixture Models since we’ll be using these.

The above chart has two different ways to represent the 2D Gaussian. The upper plot is a surface plot that shows this our 2D Gaussian in 3D. The X and Y axes are the two inputs and the Z axis represents the probability. The lower plot is a contour plot. I’ve plotted these on top of each other to show how the contour plot is just a flattened surface plot where color is used to determine the height. The lighter the color, the larger the probability. The Gaussian contours resemble ellipses so our Gaussian Mixture Model will look like it’s fitting ellipses around our data. Since the surface plot can get a little difficult to visualize on top of data, we’ll be sticking to the contour plots.

Gaussian Mixture Models

Now that we understand Gaussians, let’s discuss to Gaussian Mixture Models (GMMs)! To motivate our discussion, let’s see some example data that we want to cluster.

We could certainly cluster these data using an algorithm like K-Means to get the following results.

In this case, K-Means works out pretty well. But let’s consider another case where we have overlap in our data. (The two Gaussians are colored differently)

In this case, it’s pretty clear that these data are generated from Gaussians from the elliptical shape of the 2D Gaussian. In fact, we know that these data follow the normal distribution so using K-Means doesn’t seem to take advantage of that fact.  Even though I didn’t tell you our data were normally distributed, remember that the Central Limit Theorem says that enough random samples from any distribution will look like the normal distribution.

Additionally, K-Means doesn’t take into account the covariance of our data. For example, the blue points seem to have a relationship between X and Y: larger X values tend to produce larger Y values. If we had two points that were equidistant from the center of the cluster, but one followed the trend and the other didn’t, K-Means would regard them as being equal, since it uses Euclidean distance. But it seems certainly more likely that the point that follows the trend should match closer to the Gaussian than the point that doesn’t.

Since we know these data are Gaussian, why not try to fit Gaussians to them instead of a single cluster center? The idea behind Gaussian Mixture Models is to find the parameters of the Gaussians that best explain our data.

This is what we call generative modeling. We are assuming that these data are Gaussian and we want to find parameters that maximize the likelihood of observing these data. In other words, we regard each point as being generated by a mixture of Gaussians and can compute that probability.

\begin{align}
p(x) = \displaystyle\sum_{j=1}^{k} \phi_j\mathcal{N}(x; \mu_j, \Sigma_j)\\
\displaystyle\sum_{j=1}^{k} \phi_j = 1
\end{align}

The first equation tells us that a particular data point $x$ is a linear combination of the $k$ Gaussians. We weight each Gaussian with $\phi_j$, which represents the strength of that Gaussian. The second equation is a constraint on the weights: they all have to sum up to 1. We have three different parameters that we need to write update: the weights for each Gaussian $\phi_j$, the means of the Gaussians $\mu_j$, and the covariances of each Gaussian $\Sigma_j$.

If we try to directly solve for these, it turns out that we can actually find closed-forms! But there is one huge catch: we have to know the $\phi_j$’s! In other words, if we knew exactly which combination of Gaussians a particular point was taken from, then we could easily figure out the means and covariances! But this one critical flaw prevents us from solving GMMs using this direct technique. Instead, we have to come up with a better approach to estimate the weights, means, covariances.

Expectation-Maximization

How do we learn the parameters? There’s a very famous algorithm called the Expectation-Maximization Algorithm, also called the EM algorithm for short, (written in 1977 with over 50,000 paper citations!) that we’ll use for updating these parameters. There are two steps in this algorithm as you might think: expectation and maximization. To explain these steps, I’m going to cover how the algorithm works at a high level.

The first part is the expectation step. In this step, we have to compute the probability that each data point was generated by each of the $k$ Gaussians. In contrast to the K-Means hard assignments, these are called soft assignments since we’re using probabilities. Note that we’re not assigning each point to a Gaussian, we’re simply determining the probability of a particular Gaussian generating a particular point. We compute this probability for a given Gaussian by computing $\phi_j\mathcal{N}(x; \mu_j, \Sigma_j)$ and normalizing by dividing by $\sum_{q=1}^{k} \phi_q\mathcal{N}(x; \mu_q, \Sigma_q)$. We’re directly applying the Gaussian equation, but multiplying it by its weight $\phi_j$. Then, to make it a probability, we normalize. In K-Means, the expectation step is analogous to assigning each point to a cluster.

The second part is the maximization step. In this step, we need to update our weights, means, and covariances. Recall in K-Means, we simply took the mean of the set of points assigned to a cluster to be the new mean. We’re going to do something similar here, except apply our expectations that we computed in the previous step. To update a weight $\phi_j$, we simply sum up the probability that each point was generated by Gaussian $j$ and divide by the total number of points. For a mean $\mu_j$, we compute the mean of all points weighted by the probability of that point being generated by Gaussian $j$. For a covariance $\Sigma_j$, we compute the covariance of all points weighted by the probability of that point being generated by Gaussian $j$. We do each of these for each Gaussian $j$. Now we’ve updated the weights, means, and covariances! In K-Means, the maximization step is analogous to moving the cluster centers.

Mathematically, at the expectation step, we’re effectively computing a matrix where the rows are the data point and the columns are the Gaussians. An element at row $i$, column $j$ is the probability that $x^{(i)}$ was generated by Gaussian $j$.

\[ W^{(i)}_j} =  \displaystyle\frac{\phi_j\mathcal{N}(x^{(i)}; \mu_j, \Sigma_j)}{\displaystyle\sum_{q=1}^{k}\phi_q\mathcal{N}(x^{(i)}; \mu_q, \Sigma_q)} \]

The denominator just sums over all values to make each entry in $W$ a probability. Now, we can apply the update rules.

\begin{align*}
\phi_j &= \displaystyle\frac{1}{N}\sum_{i=1}^N W^{(i)}_j\\
\mu_j &= \displaystyle\frac{\sum_{i=1}^N W^{(i)}_j x^{(i)}}{\sum_{i=1}^N W^{(i)}_j}\\
\Sigma_j &= \displaystyle\frac{\sum_{i=1}^N W^{(i)}_j (x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^N W^{(i)}_j}
\end{align*}

The first equation is just the sum of the probabilites of a particular Gaussian $j$ divided by the number of points. In the second equation, we’re just computing the mean, except we multiply by the probabilities for that cluster. Similarly, in the last equation, we’re just computing the covariance, except we multiply by the probabilities for that cluster.

Applying GMMs

Let’s apply what we learned about GMMs to our dataset. We’ll be using scikit-learn to run a GMM for us. In the ZIP file, I’ve saved some data in a numpy array. We’re going to extract it, create a GMM, run the EM algorithm, and plot the results!

First, we need to load the data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture

X_train = np.load('data.npy')

Additionally, we can generate a plot of our data using the following code.

plt.plot(X[:,0], X[:,1], 'bx')
plt.axis('equal')
plt.show()

Remember that clustering is unsupervised, so our input is only a 2D point without any labels. We should get the same plot of the 2 Gaussians overlapping.

Using the GaussianMixture class of scikit-learn, we can easily create a GMM and run the EM algorithm in a few lines of code!

gmm = GaussianMixture(n_components=2)
gmm.fit(X_train)

After our model has converged, the weights, means, and covariances should be solved! We can print them out.

print(gmm.means_)
print('\n')
print(gmm.covariances_)

For comparison, I generate the original data data according to the following Gaussians.

\begin{align*}
\mu_1 &= \begin{bmatrix}3 \\ 3\end{bmatrix}\\
\Sigma_1 &=
\begin{bmatrix}
1 & 0.6 \\
0.6 & 1
\end{bmatrix}\\

\mu_2 &= \begin{bmatrix}1.5 \\ 1.5\end{bmatrix}\\
\Sigma_2 &=
\begin{bmatrix}
1 & -0.7 \\
-0.7 & 1
\end{bmatrix}
\end{align*}

Our means should be pretty close to the actual means! (Our covariances might be a bit off due to the weights.) Now we can write some code to plot our mixture of Gaussians.

X, Y = np.meshgrid(np.linspace(-1, 6), np.linspace(-1,6))
XX = np.array([X.ravel(), Y.ravel()]).T
Z = gmm.score_samples(XX)
Z = Z.reshape((50,50))

plt.contour(X, Y, Z)
plt.scatter(X_train[:, 0], X_train[:, 1])

plt.show()

This code simply creates a grid of all X and Y coordinates between -1 and 6 (for both) and evaluates our GMM. Then we can plot our GMM as contours over our original data.

The plot looks just as we expected! Recall that with the normal distribution, we expect to see most of the data points around the mean and less as we move away. In the plot, the first few ellipses have most of the data, with only a few data points towards the outer ellipses. The darker the contour, the lower the score.

(The score isn’t quite a probability: it’s actually a weighted log probability. Remember that each point is generated by a weighted sum of Gaussians, and, in practice, we apply a logarithm for numerical stability, thus prevent underflow.)

To summarize, Gaussian Mixture Models are a clustering technique that allows us to fit multivariate Gaussian distributions to our data. These GMMs well when our data is actually Gaussian or we suspect it to be. We also discussed the famous expectation-maximization algorithm, at a high level, to see how we can iteratively solve for the parameters of the Gaussians. Finally, we wrote code to train a GMM and plot the resulting Gaussian ellipses.

Gaussian Mixture Models are an essential part of data analysis and anomaly detection – which are important to learn when pursuing exciting developer careers! You can also find out more resources for exploring Python here.

]]>
Face Recognition with Eigenfaces – Computer Vision Tutorial https://gamedevacademy.org/face-recognition-with-eigenfaces/ Sun, 02 Oct 2022 02:49:02 +0000 https://pythonmachinelearning.pro/?p=1878 Read more]]> Face recognition – or the ability of computers to recognize faces and facial features – is an imminent concern to our future.

In this tutorial, we’re going to explore face recognition in-depth and learn how with techniques like eigenfaces, we can create our own software programs capable of identifying human faces. The applications for facial recognition are endless, and whether good or bad, it’s a skill that can help take your Python skills to the next level.

Let’s jump in and explore the world of computer vision and faces!

Intro & Project Files

Face recognition is ubiquitous in science fiction: the protagonist looks at a camera, and the camera scans his or her face to recognize the person. More formally, we can formulate face recognition as a classification task, where the inputs are images and the outputs are people’s names. We’re going to discuss a popular technique for face recognition called eigenfaces. And at the heart of eigenfaces is an unsupervised dimensionality reduction technique called principal component analysis (PCA), and we will see how we can apply this general technique to our specific task of face recognition.

Download the full code here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

Face Recognition

Before discussing principal component analysis, we should first define our problem. Face recognition is the challenge of classifying whose face is in an input image. This is different than face detection where the challenge is determining if there is a face in the input image. With face recognition, we need an existing database of faces. Given a new image of a face, we need to report the person’s name.

A naïve way of accomplishing this is to take the new image, flatten it into a vector, and compute the Euclidean distance between it and all of the other flattened images in our database.

There are several downsides to this approach. First of all, if we have a large database of faces, then doing this comparison for each face will take a while! Imagine that we’re building a face recognition system for real-time use! The larger our dataset, the slower our algorithm. But more faces will also produce better results! We want a system that is both fast and accurate. For this, we’ll use a neural network! We can train our network on our dataset and use it for our face recognition task.

There’s an issue with directly using a neural network: images can be large! If we had a single $m\times n$ image, we would have to flatten it out into a single $m\dot n\times 1$ vector to feed into our neural network as input. For large image sizes, this might hurt speed! This is related to the second problem with using images as-is in our naïve approach: they are high-dimensional! (An $m\times n$ image is really a $m\dot n\times 1$ vector) A new input might have a ton of noise and comparing each and every pixel using matrix subtraction and Euclidean distance might give us a high error and misclassifications!

These issues are why we don’t use the naïve method. Instead, we’d like to take our high-dimensional images and boil them down to a smaller dimensionality while retaining the essence or important parts of the image.

Dimensionality Reduction

The previous section motivates our reason for using a dimensionality reduction technique. Dimensionality reduction is a type of unsupervised learning where we want to take higher-dimensional data, like images, and represent them in a lower-dimensional space. Let’s use the following image as an example.

These plots show the same data, except the bottom chart zero-centers it. Notice that our data do not have any labels associated with them because this is unsupervised learning! In our simple case, dimensionality reduction will reduce these data from a 2D plane to a 1D line. If we had 3D data, we could reduce it down to a 2D plane or even a 1D line.

All dimensionality reduction techniques aim to find some hyperplane, a higher-dimensional line, to project the points onto. We can imagine a projection as taking a flashlight perpendicular to the hyperplane we’re project onto and plotting where the shadows fall on that hyperplane. For example, in our above data, if we wanted to project our points onto the x-axis, then we pretend each point is a ball and our flashlight would point directly down or up (perpendicular to the x-axis) and the shadows of the points would fall on the x-axis. This is a projection. We won’t worry about the exact math behind this since scikit-learn can apply this projection for us.

In our simple 2D case, we want to find a line to project our points onto. After we project the points, then we have data in 1D instead of 2D! Similarly, if we had 3D data, we want to find a plane to project the points down onto to reduce the dimensionality of our data from 3D to 2D. The different types of dimensionality reduction are all about figuring out which of these hyperplanes to select: there are an infinite number of them!

Princpal Component Analysis

One technique of dimensionality reduction is called principal component analysis (PCA). The idea behind PCA is that we want to select the hyperplane such that when all the points are projected onto it, they are maximally spread out. In other words, we want the axis of maximal variance! Let’s consider our example plot above. A potential axis is the x-axis or y-axis, but, in both cases, that’s not the best axis. However, if we pick a line that cuts through our data diagonally, that is the axis where the data would be most spread!

The longer blue axis is the correct axis! If we were to project our points onto this axis, they would be maximally spread! But how do we figure out this axis? We can borrow a term from linear algebra called eigenvectors! This is where eigenfaces gets its name! Essentially, we compute the covariance matrix of our data and consider that covariance matrix’s largest eigenvectors. Those are our principal axes and the axes that we project our data onto to reduce dimensions. Using this approach, we can take high-dimensional data and reduce it down to a lower dimension by selecting the largest eigenvectors of the covariance matrix and projecting onto those eigenvectors.

Since we’re computing the axes of maximum spread, we’re retaining the most important aspects of our data. It’s easier for our classifier to separate faces when our data are spread out as opposed to bunched together.

(There are other dimensionality techniques, such as Linear Discriminant Analysis, that use supervised learning and are also used in face recognition, but PCA works really well!)

How does this relate to our challenge of face recognition? We can conceptualize our $m\times n$ images as points in $m\dot n$-dimensional space. Then, we can use PCA to reduce our space from $m\dot n$ into something much smaller. This will help speed up our computations and be robust to noise and variation.

Aside on Face Detection

So far, we’ve assumed that the input image is only that of a face, but, in practice, we shouldn’t require the camera images to have a perfectly centered face. This is why we run an out-of-the-box face detection algorithm, such as a cascade classifier trained on faces, to figure out what portion of the input image has a face in it. When we have that bounding box, we can easily slice out that portion of the input image and use eigenfaces on that slice. (Usually, we smooth that slide and perform an affine transform to de-warp the face if it appears at an angle.) For our purposes, we’ll assume that we have images of faces already.

Eigenfaces Code

Now that we’ve discussed PCA and eigenfaces, let’s code a face recognition algorithm using scikit-learn! First, we’ll need a dataset. For our purposes, we’ll use an out-of-the-box dataset by the University of Massachusetts called Labeled Faces in the Wild (LFW).

Feel free to substitute your own dataset! If you want to create your own face dataset, you’ll need several pictures of each person’s face (at different angles and lighting), along with the ground-truth labels. The wider variety of faces you use, the better the recognizer will do. The easiest way to create a dataset for face recognition is to create a folder for each person and put the face images in there. Make sure each are the same size and resize them so they aren’t large images! Remember that PCA will reduce the image’s dimensionality when we project onto that space anyways so using large, high-definition images won’t help and will slow down our algorithm. A good size is ~512×512 for each image. The images should all be the same size so you can store them in one numpy array with dimensions (num_examples, height, width) . (We’re assuming grayscale images). Then use the folder names to disambiguate classes. Using this approach, you can use your own images.

However, we’ll be using the LFW dataset. Luckily, scikit-learn can automatically load our dataset for us in the correct format. We can call a function to load our data. If the data aren’t available on disk, scikit-learn will automatically download them for us from the University of Massachusetts’ website.

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier


# Load data
lfw_dataset = fetch_lfw_people(min_faces_per_person=100)

_, h, w = lfw_dataset.images.shape
X = lfw_dataset.data
y = lfw_dataset.target
target_names = lfw_dataset.target_names

# split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

The argument to our function just prunes all people without at least 100 faces, thus reducing the number of classes. Then we can extract our dataset and other auxiliary information. Finally, we split our dataset into training and testing sets.

Now we can simply use scikit-learn’s PCA class to perform the dimensionality reduction for us! We have to select the number of components, i.e., the output dimensionality (the number of eigenvectors to project onto), that we want to reduce down to, and feel free to tweak this parameter to try to get the best result! We’ll use 100 components. Additionally, we’ll whiten our data, which is easy to do with a simple boolean flag! (Whitening just makes our resulting data have a unit variance, which has been shown to produce better results)

# Compute a PCA 
n_components = 100
pca = PCA(n_components=n_components, whiten=True).fit(X_train)

# apply PCA transformation
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

We can apply the transform to bring our images down to a 100-dimensional space. Notice we’re not performing PCA on the entire dataset, only the training data. This is so we can better generalize to unseen data.

Now that we have a reduced-dimensionality vector, we can train our neural network!

# train a neural network
print("Fitting the classifier to the training set")
clf = MLPClassifier(hidden_layer_sizes=(1024,), batch_size=256, verbose=True, early_stopping=True).fit(X_train_pca, y_train)

To see how our network is training, we can set the verbose  flag. Additionally, we use early stopping.

Let’s discuss early stopping as a brief aside. Essentially, our optimizer will monitor the average accuracy for the validation set for each epoch. If it notices that our validation accuracy hasn’t increased significantly for a certain number of epochs, then we stop training. This is a regularization technique that prevents our model from overfitting!

Consider the above chart. We notice overfitting when our validation set accuracy starts to decline. At that point, we immediately stop training to prevent overfitting.

Finally, we can make a prediction and use a function to print out an entire report of quality for each class.

y_pred = clf.predict(X_test_pca)
print(classification_report(y_test, y_pred, target_names=target_names))

Here’s an example of a classification report.

                   precision    recall  f1-score   support

     Colin Powell       0.86      0.89      0.87        66
  Donald Rumsfeld       0.85      0.61      0.71        38
    George W Bush       0.88      0.94      0.91       177
Gerhard Schroeder       0.67      0.69      0.68        26
       Tony Blair       0.86      0.71      0.78        35

      avg / total       0.85      0.85      0.85       342

Notice there is no accuracy metric. Accuracy isn’t the most specific kind of metric to begin. Instead, we see precision, recall, f1-score, and support. The support is simply the number of times this ground truth label occurred in our test set, e.g., in our test set, there were actually 35 images of Tony Blair. The F1-Score is actually just computed from the precision and recall scores. Precision and recall are more specific measures than a single accuracy score. A higher value for both is better.

After training our classifier, we can give it a few images to classify.

# Visualization
def plot_gallery(images, titles, h, w, rows=3, cols=4):
    plt.figure()
    for i in range(rows * cols):
        plt.subplot(rows, cols, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i])
        plt.xticks(())
        plt.yticks(())

def titles(y_pred, y_test, target_names):
    for i in range(y_pred.shape[0]):
        pred_name = target_names[y_pred[i]].split(' ')[-1]
        true_name = target_names[y_test[i]].split(' ')[-1]
        yield 'predicted: {0}\ntrue: {1}'.format(pred_name, true_name)

prediction_titles = list(titles(y_pred, y_test, target_names))
plot_gallery(X_test, prediction_titles, h, w)

(plot_gallery  and titles  functions modified from the scikit-learn documentation)

We can see our network’s predictions and the ground truth value for each image.

Another thing interesting thing to visualize is are the eigenfaces themselves. Remember that PCA produces eigenvectors. We can reshape those eigenvectors into images and visualize the eigenfaces.

These represent the “generic” faces of our dataset. Intuitively, these are vectors that represent directions in “face space” and are what our neural network uses to help with classification. Now that we’ve discussed the eigenfaces approach, you can build applications that use this face recognition algorithm!

We discussed a popular approach to face recognition called eigenfaces. The essence of eigenfaces is an unsupervised dimensionality reduction algorithm called Principal Components Analysis (PCA) that we use to reduce the dimensionality of images into something smaller. Now that we have a smaller representation of our faces, we apply a classifier that takes the reduced-dimension input and produces a class label. For our classifier, we used a single-layer neural network.

Face recognition is a fascinating example of merging computer vision and machine learning and many researchers are still working on this challenging problem today!

Nowadays, deep convolutional neural networks are used for face recognition, and have vast implications in terms of the development career world. Try one out on this dataset!

]]>
Perceptrons: The First Neural Networks for Machine Learning https://gamedevacademy.org/perceptrons-the-first-neural-networks/ Wed, 28 Sep 2022 16:31:42 +0000 https://pythonmachinelearning.pro/?p=1620 Read more]]> Machine learning is not only a fascinating topic but one with a variety of approaches. Neural networks, one such approach, have come to the forefront largely due to their accessibility for those taking their first step into the machine learning world.

In this tutorial, we’re going to use Python and NumPy to explore the fundamentals of creating neural networks and combining them – using analogies for how our brains work to better understand them.

If you’re ready to start diving deep into the world of machine learning, let’s get started!

Intro & Project Files

Neural Networks have become incredibly popular over the past few years, and new architectures, neuron types, activation functions, and training techniques pop up all the time in research. But without a fundamental understanding of neural networks, it can be quite difficult to keep up with the flurry of new work in this area.

To understand the modern approaches, we have to understand the tiniest, most fundamental building block of these so-called deep neural networks: the neuron. In particular, we’ll see how to combine several of them into a layer and create a neural network called the perceptron. We’ll write Python code (using NumPy) to build a perceptron network from scratch and implement the learning algorithm.

For the completed code, download the ZIP file here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

Biological Neurons

Perceptrons and artificial neurons actually date back to 1958. Frank Rosenblatt was a psychologist trying to solidify a mathematical model for biological neurons. To better understand the motivation behind the perceptron, we need a superficial understanding of the structure of biological neurons in our brains.

(Credit: https://commons.wikimedia.org/wiki/File:Neuron_-_annotated.svg)

Let’s consider a biological neuron. The point of this cell is to take in some input (in the form of electrical signals in our brains), do some processing, and produce some output (also an electrical signal). One very important thing to note is that the inputs and outputs are binary (0 or 1)! An individual neuron accepts inputs, usually from other neurons, through its dendrites. Although the image above doesn’t depict it, the dendrites connect with other neurons through a gap called the synapse that assigns a weight to a particular input. Then, all of these inputs are considered together when they are processed in the cell body, or soma.

Neurons exhibit an all-or-nothing behavior. In other words, if the combination of inputs exceeds a certain threshold, then an output signal is produced, i.e., the neuron “fires.” If the combination falls short of the threshold, then the neuron doesn’t produce any output, i.e., the neuron “doesn’t fire.” In the case where the neuron does fire, the output travels along the axon to the axon terminals. These axon terminals are connected to the dendrites of other neurons through the synapse.

Let’s take a moment to recap biological neurons. They take some binary inputs through the dendrites, but not all inputs are treated the same since they are weighted. We combine these weighted signals and, if they surpass a threshold, the neuron fires. This single output travels along the axon to other neurons. Now that we have this summary in mind, we can develop mathematical equations to roughly represent a biological neuron.

Artificial Neurons

Now that we have some understanding of biological neurons, the mathematical model should follow from the operations of a neuron.

In this model, we have n binary inputs (usually given as a vector) and exactly the same number of weights $W_1, …, W_n$. We multiply these together and sum them up. We denote this as z and call it the pre-activation.

\[ z = \displaystyle\sum_{i=1}^{n} W_i x_i = W^T x \]

(We can re-write this as an inner product for succinctness.) There is another term, called the bias, that is just a constant factor.

\[ z = \displaystyle\sum_{i=1}^{n} W_i x_i + b = W^T x + b \]

For mathematical convenience, we can actually incorporate it into our weight vector as $W_0$ and set $x_0 = +1$ for all of our inputs. (This concept of incorporating the bias into the weight vector will become clearer when we write code.)

\[ z = \displaystyle\sum_{i=0}^{n} W_i x_i = W^T x \]

After taking the weighted sum, we apply an activation function, $\sigma$, to this and produce an activation a. The activation function for perceptrons is sometimes called a step function because, if we were to plot it, it would look like a stair.

\[
\sigma(q)=
\begin{cases}
1 & q\geq 0 \\
0 & q < 0
\end{cases}
\]

In other words, if the input is greater than or equal to 0, then we produce an output of 1. Otherwise, we produce an output of 0. This is the mathematical model for a single neuron, the most fundamental unit for a neural networks.

\[ a = \sigma (W^T x) \]

Let’s compare this model to the biological neuron. The inputs are analogous to the dendrites, and the weights model the synapse. We combine the weighted inputs by summing and send that weighted sum to the activation function. This acts as our all-or-nothing response function where 0 means the neuron didn’t produce an output. Also note that our inputs and outputs are also binary, which is in accordance with the biological model.

Capabilities and Limitations of Perceptrons

Since the output of a perceptron is binary, we can use it for binary classification, i.e., an input belongs to only one of two classes. The classic examples used to explain what perceptrons can model are logic gates!

Let’s consider the logic gates in the figure above. A white circle means an output of 1 and a black circle means an output of 0, and the axes indicate inputs. For example, when we input 1 and 1 to an AND gate, the output is 1, the white circle. We can create perceptrons that act like gates: they take 2 binary inputs and produce a single binary output!

However, perceptrons are limited to solving problems that are linearly separable. If two classes are linearly separable, this means that we can draw a single line to separate the two classes. We can do this easily for the AND and OR gates, but there is no single line that can separate the classes for the XOR gate! This means that we can’t use our single-layer perceptron to model an XOR gate.

An intuitive way to understand why perceptrons can only model linearly separable problems is to look the weighted sum equation (with the bias).

\[ \displaystyle\sum_{i=0}^N W_i x_i + b \]

This looks very similar to the equation of a line! (Or, more generally, a hyperplane.) Hence, we’re creating a line and saying that everything on one side of the line belongs to one class and everything on the other side belongs to the other class. This line is called the decision boundary, and, when we use a single-layer perceptron, we can only produce one decision boundary.

In light of this new information, it doesn’t seem like perceptrons are useful! But, in practice, many problems are actually linearly separable. Hope is not lost for non-linearly separably problems however! It can be shown that organizing multiple perceptrons into layers and using an intermediate layer, or hidden layer, can solve the XOR problem! This is the foundation of modern neural networks!

Single-Layer Perceptron Code

Now that we have a good understanding of how perceptrons works, let’s take one more step and solidify the math into code. We’ll use object-oriented principles and create a class. In order to construct our perceptron, we need to know how many inputs there are to create our weight vector. The reason we add one to the input size is to include the bias in the weight vector.

import numpy as np

class Perceptron(object):
    """Implements a perceptron network"""
    def __init__(self, input_size):
        self.W = np.zeros(input_size+1)

We’ll also need to implement our activation function. We can simply return 1 if the input is greater than or equal to 0 and 0 otherwise.

def activation_fn(self, x):
    return 1 if x >= 0 else 0

Finally, we need a function to run an input through the perceptron and return an output. Conventionally, this is called the prediction. We add the bias into the input vector. Then we can simply compute the inner product and apply the activation function.

def predict(self, x):
    x = np.insert(x, 0, 1)
    z = self.W.T.dot(x)
    a = self.activation_fn(z)
    return a

All of these are functions of the Perceptron class that we’ll use for perceptron learning.

Perceptron Learning Algorithm

We’ve defined a perceptron, but how do perceptrons learn? Rosenblatt, the creator of the perceptron, also had some thoughts on how to train neurons based on his intuition about biological neurons. Rosenblatt intuited a simple learning algorithm. His idea was to run each example input through the perceptron and, if the perceptron fires when it shouldn’t have, inhibit it. If the perceptron doesn’t fire when it should have, excite it.

How do we inhibit or excite? We change the weight vector (and bias)! The weight vector is a parameter to the perceptron: we need to keep changing it until we can correctly classify each of our inputs. With this intuition in mind, we need to write an update rule for our weight vector so that we can appropriately change it:

\[ w \leftarrow w + \Delta w \]

We have to determine a good $\Delta w$ that does what we want. First, we can define the error as the difference between the desired output d and the predicted output y.

\[ e = d – y \]

Notice that when d and y are the same (both are 0 or both are 1), we get 0! When they are different, (0 and 1 or 1 and 0), we can get either 1 or -1. This directly corresponds to exciting and inhibiting our perceptron! We multiply this with the input to tell our perceptron to change our weight vector in proportion to our input.

\[ w \leftarrow w + \eta\cdot e\cdot x \]

There is a hyperparameter $\eta$ that is called the learning rate. It is just a scaling factor that determines how large the weight vector updates should be. This is a hyperparameter because it is not learned by the perceptron (notice there’s no update rule for $\eta$!), but we select this parameter.

(For perceptrons, the Perceptron Convergence Theorem says that a perceptron will converge, given that the classes are linearly separable, regardless of the learning rate. But for other learning algorithms, this is a critical parameter!)

Let’s take another look at this update rule. When the error  is 0, i.e., the output is what we expect, then we don’t change the weight vector at all. When the error is nonzero, we update the weight vector accordingly.

Perceptron Learning Algorithm Code

With the update rule in mind, we can create a function to keep applying this update rule until our perceptron can correctly classify all of our inputs. We need to keep iterating through our training data until this happens; one epoch is when our perceptron has seen all of the training data once. Usually, we run our learning algorithm for multiple epochs.

Before we code the learning algorithm, we need to make some changes to our init function to add the learning rate and number of epochs as inputs.

def __init__(self, input_size, lr=1, epochs=10):
    self.W = np.zeros(input_size+1)
    # add one for bias
    self.epochs = epochs
    self.lr = lr

Now we can create a function, given inputs and desired outputs, run our perceptron learning algorithm. We keep updating the weights for a number of epochs, and iterate through the entire training set. We insert the bias into the input when performing the weight update. Then we can create our prediction, compute our error, and perform our update rule.

def fit(self, X, d):
    for _ in range(self.epochs):
        for i in range(d.shape[0]):
            y = self.predict(X[i])
            e = d[i] - y
            self.W = self.W + self.lr * e * np.insert(X[i], 0, 1)

The entire code for our perceptron is shown below.

class Perceptron(object):
    """Implements a perceptron network"""
    def __init__(self, input_size, lr=1, epochs=100):
        self.W = np.zeros(input_size+1)
        # add one for bias
        self.epochs = epochs
        self.lr = lr
    
    def activation_fn(self, x):
        #return (x >= 0).astype(np.float32)
        return 1 if x >= 0 else 0

    def predict(self, x):
        z = self.W.T.dot(x)
        a = self.activation_fn(z)
        return a

    def fit(self, X, d):
        for _ in range(self.epochs):
            for i in range(d.shape[0]):
                x = np.insert(X[i], 0, 1)
                y = self.predict(x)
                e = d[i] - y
                self.W = self.W + self.lr * e * x

Now that we have our perceptron coded, we can try to give it some training data and see if it works! One easy set of data to give is the AND gate. Here’s a set of inputs and outputs.

if __name__ == '__main__':
    X = np.array([
        [0, 0],
        [0, 1],
        [1, 0],
        [1, 1]
    ])
    d = np.array([0, 0, 0, 1])

    perceptron = Perceptron(input_size=2)
    perceptron.fit(X, d)
    print(perceptron.W)

In just a few lines, we can start using our perceptron! At the end, we print the weight vector. Using the AND gate data, we should get a weight vector of [-3, 2, 1]. This means that the bias is -3 and the weights are 2 and 1 for $x_1$ and $x_2$, respectively.

To verify this weight vector is correct, we can try going through a few examples. If both inputs are 0, then the pre-activation will be -3+0*2+0*1 = -3. When applying our activation function, we get 0, which is exactly 0 AND 0! We can try this for other gates as well. Note that this is not the only correct weight vector. Technically, if there exists a single weight vector that can separate the classes, there exist an infinite number of weight vectors. Which weight vector we get depends on how we initialize the weight vector.

To summarize, perceptrons are the simplest kind of neural network: they take in an input, weight each input, take the sum of weighted inputs, and apply an activation function. Since they were modeled from biological neurons by Frank Rosenblatt, they take and produce only binary values. In other words, we can perform binary classification using perceptrons. One limitation of perceptrons is that they can only solve linearly separable problems. In the real world, however, many problems are actually linearly separable. For example, we can use a perceptron to mimic an AND or OR gate. However, since XOR is not linearly separable, we can’t use single-layer perceptrons to create an XOR gate. The perceptron learning algorithm fits the intuition by Rosenblatt: inhibit if a neuron fires when it shouldn’t have, and excite if a neuron does not fire when it should have. We can take that simple principle and create an update rule for our weights to give our perceptron the ability of learning.

Perceptrons are the foundation of neural networks so having a good understanding of them now will be beneficial when learning about deep neural networks! This will also help as you pursue exciting developer opportunities.

Want to learn more Python in general? Check out our free course on Kivy!

]]>
Text Classification with Naive Bayes – Python Tutorial https://gamedevacademy.org/text-classification-tutorial-with-naive-bayes/ Sat, 24 Sep 2022 13:36:53 +0000 https://pythonmachinelearning.pro/?p=1766 Read more]]> If asked how your email sorted spam emails from legitimate emails, what would you say? Do you think you could build a spam filter yourself?

Hidden throughout our lives is one of the most prevalent uses for machine learning: text classification. Text classification allows us to do everything from identifying spam emails to being able to convert hand-written words into their plain text, digital equivalent.

In this tutorial, we’re going to discover how we can use the Naive Bayes technique to build our own text classification and build a spam filter using Python.

If you’re ready to expand how you see the digital world and text, let’s jump in!

Intro to Text Classification

The challenge of text classification is to attach labels to bodies of text, e.g., tax document, medical form, etc. based on the text itself. For example, think of your spam folder in your email. How does your email provider know that a particular message is spam or “ham” (not spam)? We’ll take a look at one natural language processing technique for text classification called Naive Bayes.

Download the full code here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

Bayes Theorem

(credit: https://en.wikipedia.org/wiki/Bayes%27_theorem#/media/File:Bayes%27_Theorem_MMB_01.jpg)

Before we derive the algorithm, we need to discuss the fundamental rule that Naive Bayes uses: Bayes Theorem:

\[ p(A|B) = \displaystyle\frac{p(B|A)p(A)}{p(B)} \]

where $A$ and $B$ are events and $p(\cdot)$ is a probability.

Let’s take a second to break this down. On the left, we have the probability of an event $A$ happening given that event $B$ happens. We say this is equal to the probability of event $B$ happening given event $A$ times the probability that event $A$ happens overall. All of that is divided by the probability that event $B$ happens overall. An example of this might help shed some light on why this is an ingenious theorem.

The classic example used to illustrate Bayes Theorem involves medical testing. Let’s suppose that we were getting tested for the flu. When we get a medical test, there are really 4 cases to consider when we get the results back:

  • True Positive: The test says we have the flu and we actually have the flu
  • True Negative: The test says we don’t have the flu and we actually don’t have the flu
  • False Positive: The test says we have the flu and we actually don’t have the flu
  • False Negative: The test says we don’t have the flu and we actually do have the flu

Suppose we also know some information about the flu and our testing methodology: we know our test can correctly detect that a person has the flu 99.5% of the time (i.e., $p(\text{+}|\text{Flu})=0.995$)  and correctly detect that a person does not have the flu 99.5% of the time (i.e., $p(\text{-}|\text{No Flu})=0.995$). These correspond to the true positive rate and true negative rate. We also know that this specific type of flu is rare and only affects 1% of people. Given this information, we can compute the probability that any randomly selected person will have this specific type of the flu. Specifically, we want to compute the probability that the person has the specific type of flu, given that the person tested positive for it, i.e., event $A=\text{Flu}$ and $B=\text{+}$.

Let’s just substitute the problem specifics into Bayes Theorem.

\[ p(\text{Flu}|\text{+}) = \displaystyle\frac{p(\text{+}|\text{Flu})p(\text{Flu})}{p(\text{+})} \]

Now let’s try to figure out specific values for the quantities on the right-hand side. The first quantity is $p(\text{+}|\text{Flu})$. This is the probability that someone tests positive given that they have the flu. In other words, this is the true positive rate: the probability that our test can correctly detect that a person has the flu! This number is 99.5% or 0.995 The next quantity in the numerator is $p(\text{Flu})$. This is called the prior probability. In other words, it is the probability that any random person has the flu. We know from our problem that this number is 1%, or 0.01. Let’s substitute in those values in the numerator.

\[ p(\text{Flu}|\text{+}) = \displaystyle\frac{0.995 \cdot 0.01}{p(\text{+})} \]

Now we have to deal with the denominator: $p(\text{+})$. This is the probability that our test returns positive overall. We can’t quite use the information given in the problem as directly as before however. But first, why do we even need $p(\text{+})$? Recall that probabilities have to be between 0 and 1. Based on the above equation, if we left out the denominator, then we wouldn’t have a valid probability!

Anyways, when can our test return positive? Well there are two cases: either our test returns positive and the person actually has the flu (true positive) or our test returns positive and our person does not have the flu (false positive). We can’t quite simply sum both of these cases to be the denominator. We have to weight them by their respective probabilities, i.e., the probability that any person has the flu overall and the probability that any person does not have the flu overall. Let’s expand the denominator.

\[ p(\text{Flu}|\text{+}) = \displaystyle\frac{0.995 \cdot 0.01}{p(\text{+}|\text{Flu})p(\text{Flu})+p(\text{+}|\text{No Flu})p(\text{No Flu})} \]

Now let’s reason about these values. $p(\text{+}|\text{Flu})p(\text{Flu})$ is something we’ve seen before: it’s the numerator! Now let’s look at the next quantity: $p(\text{+}|\text{No Flu})p(\text{No Flu})$. We can compute the first term by taking the complement of the true negative: $p(\text{+}|\text{No Flu})=1-p(\text{-}|\text{No Flu})=0.005$. And $p(\text{No Flu})=1-p(\text{Flu})=0.99$ since they are complimentary events. So now we can plug in all of our values and get a result.

\[ p(\text{Flu}|\text{+}) = \displaystyle\frac{0.995 \cdot 0.01}{0.995 \cdot 0.01+0.005 \cdot 0.99}= 0.6678\]

This result is a little surprising! This is saying, despite our test’s accuracy, knowing someone tested positive means that there’s only a 67% chance that they actually have the flu! Hopefully, this example illustrated how to use Bayes Theorem.

Deriving Naive Bayes

Now let’s convert the Bayes Theorem notation into something slightly more machine learning-oriented.

\[ p(H|E) = \displaystyle\frac{p(E|H)p(H)}{p(E)} \]

where $H$ is the hypothesis and $E$ is the evidence. Now this might make more sense in the context of text classification: the probability that our hypothesis is correct given the evidence to support it is equal to the probability of observing that evidence given our hypothesis times the prior probability of the hypothesis divided by the probability of observing that evidence overall.

Let’s break this down again like we did for the original Bayes Theorem, except we’ll use the context of the text classification problem we’re trying to solve: spam detection. Our hypothesis $H$ is something like “this text is spam” and the evidence $E$ is the text of the email. So to restate, we’re trying to find the probability that our email is spam given the text in the email. The numerator is then the probability that that we find these words in a spam email times the probability that any email is spam. The denominator is a bit tricky: it’s the probability that we observe those words overall.

There’s something a bit off with this formulation though: the evidence needs to be represented as multiple pieces of evidence: the words $w_1,\dots,w_n$. No problem! We can do that and Bayes Theorem still holds. We can also change hypothesis $H$ to a class $\text{Spam}$.

\[ p(\text{Spam}|w_1,\dots,w_n) = \displaystyle\frac{p(w_1,\dots,w_n|\text{Spam})p(\text{Spam})}{p(w_1,\dots,w_n)} \]

Excellent! We can use a conditional probability formula to expand out the numerator.

\[ p(\text{Spam}|w_1,\dots,w_n) = \displaystyle\frac{p(w_1|w_2,\dots,w_n,\text{Spam})p(w_2|w_3,\dots,w_n,\text{Spam})\dots p(w_{n-1}|w_n,\text{Spam})p(\text{Spam})}{p(w_1,\dots,w_n)} \]

Not only does this look messy, it’s also quite messy to compute! Let’s think about the first term: $p(w_1|w_2,\dots,w_n,\text{Spam})$. This is the probability of finding the first word, given all of the other words and given that the email is spam. This is really difficult to compute if we have a lot of words!

Naive Bayes Assumption

To help us with that equation, we can make an assumption called the Naive Bayes assumption to help us with the math, and eventually the code. The assumption is that each word is independent of all other words. In reality, this is not true! Knowing what words come before/after do influence the next/previous word! However, making this assumption greatly simplifies the math and, in practice, works well! This assumption is why this technique is called Naive Bayes. So after making that assumption, we can break down the numerator into the following.

\[ p(\text{Spam}|w_1,\dots,w_n) = \displaystyle\frac{p(w_1|\text{Spam})p(w_2|\text{Spam})\dots p(w_n|\text{Spam})p(\text{Spam})}{p(w_1,\dots,w_n)} \]

This looks better! Now we can interpret a term $p(w_1|\text{Spam})$ to mean the probability of finding word $w_1$ in a spam email. We can use a notational shorthand to symbolize product ($\Pi$).

\[ p(\text{Spam}|w_1,\dots,w_n) = \displaystyle\frac{p(\text{Spam})\displaystyle\prod_{i=1}^np(w_i|\text{Spam})}{p(w_1,\dots,w_n)} \]

This is the Naive Bayes formulation! This returns the probability that an email message is spam given the words in that email. For text classification, however, we need an actually label, not a probability, so we simply say that an email is spam if $p(\text{Spam}|w_1,\dots,w_n)$ is greater than 50%. If not, then it is not spam. In other words, we choose “spam” or “ham” based on which one of these two classes has the higher probability! Actually, we don’t need probabilities at all. We can forget about the denominator since its only purpose is to scale the numerator.

\[ p(\text{Spam}|w_1,\dots,w_n) \propto p(\text{Spam})\displaystyle\prod_{i=1}^np(w_i|\text{Spam}) \]

(where $\propto$ signifies proportional to) That’s one extra thing we don’t have to compute! In this instance, we pick whichever class has the higher score since this is not a true probability anymore.

Numerical Stability

There’s one extra thing we’re going to do to help us with numerical stability. If we look at the numerator, we see we’re multiplying many probabilities together. If we do that, we could end up with really small numbers, and our computer might round down to zero! To prevent this, we’re going to look at the log probability by taking the log of each side. Using some properties of logarithms, we can manipulate our Naive Bayes formulation.

\begin{align*}
\log p(\text{Spam}|w_1,\dots,w_n) &\propto \log p(\text{Spam})\displaystyle\prod_{i=1}^np(w_i|\text{Spam})\\
\log p(\text{Spam}|w_1,\dots,w_n) &\propto \log p(\text{Spam}) + \log \displaystyle\prod_{i=1}^np(w_i|\text{Spam})\\
\log p(\text{Spam}|w_1,\dots,w_n) &\propto \log p(\text{Spam}) + \displaystyle\sum_{i=1}^n \log p(w_i|\text{Spam})
\end{align*}

Now we’re dealing with additions of log probabilities instead of multiplying many probabilities together! Since log has really nice properties (monotonicity being the key one), we can still take the highest score to be our prediction, i.e., we don’t have to “undo” the log!

Dataset

We’ll be using the Enron email dataset for our training data. This is real email data from the Enron Corporation after the company collapsed. Before starting, download all of the numbered folders, i.e., enron1, enron2, etc., here and put all of them in a top-level folder simply called enron. Put this enron folder in the same directory as your source code so we can find the dataset!

A WORD OF WARNING!: Since this dataset is a real dataset of emails, it contains real spam messages. Your anti-virus may prune some these emails because they are spam. Let your anti-virus prune as many as it wants. This will not affect our code as long as there are some spam and ham messages still there!

Naive Bayes Code

Here is the dataset-loading code:

import os
import re
import string
import math

DATA_DIR = 'enron'
target_names = ['ham', 'spam']

def get_data(DATA_DIR):
	subfolders = ['enron%d' % i for i in range(1,7)]

	data = []
	target = []
	for subfolder in subfolders:
		# spam
		spam_files = os.listdir(os.path.join(DATA_DIR, subfolder, 'spam'))
		for spam_file in spam_files:
			with open(os.path.join(DATA_DIR, subfolder, 'spam', spam_file), encoding="latin-1") as f:
				data.append(f.read())
				target.append(1)

		# ham
		ham_files = os.listdir(os.path.join(DATA_DIR, subfolder, 'ham'))
		for ham_file in ham_files:
			with open(os.path.join(DATA_DIR, subfolder, 'ham', ham_file), encoding="latin-1") as f:
				data.append(f.read())
				target.append(0)

	return data, target

This will produce two lists: the data list, where each element is the text of an email, and the target list, which is simply binary (1 meaning spam and 0 meaning ham). Now let’s create a class and add some helper functions for string manipulation.

class SpamDetector(object):
    """Implementation of Naive Bayes for binary classification"""
    def clean(self, s):
        translator = str.maketrans("", "", string.punctuation)
        return s.translate(translator)

    def tokenize(self, text):
        text = self.clean(text).lower()
        return re.split("\W+", text)

    def get_word_counts(self, words):
        word_counts = {}
        for word in words:
            word_counts[word] = word_counts.get(word, 0.0) + 1.0
        return word_counts

We have a function to clean up our string by removing punctuation, one to tokenize our string into words, and another to count up how many of each word appears in a list of words.

Before we start the actual algorithm, let’s first understand the algorithm. For training,  we need three things: the (log) class priors, i.e., the probability that any given message is spam/ham; a vocabulary of words; and words frequency for spam and ham separately, i.e., the number of times a given word appears in a spam and ham message. Given a list of input documents, we can write this algorithm.

  1. Compute log class priors by counting how many messages are spam/ham, dividing by the total number of messages, and taking the log.
  2. For each (document, label) pair, tokenize the document into words.
  3. For each word, either add it to the vocabulary for spam/ham, if it isn’t already there, and update the number of counts. Also add that word to the global vocabulary.
def fit(self, X, Y):
    self.num_messages = {}
    self.log_class_priors = {}
    self.word_counts = {}
    self.vocab = set()

    n = len(X)
    self.num_messages['spam'] = sum(1 for label in Y if label == 1)
    self.num_messages['ham'] = sum(1 for label in Y if label == 0)
    self.log_class_priors['spam'] = math.log(self.num_messages['spam'] / n)
    self.log_class_priors['ham'] = math.log(self.num_messages['ham'] / n)
    self.word_counts['spam'] = {}
    self.word_counts['ham'] = {}

    for x, y in zip(X, Y):
        c = 'spam' if y == 1 else 'ham'
        counts = self.get_word_counts(self.tokenize(x))
        for word, count in counts.items():
            if word not in self.vocab:
                self.vocab.add(word)
            if word not in self.word_counts[c]:
                self.word_counts[c][word] = 0.0

            self.word_counts[c][word] += count

First, we can compute the log class priors by counting up how many spam/ham messages are in our dataset and dividing by the total number. Finally, we take the log.

Then we can iterate through our dataset. For each input, we get the word counts and iterate through each (word, frequency) pair. If the word isn’t in our global vocabulary, we add it. If it isn’t in the vocabulary for that particular class label, we also add it along with the frequency.

For example, suppose we had a “spam” message. We count up how many times each unique word appears in that spam message and add that count to the “spam” vocabulary. Suppose the word “free” appears 4 times. Then we add the word “free” to our global vocabulary and add it to the “spam” vocabulary with a count of 4.

We’re keeping track of the frequency of each word as it appears in either a spam or ham message. For example, we expect the word “free” to appear in both messages, but we expect it to be more frequent in the “spam” vocabulary than the “ham” vocabulary.

Now that we’ve extracted all of the data we need from the training data, we can write another function to actually output the class label for new data. To do this classification, we apply Naive Bayes directly. For example, given a document, we need to iterate each of the words and compute $\log p(w_i|\text{Spam})$ and sum them all up, and we also compute $\log p(w_i|\text{Ham})$ and sum them all up. Then we add the log class priors and check to see which score is bigger for that document. Whichever is larger, that is the predicted label!

To compute $\log p(w_i|\text{Spam})$, the numerator is how many times we’ve seen $w_i$ in a “spam” message divided by the total count of all words in every “spam” message.

On additional note: remember that the log of 0 is undefined! What if we encounter a word that is in the “spam” vocabulary, but not the “ham” vocabulary? Then $p(w_i|\text{Ham})$ will be 0! One way around this is to use Laplace Smoothing. We simply add 1 to the numerator, but we also have to add the size of the vocabulary to the denominator to balance it.

def predict(self, X):
    result = []
    for x in X:
        counts = self.get_word_counts(self.tokenize(x))
        spam_score = 0
        ham_score = 0
        for word, _ in counts.items():
            if word not in self.vocab: continue
            
            # add Laplace smoothing
            log_w_given_spam = math.log( (self.word_counts['spam'].get(word, 0.0) + 1) / (self.num_messages['spam'] + len(self.vocab)) )
            log_w_given_ham = math.log( (self.word_counts['ham'].get(word, 0.0) + 1) / (self.num_messages['ham'] + len(self.vocab)) )

            spam_score += log_w_given_spam
            ham_score += log_w_given_ham

        spam_score += self.log_class_priors['spam']
        ham_score += self.log_class_priors['ham']

        if spam_score > ham_score:
            result.append(1)
        else:
            result.append(0)
    return result

In our case, the input can be a list of document texts; we return a list of predictions. Finally, we can use the class like this.

if __name__ == '__main__':
    X, y = get_data(DATA_DIR)
    MNB = SpamDetector()
    MNB.fit(X[100:], y[100:])

    pred = MNB.predict(X[:100])
    true = y[:100]

    accuracy = sum(1 for i in range(len(pred)) if pred[i] == true[i]) / float(len(pred))
    print("{0:.4f}".format(accuracy))

We’re reserving the first 100 for the testing set, “train” our Naive Bayes classifier, then compute the accuracy.

To recap, we reviewed Bayes Theorem and demonstrated how to use it with an example. Then we re-worked it using hypotheses and evidence instead of just events $A$ and $B$ to make it more specific to our task of spam detection. From there, we derived Naive Bayes by making the Naive Bayes Assumption that each word appears independently of all other words. Then we formulated a prediction equation/rule. Using the Enron dataset, we created a binary Naive Bayes classifier for detecting spam emails.

Naive Bayes is a simple text classification algorithm that uses basic probability laws and works quite well in practice! It is also an important skill set used in the professional industry, so having these skills down pat will have immense implications for your career as well!

]]>
An Introduction to Image Recognition https://gamedevacademy.org/image-recognition-guide/ Sat, 31 Oct 2020 15:00:35 +0000 https://pythonmachinelearning.pro/?p=2698 Read more]]>

You can access the full course here: Convolutional Neural Networks for Image Classification

Intro to Image Recognition

Let’s get started by learning a bit about the topic itself. Image recognition is, at its heart, image classification so we will use these terms interchangeably throughout this course. We see images or real-world items and we classify them into one (or more) of many, many possible categories. The categories used are entirely up to use to decide. For example, we could divide all animals into mammals, birds, fish, reptiles, amphibians, or arthropods. Alternatively, we could divide animals into carnivores, herbivores, or omnivores. Perhaps we could also divide animals into how they move such as swimming, flying, burrowing, walking, or slithering. There are potentially endless sets of categories that we could use.

Among categories, we divide things based on a set of characteristics. When categorizing animals, we might choose characteristics such as whether they have fur, hair, feathers, or scales. Maybe we look at the shape of their bodies or go more specific by looking at their teeth or how their feet are shaped. Once again, we choose there are potentially endless characteristics we could look for.

Analogies aside, the main point is that in order for classification to work, we have to determine a set of categories into which we can class the things we see and the set of characteristics we use to make those classifications. This allows us to then place everything that we see into one of the categories or perhaps say that it belongs to none of the categories. The more categories we have, the more specific we have to be. It’s easier to say something is either an animal or not an animal but it’s harder to say what group of animals an animal may belong to. However complicated, this classification allows us to not only recognize things that we have seen before, but also to place new things that we have never seen. Good image recognition models will perform well even on data they have never seen before (or any machine learning model, for that matter).

How do we Perform Image Recognition?

We do a lot of this image classification without even thinking about it. For starters, we choose what to ignore and what to pay attention to. This actually presents an interesting part of the challenge: picking out what’s important in an image. We see everything but only pay attention to some of that so we tend to ignore the rest or at least not process enough information about it to make it stand out. Knowing what to ignore and what to pay attention to depends on our current goal. For example, if we were walking home from work, we would need to pay attention to cars or people around us, traffic lights, street signs, etc. but wouldn’t necessarily have to pay attention to the clouds in the sky or the buildings or wildlife on either side of us. On the other hand, if we were looking for a specific store, we would have to switch our focus to the buildings around us and perhaps pay less attention to the people around us.

The same thing occurs when asked to find something in an image. We decide what features or characteristics make up what we are looking for and we search for those, ignoring everything else. This is easy enough if we know what to look for but it is next to impossible if we don’t understand what the thing we’re searching for looks like. 

This brings to mind the question: how do we know what the thing we’re searching for looks like? There are two main mechanisms: either we see an example of what to look for and can determine what features are important from that (or are told what to look for verbally) or we have an abstract understanding of what we’re looking for should look like already. For example, if you’ve ever played “Where’s Waldo?”, you are shown what Waldo looks like so you know to look out for the glasses, red and white striped shirt and hat, and the cane. To the uninitiated, “Where’s Waldo?” is a search game where you are looking for a particular character hidden in a very busy image. I’d definitely recommend checking it out. However, if we were given an image of a farm and told to count the number of pigs, most of us would know what a pig is and wouldn’t have to be shown. That’s because we’ve memorized the key characteristics of a pig: smooth pink skin, 4 legs with hooves, curly tail, flat snout, etc. We don’t need to be taught because we already know.

This logic applies to almost everything in our lives. We learn fairly young how to classify things we haven’t seen before into categories that we know based on features that are similar to things within those categories. If we come across something that doesn’t fit into any category, we can create a new category. For example, there are literally thousands of models of cars; more come out every year. However, we don’t look at every model and memorize exactly what it looks like so that we can say with certainty that it is a car when we see it. We know that the new cars look similar enough to the old cars that we can say that the new models and the old models are all types of car. 

By now, we should understand that image recognition is really image classification; we fit everything that we see into categories based on characteristics, or features, that they possess. We’re intelligent enough to deduce roughly which category something belongs to, even if we’ve never seen it before. If something is so new and strange that we’ve never seen anything like it and it doesn’t fit into any category, we can create a new category and assign membership within that. The next question that comes to mind is: how do we separate objects that we see into distinct entities rather than seeing one big blur?

The somewhat annoying answer is that it depends on what we’re looking for. If we look at an image of a farm, do we pick out each individual animal, building, plant, person, and vehicle and say we are looking at each individual component or do we look at them all collectively and decide we see a farm? Okay, let’s get specific then. Let’s say we aren’t interested in what we see as a big picture but rather what individual components we can pick out. How do we separate them all? 

The key here is in contrast. Generally, we look for contrasting colours and shapes; if two items side by side are very different colours or one is angular and the other is smooth, there’s a good chance that they are different objects. Although this is not always the case, it stands as a good starting point for distinguishing between objects. 

Coming back to the farm analogy, we might pick out a tree based on a combination of browns and greens: brown for the trunk and branches and green for the leaves. Of course this is just a generality because not all trees are green and brown and trees come in many different shapes and colours but most of us are intelligent enough to be able to recognize a tree as a tree even if it looks different. We could find a pig due to the contrast between its pink body and the brown mud it’s playing in. We could recognize a tractor based on its square body and round wheels. This is why colour-camouflage works so well; if a tree trunk is brown and a moth with wings the same shade of brown as tree sits on the tree trunk, it’s difficult to see the moth because there is no colour contrast.

Another amazing thing that we can do is determine what object we’re looking at by seeing only part of that object. This is really high level deductive reasoning and is hard to program into computers. This is one of the reasons it’s so difficult to build a generalized artificial intelligence but more on that later. As long as we can see enough of something to pick out the main distinguishing features, we can tell what the entire object should be. For example, if we see only one eye, one ear, and a part of a nose and mouth, we know that we’re looking at a face even though we know most faces should have two eyes, two ears, and a full mouth and nose. 

Although we don’t necessarily need to think about all of this when building an image recognition machine learning model, it certainly helps give us some insight into the underlying challenges that we might face. If nothing else, it serves as a preamble into how machines look at images. The main problem is that we take these abilities for granted and perform them without even thinking but it becomes very difficult to translate that logic and those abilities into machine code so that a program can classify images as well as we can. This is just the simple stuff; we haven’t got into the recognition of abstract ideas such as recognizing emotions or actions but that’s a much more challenging domain and far beyond the scope of this course.

How do Machines Interpret Images?

The previous topic was meant to get you thinking about how we look at images and contrast that against how machines look at images. We’ll see that there are similarities and differences and by the end, we will hopefully have an idea of how to go about solving image recognition using machine code. 

Let’s start by examining the first thought: we categorize everything we see based on features (usually subconsciously) and we do this based on characteristics and categories that we choose. The number of characteristics to look out for is limited only by what we can see and the categories are potentially infinite. This is different for a program as programs are purely logical. As of now, they can only really do what they have been programmed to do which means we have to build into the logic of the program what to look for and which categories to choose between

This is a very important notion to understand: as of now, machines can only do what they are programmed to do. If we build a model that finds faces in images, that is all it can do. It won’t look for cars or trees or anything else; it will categorize everything it sees into a face or not a face and will do so based on the features that we teach it to recognize. This means that the number of categories to choose between is finite, as is the set of features we tell it to look for. We can tell a machine learning model to classify an image into multiple categories if we want (although most choose just one) and for each category in the set of categories, we say that every input either has that feature or doesn’t have that feature. Machine learning helps us with this task by determining membership based on values that it has learned rather than being explicitly programmed but we’ll get into the details later.

Often the inputs and outputs will look something like this:

Input: [ 1 1 0 0 0 1 0 0 1 0 ] 

Output: [ 0 0 1 0 0 ]

In the above example, we have 10 features. A 1 means that the object has that feature and a 0 means that it does not so this input has features 1, 2, 6, and 9 (whatever those may be). We can 5 categories to choose between. A 1 in that position means that it is a member of that category and a 0 means that it is not so our object belongs to category 3 based on its features. This form of input and output is called one-hot encoding and is often seen in classification models. Realistically, we don’t usually see exactly 1s and 0s (especially in the outputs). We should see numbers close to 1 and close to 0 and these represent certainties or percent chances that our outputs belong to those categories. For example, if the above output came from a machine learning model, it may look something more like this:

[ 0.01 0.02 0.95 0.01 0.01]

This means that there is a 1% chance the object belongs to the 1st, 4th, and 5th categories, a 2% change it belongs to the 2nd category, and a 95% chance that it belongs to the 3rd category. It can be nicely demonstrated in this image:

Visual of how machine learning identifies objects

How do Machines Interpret Images?

This provides a nice transition into how computers actually look at images. To a computer, it doesn’t matter whether it is looking at a real-world object through a camera in real time or whether it is looking at an image it downloaded from the internet; it breaks them both down the same way. Essentially, in image is just a matrix of bytes that represent pixel values. When it comes down to it, all data that machines read whether it’s text, images, videos, audio, etc. is broken down into a list of bytes and is then interpreted based on the type of data it represents. 

For images, each byte is a pixel value but there are up to 4 pieces of information encoded for each pixel. Grey-scale images are the easiest to work with because each pixel value just represents a certain amount of “whiteness”. Because they are bytes, values range between 0 and 255 with 0 being the least white (pure black) and 255 being the most white (pure white). Everything in between is some shade of grey. With colour images, there are additional red, green, and blue values encoded for each pixel (so 4 times as much info in total). Each of those values is between 0 and 255 with 0 being the least and 255 being the most. If an image sees a bunch of pixels with very low values clumped together, it will conclude that there is a dark patch in the image and vice versa.

Below is a very simple example. An image of a 1 might look like this:

Hand drawn 1 digit

And have this as the pixel values:

[[255, 255, 255, 255, 255],

 [255, 255, 0, 255, 255],

 [255, 255, 0, 255, 255],

 [255, 255, 0, 255, 255],

 [255, 255, 255, 255, 255]]

This is definitely scaled way down but you can see a clear line of black pixels in the middle of the image data (0) with the rest of the pixels being white (255).

Images have 2 dimensions to them: height and width. These are represented by rows and columns of pixels, respectively. In this way, we can map each pixel value to a position in the image matrix (2D array so rows and columns). Machines don’t really care about the dimensionality of the image; most image recognition models flatten an image matrix into one long array of pixels anyway so they don’t care about the position of individual pixel values. Rather, they care about the position of pixel values relative to other pixel values. They learn to associate positions of adjacent, similar pixel values with certain outputs or membership in certain categories. In the above example, a program wouldn’t care that the 0s are in the middle of the image; it would flatten the matrix out into one long array and say that, because there are 0s in certain positions and 255s everywhere else, we are likely feeding it an image of a 1. The same can be said with coloured images. If a model sees pixels representing greens and browns in similar positions, it might think it’s looking at a tree (if it had been trained to look for that, of course). 

This is also how image recognition models address the problem of distinguishing between objects in an image; they can recognize the boundaries of an object in an image when they see drastically different values in adjacent pixels. A machine learning model essentially looks for patterns of pixel values that it has seen before and associates them with the same outputs. It does this during training; we feed images and the respective labels into the model and over time, it learns to associate pixel patterns with certain outputs. If a model sees many images with pixel values that denote a straight black line with white around it and is told the correct answer is a 1, it will learn to map that pattern of pixels to a 1.

This is great when dealing with nicely formatted data. If we feed a model a lot of data that looks similar then it will learn very quickly. The problem then comes when an image looks slightly different from the rest but has the same output. Consider again the image of a 1. It could be drawn at the top or bottom, left or right, or center of the image. It could have a left or right slant to it. It could look like this: 1 or this l. This is a big problem for a poorly-trained model because it will only be able to recognize nicely-formatted inputs that are all of the same basic structure but there is a lot of randomness in the world. We need to be able to take that into account so our models can perform practically well. This is why we must expose a model to as many different kinds of inputs as possible so that it learns to recognize general patterns rather than specific ones. There are tools that can help us with this and we will introduce them in the next topic.

Hopefully by now you understand how image recognition models identify images and some of the challenges we face when trying to teach these models. Models can only look for features that we teach them to and choose between categories that we program into them. To machines, images are just arrays of pixel values and the job of a model is to recognize patterns that it sees across many instances of similar images and associate them with specific outputs. We need to teach machines to look at images more abstractly rather than looking at the specifics to produce good results across a wide domain. Next up we will learn some ways that machines help to overcome this challenge to better recognize images. In the meantime, though, consider browsing our article on just what sort of job opportunities await you should you pursue these exciting Python topics!

 

Transcript

What is up, guys? Welcome to the first tutorial in our image recognition course. This is also the very first topic, and is just going to provide a general intro into image recognition. Now we’re going to cover two topics specifically here. One will be, “What is image recognition?” and the other will be, “What tools can help us to solve image recognition?”

The first part, which will be this video, will be all about introducing the problem of image recognition, talk about how we solve the problem of image recognition in our day-to-day lives, and then we’ll go onto explore this from a machine’s point of view. After that, we’ll talk about the tools specifically that machines use to help with image recognition. Specifically, we’ll be looking at convolutional neural networks, but a bit more on that later.

Let’s get started with, “What is image recognition?” Image recognition is seeing an object or an image of that object and knowing exactly what it is. At the very least, even if we don’t know exactly what it is, we should have a general sense for what it is based on similar items that we’ve seen. Essentially, we class everything that we see into certain categories based on a set of attributes. That’s why image recognition is often called image classification, because it’s essentially grouping everything that we see into some sort of a category.

Now the attributes that we use to classify images is entirely up to us. For example, if we’re looking at different animals, we might use a different set of attributes versus if we’re looking at buildings or let’s say cars, for example. If we’re looking at vehicles, we might be taking a look at the shape of the vehicle, the number of windows, the number of wheels, et cetera. If we’re looking at animals, we might take into consideration the fur or the skin type, the number of legs, the general head structure, and stuff like that. It’s entirely up to us which attributes we choose to classify items. And this could be real-world items as well, not necessarily just images.

Now, this allows us to categorize something that we haven’t even seen before. In fact, this is very powerful. We can take a look at something that we’ve literally never seen in our lives, and accurately place it in some sort of a category. We can often see this with animals. I highly doubt that everyone has seen every single type of animal there is to see out there. No doubt there are some animals that you’ve never seen before in your lives. But, you should, by looking at it, be able to place it into some sort of category. You should know that it’s an animal. You should have a general sense for whether it’s a carnivore, omnivore, herbivore, and so on and so forth.

Now, another example of this is models of cars. Now, every single year, there are brand-new models of cars coming out, some which we’ve never seen before. Some look so different from what we’ve seen before, but we recognize that they are all cars. We can take a look again at the wheels of the car, the hood, the windshield, the number of seats, et cetera, and just get a general sense that we are looking at some sort of a vehicle, even if it’s not like a sedan, or a truck, or something like that.

Now, how does this work for us? Well, a lot of the time, image recognition actually happens subconsciously. In fact, we rarely think about how we know what something is just by looking at it. We just kinda take a look at it, and we know instantly kind of what it is. And a big part of this is the fact that we don’t necessarily acknowledge everything that is around us. If we do need to notice something, then we can usually pick it out and define and describe it.

Take, for example, if you’re walking down the street, especially if you’re walking a route that you’ve walked many times. It’s highly likely that you don’t pay attention to everything around you. Maybe there’s stores on either side of you, and you might not even really think about what the stores look like, or what’s in those stores. However, when you go to cross the street, you become acutely aware of the other people around you, of the cars around you, because those are things that you need to notice. In fact, even if it’s a street that we’ve never seen before, with cars and people that we’ve never seen before, we should have a general sense for what to do. The light turns green, we go, if there’s a car driving in front of us, probably shouldn’t walk into it, and so on and so forth.

Now, this kind of process of knowing what something is is typically based on previous experiences. If we’d never come into contact with cars, or people, or streets, we probably wouldn’t know what to do. However, we’ve definitely interacted with streets and cars and people, so we know the general procedure. So, go on a green light, stop on a red light, so on and so forth, and that’s because that’s stuff that we’ve seen in the past. Even if we haven’t seen that exact version of it, we kind of know what it is because we’ve seen something similar before.

Now, sometimes this is done through pure memorization. Maybe we look at a specific object, or a specific image, over and over again, and we know to associate that with an answer. This is just kind of rote memorization. However, the more powerful ability is being able to deduce what an item is based on some similar characteristics when we’ve never seen that item before. And that’s really the challenge.

It’s easy enough to program in exactly what the answer is given some kind of input into a machine. You could just use like a map or a dictionary for something like that. However, the challenge is in feeding it similar images, and then having it look at other images that it’s never seen before, and be able to accurately predict what that image is. Now, this kind of a problem is actually two-fold. The problem is first deducing that there are multiple objects in your field of vision, and the second is then recognizing each individual object.

So, step number one, how are we going to actually recognize that there are different objects around us? Typically, we do this based on borders that are defined primarily by differences in color. This makes sense. If we’ve seen something that camouflages into something else, probably the colors are very similar, so it’s just hard to tell them apart, it’s hard to place a border on one specific item. However, if you see, say, a skyscraper outlined against the sky, there’s usually a difference in color. It’s very easy to see the skyscraper, maybe, let’s say, brown, or black, or gray, and then the sky is blue. So there’s that sharp contrast in color, therefore we can say, ‘Okay, there’s obviously something in front of the sky.’

Now, again, another example is it’s easy to see a green leaf on a brown tree, but let’s say we see a black cat against a black wall. We might not even be able to tell it’s there at all, unless it opens its eyes, or maybe even moves. Now, we don’t necessarily need to look at every single part of an image to know what some part of it is. Take, for example, if you have an image of a landscape, okay, so there’s maybe some trees in the background, there’s a house, there’s a farm, or something like that, and someone asks you to point out the house. Well, you don’t even need to look at the entire image, it’s just as soon as you see the bit with the house, you know that there’s a house there, and then you can point it out.

This is even more powerful when we don’t even get to see the entire image of an object, but we still know what it is. Take, for example, an image of a face. Let’s say we’re only seeing a part of a face. Specifically, we only see, let’s say, one eye and one ear. But we still know that we’re looking at a person’s face based on the color, the shape, the spacing of the eye and the ear, and just the general knowledge that a face, or at least a part of a face, looks kind of like that. Our brain fills in the rest of the gap, and says, ‘Well, we’ve seen faces, a part of a face is contained within this image, therefore we know that we’re looking at a face.’

That’s, again, a lot more difficult to program into a machine because it may have only seen images of full faces before, and so it gets a part of a face, and it doesn’t know what to do. No longer are we looking at two eyes, two ears, the mouth, et cetera. We’re only looking at a little bit of that.

Now, before we talk about how machines process this, I’m just going to kind of summarize this section, we’ll end it, and then we’ll cover the machine part in a separate video, because I do wanna keep things a bit shorter, there’s a lot to process here. So some of the key takeaways are the fact that a lot of this kinda image recognition classification happens subconsciously. We just look at an image of something, and we know immediately what it is, or kind of what to look out for in that image. Obviously this gets a bit more complicated when there’s a lot going on in an image.

Also, image recognition, the problem of it is kinda two-fold. The first is recognizing where one object ends and another begins, so kinda separating out the object in an image, and then the second part is actually recognizing the individual pieces of an image, putting them together, and recognizing the whole thing. Also, know that it’s very difficult for us to program in the ability to recognize a whole part of something based on just seeing a single part of it, but it’s something that we are naturally very good at.

Okay, so, think about that stuff, stay tuned for the next section, which will kind of talk about how machines process images, and that’ll give us insight into how we’ll go about implementing the model. Okay, so thanks for watching, we’ll see you guys in the next one.

What’s up guys? Welcome to the second tutorial in our image recognition course. Here we’re going to continue on with how image recognition works, but we’re going to explore it from a machine standpoint now. We just finished talking about how humans perform image recognition or classification, so we’ll compare and contrast this process in machines.

For starters, contrary to popular belief, machines do not have infinite knowledge of what everything they see is. So, let’s say we’re building some kind of program that takes images or scans its surroundings. Well, it’s going to take in all that information, and it may store it and analyze it, but it doesn’t necessarily know what everything it sees it. It might not necessarily be able to pick out every object.

Machines only have knowledge of the categories that we have programmed into them and taught them to recognize. And, actually, this goes beyond just image recognition, machines, as of right now at least, can only do what they’re programmed to do. So this means, if we’re teaching a machine learning image recognition model, to recognize one of 10 categories, it’s never going to recognize anything else, outside of those 10 categories.

Now, a simple example of this, is creating some kind of a facial recognition model, and its only job is to recognize images of faces and say, “Yes, this image contains a face,” or, “no, it doesn’t.” So basically, it classifies everything it sees into a face or not a face. Now, this means that even the most sophisticated image recognition models, the best face recognition models will not recognize everything in that image. It’s never going to take a look at an image of a face, or it may be not a face, and say, “Oh, that’s actually an airplane,” or, “that’s a car,” or, “that’s a boat or a tree.”

It’s just going to say, “No, that’s not a face,” okay? Because that’s all it’s been taught to do. It’s classifying everything into one of those two possible categories, okay? So even if something doesn’t belong to one of those categories, it will try its best to fit it into one of the categories that it’s been trained to do. So, essentially, it’s really being trained to only look for certain objects and anything else, just, it tries to shoehorn into one of those categories, okay? So that’s a very important takeaway, is that if we want a model to recognize something, we have to program it to recognize that, okay? Otherwise, it may classify something into some other category or just ignore it completely.

Now, to a machine, we have to remember that an image, just like any other data, is simply an array of bytes. So it’s really just an array of data. It doesn’t look at an incoming image and say, “Oh, that’s a two,” or “that’s an airplane,” or, “that’s a face.” It’s just an array of values. Even images – which are technically matrices, there they have columns and rows, they are essentially rows of pixels, these are actually flattened out when a model processes these images.

Generally speaking, we flatten it all into one long array of bytes. So, I say bytes because typically the values are between zero and 255, okay? So that’s a byte range, but, technically, if we’re looking at color images, each of the pixels actually contains additional information about red, green, and blue color values. Lucky for us, we’re only really going to be working with black and white images, so this problem isn’t quite as much of a problem. But realistically, if we’re building an image recognition model that’s to be used out in the world, it does need to recognize color, so the problem becomes four times as difficult.

Now, if an image is just black or white, typically, the value is simply a darkness value. I guess this actually should be a whiteness value because 255, which is the highest value as a white, and zero is black. And, that means anything in between is some shade of gray, so the closer to zero, the lower the value, the closer it is to black. And, the higher the value, closer to 255, the more white the pixel is.

Now, this is the same for red, green, and blue color values, as well. If we get a 255 in a red value, that means it’s going to be as red as it can be. If we get 255 in a blue value, that means it’s gonna be as blue as it can be. But, of course, there are combinations. So, for example, if we get 255 red, 255 blue, and zero green, we’re probably gonna have purple because it’s a lot of red, a lot of blue, and that makes purple, okay? So this is kind of how we’re going to get these various color values encoded into our images.

Now, machines don’t really care about seeing an image as a whole, it’s a lot of data to process as a whole anyway, so actually, what ends up happening is these image recognition models often make these images more abstract and smaller, but we’ll get more into that later. To process an image, they simply look at the values of each of the bytes and then look for patterns in them, okay?

So if we feed an image of a two into a model, it’s not going to say, “Oh, well, okay, I can see a two.” It’s just gonna see all of the pixel value patterns and say, “Oh, I’ve seen those before “and I’ve associated with it, associated those with a two. “So we’ll probably do the same this time,” okay? So they’re essentially just looking for patterns of similar pixel values and associating them with similar patterns they’ve seen before. In this way, image recognition models look for groups of similar byte values across images so that they can place an image in a specific category.

Again, coming back to the concept of recognizing a two, because we’ll actually be dealing with digit recognition, so zero through nine, we essentially will teach the model to say, “‘Kay, we’ve seen this similar pattern in twos. “We’ve seen this pattern in ones,” et cetera. So when it sees a similar patterns, it says, “Okay, well, we’ve seen those patterns “and associated it with a specific category before, “so we’ll do the same.”

Now, I should say actually, on this topic of categorization, it’s very, very rarely going to be the case that the model is 100% certain an image belongs to any category, okay? That’s why these outputs are very often expressed as percentages. So it might be, let’s say, 98% certain an image is a one, but it also might be, you know, 1% certain it’s a seven, maybe .5% certain it’s something else, and so on, and so forth. So it’s very, very rarely 100% it will, you know, we can get very close to 100% certainty, but we usually just pick the higher percent and go with that.

Now, an example of a color image would be, let’s say, a high green and high brown values in adjacent bytes, may suggest an image contains a tree, okay? Now, if many images all have similar groupings of green and brown values, the model may think they all contain trees. So it will learn to associate a bunch of green and a bunch of brown together with a tree, okay? So this is maybe an image recognition model that recognizes trees or some kind of, just everyday objects.

Now, the unfortunate thing is that can be potentially misleading. There are plenty of green and brown things that are not necessarily trees, for example, what if someone is wearing a camouflage tee shirt, or camouflage pants? Well, that’s definitely not a tree, that’s a person, but that’s kind of the point of wearing camouflage is to fool things or people into thinking that they are something else, in this case, a tree, okay? So really, the key takeaway here is that machines will learn to associate patterns of pixels, rather than an individual pixel value, with certain categories that we have taught it to recognize, okay?

Now, we can see a nice example of that in this picture here. So, there’s a lot going on in this image, even though it may look fairly boring to us. There’s the decoration on the wall. There’s the lamp, the chair, the TV, the couple of different tables. There’s a vase full of flowers. There’s a picture on the wall and there’s obviously the girl in front. And, the girl seems to be the focus of this particular image.

Now, we are kind of focusing around the girl’s head, but there’s also, a bit of the background in there, there’s also, you got to think about her hair, contrasted with her skin. There’s also a bit of the image, that kind of picture on the wall, and so on, and so forth. So there may be a little bit of confusion. It’s not 100% girl and it’s not 100% anything else. And, that’s why, if you look at the end result, the machine learning model, this is 94% certain that it contains a girl, okay? It’s, for a reason, 2% certain it’s the bouquet or the clock, even though those aren’t directly in the little square that we’re looking at, and there’s a 1% chance it’s a sofa.

Now, I know these don’t add up to 100%, it’s actually 101%. But, you’ve got to take into account some kind of rounding up. Also, this definitely demonstrates how a bigger image is broken down into many, many smaller images and ultimately is categorized into one of these categories. So, in this case, we’re maybe trying to categorize everything in this image into one of four possible categories, either it’s a sofa, clock, bouquet, or a girl. And, in this case, what we’re looking at, it’s quite certain it’s a girl, and only a lesser bit certain it belongs to the other categories, okay?

So again, remember that image classification is really image categorization. Machines can only categorize things into a certain subset of categories that we have programmed it to recognize, and it recognizes images based on patterns in pixel values, rather than focusing on any individual pixel, ‘kay? So when we come back, we’ll talk about some of the tools that will help us with image recognition, so stay tuned for that.

Otherwise, thanks for watching! See you guys in the next one!

Interested in continuing?  Check out the full Convolutional Neural Networks for Image Classification course, which is part of our Machine Learning Mini-Degree.

]]>
Classification with Support Vector Machines https://gamedevacademy.org/classification-with-support-vector-machines/ Sun, 06 Sep 2020 02:21:57 +0000 https://pythonmachinelearning.pro/?p=1952 Read more]]> One of the most widely-used and robust classifiers is the support vector machine. Not only can it efficiently classify linear decision boundaries, but it can also classify non-linear boundaries and solve linearly inseparable problems. We’ll be discussing the inner workings of this classification jack-of-all-trades. We first have to review the perceptron so we can talk about support vector machines. Then we’ll derive the support vector machine problem for both linearly separable and inseparable problems. We’ll discuss the kernel trick, and, finally, we’ll see how varying parameters affects the decision boundary on the most popular classification dataset: the iris dataset.

Download the full code here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

Perceptron Review

Before continuing on to discuss support vector machines, let’s take a moment to recap the perceptron.

The perceptron takes a weighted sum of its inputs and applies an activation function. To train a perceptron, we adjust the weights of the weighted sum. The activation function can be any number of things, such as the sigmoid, hyperbolic tangent (tanh), or rectified linear unit (ReLU). After applying the activation function, we get an activation out, and that activation is compared to the actual output to measure how well our perceptron is doing. If it didn’t correctly classify our data, then we adjust the weights. We keep iterating over our training data until the perceptron can correctly classify each of our examples (or we hit the maximum number of epochs).

We trained our perceptron to solve logic gates but came to an important realization: the perceptron can only solve linear problems! In other words, the perceptron’s weights create a line (or hyperplane)! This is the reason we can’t use a single perceptron to solve the XOR problem.

Let’s discuss just linear problems for now. One of the most useful properties of the perceptron is the perceptron convergence theorem: for a linearly separable problem, the perceptron is guaranteed to find an answer in a finite amount of time.

However, there is one big catch: it finds the first line that correctly classifies all examples, not the best line. For any problem, if there is a single line that can correctly classify all training examples, there are an infinite number of lines that can separate the classes! These separating lines are also called decision boundaries because they determine the class based on which side of the boundary an example falls on.

Let’s see an example to make this more concrete. Suppose we had the given data for a binary classification problem. If we used a perceptron, we might get a decision boundary that looks like this.

This isn’t the best decision boundary! The line is really close to all of our green examples and far from our magenta examples. If we get new examples, then we might have an example that’s really close to the decision boundary, but on the magenta side. If I didn’t draw that line, we would certainly think that the new point would be a green point. But, since it is on the other side of the decision boundary, even though it is closer to the green examples, our perceptron would classify it as a magenta point. This is not good!

If this decision boundary is bad, then where, among the infinite number of decision boundaries, is the best one? Our intuition tell us that the best decision boundary should probably be oriented in the exact middle of the two classes of data.

The dashed line is the decision boundary. This seems like a better fit! Now, if we have a new example that’s really close to this decision boundary, we still can classify it correctly! But how do we find this best decision boundary?

Support Vector Machines

The goal of support vector machines (SVMs) is to find the optimal line (or hyperplane) that maximally separates the two classes! (SVMs are used for binary classification, but can be extended to support multi-class classification). Mathematically, we can write the equation of that decision boundary as a line.

    \[ g(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b = 0 \]

Note that we set this equal to zero because it is an equation. Depending on the value of g for a particular point \mathbf{x}, we can classify into the two classes. We’re using vector notation to be as general as possible, but this works for a simple 2D (one input) case as well.

If we do some geometry, we can figure out that the distance from any point to the decision boundary is the following

    \[ r = \displaystyle\frac{g(\mathbf{x})}{||\mathbf{w}||} \]

Our goal is to maximize r for the points closest to the optimal decision boundary. These points are so important that they have a special name: support vectors!

We can actually simplify this goal a little bit by considering only the support vectors. Notice that the numerator just tells us which class (we’re assuming the two classes are 1 and -1), but the denominator doesn’t change. We can take the absolute value of each side to get rid of the numerator.

    \[ |r| = \displaystyle\frac{1}{||\mathbf{w}||} \]

where \mathbf{w} is the optimal decision boundary (later we’ll show that the bias is easy to solve for if we know \mathbf{w}) We can simplify even further! Maximizing \frac{1}{||\mathbf{w}||} is equivalent to minimizing ||\mathbf{w}||. This is a bit tricky to do mathematically, so we can just square this to get \min \frac{1}{2}\mathbf{w}^T\mathbf{w}. (The constant out front is there so it can nicely cancel out later!)

However, we need more constraints, else we could just make ||\mathbf{w}||=0! That wouldn’t solve anything! The other constraints come from our need to correctly classify the examples!

    \[ y_i (\mathbf{w}^T\mathbf{x}_i + b) \geq 1~~~~\forall i \]

where y_i is the ground truth and we iterate over our training set. To see why this is correct, let’s split it into the two classes 1 and -1:

    \begin{align*} \mathbf{w}^T\mathbf{x}_i + b &\geq 1~~~~~~\forall y_i = 1\\ \mathbf{w}^T\mathbf{x}_i + b &\leq -1 ~~~~\forall y_i = -1 \end{align*}

We can compress the two into the single equation above. After we’ve considered all of this, we can formally state our optimization problem! (In the constraints, the 1 was moved over to the other side of the inequality.)

    \begin{align*} & \text{minimize} & & \frac{1}{2}\mathbf{w}^T\mathbf{w} \\ & \text{subject to} & & y_i (\mathbf{w}^T\mathbf{x}_i + b) -1 \geq 0~~~~\forall i \end{align*}

This is called the primal problem. This is a run-of-the-mill optimization problem, so we can use the technique of Lagrange Multipliers to solve this problem.

    \[ \mathcal{L} = \displaystyle\frac{1}{2}\mathbf{w}^T\mathbf{w} -   \displaystyle\sum_{i=1}^N \alpha_i [y_i (\mathbf{w}^T\mathbf{x}_i + b) - 1 ] \]

where the \alpha_i‘s are the Lagrange multipliers. To solve this, we have to compute the partial derivatives with respect to our weights and bias, set them to zero, and solve! I’ll skip over the derivation and just give the solutions.

    \begin{align*} \mathbf{w} &= \displaystyle\sum_{i=1}^N \alpha_i y_i\mathbf{x}_i\\ \displaystyle\sum_{i=1}^N \alpha_i y_i &= 0 \end{align*}

The first equation is \frac{\partial\mathcal{L}}{\partial\mathbf{w}} and the second equation is \frac{\partial\mathcal{L}}{\partial b}. These solutions tell us some useful things about the weights and Lagrange multipliers. In particular, they give some constraints on the Lagrange multipliers. These \alpha_i‘s also tell us something very important about our SVM: they indicate the support vectors! If a particular point \mathbf{x}_i is a support vector, then its corresponding Lagrange multiplier \alpha_i will be greater than 0! If it is not a support vector, then it will be equal to 0!

However, we still don’t have enough information to solve our problem. As it turns out, there is a corresponding problem called the dual problem that we can solve instead.

    \begin{align*} & \text{maximize} & & \displaystyle\sum_{i=1}^N \alpha_i - \displaystyle\frac{1}{2} \displaystyle\sum_{i=1}^N \displaystyle\sum_{j=1}^N \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j \\ & \text{subject to} & & \displaystyle\sum_{i=1}^N \alpha_i y_i = 0~~~~\forall i\\ &&& \alpha_i \geq 0~~~~\forall i \end{align*}

This is something that we can solve! Notice that it’s only in terms of the Lagrange multipliers! Everything else is known! We usually use a quadratic programming solver to do this for us because it is infeasible to solve by-hand for large numbers of points. But we would solve for this by setting each \frac{\partial\mathcal{L}}{\partial\alpha_i} = 0 and solving.

After we’ve solved for the \alpha_i‘s, we can find the optimal line using the following equations.

    \begin{align*} \mathbf{w} &= \displaystyle\sum_{i=1}^N \alpha_i y_i\mathbf{x}_i\\ b &= \underset{i\in N}{\text{average}} \frac{1}{y_i} - \mathbf{w}^T\mathbf{x}_i \end{align*}

The first is from the primal problem, and the second is just solving for the bias from the decision boundary equation.

SVMs for Logic Gates

Let’s take a break from the math and apply support vector machines to a simple logic gate, like what we did for perceptrons. In particular, let’s train an SVM to solve the logic AND gate.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])
y = np.array([0, 0, 0, 1])

clf = clf.SVC(kernel='linear', C=1e6)
clf.fit(X, y)

We’re building a linear decision boundary. Ignore the other parameter C; we’ll discuss that later. Now we can use some plotting code (source) to show the decision boundary and support vectors.

plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)

# plot the decision function
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)

# plot decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])
# plot support vectors
ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=100, linewidth=1, facecolors='none')
plt.show()

Before we plot this, let’s try to predict what our decision boundary and surface will look like. Here’s the picture of the logic gates again.

Where will the decision boundary be? Which points will be the support vectors? The decision boundary will be a diagonal line between the two classes. The support vectors will be (1,1), (0,1), and (1,0) since they are closest to that boundary.

This matches our intuition!  So SVMs can certainly solve linear separable problems, but what about non-linearly separable problems?

SVMs for Linearly Inseparable Problems

Suppose we had the following linearly inseparable data.

There is no line that can correctly classify each point! Can we still use our SVM? We can, but with a modification. We have to add slack variables \xi_i. These measure how many misclassifications there are. We also want to minimize the sum of all of the slack variables. Intuitively, this corresponds to minimizing the number of incorrect classifications. We can reformulate our primal problem.

    \begin{align*} & \text{minimize} & & \frac{1}{2}\mathbf{w}^T\mathbf{w} + C\displaystyle\sum_{i=0}^N \xi_i \\ & \text{subject to} & & y_i (\mathbf{w}^T\mathbf{x}_i + b) - 1 + \xi_i \geq 0~~~~\forall i\\ &&& \xi_i \geq 0~~~~\forall i \end{align*}

where we introduce a new hyperparameter C that measures the tradeoff between the two objectives: largest margin of separation and smallest number of incorrect classifications. And, from there, go to our corresponding dual problem.

    \begin{align*} & \text{maximize} & & \displaystyle\sum_{i=1}^N \alpha_i - \displaystyle\frac{1}{2} \displaystyle\sum_{i=1}^N \displaystyle\sum_{j=1}^N \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j \\ & \text{subject to} & & \displaystyle\sum_{i=1}^N \alpha_i y_i = 0~~~~\forall i\\ &&& 0 \leq \alpha_i \leq C~~~~\forall i \end{align*}

This looks almost the same as before! The change is that our \alpha_i‘s are also bounded above by C. After solving for our \alpha_i‘s, we can solve for our weights and bias exactly the same as in our linearly separable case!

The Kernel Trick

One last topic to discuss is the kernel trick. Instead of having a linear decision boundary, we can have a nonlinear decision boundary. The idea behind the kernel trick is to apply a nonlinear kernel to our inputs \mathbf{x}_i to transform them into a higher-dimensional space where we can find a linear decision boundary.

Consider the above figure. The left is our 2D dataset that can’t be separated using a line. However, if we use some kernel function \varphi(\mathbf{x}_i) to project all of our points into a 3D space, then we can find a plane that separates our examples. The intuition behind this is that higher dimensional spaces have extra degrees of freedom that we can use to find a linear plane! There are many different choices of kernel functions: radial basis functions, polynomial functions, and others.

SVM for The Iris Dataset

One of the most famous datasets in all of machine learning is the iris dataset. It has 150 data points across 3 different types of flowers. The features that were collected were sepal length/width and petal length/width. Our goal is to use an SVM to correctly classify an input into the correct flower and to draw the decision boundary.

Since the iris dataset has 4 features, let’s consider only the first two features so we can plot our decision regions on a 2D plane. First, let’s load the iris dataset, create our training and testing data, and fit our SVM. We’ll change some parameters later, but let’s use a linear SVM.

iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target

C = 1.0 
clf = svm.SVC(kernel='linear', C=C)
clf.fit(X, y)

Now we can use some auxiliary functions (source) to plot our decision regions.

ax = plt.gca()
def make_meshgrid(x, y, h=.02):
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    return xx, yy


def plot_contours(ax, clf, xx, yy, **params):
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xlabel('Sepal length')
ax.set_ylabel('Sepal width')
ax.set_xticks(())
ax.set_yticks(())

plt.show()

Additionally, we’re going to print the classification report to see how well our SVM performed.

from sklearn.metrics import classification_report
print(classification_report(y, clf.predict(X), target_names=iris.target_names))

Now let’s run our code to see a plot and classification metrics!

Additionally, we can try using an RBF kernel and changing our C value. Recall that C controls the tradeoff between large margin of separation and a lower incorrect classification rate.

C = 1.0 
clf = svm.SVC(kernel='rbf', C=C)

Try varying different parameters to get the best classification score – and feel free to add all this to your own coding portfolio as well!

To summarize, Support Vector Machines are very powerful classification models that aim to find a maximal margin of separation between classes. We saw how to formulate SVMs using the primal/dual problems and Lagrange multipliers. We also saw how to account for incorrect classifications and incorporate that into the primal/dual problems. Finally, we trained an SVM on the iris dataset.

Support Vector Machines are one of the most flexible non-neural models for classification; they’re able to model linear and nonlinear decision boundaries for linearly separable and inseparable problems.

]]>
An Introduction to Machine Learning https://gamedevacademy.org/what-is-machine-learning/ Fri, 20 Dec 2019 16:00:56 +0000 https://pythonmachinelearning.pro/?p=2651 Read more]]>

You can access the full course here: Machine Learning for Beginners with TensorFlow

Intro to Machine Learning

Now that we know what the course is all about, let’s learn a bit about the main topic: machine learning. What is machine learning? Machine learning is the study of statistics and algorithms aimed at performing a task without being explicitly programmed to. Theoretically, a machine is said to have learned if it produces “better” results over time without modifications to its programming. Practically, this means writing an algorithm, feeding it some data, and letting it interpret the data to find some pattern to solve a problem. 

Behind the scenes, machine learning models consist of layers of connected nodes (often called neurons). We will cover model structure in greater detail later but it is important to know now that each node has one or more values assigned to it that, when multiplied with a function of our choice produces some output. Through training, a model can change the values to produce more accurate outputs rather than having us, as the coders, explicitly change the algorithm or values manually. In this way, the model is “learning” as it is improving its results over time using the same algorithm. It should also be noted that some models do not undergo training and are only used to find patterns in data. We will explain the differences in the section common machine learning models.

So what makes machine learning so special? Why not just hard code the algorithms ourselves? The main reason we use machine learning is to help find patterns in data that we wouldn’t otherwise be able to see. If we cannot find patterns in data, we cannot hard code algorithms to look for them as we wouldn’t know what to tell the algorithm to do. It is this pattern recognition that allows machine learning models to solve recognition, classification, and prediction problems such as speech recognition, image classification, and stock market prediction. Machine learning also helps to customize user experience and tailor solutions to the user based on their previous habits such as with responsive game AIs, health apps, and text suggestion.

Transcript

What is up, guys? And welcome to the first tutorial in our machine learning course! This’ll also be the very first topic and, as you can see, it is on an Intro to Machine Learning. This is a good way to get some conceptual background info about what machine learning is, kind of how it works, and also go over some practical examples before we start writing any actual code.

So, what topics will we be covering here? Well, we’re gonna divide ourselves into three subtopics. We’ll start with “What is Machine Learning?”, then “What can we do with Machine Learning?”, and we’ll finish up with “What types of Machine Learning there are out there”. Now, I’m going to devote a separate tutorial to each of these three topics just because I’m trying to keep things a bit shorter. There’s a lot to digest within each of these so, we don’t want anything running too, too long.

Okay, so, for starters, we’ll cover, What Is Machine Learning? I think we can safely do that in this video, too. Now, there is a technical definition here; Machine Learning is the study of statistics and algorithms that’s aimed at performing a task without being explicitly programmed to. That’s a lot to take in. It’s quite wordy and is very technical, uses a lot of jargon. So, let’s try to break that down a bit. I’ve got a, hopefully a bit more of an easy definition to understand here and that is that Machine Learning is finding patterns in data that help to solve a problem without us necessarily writing the algorithm to find the patterns and solve the problem. Now, realistically, these are both kind of describing the same thing. There are a couple of common themes in there.

So, the first is that we’re aimed at performing a task or solving some kind of a problem. Well, that’s kind of obvious. I mean, that’s what software is supposed to help us to do is perform some task or solve a problem easier than we would otherwise be able to do. But the second thing that’s really important is that it’s not programmed explicitly to necessarily solve that problem or to improve over time, okay? So, the learning aspect of it is actually not something that we, ourselves, necessarily program in. It kind of figures out what it should and should be doing, should and shouldn’t be doing, based on the data that it sees.

So, usually Machine Learning does involve performing a task better over time, otherwise there’s not really an aspect of learning involved. Performance in most of the models, not all of them however, is linked to this concept of training. Now, we’ll go into training in greater detail later but for now this is essentially feeding data into a model to increase a performance over time without modifying the algorithm itself.

Now, that’s a very important aspect of it because, otherwise, it’s not really Machine Learning. If we’re actually making changes to the algorithm to increase performance then, it’s not learning itself. That’s us manually changing things. There’s no Machine Learning there. If a machine produces the best results after training with the same algorithm, it’s said to have learned. So, if the model starts out at the very beginning with maybe, 20% accuracy, that means, it’s only getting 20% of the answers correct, and then, later on, that bumps up to, I don’t know maybe 80 or 90% and then we have a good model and that model is set to have learned, a lot.

The Key Takeaway here is that Machine Learning Models perform better over time without any changes made to the algorithm themselves. It’s really just putting in a bunch of data into the model and then having the model kind of perform better over time without any changes to the actual algorithm.

Okay! So, that’s all we’re actually gonna do here. Hopefully, just take a second to digest that. Hopefully it wasn’t too much and when we come back we’ll talk about some kinda things that we can do with Machine Learning, where we might see some practical applications of Machine Learning in real life, hopefully that will help to clear up any questions that you might have, okay? So, stay tuned for that. Thanks for watching! See you guys in the next one.

Interested in continuing? Check out the full Machine Learning for Beginners with TensorFlow course, which is part of our Machine Learning Mini-Degree.

]]>
How to Build a Spam Detector https://gamedevacademy.org/spam-text-classification-tutorial/ Fri, 15 Mar 2019 04:00:28 +0000 https://pythonmachinelearning.pro/?p=2252 Read more]]>

 

You can access the full course here: Build a Spam Detector AI with Text Classification

Transcript Part 1

Hello everybody my name is Mohit Deshpande, and in this course we’ll be building an AI that would be able to determine if an input email is spam or not.

And so you can see in some of the scores that we’re gonna be achieving as we build this AI, 99, 98% accuracy at determining if an email is spam or not. And just to kinda mess with it, I tried to, you know, I tried to give an AI an email that I just kind of fabricated up. Here’s the text of the email and you can see it correctly determined that this email is spam and we’re gonna be building this AI. So we’re gonna be learning a lot of things.

In particular, we have to look at text classification and how, you know, we’re gonna discuss some of the challenges of text classification and we’re gonna discuss one group of algorithms called Naive Bayes that are gonna be helpful for our problem. And then we’re also going to discuss this term frequency and inverse document frequency because those provide us some improvements that we can use to help bring up, the accuracy of our AI and we’re gonna be looking at one dataset in particular, called the Enron dataset that I’m going to, kind of go over a little bit in the course.

It’s a publicly available dataset. You can download it, I’ll provided it in the source code, you can download it. And we’re gonna be using that to tell our AI, what is spam email and what is not a spam email.

So we’ve been making video courses since 2012, we’re super excited to have you on board. Online course is fantastic way to learn a new skill and I take a lot of them myself. ZENVA courses consist mainly of video lessons that you can watch, re-watch, at your own pace as many times as you want. We also have downloadable source code and project files that contain everything that we build in a lesson. It’s highly recommended that you code along with me, in my experience it’s the best way to learn something.

And finally, we’ve seen that students who make the most out of these online courses are the same students that create a plan and stick with it, depending on your own availability and learning style, of course. And remember that these videos you can watch and re watch as many times as you want, so that really gives you more flexibility, so you can adapt to how you learn. At ZENVA we’ve taught programming and game development to over 200,000 students, over 50 courses since 2012. (now 350,000+)

Some of these students have used the the skills that they’ve learned in these courses to advance their own careers, start a company or publish their own apps and games. Thanks again for joining and I look forward to seeing all the cool stuff that you’ll be building. Now without further ado, let’s get started.

Transcript 2

Hello everybody. In in this video, I just wanna introduce you guys to the problem of text classification. I just wanna give this brief overview of what it is and a little bit about kind of a specific thing and we’re gonna go through an example and then we’ll then move on to how we can actually perform this kind of classification.

The problem of this text classification and actually it’s sometimes also called document classification. So, document classification– The challenge with document classification is, given an input document, there’s some text in it. We want to be able to put it into one or in some cases more, different bins. For example, if I were given some kind of document, maybe I wanna know whether it’s an invoice. Here’s a possible bin it could go into. Or maybe it’s a receipt. So here’s another possible bin it could go into, and then so on and so on.

Just looking at the text of a document, I want to be able to tell what kind of document it is and as you probably guess this requires supervised learning but it’s a bit more challenging than what we’ve seen before because instead of just having those X’s and O’s on that nice two dimensional plane, we’re dealing with text data and computers are really great at handling numerical data but with text data they’re not as good.

Humans, on the other hand, we can look at text data and it’s fine. Looking at large amounts of numerical data for humans can be kind of tedious and error prone. But for computers, they love that stuff. So, we’re gonna discuss a bit later how we can take textual data, we have to take textual data and convert it into some kind of numerical representation so that we can work with it using learning algorithms. We should be able to work with it once numerical data ’cause learning algorithms operate on numerical data. And so we have to find some way to convert the words into a numeric representation so that we can work with them a little bit better.

Now in particular, a good example that’s used for text classification is, suppose I have an email and I basically wanna categorize it as spam or not spam, also called ham. But if I’m given an email, I want to determine if it’s spam or not spam and a lot of email services already have a built in way to do this. For example, like Gmail or Outlook, Microsoft’s stuff, all these companies already have ways that given a new input email, detect whether it’s spam or ham. This is a good example to use because it’s simple enough that there are only two categories. This is really just a binary problem, whether it’s spam or not.

And so, it’s a bit easier to understand than maybe this top example… you have a document or wanna put it into different classes because it might be the case that it belongs in more than one of these classes. I could get a receipt that’s also an invoice or I can mix and merge these things but with something like email classification, like spam filtering, it’s either spam or not spam. There’s no combination.

Just to recap, we want to look at the text of the email, the words on this email because some words are more indicative of an email that’s… being spam than others. And so we want to look at the text of this email, including often information and we want to build and construct an AI that given a new email message, it can determine if it’s spam or not and then route it to either your inbox or your spam folder. Like I mentioned before, supervised learning is a great way to accomplish this task. I’m gonna give our AI lots of examples, labeled examples of messages that are spam and messages that are not spam and then it should be able to learn what kind of words or what kind of other characteristics spam messages have and then apply them to a new input document.

Like I mentioned, in order to do this, we have to have a numerical representation but let’s assume we have this right now. Assume it’s already there. We’ll deal with that a bit later. I also wanna mention that text classification isn’t just used for spam filtering.

Actually, there’s a ton of different applications. Spam filtering is just one of them. But there’s also a sentiment analysis. So, sentiment analysis, which is given the text of a document, or actually, it’s very popularly used for social media, given a tweet or a Facebook post, or something like that, I want to be able to determine the sentiment of the creator. I wanna be able to tell is the sender angry or upset or other things like that. And there’s certain words and phrasings that are used in that sense and we can try to predict the sentiment of the sender, for example. This is especially popular in social media so you can mine lots of social media data. There’s been a lot of work there and then try to run sentiment analysis on that.

Another thing it’s also used for is book classification. Book classification, and what I mean by that is, given a book, maybe it’s title or something like that, we wanna be able to tell what genre it is, for example. Given some information about the book, like the title, the author or something like that, we wanna be able to tell what genre it is and this is super useful because then you don’t have to have manual people that have to make these decisions themselves because it might get tedious or maybe they have more important things to do. This is something that we can delegate to a machine learning algorithm lists from previous inaccuracies. Give it a book, you can tell what genre it is, for example or even categorize it even further.

Another thing that I wanna mention, another popular use of this is called readability. Readability. And with readability, it is more like given a passage of text, I wanna determine things like what level of reading or comprehension level you need to understand this passage or more accurately, given some of the words in this passage, what is the expected reading level, for example. So, you might notice that if you have an elementary school or a primary school reading level, the words might be just one or two syllables and the sentence structure is fairly simple but as you go up to more scientific writing or graduate school writing for example, then you notice that there are longer words.

There are more complicated words. Maybe the sentence structure is different. The structure varies, it’s more complicated. Readability assessment, we can look at text. We can look at words, the sentence structure to access a passage and determine what level of reading comprehension would be assigned to that passage of text. So that’s where I’m gonna stop here.

So just to recap, text classification or document classification is this problem of given some text input, I want to assign some label to it. I can assign labels like invoice, receipt, medical record and so on. Or the example that’s nice to look at is email filtering and that is given an email, I just want it to say whether it’s spam or not spam and I can use the words inside of the email to help me make this determination. I mentioned that there are plenty of different applications of text classification. I mentioned sentiment analysis in social media, book classification to tag books what genres and readability to determine to what level we can tag a passage like primary, elementary school up to scientific, college, grad school sort of thing. And that is text classification.

Transcript 3

Hello everybody, my name is Mohit Deshpande and in this video, I just want to introduce this algorithm that we can use called Naive Bayes, and we’re going to kind of be discussing it over the next few videos.

In this particular video I’m gonna write down and explain the actual equation that we use and we’ll look at how we can do a concrete example and see how Naive Bayes works there. And then I have some other, just kinda wrap-up stuff to discuss Naive Bayes. And so, like we were mentioning with textual data we have to have some kind of numeric representation for it and we want to learn some information about whether or not we’re going to be using spam detection, for example, for this sequence. Because it’s a nice example, the equation turned out to be fairly simple and easy to understand. It’s pretty intuitive.

So we’re gonna stick with this notion of spam filtering or spam detection. So spam filtering, or detection. And so the uh, so with Naive Bayes and with spam filtering it’s kind of logical to assume that spam messages tend to have words in more, have a different word distribution than messages that are, that are not spam. What I mean by this is, like for example, one example we’ll look at is the word “free”, or something, right.

So the word “free” is probably going to be more common in spam messages that are advertising like “free money” or “free something or the other”. For example, like the word “free” is more associated with spam messages than it is with ham or not spam messages. And so we can use that, we can word this probabilities and so we can use that to better assess given an input image, or given an input text, I should say, given an input text. We can determine whether it’s spam or not based on the words that are in that email. So I wanna write down the equation.

So Naive Bayes centers around this thing of probability in many fields called Bayes theorem. And so let me write it down. I’m gonna write it down first and then I’m going to discuss it intuitively. It might seem a little scary at first but each part intuitively makes sense that you’ll see. So, the probability that a message is spam knowing that we have some word, W, in it, is equal to the probability that of finding that same word, W, in a spam message times the probability that we actually have a spam message all divided by the same thing up here. The probability of finding that word in a spam message times the probability that message is spam to begin with plus the probability of finding that word in something that is not spam.

And this symbol by the way just means “not”. Times the probability that a message is not spam. So this seems kind of big and tedious but I just want to, we’ll kind of go through it in part. And so, I should mention that these Ps by the way, this means “probability” and you can intuitively think of probability as like chances or odds, like if I roll a die, if I roll a six-sided die, then what is the probability that it lands on a five? Well, it’s just one out of six because I have one outcome that I want which is landing with a five side up, and there’s six possible outcomes.

And so, intuitively, this is also something that in the probabilities, here’s the outcome and then here are all possible outcomes. And so you can kind of intuitively think of this as being, you know, probability but that’s what just the Ps mean, probability. Might see a capital P or P, capital P or lower case R. I’m just gonna write it like this.

So an this bar means “given”. And what I mean by given is that this is a conditional probability because it depends on what word that we know here. So, anyway, I just wanna kinda go through this intuitively. It seems a little scary at first but let’s get through this. So this first term here, this is what we want to find. We want to know, what is the probability that this message that we received is spam, given that it has some word, W, in it. Maybe I should make this rule. So this is spam. This is ham, or not spam. And W then is just a word, any word. So, so this is, let me get back here.

So this is a probability that their input message is spam given that it has some word in it. And this is equal to the probability that we find that word in a spam message times the probability, this probability, S, means what’s the probability that this message is even spam to begin with.

So we can have like general statistics as to you know, what percent of input email, what percent of email that you get is spam, now versus ham. And so this is the probability that any input email that you get to begin with is spam or not. And this is if it’s not spam. So probably what’s gonna happen, we’re also gonna see what happens in practice is that this is actually pretty high. The probability that it could, you probably get more spam than you do ham messages. So, expect this to be kind of close to one.

And so anyway, this is the probability that we find a word in that spam message. And this is like what I mentioned with the word “free”. We expect the word “free” to be found, it’s more likely to find that word in a spam message than a ham message. And so this is what this top quantity kind of looks at is what are the odds of finding this word, W, in a spam message. Then we divide by this sum, which is all possible outcomes. A possible outcome is that this word is in a spam message and probably spam.

But then there’s also maybe a smaller likelihood that this word is, you wanna also know like what are the odds of finding this word in a message that’s not spam? So would you look at a word and see is it more likely to fall into a spam message or is it more likely to fall into a not spam message? And so we can make a determination based on which of those two outcomes has a higher probability. So, is the probability that, this output probability is, if we find it’s more likely to be a spam message then we’re gonna assign it to be a spam message. And so, that’s how we can make a determination.

And I, we’re going to go through an example, a more concrete example of this. But intuitively this is like saying that the probability of a spam message with some given word is the likelihood, first of all that we have a spam message to begin with, times the probability of that word being in a spam message, divided by the probability that we actually encounter this word in the first place. And that’s what this is trying to say.

So this is a word given that it’s in spam and then times the probability of spam plus the word is in not spam. So there are really only two outcomes, right? Either the word is in a spam message or it’s not in a spam message. And so this is what this is trying to account for. And we’ll try to find the probability that this is in a spam message. So this is high. What does that mean? That means that the probability that this input email is spam given this word is in it, means that this email is probably spam. And so that’s kind of, I’m gonna stop right here because I don’t want to complicate this further. We’re actually going to go through an example of this in the next video. But I’m gonna stop right here and just kinda reexplain this one more time.

So Naive Bayes centers around this principal of Bayes Theorem and intuitively I can say what this says is what we wanna find is the probability that any input email is spam given that it has some word in it. We say that this is a logical assumption to make because there are a lot of, you know, we look at the content of an email to determine whether it’s spam or ham. So we wanna say what the probability of finding you know, finding if this is a spam message given that it has a word in it, a particular word, W, in it. And so that’s equal to the probability that this word appears in spam messages times the overall probability that we even get spam messages to begin with.

And then this bottom is kind of, like I said, separating this into two parts, where you know, this is W in spam, this is W in ham or not spam, messages.

So these are kinda the two things, outcomes that we can get for W. It’s also commonly called our evidence. Basically, you know, so that’s what we’re trying. So we can find all of these numbers based on our data. We can look at each word in our all the emails in our training data and see, you know, what’s the likelihood that is in a spam message? Oh hey, I found this word in all these spam messages this many times, and then I can, you know, figure out these probabilities. So all these things I can calculate from my data set. And then given a new input image, or a new input text I should say, I can determine and compute this probability. And if it’s super-high then I know that this message is spam.

So I’m gonna stop right here and in the next video I’m actually gonna go through a more concrete example so that we can understand this a bit better.

Transcript 4

Hello everybody. My name is Mohit Deshpande and in this video, I want to go through a more concrete example of applying this Naive Bayes approach to seeing if there is Spam or not.

So I wanna substitute concrete values for these and so we can kinda see. We’re trying to make the values, well it’s gonna be obvious that we can see that this is a good approach and that it works. And so let’s kinda use an example. So I’m gonna give you some of these numbers. So first, I’ll give you the probability of that we have a message and it’s Spam. So, and this is something that you can, again, you can compute from your data set but I’m just kinda use a overall statistic.

So the way that, there’s been like studies and to try to find this value, and so it’s been shown that about 86% of messages that you get are Spam. And so logically, if 86% of messages you get are Spam, then that means that the other 14% must be messages that are not Spam. So I can write that probability of not Spam is then 0.14, and this follows logically cause this message is either Spam or not Spam. Now suppose that the word we’re looking at is a word, free. And so now we still need some, we’re still missing some values. So we have like here, and here, and here.

So we got three values down, we still need three more. And we’ll really, really just need two more because these two are the same here. Actually these two are the same here as well. And so let’s suppose that we’re looking at the word, free, because same as just like I mentioned, I tend to have the word, free, in there and regular emails that you get, Ham emails, generally don’t have the word, but they could. So let’s make this like really obvious. Let’s suppose that the probability that we’d find the word, free, in a Spam message is something really high, like 0.96.

And so what this is saying is that the odds that we find the word, free, in a Spam message is 96%. It means that in, given that a message is, if we know that the message is Spam, then we have this chance of the word, free, appears in Spam messages, 96% of the time. What we wanna find is whether this new message that we get, that we received was Spam, but given that it has the word free in it.

So based on just looking at this number, you’re probably already thinking that, well this is probably gonna be a Spam message because in our data set, for example, this is such a high probability. But let’s suppose that finding the message, finding free, in a message that’s not Spam, is something like 0.02. So two, in very few cases, 2% of the time, we find the word free in something that is not Spam.

And it’s important to note that these two things are not related. So you can’t do the same thing like you did here. You can’t be like, well hey, so. If this is free in Spam message then why don’t I have 0.04 here, because 0.04 is actually a probability of not finding free in a Spam message, that would be 0.04, so these two things are not related at all. So now we actually have enough values to plug this in and find our answer.

So just kinda off the bat, you can probably, I’ll try to make this obvious, but you can probably tell that this message is not gonna be Spam because the probability that you find the word, free, in a Spam message is 96%, so the word, free, is a pretty good indicator, finding the word free is a pretty good indicator that your message is Spam, and so let’s actually like go through a computation and find this probability, and we expect it to be really high. So let’s plug in values.

So first I wanna find the probability that the message is Spam given that it has the word, free, in it. Well that’s equal to the probability of finding the word, free, in Spam messages times the overall probability that this, any given input is Spam to begin with. Sometimes the probability that I find the word that, the same thing here plus the probability, when I find the word, free, in a message that is not Spam, times the probability that this is not Spam.

And so I can like plug in values here. So the probability that it’s free, given that it’s, probably to finding the word, free, in Spam message is 0.96, 0.96 times probability that I actually have a Spam message, 0.86 divided by the same quantity here. More or less that these two are the same because this is one possible outcome, and this must be overall outcomes, plus probability that I find the word, free, in a Ham message was only 0.02, and then the probability of actually finding, getting a message that is not Spam is 0.14.

So when I compute all of these out, I get a 0.9966, and so we can’t have this value, we know from our data set or from wherever you find these values from, we know that this input email is almost, almost positive that this input email is a Spam message and we just did that by knowing just these four values. And these values are actually things that we’d learn from our data set. Given our input data set, that they’re labeled, examples of Spam and Ham messages.

So we can compute this probability. So we can know what the probability of receiving a Spam message based on our data set. Likewise, we also know this. And then we can find this probability about our word or in case these words, and there’s actually a problem with this approach that I’m gonna address in the next video is that it’s not just, we just don’t really look at a single word, there’s actually, we don’t look at all the words in our email, and there’s kind of, I’m gonna talk about why it’s called Naive Bayes in the next video because of this assumption that we make.

But yeah, so we can see that this is actually a pretty good way that we can determine whether an email is Spam or not by looking at what’s the likelihood that I find a, what’s the likelihood that I find this word or any word in a Spam message, or what’s the likelihood that I find it in a message that isn’t Spam and then compute that probability, and so suppose I had this, we’re supposed to set a free, I had like, something like, I don’t know, like report or, well like report or research or something like that.

Now then what’s probably gonna happen is that the probability that I find the word, research, for example, in a Spam message is probably gonna be pretty low or it will be lower than if I look at it. The probability of finding the word, research, in a Ham message or a message that is not Spam or from looking at my university account, for example. So based on that, then what would happen is this probability would be pretty low, probably finding the word, the probability that this message is Spam, given that it has the word, research, in it for example. That probability might be low, and if it’s low then I can conclude that this message is not Spam.

And these two things are, like finding this, and there’s a probability that it’s not Spam given that it has the word, free, in it by the way, are indeed related by the way, because, so if the probability that I have a message, if it’s Spam message given it has the word, free, it’s 0.9966, then the probability that it’s not Spam, or it’s Ham is gonna be what, 0.0044, and so that’s very low probability. And so I wanna pick which one of these gets me the highest probability, in this case, it’s far more likely that this input email is Spam instead of Ham and so I can brought that to my Spam folder.

And I should mention that the algorithms and techniques that companies use or personal proprietary, and they’re probably far more complicated than this but this approach is actually not that bad. So this is kinda, it’s certainly easier to understand than some of the more advanced technique, so we’ll just be looking at, we’ll be looking at this, and so I’m gonna stop right here, just do a quick recap. So we actually ran through a computation of using a Naive Bayes technique to determine if a message was Spam, given that it has the word, free, in it.

And i kinda gave you some numbers here, and while the probability of getting any message that’s Spam is like pretty high, and consequently, the probability that you’re getting a message that is Ham is pretty low and then it looks for a data set and say, well based on my experience, have I seen the word, free, in Spam messages or Ham messages and so based on that, I can, I know whether or not this is Spam or Ham because then I just look at how many times have I seen this in Spam messages versus Ham messages and I’ll be computing this probability and we found that this message is almost, almost 100% positive that this message is almost 100% likely this message is Spam. And so that’s how we can do a computation of Naive Bayes.

All these things are things that we can learn or algorithms that can automatically compute for us. Kind of problem with is that we’re only looking at a single word but we wanna be looking at emails or sequence of words. So we wanna be able to consider a sequence of words. So I’m gonna address how we can do that in the next video.

Interested in continuing? Check out the full Build a Spam Detector AI with Text Classification course.  You can also check out our full Machine Learning Mini-Degree for further Python development skills.

]]>