Explore Free Unsupervised Learning Tutorials – GameDev Academy

How to do Cluster Analysis with Python – Data Science

Zenva — Sun, 18 Dec 2022 14:00:19 +0000

You can access the full course here: Data Insights with Cluster Analysis

Want more Python topics for a school environment? The Zenva Schools platform which offers online courses, classroom management tools, reporting, pre-made course plans, and more for teachers.

Part 1

In this video we are going to discuss Cluster Analysis. We will discuss the following topics:

Intro to Cluster Analysis – what is it, what are it’s different applications, the kinds of algorithms we can expect.
K-means clustering
Density-based Spatial Clustering of Applications with Noise (DBSCAN)
Hierarchical Agglomerative Clustering (HAC)

k-means, DBSCAN and HAC are 3 very popular clustering algorithms which all take very different approaches to creating clusters.

Before diving in, you can also explore why you might want to learn these topics or just what career opportunities these skills will present you with!

Cluster Analysis

Imagine we have some data. In cluster analysis, we want to (in an unsupervised manner – no apriori information), separate different groups based on the data.

Looking at a plot of the above data, we can say that it fits into 2 different groups – a cluster of points in the bottom left and a larger, elongated cluster on the top right. When we give this data to a clustering algorithm, it will create a split. Algorithms like k-means need to be told how many clusters we want. In some cases, we don’t need to specify the number of clusters. DBSCAN for instance is smart enough to figure out how many clusters there are in the data.

The data above is from the IRIS data set. This was created by a famous statistician R.A. Fischer, who collected this data set of 3 different species of flowers and plotted their measured properties such as petal width, petal length, sepal width, sepal length. Since we are doing clustering, we have removed the class labels from the data set as that is what the clustering algorithms is trying to give us in terms of what data points belong together.

Clustering

Grouping data into clusters so that the data in each cluster has similar attributes or properties. For example the data in the small cluster in the above plot have small petal length and small petal width.

There are several applications of clustering analysis used across a variety of fields:

Market analysis and segmentation
Medical imaging – Xrays, MRIs, fMRIs
Recommender systems – such as those used on Amazon.com
Geospatial data – longitudinal coordinates etc
Anomaly detection

People have used clustering algorithms to detect brain anomalies. We see below various brain images. C1 – C7 are various clusters. This is an example of clustering in the medical domain.

Another example is a spatial analysis of user-generated ratings of venues on yelp.com

Cluster Analysis

A set of points X1….Xn are taken in, analyzed and out comes a set of mappings from each point to a cluster (X1 -> C1, X2 -> C2 etc).

There are several such algorithms that will detect clusters. Some of these algorithms have additional parameters e.g. Number of clusters. These parameters vary for each algorithm.

The input however is a set of data points X1…Xn in any dimensionality i.e. 2D, 3D, 100D etc. For our purposes, we will stick with 2D points. It is hard to visualize data of higher dimensions though there are dimensionality reduction techniques that reduce say 100 dimensions to 2 so that they can be plotted.

The output is a cluster assignment where each point either belongs to a cluster or could be an outlier (noise).

Cluster analysis is a kind of unsupervised machine learning technique, as in general, we do not have any labels. There may be some techniques that use class labels to do clustering but this is generally not the case.

Summary

We discussed what clustering analysis is, various clustering algorithms, what are the inputs and outputs of these algorithms. We discussed various applications of clustering – not necessarily in the data science field.

Part 2

In this video, we will look at probably the most popular clustering algorithm i.e. k-means clustering.

This is a very popular, simple and easy-to-implement algorithm.

In this algorithm, we separate data into k disjoint clusters. These clusters are defines such that they minimize the within-cluster sum-of-squares. We’ll discuss this more when we look at k-means convergence. Disjoint here means that 1 point cannot belong to more than 1 cluster.

There is only 1 parameter for this algorithm i.e. k (the number of clusters). We need to have an idea of k before we run the algorithm. Sometimes it is obvious how many clusters we should have, but sometimes it is not that clear. We will discuss later how to make this choice.

This algorithm is a baseline algorithm in clustering.

The cluster center/centroid is a point that represents the cluster. The figure above has a red and a blue cluster. X is the centroid – the average of the x and y coordinates. In the blue cluster the average of the x and y coordinates is somewhere in the middle represented by the X in the middle of the square.

K-means Clustering Algorithm

1. Randomly initialize the cluster centers. For example, in the above diagram, we pick 2 random points to initialize the clusters.
2. Assign each point to it’s nearest cluster using distance formula like Euclidian distance.
3. Update the cluster centroids using the mean of the points assigned to it.
4. Go back to 2 until convergence (the cluster centroids stop moving or they move small imperceptible amounts).

Let’s go through an example (fake data) and discuss some interesting things about this algorithm.

Visually, we can see that there are 2 clusters present.

Let’s randomly assign the cluster centers.

Let’s now assign each point to the closest cluster.

The points are now colored blue or red depending on which centroid they are closer to.

Next we need to update our cluster centroids. For the blue centroid, we take the average of all the x-coordinates – this will be the new x-coordinate for the centroid. Similarly we look at all the y-coordinates for the blue points, take their average and this becomes the new y-coordinate for the centroid. Likewise for the red points.

When I do this, the centroids shift over.

Once again I need to figure out which centroid each point is close to, which gives me the following.

Once again, I update my cluster centroids as before.

If we try to do another shift, the centroids won’t move again. This is evident from the last 2 figures where the same points are assigned to the cluster centroids.

At this point, we say that k-means has converged.

Convergence

Convergence mean that the cluster centroid don’t move at all or move a very very small amount. We use a threshold value to indicate that if the centroid does not move at least that much the k-means has converged.

Mathematically, k-means is guaranteed to converge in a finite number of iterations (assigning point to a cluster and shifting). It may take a long time, but will eventually converge. It does not say anything about best or optimal clustering, just that it will converge.

K-means is sensitive to where you initialize the centroids. There are a few techniques to do this:

Assign each cluster center to a random data point.
Choose k points to be farthest away from each other within the bounds of the data.
Repeat k-means over and over again and pick the average of the clusters.
Another advanced approach called k-means ++ does things like ANOVA (Analysis Of Variance). We won’t be getting into it though.

Choosing k (How many clusters to use)

One way is to plot the data points and try different values to see what works the best. Another technique is called the elbow method.

Elbow method

Steps:

Choose some values of k and run the clustering algorithm
For each cluster, compute the within-cluster sum-of-squares between the centroid and each data point.
Sum up for all clusters, plot on a graph
Repeat for different values of k, keep plotting on the graph.
Then pick the elbow of the graph.

This is a popular method supported by several libraries.

Advantages Of k-means

This is widely known and used algorithm.
It’s also fairly simple to understand and easy to implement.
It is also guaranteed to converge.

Disadvantages of k-means

It is algorithmically slow i.e. can take a long time to converge.
It may also not converge to the local minima i.e. the optimal solution.
It’s also not very robust against varying cluster shapes e.g. It may not perform very well for elongated cluster shapes. This is because we use the same parameters for each cluster.

This was a quick overview of k-means clustering. Lets now look at how it performs on different kinds of data sets.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Transcript 1

Hello, world, and thanks for joining me. My name is Mohit Deshpande. In this course, we’ll be learning about clustering analysis. In particular, we’re gonna use it in the context of data science, and we’re gonna analyze some data and see if we can segment out different kinds of customers so that we can provide them with all kinds of neat promotional discounts or special offers.

This is a very common thing that’s done for a lot of companies. We have a bunch of data. We wanna figure out unique groups of customers so that we can offer them special things.

In this course, we’ll be learning a little bit about three different kinds of clustering algorithms. First, I wanna introduce you to the concept of clustering, or what is clustering, and what are some other applications of it. Then we’ll get onto three different clustering algorithms, and what’s really neat is that they all approach clustering in a very different fashion.

So first, we’re gonna learn about a very popular kind of clustering called K-means clustering. Then we’ll look into a density-based clustering algorithm called DBSCAN. Then finally we’ll discuss a hierarchical clustering algorithm called hierarchical agglomerative clustering. And then we’ll see different kinds of data, where they work really well, and where they don’t work quite so well. So we’ll get a good idea of, at least some kind of notion, of which kind of clustering algorithms tend to work well on which kind of data.

Then, finally, we’re gonna tie everything together by looking at a real world data set and see if we can segment out different customers. We’ve been making courses since 2012, and we’re super-excited to have you on board. Online courses are a great way to learn new skills, and I take a lot of online courses myself.

Several courses consist mainly of video lessons that you can watch at your own pace and as many times as you want. So feel free to watch and re-watch and pause the video as many times as you want. We also have downloadable source code and project files and you’re getting everything that we build during the lesson.

It’s highly recommended that you code along with me. In my experience, that’s the best way to learn something is to get your feet wet. Lastly, we’ve seen the students who get the most out of these online courses are those who make a weekly plan and stick with it, depending, of course, on your own availability and learning style. Zenva, over the past six years, has taught all kinds of different topics on programming and game development, over 300,000 students, across 100 courses. The skills that they learned in these courses are completely transferrable. In fact, some of the students have used the skills that they’ve learned to advance their careers, to make a start-up, or to publish their own content from the skills that they’ve learned.

Thanks for joining, and I look forward to seeing the cool stuff you’ll be building. Now, without further ado, let’s get started.

Transcript 2

In this video, we are going to learn a little bit about cluster analysis. And this is a topic that we’re gonna be discussing over the duration of this course.

So just to give you an overview of the different things we’re gonna be covering, I’m gonna give you an introduction to cluster analysis, basically what is it and what are the different applications of it, as well as what kind of algorithms can we expect. And in fact, we’re gonna be covering three very popular algorithms, k-means clustering, DBSCAN, which stands for density-based spatial clustering of applications with noise, but usually we just call it DBSCAN, and then hierarchical agglomerative clustering, HAC.

These are three very popular clustering algorithms. And the interesting thing is they all take very different approaches to creating clusters. And we’re gonna get into all those in the subsequent videos. But first let’s talk a little bit about cluster analysis. And that’s what we’re gonna be focusing on primarily in this video, just to acquaint you with some of the terminology as well as some applications of cluster analysis for example. So clustering analysis, so imagine we have some data.

The whole point of clustering analysis is in an unsupervised way with no prior information, we want to be able to separate different groups based on the data that we have. And now sometimes these groups are predefined. You have the set of data like in this case, and you say, well, this seems, we plot this data, and you say, well, it seems to fit into two little groups. Here there’s a little clustering of points on the bottom left, and there’s a larger, kind of elongated cluster on the top right and so we might say, well, we can give a predefined number of clusters.

We want two clusters and we can give that to the clustering algorithms and then they’ll group these guys together. It’ll make a split and it actually, in some cases, we don’t need to specify the number of clusters. In fact, some algorithms, which is DBSCAN, are actually smart enough to be able to figure out how many clusters are based entirely on the data. But algorithms like k-means will actually need to be specified how many clusters that we have.

And so, for example, this data scan is actually taken, it’s a very famous data set called the Iris Dataset, collected by Ronald Fisher, which is, and here is a quick historical side note, he’s probably the most important statistician of the 20th century. A lot of statistical techniques that we have that are used in all kinds of companies were originally some of his work, but he collected this dataset of flowers. He has 50 different of three different kinds of species of flowers and he plots their measured properties like petal width, petal length, sepal width, and sepal length, and they’re all plotted out.

In this case, what I’ve actually done is removed the class labels, because usually when we’re doing clustering analysis, we don’t have the correct labels. In fact, that’s what the clustering is trying to give us. It’s trying to give us some notion that these things belong together and these other things belong together. So this is just a kind of data that you might expect with some clustering.

So clustering is taking our data and then putting it into groups such that the groups have some kind of similar properties or similar attributes here. So if we go back a slide here, so we have one cluster at the bottom left for example. That might be considered a cluster where the flowers in that cluster have a small petal length and a smaller petal width, for example. That’s an example of grouping, as I’m talking about.

And there’s so many different applications of clustering analysis, not just used for something like data science. But also things like medical imaging for things like x-rays or MRIs or FMRIs. They use clustering analysis. Recommender systems like those used on websites like Amazon.com and whatnot, they recommend you can use clustering analysis to help build recommendation systems. Geospatial data, of course, our data is in latitude and longitude coordinates and we can do some kind of clustering with that as well. And we can also use it for things like anomaly detection in our data. The top bullet point here is that we can use it for market analysis and segmentation, a very popular technique for that as well.

And so this kind of gives you a little bit of the applications of clustering analysis and certainly the algorithms that we’re gonna be learning are used in the field and can be used for your data as well. So for example, people have used clustering algorithms to actually do things like, to check brain anomalies, so here are just images I’ve taken of the brain and the different C-1 to C-Center are different clusters of the brain. They use a different clustering metric. And the result is, you can kind of segment out different kinds of brain anomalies. So it’s an application of clustering in the medical domain.

And another domain, what these guys have done is look at reviews on Yelp.com, and they’ve performed a spatial analysis of that. So there are a lot of different applications of clustering in many different fields.

So just a quick overview of clustering, but we can get a little bit more formal with this, with the algorithms. So you can think of clustering analysis on the bottom here, as just taking in a set of points, X1 to Xn. And then we churn it through a machine that is the cluster analysis, and then out comes an assignment that maps each point to a particular cluster. So on the bottom here, X1 is mapped to Cluster 1, X2 is mapped to Cluster 2, and there are a lot of different algorithms that exist to detect clusters.

In fact, like I’ve mentioned, we’re gonna be covering three of them, of the most popular ones. Now the parameters, some of these algorithms have additional parameters to them. So like I mentioned before, one of those parameters might be the number of clusters that you have that might have to be something that’s pre-defined that you would give to an algorithm. But there are sometimes other parameters as well, and they vary for each algorithm, so there’s no uniformness among the parameters. However, like you see in the chart here, the input to clustering algorithms is usually a set of data points, X1 to Xn, and they can be of any dimensionality, they can be in 2D, they can be 2D points, they can be 3D points, they can be 100 dimensional points.

But for our purposes, we’re just gonna stick with 2D points. It is certainly possible to do clustering on higher dimensionality data, but one of the issues is actually visualizing that higher dimensional data. Because we can’t really plot a hundred dimensional graph terribly easily, but there exist dimensionality production techniques that can press 100 dimensions down into two so we can plot it, but that’s a little beyond the scope of the course here. So we’re mostly gonna be sticking with 2D.

And then the output, like I said, is a cluster assignment. So each data point belongs to a cluster, or actually in some cases, some algorithms have some notion of outliers or noise built in, so just DBSCAN for example. So a point may not necessarily belong to an exact cluster, it might also belong to, it might be a noise point or some people like to think of it as being a noise cluster, although that’s not quite correct. But that is the output, and this is kind of a generalization of clustering algorithms. They take in this set of input points and then they produce a mapping that maps each point to a cluster or perhaps to a noise point.

And you can think of clustering analysis as being a kind of unsupervised machine learning, per se. And it’s unsupervised because generally, with clustering analysis, we don’t have any labels for our data. We just get a set of points and we’re set to cluster them. And so this unsupervised, and there are techniques that might use class labels to help do some kind of clustering or provide some other auxiliary information, but in many, many cases, clustering analysis is done unsupervised. We don’t have correct labels.

Alright, so that is just a quick overview of clustering analysis, so basically, what is clustering analysis, what kind of inputs we give to a clustering algorithm, what kinds of outputs do we expect from a clustering algorithm, and also discuss a little bit about different applications of clustering algorithms used in a wide variety of fields. It doesn’t even have to be something like data science or computer science, it can be used in the medical field, in business, finance, and so learning this information, we can take this and apply it to many other fields. So that is just a quick introduction to clustering analysis.

Transcript 3

In this video we are going to look at probably the most popular clustering algorithm called K-means clustering. So K-means clustering, the point of it is to take our cluster data and then separate it into K disjoint clusters, and these clusters are defined so that they minimize what’s called the within-cluster sum-of-squares, but we’ll get to that a bit later. You’ll see what that means when we discuss a little bit more about K-means convergence.

So the point of K-means clustering is to separate the data into K disjoint clusters. What I mean by disjoint clusters means that one point cannot belong to more than one cluster, and the only parameter that we have to set for this algorithm is the number of clusters. So you’ll see an example of an algorithm where you need to have some notion of how many clusters you want your data to have before you run the algorithm. Later, we’re gonna discuss how you can select this value of K, because sometimes it’s quite obvious how many clusters you have, but many times it’s not quite so clear how many clusters you should have, and so we’ll discuss a couple techniques a bit later on how to choose this parameter K.

So K-means clustering is very popular, very well-known. It actually, the algorithm itself is quite simple. It’s just a couple lines of algorithm code and the code itself is writing, if you had to write it from scratch also, it’s not something that would take you a long time. It’s very popular. I mean, it’s taught in computer science curriculums very commonly. So knowing this algorithm is kind of the first step to getting more acquainted with clustering. It’s kind of the baseline algorithm, and we’ll move on to more complicated algorithms a bit later, and then one point of terminology here.

You’ll often hear something called a cluster center or a centroid, and really that’s just a point that represents a cluster, so in the figure that I have here on the bottom right, we have two clusters. We have a red cluster and a blue cluster, and the X is the centroid. In other words, the centroid is really like the average of the X coordinates and the average of the Y coordinates and then that’s the point. So you can see in the blue, the average, if I average the X coordinates is somewhere gonna be in the middle, and if I average the Y coordinates, that’s gonna be somewhere in the middle and we end up with some centroid that’s in the middle of that square, and so that’s what I mean by centroid. So that’s all we really need to know about K-means.

Now we can get to the algorithm, we can actually get to the algorithm. I’m going to go through an example with you so you can kinda see it progress step-by-step with just some fake data, and then we’ll discuss some more interesting things about it such as convergence, properties, you know, talk a little bit about what convergence is, and then we’ll also discuss a bit later how do you select the value of K and what not. So but we’re gonna get to that.

So the clustering algorithm, the first step to the clustering is you randomly initialize the cluster centers. So suppose I have some data, that I have two clusters of four like in the previous chart. I have two clusters, so what I do is I just pick two random points to initialize the clusters. Now, the algorithm doesn’t, there are many ways to do this, and we’ll discuss a couple different ways a bit later, and so after you pick, we randomly initialize our cluster centers, then for each point we assign it to its nearest cluster using some kind of like metric, like the distance formula for example, Euclidean distance, and then we update the cluster centroids by taking the mean of the points assigned to it.

So after each point is assigned to a centroid, we look at the centroids and see how many points are assigned to this one centroid, and then we just average all the points, and that’s gonna shift the centroid. We basically keep doing this.

After the centroids have shifted, we then reset all of our points and figure out where the near, who, assign them to the nearest cluster again, and we just keep doing this until the cluster centroids stop moving or they move a very, very small imperceptible amount. Now, let’s actually do an example of this with some data so you have a better understanding of what I mean by each of these steps. So here is our data and we wanna cluster it and we want it to, we’re setting K equals two. There’s two clusters, okay?

If we look at this data we can kinda see that, well, yeah, there are two clusters here. So here is just like randomly initialized the two clusters. There’s red and blue, and just placed ’em somewhere on the plane. So now what I want to do is for each point, I’m going to assign it to the closest cluster. So after I do that I get something like this. So now I have colored all the points red or blue depending on if it’s closer to the red centroid or closer to the blue centroid.

So after I’ve done this, now I need to update my cluster centroid. So how I do that is I look at all the X, for example, for the blue centroid, I’ll look at all the X coordinates and take the average and that’s my new X coordinate for the centroid. I look at all the Y coordinates of all the blue points and I take the average of those and that becomes the new Y coordinate for my blue centroid. So after I do this, the centroids kinda shift over and then I reset the colors of all the points, and now I just do the same thing again.

So now for each point I have to figure out which centroid is it close to. So after I do that I get this assignment, and now I, again, take the average of the X coordinates and the average of the Y coordinates for each cluster and that becomes the location of my new centroid, and I shift it over again, and so now it’s gonna be here, and now let’s do another, now I do this one more time, and you’ll see that now, where if I try to do another shift, that nothing will, the cluster centroids won’t actually move again because in the previous step we see that the same points are assigned to the same cluster centroids.

So in other words, my centroids aren’t gonna move, and at this point we say that K-means has converged. So I hope that this gives you a bit of an overview of a K-means here as in how do you actually run the algorithm in step. So like I mentioned convergence, just ignore that big equation. It looks kinda scary but this is just to illustrate a point here. So convergence means that the cluster centroids either don’t move entirely or they move a very, very small very small amount, and you usually just set a threshold value. So if they don’t move at least this much, then we say they’ve converged.

The reason that I put this equation up here is just to illustrate the point that mathematically K-means is guaranteed to converge in a finite number of iterations, and by iterations I mean assigning the points to a cluster and then shifting and then assigning the points to a cluster and shifting.

In a finite amount of iterations, it is mathematically guaranteed to converge. Now, finite number, that still means that it might take a long time, but it will eventually converge. Now, this convergence, that theorem of K-means, doesn’t say anything about converging to the best cluster or converging to the optimal clustering. It just says that it will converge. This is still a useful property to know about K-means because this might not be true of other algorithms.

So one other point that I want to discuss is that K-means is actually quite sensitive to where you initialize the centroids, and there are a couple techniques to do this. So one technique that people like to use is you can just assign a cluster to be a random data point. So you pick a data point and say that is my cluster center. Another interesting thing is to choose K points that are the farthest away from each other. So if I have two points, I’ll choose two points that are the farthest away from each other within the bounds of my data, between the bound of the smallest X coordinate and the largest X coordinate, smallest Y coordinate, largest Y coordinate, and another approach is to repeat my K-means many times over and over again.

You just run the algorithm over and over again using one of these initializations and just take the average of the clusters. There’s another way more advanced approach called K-means plus plus that does all kinds of analysis of variance, but we’re not gonna get into that. In fact, we don’t have to implement it either ’cause it’s usually embedded in many libraries. Okay, so the only thing that remains to talk about is how do you choose a value of K. Well, one other thing to do is just plot your data, try a couple K values, and see which clustering works the best.

Another technique that’s actually quite popular is called the elbow method, and what you do is you choose some values of K, you run the clustering algorithm, you compute this thing called a within-cluster-sum-of-squares. In other words, you take the centroid and each data point assigned to that centroid and compute the sum of the square differences between those guys and you sum up all that. So for each cluster so you get one of those, and then you sum it up for all the clusters and you plot that on a graph, and you can keep plotting this on a graph for different values of K and then you pick the elbow of this graph here and that’s, it’s not a very clear-cut way as to what value of K you should use, it just gives you kind of an idea of what the range of different K you should choose there.

So this is also very popular and, again, there are lots of libraries that can actually generate this for you. You can do this yourself as well. So this is just a technique that you might wanna use to figure out how many values of K you should use. Okay, so that pretty much covers K-means clustering.

So I wanna talk a little bit about some of the advantages and disadvantages of this, and then later we’re gonna actually see it run on different kinds of data, and we’ll see how well it performs there. So it’s widely known and used, very popular like I said. It’s a fairly simple algorithm actually. It’s really only like four steps, and in lines of code wise it’s not that long either. It’s in the algorithm, easy to implement. One big perk is that it’s guaranteed to converge. It’s pretty nice.

Disadvantages of K-means, it’s algorithmically pretty slow if you know anything about big O notation. It’s take a long time for it to converge. Another issue is that it might not converge to the best solution. It’s guaranteed to converge, but it might not converge to the best solution, and it’s not very robust against clusters that have various shapes, like elongated clusters for example. It might not perform so well.

We’re actually gonna see this when we look at some fake data. We’ll see how well it performs against elongated data for example. And the reason for this is because we’re using the same parameters for each cluster when we do this. So anyway, that is just a quick overview of K-means clustering, and so now that we have this example we can go ahead and try to look at, analyze how it performs on different kinds of data sets.

Interested in continuing? Check out the full Data Insights with Cluster Analysis course, which is part of our Data Science Mini-Degree.

For teachers, you can also try Zenva Schools which offers online learning suitable for K12 school environments and offers a wide array of topics including Python, web development, Unity, and more.

Clustering with Gaussian Mixture Models – Data Science & ML

Mohit Deshpande — Tue, 29 Nov 2022 12:26:18 +0000

Gaussian Mixture Models are an essential part of data analysis – but do you know how they work?

In this article, we’ll seek to demystify how to analyze “clusters” of data and discuss Gaussian Mixture Models which can help us more efficiently and accurately sort out clusters of data points.

Let’s dive in!

What is Clustering?

Clustering is an essential part of any data analysis. Using an algorithm such as K-Means leads to hard assignments, meaning that each point is definitively assigned a cluster center. This leads to some interesting problems: what if the true clusters actually overlap? What about data that is more spread out; how do we assign clusters then?

Gaussian Mixture Models save the day! We will review the Gaussian or normal distribution method and the problem of clustering. Then we will discuss the overall approach of Gaussian Mixture Models. Training them requires using a very famous algorithm called the Expectation-Maximization Algorithm that we will discuss.

Download the full code here.

If you are not familiar with the K-Means algorithm or clustering, read about it here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Gaussian Distribution

The first question you may have is “what is a Gaussian?”. It’s the most famous and important of all statistical distributions. A picture is worth a thousand words so here’s an example of a Gaussian centered at 0 with a standard deviation of 1.

This is the Gaussian or normal distribution! It is also called a bell curve sometimes. The function that describes the normal distribution is the following

\[ \mathcal{N}(x; \mu, \sigma^2) = \displaystyle\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\displaystyle\frac{(x-\mu)^2}{2\sigma^2}} \]

That looks like a really messy equation! And it is, so we’ll use $\mathcal{N}(x; \mu, \sigma^2)$ to represent that equation. If we look at it, we notice there are one input and two parameters. First, let’s discuss the parameters and how they change the Gaussian. Then we can discuss what the input means.

The two parameters are called the mean $\mu$ and standard deviation $\sigma$. In some cases, the standard deviation is replaced with the variance $\sigma^2$, which is just the square of the standard deviation. The mean of the Gaussian simply shifts the center of the Gaussian, i.e., the “bump” or top of the bell. In the image above, $\mu=0$, so the largest value is at $x=0$.

The standard deviation is a measure of the spread of the Gaussian. It affects the “wideness” of the bell. Using a larger standard deviation means that the data are more spread out, rather than closer to the mean.

What about the input? More specifically, the above function is called the probability density function (pdf) and it tells us the probability of observing an input $x$, given that specific normal distribution. Given the graph above, we see that observing an input value of 0 gives us a probability of about 40%. As we move away in either direction from the center, we are decreasing the probability. This is one key property of the normal distribution: the highest probability is located at mean while the probabilities approach zero as we move away from the mean.

(Since this is a probability distribution, the sum of all of the values under the bell curve, i.e., the integral, is equal to 1; we also have no negative values.)

Why are we using the Gaussian distribution? The Expectation-Maximization algorithm is actually more broad than just the normal distribution, but what makes Gaussians so special? It turns out that many dataset distributions are actually Gaussian! We find these Gaussians in nature, mathematics, physics, biology, and just about every other field! They are ubiquitous! There is a famous theorem in statistics called the Central Limit Theorem that states that as we collect more and more samples from a dataset, they tend to resemble a Gaussian, even if the original dataset distribution is not Gaussian! This makes Gaussian very powerful and versatile!

Multivariate Gaussians

We’ve only discussed Gaussians in 1D, i.e., with a single input. But they can easily be extended to any number of dimensions. For Gaussian Mixture Models, in particular, we’ll use 2D Gaussians, meaning that our input is now a vector instead of a scalar. This also changes our parameters: the mean is now a vector as well! The mean represents the center of our data so it must have the same dimensionality as the input.

The variance changes less intuitively into a covariance matrix $\Sigma$. The covariance matrix, in addition to telling us the variance of each dimension, also tells us the relationship between the inputs, i.e., if we change x, how does y tend to change?

We won’t discuss the details of the multivariate Gaussian or the equation that generates it, but knowing what it looks like is essential to Gaussian Mixture Models since we’ll be using these.

The above chart has two different ways to represent the 2D Gaussian. The upper plot is a surface plot that shows this our 2D Gaussian in 3D. The X and Y axes are the two inputs and the Z axis represents the probability. The lower plot is a contour plot. I’ve plotted these on top of each other to show how the contour plot is just a flattened surface plot where color is used to determine the height. The lighter the color, the larger the probability. The Gaussian contours resemble ellipses so our Gaussian Mixture Model will look like it’s fitting ellipses around our data. Since the surface plot can get a little difficult to visualize on top of data, we’ll be sticking to the contour plots.

Gaussian Mixture Models

Now that we understand Gaussians, let’s discuss to Gaussian Mixture Models (GMMs)! To motivate our discussion, let’s see some example data that we want to cluster.

We could certainly cluster these data using an algorithm like K-Means to get the following results.

In this case, K-Means works out pretty well. But let’s consider another case where we have overlap in our data. (The two Gaussians are colored differently)

In this case, it’s pretty clear that these data are generated from Gaussians from the elliptical shape of the 2D Gaussian. In fact, we know that these data follow the normal distribution so using K-Means doesn’t seem to take advantage of that fact. Even though I didn’t tell you our data were normally distributed, remember that the Central Limit Theorem says that enough random samples from any distribution will look like the normal distribution.

Additionally, K-Means doesn’t take into account the covariance of our data. For example, the blue points seem to have a relationship between X and Y: larger X values tend to produce larger Y values. If we had two points that were equidistant from the center of the cluster, but one followed the trend and the other didn’t, K-Means would regard them as being equal, since it uses Euclidean distance. But it seems certainly more likely that the point that follows the trend should match closer to the Gaussian than the point that doesn’t.

Since we know these data are Gaussian, why not try to fit Gaussians to them instead of a single cluster center? The idea behind Gaussian Mixture Models is to find the parameters of the Gaussians that best explain our data.

This is what we call generative modeling. We are assuming that these data are Gaussian and we want to find parameters that maximize the likelihood of observing these data. In other words, we regard each point as being generated by a mixture of Gaussians and can compute that probability.

\begin{align}
p(x) = \displaystyle\sum_{j=1}^{k} \phi_j\mathcal{N}(x; \mu_j, \Sigma_j)\\
\displaystyle\sum_{j=1}^{k} \phi_j = 1
\end{align}

The first equation tells us that a particular data point $x$ is a linear combination of the $k$ Gaussians. We weight each Gaussian with $\phi_j$, which represents the strength of that Gaussian. The second equation is a constraint on the weights: they all have to sum up to 1. We have three different parameters that we need to write update: the weights for each Gaussian $\phi_j$, the means of the Gaussians $\mu_j$, and the covariances of each Gaussian $\Sigma_j$.

If we try to directly solve for these, it turns out that we can actually find closed-forms! But there is one huge catch: we have to know the $\phi_j$’s! In other words, if we knew exactly which combination of Gaussians a particular point was taken from, then we could easily figure out the means and covariances! But this one critical flaw prevents us from solving GMMs using this direct technique. Instead, we have to come up with a better approach to estimate the weights, means, covariances.

Expectation-Maximization

How do we learn the parameters? There’s a very famous algorithm called the Expectation-Maximization Algorithm, also called the EM algorithm for short, (written in 1977 with over 50,000 paper citations!) that we’ll use for updating these parameters. There are two steps in this algorithm as you might think: expectation and maximization. To explain these steps, I’m going to cover how the algorithm works at a high level.

The first part is the expectation step. In this step, we have to compute the probability that each data point was generated by each of the $k$ Gaussians. In contrast to the K-Means hard assignments, these are called soft assignments since we’re using probabilities. Note that we’re not assigning each point to a Gaussian, we’re simply determining the probability of a particular Gaussian generating a particular point. We compute this probability for a given Gaussian by computing $\phi_j\mathcal{N}(x; \mu_j, \Sigma_j)$ and normalizing by dividing by $\sum_{q=1}^{k} \phi_q\mathcal{N}(x; \mu_q, \Sigma_q)$. We’re directly applying the Gaussian equation, but multiplying it by its weight $\phi_j$. Then, to make it a probability, we normalize. In K-Means, the expectation step is analogous to assigning each point to a cluster.

The second part is the maximization step. In this step, we need to update our weights, means, and covariances. Recall in K-Means, we simply took the mean of the set of points assigned to a cluster to be the new mean. We’re going to do something similar here, except apply our expectations that we computed in the previous step. To update a weight $\phi_j$, we simply sum up the probability that each point was generated by Gaussian $j$ and divide by the total number of points. For a mean $\mu_j$, we compute the mean of all points weighted by the probability of that point being generated by Gaussian $j$. For a covariance $\Sigma_j$, we compute the covariance of all points weighted by the probability of that point being generated by Gaussian $j$. We do each of these for each Gaussian $j$. Now we’ve updated the weights, means, and covariances! In K-Means, the maximization step is analogous to moving the cluster centers.

Mathematically, at the expectation step, we’re effectively computing a matrix where the rows are the data point and the columns are the Gaussians. An element at row $i$, column $j$ is the probability that $x^{(i)}$ was generated by Gaussian $j$.

\[ W^{(i)}_j} = \displaystyle\frac{\phi_j\mathcal{N}(x^{(i)}; \mu_j, \Sigma_j)}{\displaystyle\sum_{q=1}^{k}\phi_q\mathcal{N}(x^{(i)}; \mu_q, \Sigma_q)} \]

The denominator just sums over all values to make each entry in $W$ a probability. Now, we can apply the update rules.

\begin{align*}
\phi_j &= \displaystyle\frac{1}{N}\sum_{i=1}^N W^{(i)}_j\\
\mu_j &= \displaystyle\frac{\sum_{i=1}^N W^{(i)}_j x^{(i)}}{\sum_{i=1}^N W^{(i)}_j}\\
\Sigma_j &= \displaystyle\frac{\sum_{i=1}^N W^{(i)}_j (x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^N W^{(i)}_j}
\end{align*}

The first equation is just the sum of the probabilites of a particular Gaussian $j$ divided by the number of points. In the second equation, we’re just computing the mean, except we multiply by the probabilities for that cluster. Similarly, in the last equation, we’re just computing the covariance, except we multiply by the probabilities for that cluster.

Applying GMMs

Let’s apply what we learned about GMMs to our dataset. We’ll be using scikit-learn to run a GMM for us. In the ZIP file, I’ve saved some data in a numpy array. We’re going to extract it, create a GMM, run the EM algorithm, and plot the results!

First, we need to load the data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture

X_train = np.load('data.npy')

Additionally, we can generate a plot of our data using the following code.

plt.plot(X[:,0], X[:,1], 'bx')
plt.axis('equal')
plt.show()

Remember that clustering is unsupervised, so our input is only a 2D point without any labels. We should get the same plot of the 2 Gaussians overlapping.

Using the GaussianMixture class of scikit-learn, we can easily create a GMM and run the EM algorithm in a few lines of code!

gmm = GaussianMixture(n_components=2)
gmm.fit(X_train)

After our model has converged, the weights, means, and covariances should be solved! We can print them out.

print(gmm.means_)
print('\n')
print(gmm.covariances_)

For comparison, I generate the original data data according to the following Gaussians.

\begin{align*}
\mu_1 &= \begin{bmatrix}3 \\ 3\end{bmatrix}\\
\Sigma_1 &=
\begin{bmatrix}
1 & 0.6 \\
0.6 & 1
\end{bmatrix}\\

\mu_2 &= \begin{bmatrix}1.5 \\ 1.5\end{bmatrix}\\
\Sigma_2 &=
\begin{bmatrix}
1 & -0.7 \\
-0.7 & 1
\end{bmatrix}
\end{align*}

Our means should be pretty close to the actual means! (Our covariances might be a bit off due to the weights.) Now we can write some code to plot our mixture of Gaussians.

X, Y = np.meshgrid(np.linspace(-1, 6), np.linspace(-1,6))
XX = np.array([X.ravel(), Y.ravel()]).T
Z = gmm.score_samples(XX)
Z = Z.reshape((50,50))

plt.contour(X, Y, Z)
plt.scatter(X_train[:, 0], X_train[:, 1])

plt.show()

This code simply creates a grid of all X and Y coordinates between -1 and 6 (for both) and evaluates our GMM. Then we can plot our GMM as contours over our original data.

The plot looks just as we expected! Recall that with the normal distribution, we expect to see most of the data points around the mean and less as we move away. In the plot, the first few ellipses have most of the data, with only a few data points towards the outer ellipses. The darker the contour, the lower the score.

(The score isn’t quite a probability: it’s actually a weighted log probability. Remember that each point is generated by a weighted sum of Gaussians, and, in practice, we apply a logarithm for numerical stability, thus prevent underflow.)

To summarize, Gaussian Mixture Models are a clustering technique that allows us to fit multivariate Gaussian distributions to our data. These GMMs well when our data is actually Gaussian or we suspect it to be. We also discussed the famous expectation-maximization algorithm, at a high level, to see how we can iteratively solve for the parameters of the Gaussians. Finally, we wrote code to train a GMM and plot the resulting Gaussian ellipses.

Gaussian Mixture Models are an essential part of data analysis and anomaly detection – which are important to learn when pursuing exciting developer careers! You can also find out more resources for exploring Python here.

Face Recognition with Eigenfaces – Computer Vision Tutorial

Mohit Deshpande — Sun, 02 Oct 2022 02:49:02 +0000

Face recognition – or the ability of computers to recognize faces and facial features – is an imminent concern to our future.

In this tutorial, we’re going to explore face recognition in-depth and learn how with techniques like eigenfaces, we can create our own software programs capable of identifying human faces. The applications for facial recognition are endless, and whether good or bad, it’s a skill that can help take your Python skills to the next level.

Let’s jump in and explore the world of computer vision and faces!

Intro & Project Files

Face recognition is ubiquitous in science fiction: the protagonist looks at a camera, and the camera scans his or her face to recognize the person. More formally, we can formulate face recognition as a classification task, where the inputs are images and the outputs are people’s names. We’re going to discuss a popular technique for face recognition called eigenfaces. And at the heart of eigenfaces is an unsupervised dimensionality reduction technique called principal component analysis (PCA), and we will see how we can apply this general technique to our specific task of face recognition.

Download the full code here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Face Recognition

Before discussing principal component analysis, we should first define our problem. Face recognition is the challenge of classifying whose face is in an input image. This is different than face detection where the challenge is determining if there is a face in the input image. With face recognition, we need an existing database of faces. Given a new image of a face, we need to report the person’s name.

A naïve way of accomplishing this is to take the new image, flatten it into a vector, and compute the Euclidean distance between it and all of the other flattened images in our database.

There are several downsides to this approach. First of all, if we have a large database of faces, then doing this comparison for each face will take a while! Imagine that we’re building a face recognition system for real-time use! The larger our dataset, the slower our algorithm. But more faces will also produce better results! We want a system that is both fast and accurate. For this, we’ll use a neural network! We can train our network on our dataset and use it for our face recognition task.

There’s an issue with directly using a neural network: images can be large! If we had a single $m\times n$ image, we would have to flatten it out into a single $m\dot n\times 1$ vector to feed into our neural network as input. For large image sizes, this might hurt speed! This is related to the second problem with using images as-is in our naïve approach: they are high-dimensional! (An $m\times n$ image is really a $m\dot n\times 1$ vector) A new input might have a ton of noise and comparing each and every pixel using matrix subtraction and Euclidean distance might give us a high error and misclassifications!

These issues are why we don’t use the naïve method. Instead, we’d like to take our high-dimensional images and boil them down to a smaller dimensionality while retaining the essence or important parts of the image.

Dimensionality Reduction

The previous section motivates our reason for using a dimensionality reduction technique. Dimensionality reduction is a type of unsupervised learning where we want to take higher-dimensional data, like images, and represent them in a lower-dimensional space. Let’s use the following image as an example.

These plots show the same data, except the bottom chart zero-centers it. Notice that our data do not have any labels associated with them because this is unsupervised learning! In our simple case, dimensionality reduction will reduce these data from a 2D plane to a 1D line. If we had 3D data, we could reduce it down to a 2D plane or even a 1D line.

All dimensionality reduction techniques aim to find some hyperplane, a higher-dimensional line, to project the points onto. We can imagine a projection as taking a flashlight perpendicular to the hyperplane we’re project onto and plotting where the shadows fall on that hyperplane. For example, in our above data, if we wanted to project our points onto the x-axis, then we pretend each point is a ball and our flashlight would point directly down or up (perpendicular to the x-axis) and the shadows of the points would fall on the x-axis. This is a projection. We won’t worry about the exact math behind this since scikit-learn can apply this projection for us.

In our simple 2D case, we want to find a line to project our points onto. After we project the points, then we have data in 1D instead of 2D! Similarly, if we had 3D data, we want to find a plane to project the points down onto to reduce the dimensionality of our data from 3D to 2D. The different types of dimensionality reduction are all about figuring out which of these hyperplanes to select: there are an infinite number of them!

Princpal Component Analysis

One technique of dimensionality reduction is called principal component analysis (PCA). The idea behind PCA is that we want to select the hyperplane such that when all the points are projected onto it, they are maximally spread out. In other words, we want the axis of maximal variance! Let’s consider our example plot above. A potential axis is the x-axis or y-axis, but, in both cases, that’s not the best axis. However, if we pick a line that cuts through our data diagonally, that is the axis where the data would be most spread!

The longer blue axis is the correct axis! If we were to project our points onto this axis, they would be maximally spread! But how do we figure out this axis? We can borrow a term from linear algebra called eigenvectors! This is where eigenfaces gets its name! Essentially, we compute the covariance matrix of our data and consider that covariance matrix’s largest eigenvectors. Those are our principal axes and the axes that we project our data onto to reduce dimensions. Using this approach, we can take high-dimensional data and reduce it down to a lower dimension by selecting the largest eigenvectors of the covariance matrix and projecting onto those eigenvectors.

Since we’re computing the axes of maximum spread, we’re retaining the most important aspects of our data. It’s easier for our classifier to separate faces when our data are spread out as opposed to bunched together.

(There are other dimensionality techniques, such as Linear Discriminant Analysis, that use supervised learning and are also used in face recognition, but PCA works really well!)

How does this relate to our challenge of face recognition? We can conceptualize our $m\times n$ images as points in $m\dot n$-dimensional space. Then, we can use PCA to reduce our space from $m\dot n$ into something much smaller. This will help speed up our computations and be robust to noise and variation.

Aside on Face Detection

So far, we’ve assumed that the input image is only that of a face, but, in practice, we shouldn’t require the camera images to have a perfectly centered face. This is why we run an out-of-the-box face detection algorithm, such as a cascade classifier trained on faces, to figure out what portion of the input image has a face in it. When we have that bounding box, we can easily slice out that portion of the input image and use eigenfaces on that slice. (Usually, we smooth that slide and perform an affine transform to de-warp the face if it appears at an angle.) For our purposes, we’ll assume that we have images of faces already.

Eigenfaces Code

Now that we’ve discussed PCA and eigenfaces, let’s code a face recognition algorithm using scikit-learn! First, we’ll need a dataset. For our purposes, we’ll use an out-of-the-box dataset by the University of Massachusetts called Labeled Faces in the Wild (LFW).

Feel free to substitute your own dataset! If you want to create your own face dataset, you’ll need several pictures of each person’s face (at different angles and lighting), along with the ground-truth labels. The wider variety of faces you use, the better the recognizer will do. The easiest way to create a dataset for face recognition is to create a folder for each person and put the face images in there. Make sure each are the same size and resize them so they aren’t large images! Remember that PCA will reduce the image’s dimensionality when we project onto that space anyways so using large, high-definition images won’t help and will slow down our algorithm. A good size is ~512×512 for each image. The images should all be the same size so you can store them in one numpy array with dimensions (num_examples, height, width) . (We’re assuming grayscale images). Then use the folder names to disambiguate classes. Using this approach, you can use your own images.

However, we’ll be using the LFW dataset. Luckily, scikit-learn can automatically load our dataset for us in the correct format. We can call a function to load our data. If the data aren’t available on disk, scikit-learn will automatically download them for us from the University of Massachusetts’ website.

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier


# Load data
lfw_dataset = fetch_lfw_people(min_faces_per_person=100)

_, h, w = lfw_dataset.images.shape
X = lfw_dataset.data
y = lfw_dataset.target
target_names = lfw_dataset.target_names

# split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

The argument to our function just prunes all people without at least 100 faces, thus reducing the number of classes. Then we can extract our dataset and other auxiliary information. Finally, we split our dataset into training and testing sets.

Now we can simply use scikit-learn’s PCA class to perform the dimensionality reduction for us! We have to select the number of components, i.e., the output dimensionality (the number of eigenvectors to project onto), that we want to reduce down to, and feel free to tweak this parameter to try to get the best result! We’ll use 100 components. Additionally, we’ll whiten our data, which is easy to do with a simple boolean flag! (Whitening just makes our resulting data have a unit variance, which has been shown to produce better results)

# Compute a PCA 
n_components = 100
pca = PCA(n_components=n_components, whiten=True).fit(X_train)

# apply PCA transformation
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

We can apply the transform to bring our images down to a 100-dimensional space. Notice we’re not performing PCA on the entire dataset, only the training data. This is so we can better generalize to unseen data.

Now that we have a reduced-dimensionality vector, we can train our neural network!

# train a neural network
print("Fitting the classifier to the training set")
clf = MLPClassifier(hidden_layer_sizes=(1024,), batch_size=256, verbose=True, early_stopping=True).fit(X_train_pca, y_train)

To see how our network is training, we can set the verbose flag. Additionally, we use early stopping.

Let’s discuss early stopping as a brief aside. Essentially, our optimizer will monitor the average accuracy for the validation set for each epoch. If it notices that our validation accuracy hasn’t increased significantly for a certain number of epochs, then we stop training. This is a regularization technique that prevents our model from overfitting!

Consider the above chart. We notice overfitting when our validation set accuracy starts to decline. At that point, we immediately stop training to prevent overfitting.

Finally, we can make a prediction and use a function to print out an entire report of quality for each class.

y_pred = clf.predict(X_test_pca)
print(classification_report(y_test, y_pred, target_names=target_names))

Here’s an example of a classification report.

                   precision    recall  f1-score   support

     Colin Powell       0.86      0.89      0.87        66
  Donald Rumsfeld       0.85      0.61      0.71        38
    George W Bush       0.88      0.94      0.91       177
Gerhard Schroeder       0.67      0.69      0.68        26
       Tony Blair       0.86      0.71      0.78        35

      avg / total       0.85      0.85      0.85       342

Notice there is no accuracy metric. Accuracy isn’t the most specific kind of metric to begin. Instead, we see precision, recall, f1-score, and support. The support is simply the number of times this ground truth label occurred in our test set, e.g., in our test set, there were actually 35 images of Tony Blair. The F1-Score is actually just computed from the precision and recall scores. Precision and recall are more specific measures than a single accuracy score. A higher value for both is better.

After training our classifier, we can give it a few images to classify.

# Visualization
def plot_gallery(images, titles, h, w, rows=3, cols=4):
    plt.figure()
    for i in range(rows * cols):
        plt.subplot(rows, cols, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i])
        plt.xticks(())
        plt.yticks(())

def titles(y_pred, y_test, target_names):
    for i in range(y_pred.shape[0]):
        pred_name = target_names[y_pred[i]].split(' ')[-1]
        true_name = target_names[y_test[i]].split(' ')[-1]
        yield 'predicted: {0}\ntrue: {1}'.format(pred_name, true_name)

prediction_titles = list(titles(y_pred, y_test, target_names))
plot_gallery(X_test, prediction_titles, h, w)

(plot_gallery and titles functions modified from the scikit-learn documentation)

We can see our network’s predictions and the ground truth value for each image.

Another thing interesting thing to visualize is are the eigenfaces themselves. Remember that PCA produces eigenvectors. We can reshape those eigenvectors into images and visualize the eigenfaces.

These represent the “generic” faces of our dataset. Intuitively, these are vectors that represent directions in “face space” and are what our neural network uses to help with classification. Now that we’ve discussed the eigenfaces approach, you can build applications that use this face recognition algorithm!

We discussed a popular approach to face recognition called eigenfaces. The essence of eigenfaces is an unsupervised dimensionality reduction algorithm called Principal Components Analysis (PCA) that we use to reduce the dimensionality of images into something smaller. Now that we have a smaller representation of our faces, we apply a classifier that takes the reduced-dimension input and produces a class label. For our classifier, we used a single-layer neural network.

Face recognition is a fascinating example of merging computer vision and machine learning and many researchers are still working on this challenging problem today!

Nowadays, deep convolutional neural networks are used for face recognition, and have vast implications in terms of the development career world. Try one out on this dataset!

Text Classification with Naive Bayes – Python Tutorial

Mohit Deshpande — Sat, 24 Sep 2022 13:36:53 +0000

If asked how your email sorted spam emails from legitimate emails, what would you say? Do you think you could build a spam filter yourself?

Hidden throughout our lives is one of the most prevalent uses for machine learning: text classification. Text classification allows us to do everything from identifying spam emails to being able to convert hand-written words into their plain text, digital equivalent.

In this tutorial, we’re going to discover how we can use the Naive Bayes technique to build our own text classification and build a spam filter using Python.

If you’re ready to expand how you see the digital world and text, let’s jump in!

Intro to Text Classification

The challenge of text classification is to attach labels to bodies of text, e.g., tax document, medical form, etc. based on the text itself. For example, think of your spam folder in your email. How does your email provider know that a particular message is spam or “ham” (not spam)? We’ll take a look at one natural language processing technique for text classification called Naive Bayes.

Download the full code here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Bayes Theorem

(credit: https://en.wikipedia.org/wiki/Bayes%27_theorem#/media/File:Bayes%27_Theorem_MMB_01.jpg)

Before we derive the algorithm, we need to discuss the fundamental rule that Naive Bayes uses: Bayes Theorem:

\[ p(A|B) = \displaystyle\frac{p(B|A)p(A)}{p(B)} \]

where $A$ and $B$ are events and $p(\cdot)$ is a probability.

Let’s take a second to break this down. On the left, we have the probability of an event $A$ happening given that event $B$ happens. We say this is equal to the probability of event $B$ happening given event $A$ times the probability that event $A$ happens overall. All of that is divided by the probability that event $B$ happens overall. An example of this might help shed some light on why this is an ingenious theorem.

The classic example used to illustrate Bayes Theorem involves medical testing. Let’s suppose that we were getting tested for the flu. When we get a medical test, there are really 4 cases to consider when we get the results back:

True Positive: The test says we have the flu and we actually have the flu
True Negative: The test says we don’t have the flu and we actually don’t have the flu
False Positive: The test says we have the flu and we actually don’t have the flu
False Negative: The test says we don’t have the flu and we actually do have the flu

Suppose we also know some information about the flu and our testing methodology: we know our test can correctly detect that a person has the flu 99.5% of the time (i.e., $p(\text{+}|\text{Flu})=0.995$) and correctly detect that a person does not have the flu 99.5% of the time (i.e., $p(\text{-}|\text{No Flu})=0.995$). These correspond to the true positive rate and true negative rate. We also know that this specific type of flu is rare and only affects 1% of people. Given this information, we can compute the probability that any randomly selected person will have this specific type of the flu. Specifically, we want to compute the probability that the person has the specific type of flu, given that the person tested positive for it, i.e., event $A=\text{Flu}$ and $B=\text{+}$.

Let’s just substitute the problem specifics into Bayes Theorem.

\[ p(\text{Flu}|\text{+}) = \displaystyle\frac{p(\text{+}|\text{Flu})p(\text{Flu})}{p(\text{+})} \]

Now let’s try to figure out specific values for the quantities on the right-hand side. The first quantity is $p(\text{+}|\text{Flu})$. This is the probability that someone tests positive given that they have the flu. In other words, this is the true positive rate: the probability that our test can correctly detect that a person has the flu! This number is 99.5% or 0.995 The next quantity in the numerator is $p(\text{Flu})$. This is called the prior probability. In other words, it is the probability that any random person has the flu. We know from our problem that this number is 1%, or 0.01. Let’s substitute in those values in the numerator.

\[ p(\text{Flu}|\text{+}) = \displaystyle\frac{0.995 \cdot 0.01}{p(\text{+})} \]

Now we have to deal with the denominator: $p(\text{+})$. This is the probability that our test returns positive overall. We can’t quite use the information given in the problem as directly as before however. But first, why do we even need $p(\text{+})$? Recall that probabilities have to be between 0 and 1. Based on the above equation, if we left out the denominator, then we wouldn’t have a valid probability!

Anyways, when can our test return positive? Well there are two cases: either our test returns positive and the person actually has the flu (true positive) or our test returns positive and our person does not have the flu (false positive). We can’t quite simply sum both of these cases to be the denominator. We have to weight them by their respective probabilities, i.e., the probability that any person has the flu overall and the probability that any person does not have the flu overall. Let’s expand the denominator.

\[ p(\text{Flu}|\text{+}) = \displaystyle\frac{0.995 \cdot 0.01}{p(\text{+}|\text{Flu})p(\text{Flu})+p(\text{+}|\text{No Flu})p(\text{No Flu})} \]

Now let’s reason about these values. $p(\text{+}|\text{Flu})p(\text{Flu})$ is something we’ve seen before: it’s the numerator! Now let’s look at the next quantity: $p(\text{+}|\text{No Flu})p(\text{No Flu})$. We can compute the first term by taking the complement of the true negative: $p(\text{+}|\text{No Flu})=1-p(\text{-}|\text{No Flu})=0.005$. And $p(\text{No Flu})=1-p(\text{Flu})=0.99$ since they are complimentary events. So now we can plug in all of our values and get a result.

\[ p(\text{Flu}|\text{+}) = \displaystyle\frac{0.995 \cdot 0.01}{0.995 \cdot 0.01+0.005 \cdot 0.99}= 0.6678\]

This result is a little surprising! This is saying, despite our test’s accuracy, knowing someone tested positive means that there’s only a 67% chance that they actually have the flu! Hopefully, this example illustrated how to use Bayes Theorem.

Deriving Naive Bayes

Now let’s convert the Bayes Theorem notation into something slightly more machine learning-oriented.

\[ p(H|E) = \displaystyle\frac{p(E|H)p(H)}{p(E)} \]

where $H$ is the hypothesis and $E$ is the evidence. Now this might make more sense in the context of text classification: the probability that our hypothesis is correct given the evidence to support it is equal to the probability of observing that evidence given our hypothesis times the prior probability of the hypothesis divided by the probability of observing that evidence overall.

Let’s break this down again like we did for the original Bayes Theorem, except we’ll use the context of the text classification problem we’re trying to solve: spam detection. Our hypothesis $H$ is something like “this text is spam” and the evidence $E$ is the text of the email. So to restate, we’re trying to find the probability that our email is spam given the text in the email. The numerator is then the probability that that we find these words in a spam email times the probability that any email is spam. The denominator is a bit tricky: it’s the probability that we observe those words overall.

There’s something a bit off with this formulation though: the evidence needs to be represented as multiple pieces of evidence: the words $w_1,\dots,w_n$. No problem! We can do that and Bayes Theorem still holds. We can also change hypothesis $H$ to a class $\text{Spam}$.

\[ p(\text{Spam}|w_1,\dots,w_n) = \displaystyle\frac{p(w_1,\dots,w_n|\text{Spam})p(\text{Spam})}{p(w_1,\dots,w_n)} \]

Excellent! We can use a conditional probability formula to expand out the numerator.

\[ p(\text{Spam}|w_1,\dots,w_n) = \displaystyle\frac{p(w_1|w_2,\dots,w_n,\text{Spam})p(w_2|w_3,\dots,w_n,\text{Spam})\dots p(w_{n-1}|w_n,\text{Spam})p(\text{Spam})}{p(w_1,\dots,w_n)} \]

Not only does this look messy, it’s also quite messy to compute! Let’s think about the first term: $p(w_1|w_2,\dots,w_n,\text{Spam})$. This is the probability of finding the first word, given all of the other words and given that the email is spam. This is really difficult to compute if we have a lot of words!

Naive Bayes Assumption

To help us with that equation, we can make an assumption called the Naive Bayes assumption to help us with the math, and eventually the code. The assumption is that each word is independent of all other words. In reality, this is not true! Knowing what words come before/after do influence the next/previous word! However, making this assumption greatly simplifies the math and, in practice, works well! This assumption is why this technique is called Naive Bayes. So after making that assumption, we can break down the numerator into the following.

\[ p(\text{Spam}|w_1,\dots,w_n) = \displaystyle\frac{p(w_1|\text{Spam})p(w_2|\text{Spam})\dots p(w_n|\text{Spam})p(\text{Spam})}{p(w_1,\dots,w_n)} \]

This looks better! Now we can interpret a term $p(w_1|\text{Spam})$ to mean the probability of finding word $w_1$ in a spam email. We can use a notational shorthand to symbolize product ($\Pi$).

\[ p(\text{Spam}|w_1,\dots,w_n) = \displaystyle\frac{p(\text{Spam})\displaystyle\prod_{i=1}^np(w_i|\text{Spam})}{p(w_1,\dots,w_n)} \]

This is the Naive Bayes formulation! This returns the probability that an email message is spam given the words in that email. For text classification, however, we need an actually label, not a probability, so we simply say that an email is spam if $p(\text{Spam}|w_1,\dots,w_n)$ is greater than 50%. If not, then it is not spam. In other words, we choose “spam” or “ham” based on which one of these two classes has the higher probability! Actually, we don’t need probabilities at all. We can forget about the denominator since its only purpose is to scale the numerator.

\[ p(\text{Spam}|w_1,\dots,w_n) \propto p(\text{Spam})\displaystyle\prod_{i=1}^np(w_i|\text{Spam}) \]

(where $\propto$ signifies proportional to) That’s one extra thing we don’t have to compute! In this instance, we pick whichever class has the higher score since this is not a true probability anymore.

Numerical Stability

There’s one extra thing we’re going to do to help us with numerical stability. If we look at the numerator, we see we’re multiplying many probabilities together. If we do that, we could end up with really small numbers, and our computer might round down to zero! To prevent this, we’re going to look at the log probability by taking the log of each side. Using some properties of logarithms, we can manipulate our Naive Bayes formulation.

\begin{align*}
\log p(\text{Spam}|w_1,\dots,w_n) &\propto \log p(\text{Spam})\displaystyle\prod_{i=1}^np(w_i|\text{Spam})\\
\log p(\text{Spam}|w_1,\dots,w_n) &\propto \log p(\text{Spam}) + \log \displaystyle\prod_{i=1}^np(w_i|\text{Spam})\\
\log p(\text{Spam}|w_1,\dots,w_n) &\propto \log p(\text{Spam}) + \displaystyle\sum_{i=1}^n \log p(w_i|\text{Spam})
\end{align*}

Now we’re dealing with additions of log probabilities instead of multiplying many probabilities together! Since log has really nice properties (monotonicity being the key one), we can still take the highest score to be our prediction, i.e., we don’t have to “undo” the log!

Dataset

We’ll be using the Enron email dataset for our training data. This is real email data from the Enron Corporation after the company collapsed. Before starting, download all of the numbered folders, i.e., enron1, enron2, etc., here and put all of them in a top-level folder simply called enron. Put this enron folder in the same directory as your source code so we can find the dataset!

A WORD OF WARNING!: Since this dataset is a real dataset of emails, it contains real spam messages. Your anti-virus may prune some these emails because they are spam. Let your anti-virus prune as many as it wants. This will not affect our code as long as there are some spam and ham messages still there!

Naive Bayes Code

Here is the dataset-loading code:

import os
import re
import string
import math

DATA_DIR = 'enron'
target_names = ['ham', 'spam']

def get_data(DATA_DIR):
	subfolders = ['enron%d' % i for i in range(1,7)]

	data = []
	target = []
	for subfolder in subfolders:
		# spam
		spam_files = os.listdir(os.path.join(DATA_DIR, subfolder, 'spam'))
		for spam_file in spam_files:
			with open(os.path.join(DATA_DIR, subfolder, 'spam', spam_file), encoding="latin-1") as f:
				data.append(f.read())
				target.append(1)

		# ham
		ham_files = os.listdir(os.path.join(DATA_DIR, subfolder, 'ham'))
		for ham_file in ham_files:
			with open(os.path.join(DATA_DIR, subfolder, 'ham', ham_file), encoding="latin-1") as f:
				data.append(f.read())
				target.append(0)

	return data, target

This will produce two lists: the data list, where each element is the text of an email, and the target list, which is simply binary (1 meaning spam and 0 meaning ham). Now let’s create a class and add some helper functions for string manipulation.

class SpamDetector(object):
    """Implementation of Naive Bayes for binary classification"""
    def clean(self, s):
        translator = str.maketrans("", "", string.punctuation)
        return s.translate(translator)

    def tokenize(self, text):
        text = self.clean(text).lower()
        return re.split("\W+", text)

    def get_word_counts(self, words):
        word_counts = {}
        for word in words:
            word_counts[word] = word_counts.get(word, 0.0) + 1.0
        return word_counts

We have a function to clean up our string by removing punctuation, one to tokenize our string into words, and another to count up how many of each word appears in a list of words.

Before we start the actual algorithm, let’s first understand the algorithm. For training, we need three things: the (log) class priors, i.e., the probability that any given message is spam/ham; a vocabulary of words; and words frequency for spam and ham separately, i.e., the number of times a given word appears in a spam and ham message. Given a list of input documents, we can write this algorithm.

Compute log class priors by counting how many messages are spam/ham, dividing by the total number of messages, and taking the log.
For each (document, label) pair, tokenize the document into words.
For each word, either add it to the vocabulary for spam/ham, if it isn’t already there, and update the number of counts. Also add that word to the global vocabulary.

def fit(self, X, Y):
    self.num_messages = {}
    self.log_class_priors = {}
    self.word_counts = {}
    self.vocab = set()

    n = len(X)
    self.num_messages['spam'] = sum(1 for label in Y if label == 1)
    self.num_messages['ham'] = sum(1 for label in Y if label == 0)
    self.log_class_priors['spam'] = math.log(self.num_messages['spam'] / n)
    self.log_class_priors['ham'] = math.log(self.num_messages['ham'] / n)
    self.word_counts['spam'] = {}
    self.word_counts['ham'] = {}

    for x, y in zip(X, Y):
        c = 'spam' if y == 1 else 'ham'
        counts = self.get_word_counts(self.tokenize(x))
        for word, count in counts.items():
            if word not in self.vocab:
                self.vocab.add(word)
            if word not in self.word_counts[c]:
                self.word_counts[c][word] = 0.0

            self.word_counts[c][word] += count

First, we can compute the log class priors by counting up how many spam/ham messages are in our dataset and dividing by the total number. Finally, we take the log.

Then we can iterate through our dataset. For each input, we get the word counts and iterate through each (word, frequency) pair. If the word isn’t in our global vocabulary, we add it. If it isn’t in the vocabulary for that particular class label, we also add it along with the frequency.

For example, suppose we had a “spam” message. We count up how many times each unique word appears in that spam message and add that count to the “spam” vocabulary. Suppose the word “free” appears 4 times. Then we add the word “free” to our global vocabulary and add it to the “spam” vocabulary with a count of 4.

We’re keeping track of the frequency of each word as it appears in either a spam or ham message. For example, we expect the word “free” to appear in both messages, but we expect it to be more frequent in the “spam” vocabulary than the “ham” vocabulary.

Now that we’ve extracted all of the data we need from the training data, we can write another function to actually output the class label for new data. To do this classification, we apply Naive Bayes directly. For example, given a document, we need to iterate each of the words and compute $\log p(w_i|\text{Spam})$ and sum them all up, and we also compute $\log p(w_i|\text{Ham})$ and sum them all up. Then we add the log class priors and check to see which score is bigger for that document. Whichever is larger, that is the predicted label!

To compute $\log p(w_i|\text{Spam})$, the numerator is how many times we’ve seen $w_i$ in a “spam” message divided by the total count of all words in every “spam” message.

On additional note: remember that the log of 0 is undefined! What if we encounter a word that is in the “spam” vocabulary, but not the “ham” vocabulary? Then $p(w_i|\text{Ham})$ will be 0! One way around this is to use Laplace Smoothing. We simply add 1 to the numerator, but we also have to add the size of the vocabulary to the denominator to balance it.

def predict(self, X):
    result = []
    for x in X:
        counts = self.get_word_counts(self.tokenize(x))
        spam_score = 0
        ham_score = 0
        for word, _ in counts.items():
            if word not in self.vocab: continue
            
            # add Laplace smoothing
            log_w_given_spam = math.log( (self.word_counts['spam'].get(word, 0.0) + 1) / (self.num_messages['spam'] + len(self.vocab)) )
            log_w_given_ham = math.log( (self.word_counts['ham'].get(word, 0.0) + 1) / (self.num_messages['ham'] + len(self.vocab)) )

            spam_score += log_w_given_spam
            ham_score += log_w_given_ham

        spam_score += self.log_class_priors['spam']
        ham_score += self.log_class_priors['ham']

        if spam_score > ham_score:
            result.append(1)
        else:
            result.append(0)
    return result

In our case, the input can be a list of document texts; we return a list of predictions. Finally, we can use the class like this.

if __name__ == '__main__':
    X, y = get_data(DATA_DIR)
    MNB = SpamDetector()
    MNB.fit(X[100:], y[100:])

    pred = MNB.predict(X[:100])
    true = y[:100]

    accuracy = sum(1 for i in range(len(pred)) if pred[i] == true[i]) / float(len(pred))
    print("{0:.4f}".format(accuracy))

We’re reserving the first 100 for the testing set, “train” our Naive Bayes classifier, then compute the accuracy.

To recap, we reviewed Bayes Theorem and demonstrated how to use it with an example. Then we re-worked it using hypotheses and evidence instead of just events $A$ and $B$ to make it more specific to our task of spam detection. From there, we derived Naive Bayes by making the Naive Bayes Assumption that each word appears independently of all other words. Then we formulated a prediction equation/rule. Using the Enron dataset, we created a binary Naive Bayes classifier for detecting spam emails.

Naive Bayes is a simple text classification algorithm that uses basic probability laws and works quite well in practice! It is also an important skill set used in the professional industry, so having these skills down pat will have immense implications for your career as well!

An Introduction to Machine Learning

Lindsay Schardon — Fri, 20 Dec 2019 16:00:56 +0000

You can access the full course here: Machine Learning for Beginners with TensorFlow

Intro to Machine Learning

Now that we know what the course is all about, let’s learn a bit about the main topic: machine learning. What is machine learning? Machine learning is the study of statistics and algorithms aimed at performing a task without being explicitly programmed to. Theoretically, a machine is said to have learned if it produces “better” results over time without modifications to its programming. Practically, this means writing an algorithm, feeding it some data, and letting it interpret the data to find some pattern to solve a problem.

Behind the scenes, machine learning models consist of layers of connected nodes (often called neurons). We will cover model structure in greater detail later but it is important to know now that each node has one or more values assigned to it that, when multiplied with a function of our choice produces some output. Through training, a model can change the values to produce more accurate outputs rather than having us, as the coders, explicitly change the algorithm or values manually. In this way, the model is “learning” as it is improving its results over time using the same algorithm. It should also be noted that some models do not undergo training and are only used to find patterns in data. We will explain the differences in the section common machine learning models.

So what makes machine learning so special? Why not just hard code the algorithms ourselves? The main reason we use machine learning is to help find patterns in data that we wouldn’t otherwise be able to see. If we cannot find patterns in data, we cannot hard code algorithms to look for them as we wouldn’t know what to tell the algorithm to do. It is this pattern recognition that allows machine learning models to solve recognition, classification, and prediction problems such as speech recognition, image classification, and stock market prediction. Machine learning also helps to customize user experience and tailor solutions to the user based on their previous habits such as with responsive game AIs, health apps, and text suggestion.

Transcript

What is up, guys? And welcome to the first tutorial in our machine learning course! This’ll also be the very first topic and, as you can see, it is on an Intro to Machine Learning. This is a good way to get some conceptual background info about what machine learning is, kind of how it works, and also go over some practical examples before we start writing any actual code.

So, what topics will we be covering here? Well, we’re gonna divide ourselves into three subtopics. We’ll start with “What is Machine Learning?”, then “What can we do with Machine Learning?”, and we’ll finish up with “What types of Machine Learning there are out there”. Now, I’m going to devote a separate tutorial to each of these three topics just because I’m trying to keep things a bit shorter. There’s a lot to digest within each of these so, we don’t want anything running too, too long.

Okay, so, for starters, we’ll cover, What Is Machine Learning? I think we can safely do that in this video, too. Now, there is a technical definition here; Machine Learning is the study of statistics and algorithms that’s aimed at performing a task without being explicitly programmed to. That’s a lot to take in. It’s quite wordy and is very technical, uses a lot of jargon. So, let’s try to break that down a bit. I’ve got a, hopefully a bit more of an easy definition to understand here and that is that Machine Learning is finding patterns in data that help to solve a problem without us necessarily writing the algorithm to find the patterns and solve the problem. Now, realistically, these are both kind of describing the same thing. There are a couple of common themes in there.

So, the first is that we’re aimed at performing a task or solving some kind of a problem. Well, that’s kind of obvious. I mean, that’s what software is supposed to help us to do is perform some task or solve a problem easier than we would otherwise be able to do. But the second thing that’s really important is that it’s not programmed explicitly to necessarily solve that problem or to improve over time, okay? So, the learning aspect of it is actually not something that we, ourselves, necessarily program in. It kind of figures out what it should and should be doing, should and shouldn’t be doing, based on the data that it sees.

So, usually Machine Learning does involve performing a task better over time, otherwise there’s not really an aspect of learning involved. Performance in most of the models, not all of them however, is linked to this concept of training. Now, we’ll go into training in greater detail later but for now this is essentially feeding data into a model to increase a performance over time without modifying the algorithm itself.

Now, that’s a very important aspect of it because, otherwise, it’s not really Machine Learning. If we’re actually making changes to the algorithm to increase performance then, it’s not learning itself. That’s us manually changing things. There’s no Machine Learning there. If a machine produces the best results after training with the same algorithm, it’s said to have learned. So, if the model starts out at the very beginning with maybe, 20% accuracy, that means, it’s only getting 20% of the answers correct, and then, later on, that bumps up to, I don’t know maybe 80 or 90% and then we have a good model and that model is set to have learned, a lot.

The Key Takeaway here is that Machine Learning Models perform better over time without any changes made to the algorithm themselves. It’s really just putting in a bunch of data into the model and then having the model kind of perform better over time without any changes to the actual algorithm.

Okay! So, that’s all we’re actually gonna do here. Hopefully, just take a second to digest that. Hopefully it wasn’t too much and when we come back we’ll talk about some kinda things that we can do with Machine Learning, where we might see some practical applications of Machine Learning in real life, hopefully that will help to clear up any questions that you might have, okay? So, stay tuned for that. Thanks for watching! See you guys in the next one.

Interested in continuing? Check out the full Machine Learning for Beginners with TensorFlow course, which is part of our Machine Learning Mini-Degree.

A Comprehensive Guide to Neural Networks

Zenva — Fri, 18 Jan 2019 23:00:41 +0000

You can access the latest Machine Learning courses here: Machine Learning Mini-Degree

Transcript 1

Hello everybody. My name is Mohit Deshpande. And before we get into our main topic of neural networks, I first wanna talk a little bit about where they come from.

In this video, I just wanna very briefly just go over kind of the inspiration for neurons, and this topic of neural networks. Because they haven’t been, this is not a new topic, neural networks. They’ve been around for decades. And even some of the more advanced techniques, like convolutional neural networks, have still been around for decades. But with advancements in research for having bigger and deeper neural networks, then that’s kind of started to bring these back into focus. That, along with GPUs, being great for these computationally intensive tasks, and a bunch of other factors, have sort of brought them up to the forefront. And you read all about this, these advancements in AI. And everyone’s running to use AI now.

So I just wanna talk a little bit about where the inspiration for neural networks comes from, ’cause it’s actually fairly natural. In fact, it comes from our own brains. And so I just wanna talk a little bit about how biological neurons work in your brain and then we can actually, you can see that we can take that model, a biological model, and actually formalize it into a mathematical model. And then when we have that mathematical model, then we can start saying, well, okay, if I have one neuron, what happens when I have four of them? And what happens if I layer them, right? And then that kind of helps build up from just a single neuron over to a full, deep neural network.

And so that’s kind of where I’m gonna be starting at, just from a single neuron. We’ll start talking a little about that. Then we’ll maybe use two neurons, and have a little tiny neural network. And then we’ll see what happens when we expand on that. Anyway, biological neurons. I’m gonna try to draw, that came out kinda weird, but, suppose this is your brain. Right? Not a very good picture, but I guess it serves the purpose.

So suppose this is your brain here, right? And so what we know is that at this point in time, there are different regions of your brain that correspond to different tasks. For example, I happen to know that the occipital lobe is kinda towards the back of the brain, and the occipital lobe deals with things like vision-based tasks, dealing with visual cues and stuff like that. So a lot of the vision stuff, that you see, and how you can make sense of what you see, happens kind of towards the back of, towards the back of your brain. So let me take… And there are all sorts of different regions, and they can change, and there’s all this neuroscience behind this, but I’m not gonna talk about that. But let’s kind of zoom in here.

So suppose I just zoom into this, and then let’s kinda make a bigger picture. Okay, so suppose I zoom into just one particular portion. You can see that the brain is actually composed of these things called neurons. And they have kind of this nice structure in the sense that they’re just a cell in your body, like any other cell. And just like any other cell, they have a nucleus. So here is the, here is the nucleus of your neuron. And the nucleus of the cell does all sorts of different functions. It pretty much is like the little mini brain of the cell. It just has a lot of the information. And it has the purpose of the cell, and how it’s suppose to carry out tasks, and DNA, and all sorts of other stuff. But anyway, that’s not particularly what we want.

Particularly what we’re interested in is let’s first get the definition of a neuron, and then see if we can model it mathematically. So there are kinda two big portions of these neurons. They’re called the dendrites and the axon. And so let’s suppose that my dendrites are over on this side. And then my axon, is over on this side. So the dendrites are like little… Let me see if I can draw a few of them here. So they’re kind of like little connections here. And they actually go to other, they go to other neurons.

So these dendrites are actually incoming connections from other neurons. See if I can kinda draw a few here. I’m just gonna, stop right there. Okay, so these are the dendrites. Let me actually do this in a different color. These are the dendrites, basically, so dendrites. And they’re kinda like incoming connections from another neuron, because all these neurons are kinda interconnected. And so these dendrites are… If there was another neuron up here, it would be connected to this through the dendrites.

And so what this neuron does, is it takes the information that it gets from these dendrites, and then does some kind of task, and then produces an output. And so that output is called the axon. And so I can kinda continue this drawing here. And you produce a single output here, but the output may go into different neurons. So here, is the axon. And so the axon, is just kinda the single output of the biological neuron.

And so then this becomes, so then this goes to other neurons. So other neurons. So then this output goes out to other neurons. And this is from… This should be to other neurons, this is from, other neurons this way. And so each one of these biological neurons are connected, they’re connected to each other in complicated ways that we don’t fully understand yet. But these dendrites are connections from other neurons. And the axon produces a single output that goes to other neurons. And you can see that this is kinda like the singular building block of our brain.

So we can take these neurons, these biological neurons, and kinda stack them together. And so maybe I make a copy of this, put it over here. And then it becomes, these connections map up to these dendrites. And I can stack them in all sorts of, organize them in all sorts of different ways and, I get a brain, or I get a portion of the brain. And then I connect those portions of the brain and we’re using other neurons to other portions of the brain, and then you keep doing that until you get a fully functioning brain. So these were kind of like the building blocks.

There’s kinda one question that we have to address is that, when you take the information from the dendrites, what does it actually do with that information? And so that’s where we kinda get into what’s called the firing rate, I guess, of the neuron, or also called the activation of the neuron, is that it takes these inputs, it performs some operation on them and it only fires, fires, meaning that it produces a particular output, it produces an output based on these inputs. And it performs some operation based on each single input that it gets. And then it must look at all of the neurons that it gets, kind of as a whole and then it produces some output.

And that’s not, still, there’s lots of stuff going on. And so don’t think that neural networks are actually a representation of this, ’cause the brain is far more complicated than any of our existing models of neural networks. But it’s mathematically fairly reasonable. And especially when you discuss things like convolutional networks, you can actually see, you can see the super position of the activation layers for the convolutional networks and then the activation layers of the occipital lobe, and when you look at them, they actually look pretty similar. So that kinda means that we’re on the right track.

But anyway, this is really all we need. And this is kind of the singular building block of neural networks. And we’re gonna see that this is enough that we can model this mathematically into something that we call the perceptron. And you’re gonna see what happens when we can stack these perceptrons together, then we get what’s called a multilayer perceptron, or MLP. And so these are just kinda the singular building blocks of our brain.

And we’re gonna formalize them into a mathematical definition so that we can use them as the singular building block for our neural networks. Okay, so this part, I’m going to stop right here, and I’ll just do a quick recap. So in this video we discussed kind of the biological inspiration for neurons and neural networks. So our brain is kinda split up to different portions, and it’s composed of these building blocks that we call neurons. And they kind of have three big portions, they have three big parts.

There’s the dendrites, which are the inputs to this neuron, quote, unquote, “inputs.” And there’s the nucleus, which actually does some of the processing. It takes the inputs of the dendrites, does some kind of processing task, and then produces an output. And then an output travels along the axon. And the axon then goes as the input to other neurons. And you get this kind of cycle over and over again. So these are kinda the fundamental building blocks of our brain. And we’re gonna model these mathematically, so that we can use them as the fundamental building blocks for our neural networks.

Transcript 2

Hello everybody, my name is Mohit Deshpande, and in this video, I want to take our model of a biological neuron and then model it more formally using mathematics into what we call a perceptron.

If you remember this picture, we kind of had, I’ll go back here. If you remember, with this picture we had these biological neurons and like I said, there are kind of three big portions of this. There are dendrites, which were different inputs from other neurons, and we could have any number of inputs from other neurons, and then we had the nucleus, which does some kind of processing task and it has to consider all of the neurons. All the input neurons, I should say. And then, it produces some output. We can model that mathematically into something called a perceptron. I’m gonna draw a big circle here, and that’s gonna represent our cell body kind of thing. This perceptron, it kind of represents this. Just kind of two portions of this.

First, let’s go from left to right, naturally. So, if you remember with the biological neurons, we have inputs from other neurons through these dendrites. Suppose we have some input here, so suppose that this is connect with, let’s say, it’s connected with four other neurons. Let’s just kind of start small. I’m gonna call my input X. X1, X2, X3, and X4. These are the four other neurons that this one happens to be connected to. Then in here, there’s actually something inside there called a synapse, and it’s kind of like a transfer of neurotransmitters, but again, I’m not gonna talk much about that.

Basically what the synapse gives us is, well, here’s my synapse I’m gonna denote with a little circle. Synapse gives us little weights. We call them weights. Here’s W1, here’s weight two. Actually, let me do these in color. Here’s weight one, weight two, weight three and weight four. Now that I have these weights, we model this with a multiplicative interaction. What I mean by that is what actually goes into this neuron is actually X1, X2, X3, X4, each of these times that respective weight, so it’s gonna be X1 W1, X2 W2, X3 times W3, and then X4 times W4.

These are the actual inputs that go into our neuron or into our perceptron. This kind of models this side here with these dendrites. Now, we actually have to do something with the nucleus. Actually, just to make this point clear, what I’m gonna do is I’m just put these in a different color again. This will then be X1 W1, X2 W2, X3 W3, and then X4 W4. When I get these different inputs, kind of a natural thing to do would be to take the sum of all of these inputs.

What’s really input into this neuron is gonna be X1 W1 plus X2 W2 plus X3 W3 plus X4 W4. One extra little term that’s added is plus B, and this B is called the bias. This is just a different term that we add into our model and it turns out that it works really well. You remember, with all these weights, these weights are things that we’re actually learning. X is just the inputs. The weights are what we wanna learn.

Because each of these are multiplied by an input, we wanna have one other parameter that’s kind of independent of all of the inputs so to speak. And that’s the bias, because we don’t multiply it by any of the inputs. This kind of serves as a global term across all of these weight inputs. If this produces a certain value that’s kind of all off by five or something, I’ll just change my bias to five or minus five and then adjust this whole thing. It’s just kind of another additional adjustment parameter. Actually, I can represent this more succinctly using a summation notation. I can do sum I, X of I, W sub I, and then plus the bias. The bias isn’t actually in this sum, it’s just kind of extra, an extra thing. This is just a more concise way of representing this sum.

I didn’t wanna have to write this out all the time. I can just write this. That’s kind of one of the big portions of our network. When I take all these inputs, I take the sum of them, and so what I end up with is the sum over I of XI WI plus the bias, B. This is what happens when I take my inputs, and each input here is weighted by some weight, W. Then what we do is, inside, we take the sum of all of these. Notice when we take the sum, we basically get a value. We just get a value, or, depending on how you think about this, you can get a vector back depending on what these inputs are.

When you get these inputs, you take this thing called a weighted sum and then you add a bias. There’s one extra thing that we have to do, remember? I mentioned that there’s some other processing that we had to do and then we have to decide whether or not this perceptron or this neuron actually fires quote unquote.

We have to also decide what value we actually fire, because the output of this is gonna be the input into some other perceptron, if we had this layered, for example, which we will. So, the question is, how do we take this input and what do we do with it, basically? Usually what we do is, we kind of cut this a little bit, we cut this up and we usually have some function that I’m gonna call G. We have some function, G, and G actually produces the output.

Remember, the perceptron takes in, or any neuron, really, takes in a single, you know, actually, it outputs a single value. It takes in any number of inputs, but the output is always just kind of one thing. But then, these one things can go into any number of different perceptrons, and it’s really the same value that’s kind of the input. If I had more perceptrons that were here, this G would then go all the way over, the output of this perceptron would go into all of them. But G is just some function that we apply to this that decides what output we get. And G is interchangeable.

I’m gonna talk a little bit about some of the values. We take this sum, then we apply G to it. So, G is this. We apply G, the weighed sum, XI WI plus B, and then G outputs some value, and that’s the value what we give to any other perceptrons that happen to be one layer deeper than this. G is what we call an activation function.

So, G is an activation function. Activation function. It’s also called a non-linearity. A non-linearity, because G is usually a non-linear function. The reason that G is not a linear function is because we might wanna model things that are not linear and so we can’t model things that are not linear by taking the linear combination of a ton of linear things. You know, if you do that, then the output is linear, so you need to add some non-linearities so that when you apply them, you can actually produce something that’s non-linear.

I’m not gonna talk too much about that, but there are all sorts of different options for this. And actually, there’s still people inventing activation functions. There’s some recent activation functions that people have been inventing. There’s all sorts of different kinds.

There’s a sigmoid, which was one of the first ones but then we kind of moved away from it ’cause it had some issues. There’s tanh, which is hyperbolic tangent. Probably one of the better ones is ReLU, or a rectified linear unit. There is kind of a variant of that called a LeakyReLU, a LeakyReLU that tries to address some of the issues with the original ReLU. Probably one of the more simple ones is, there’s just, like, a step function, for example, and there’s so many more different kinds of activation functions. It’s really just kind of picking one, seeing if it works well. If it doesn’t, then maybe try something else.

I will say that ReLUs tend to work really well, so I would recommend that you look at that. Sigmoid is probably something to avoid, unless you’re using it for an output or you’re doing something very specific with that, but internally, for perceptrons, you kind of want to avoid sigmoid. ReLU or a LeakyReLU works well.

There are others, of course, that work well, and people are inventing these all the time. Anyway, that is really our model of our perceptron. That’s just kind of a single perceptron. What we’re going to be doing, I’m gonna do a quick recap, but what we’re going to be doing is seeing what happens when we, we’re gonna use an example with these perceptrons. We’re gonna kind of go through an example of how we would use them, and we’ll see much later on what happens when you’re just dealing with one perceptron.

What happens when you stack them together? What happens when you have them in layers? How many layers should I have? How many perceptrons should I have in each layer? Well, we will answer all these questions. I’ll just do a quick recap. The perceptron is supposed to be one of the simplest models of a biological neuron, and so what it does is, basically, our inputs are X1, X2, X3, X4.

We can have any number of inputs, but each input is multiplied by a corresponding weight, W1, W2, and so on. Then, the actual input to this is then X1 W1, X2 W2, X3 W3, and so on. What we do with that is we take the sum of all those, so you get X1 W1 plus X2 W2 plus X3 W3 plus so on and so on and then we can add in this additional term called a bias, and the bias is added because it tends to work out better. And it’s good, because it’s not tied to a particular input, which is also nice. It’s gonna help out.

After we have this weighted sum, then we actually have to figure out what output we wanna produce, and to figure that out we use G, which is an activation function. We apply G to this weighted sum here, and we can produce N, we can produce an output. I guess I should use some shorthand here, in the sense that this is usually, like, Z. This is actually just gonna be G of Z that we apply and then we get an output, A, which is an activation.

That’s kind of a shorthand notion that I’ll be using kind of going forward. This is a perceptron model, and G is just an activation function or a non-linearity. It’s also called that. There’s lots of different examples you can pick for that. Probably the simplest is step function. We’re gonna be looking at that in the future. So, I’m gonna stop right here, but in case you don’t fully understand this, ’cause this is kind of all, this is a little abstract. We’ll actually take this idea of a perceptron, and let’s use this in a more concrete example.

Transcript 3

Hello everybody. My name is Mohit Deshpande. And in this video, I want to kind of go through an example of using this perceptron model.

And so, you know, this might seem kind of kind of complicated, because I’m using all this summation. And these activation functions. It might seem kind of complicated, but let’s actually do an example of this. Let’s start with a small example, and then, we can see how we can scale this up. So in particular, what I want to model with this perceptron is called an AND gate. And let’s just model an AND gate. Perceptron example. So, if you remember what AND does is AND let me actually make our grid here. So if you remember AND, what I mean by AND gate is just something that, you know is similar to in programming an AND. So if you remember, if I want to take the AND return to one, only if both of the inputs are also one. And so, I can kind of draw this graphically.

So if I suppose that this is X1, which is one of my inputs. And this is X2, which is my other input. Then, I should be able to fill this out. And so here is my zero. And so, what I want you guys to do is to take a second and see if you can fill out this graph. And so let me actually get you started. Here. So. Let’s suppose that green is zero. So I’ll just kind of model this as something like being. Well, actually, here. So green, we’ll kind of keep this simple. So this means. So this means zero. And this color circle filled in means one.

And so let me get you started. If X1 is zero, and X2 is zero, if I take zero and zero, that gets me zero. So this coordinate here at the origin is then going to be a circle. Just a non filled in circle. And so. There are two other things. There are three things you can fill out right here, so here’s. Here’s a point you have to fill out. Here’s a point you have to fill out. And here’s a point here that you have to fill out. What if both of these inputs are one?

And so why don’t you guys take a second, and no. For each one of these three other points, you figure out whether it should return zero or one. Assuming that X1 or X2 are the inputs. Why don’t you take a second to fill out this graph, and we’ll be right back with the answer. Okay, so. If one of these is one and the other is zero, then this should produce zero at this point here. If X2 is one, but X1 is zero, then this should produce a zero right here. But, if X1 and X2 are both one, then I can produce a one right here.

So I’m going to kind of fill this in. And so here is my Here is my graph right here. And so I want to build a perceptron that models this exact thing. And you know, if you can see that what I ultimately want is kind of a way to look at this line here so that everything here or actually at this point here is one, and all the points kind of here are zero. There’s only really one point here, because we’re considering only binary values.

But yeah, this kind of a line that I want to create. Let’s say we can build a perceptron model. And so let’s you know, do our little perceptron model. And then, you know remember we have our g function here, and then we actually. How many inputs do we have? Well we have two. X1 and X2. Actually, here. Here’s what I’ll do. I’ll kind of move this around so we can I’ll look at this a bit better. What I’ve done here. Here’s my g, and then I produced some output. Then I have two inputs. I have X1 and X2. Both of these go into now the drawing, but actually what I have, is I have this weight. Weight one. And weight two.

So what ultimately happens is and then I also have, by the way, I also have a bias. And so, what I want to do is find values for weight one and weight two. And bias (b), so that when these two are one, the output should also be one.

But then, you know, the question you might be asking is “Well what’s this activation function?” Well here I can define it really quickly. So my activation function g is g of x, let’s say some, actually let me use some. Let me use a different character, so it’s not as confusing.

Let’s suppose like g of g of a is going to be equal to one if a is greater than zero. And it’s going to be equal to zero if a is less than or equal to zero. So this kind of defines our function g. And so, what we want to do is take the weighted sum, right? So we X1, W1 plus X2, W2, plus bias. And you apply some g to it. And then we want to produce some value. A might have been a bad choice there. But let’s just suppose this equals some value. So, let’s try this for a different value. So we have to initialize these values to to something. And so I suggest, let’s just initialize everything to something simple like one or zero.

Let’s initialize everything to one. So let my bias be one. Let this weight one be one. Actually, let me use a different color so it’s more obvious. Let’s the bias equal one. Let this weight equal one. Let this other weight equal one. So now let’s try this with different inputs. So suppose that X1 equals zero and X2 equals zero. Then what’s my output? My output it’s actually, you know compute this output. So both of these are zero. We expect the output to also be zero. Let’s see what happens.

So this is going to be zero. This is where we are substituting values here. This is going to be zero, zero, times one, times one, plus one. So when I do all this sum, I get zero plus zero plus one equals one. Remember, I have to take g. So what’s g of one. One is greater than zero, so this output’s one. So right off the bat, we’re not at a good start. Because our value should be high. We produce a weighted sum z, we get a value of one. And we took g of one, then we get one. But that’s not the output that we want. We want that output to be zero for this. Right?

So we know that our initial assignments of weights is not correct. In fact, it’s too high. The value is too big. And so, what we want to do is find some way. From this intuitively, we know that we should be decreasing some of these values. Because it’s too high already. And so, to kind of make this go along a little quicker, I say let’s decrease the bias, because really, when we do this weighted sum, it’s really this bias term that we can use to bring the, this weighted sum down. And so, let’s set the bias equal to zero, for example.

So now, let’s try this again. So, now when my bias is zero, I get zero times one plus zero times one plus zero. And that gets me a value of zero. So when I apply g of zero, I get. Hey, I get zero! So this seems to work for this case. But that’s not the only case we should be checking. We should be checking all these other cases. So now let’s try one one. So I can do one times one plus one times one plus zero, and what does that get me? Well that gets me a value of two.

And so when I do this computation, then when I apply g of two, well two is greater than zero, so I get one. Okay? So, almost done. Now let’s try some other cases. Let’s try where one of these is zero. So zero times one plus one times one plus zero. And so this should be equal to zero, right? So let’s try this again. So we get a value of one. But, wait a minute? G of one is equal to one. So this isn’t quite right again. And so again, the value is to high. So we need to decrease our bias again.

For just the sake of simplicity, let’s it makes intuitive sense to decrease the bias, and not these weights. Because we’re considering this global parameters. So anyways, bias is zero. It isn’t quite right. So, let’s decrease it, again. So, let me change my bias, again, and. I’m just going to jump right to so let’s make our bias minus one point five. Let’s try recomputing this again. So, now. Let me compute this. We’re going to skew these computations again. Here. So now we have you know. We’re going to get rid of all this stuff here.

So let’s try it with all of our our values here. And so this’ll be one times one plus zero times one minus one point five, minus one point five, minus one point five, and minus one point five. And so, let’s kind of see what we get here. So, let’s try it in all these cases. So, zero times zero plus zero times zero minus one point five is to be minus one point five. And that’s actually below zero, and so when we apply g to it, then we get. We just use this arrow here, so. Then we apply g to it, we get zero. And that’s exactly what we expect. Because this value’s less than zero.

Now let’s try this. Well one times one is one. Plus one times one is one. That’s two. Minus one point five is going to be zero point five. What I do is zero point five, then I get a value of one. And that’s exactly what we expect. So, now. Let’s try the other two cases. So this is going to be equal to one. Well one minus one point five is going to be minus zero point five. And that’s below zero, and so I output zero. And then same for this. This is also going to be minus zero point five. Then I output a zero. So, these.

It turns out that this orientation of parameters is the right answer. Where we have weight one equals one. Weight two equals one. And then the bias is minus one point five. So seems that that this orientation of parameters works well for this AND gate. And remember that this works well for the AND gate, so if you. This won’t work for other gates. We’ll have to change this if we want this working with other gates. But, this kind of example, to show you how this how this works. So really quickly, let’s anyway, I’m just going to. I’ll stop right here, and I’m going to kind of give you some motivation that moves on to how does somebody scale these up.

So, in this video, we built a working AND gate using a perceptron. Using a single perceptron by setting these weights and this bias, we can get a working a working AND gate. And this kind of a goal is to find out what values of weights and bias produce the correct output. And so, we found through trial and error, basically, that weight one should be one. Weight two should be one. And the bias should be minus one point five. So we have the single we have the single perceptron.

Another question is, well what happens when we scale this up? Let’s suppose, you know, let’s use more neurons. Let’s have them oriented in layers. So let’s have more layers. So, the question is what happens when we take this single perceptron and expand it out, we add more perceptrons, and we kind of construct our first neural network, called a multilayer perceptron.

Transcript 4

Hello, everybody, my name is Mohit Deshpande and in this video, I want to scale up our single perceptron into this multi-layered perceptron or this neural network, and so, I’m gonna kinda generalize this notion of what happens when we have multiple neurons and what happens when we have, what happens when we have more layers, for example.

So, let’s see how we can model this, so neural network, neural network, and so, we can model this like our perceptron, but we have more than just a single perceptron. In effect, we kinda have three different kinds of layers in a neural network, if we’re generalizing this. We have an input layer, and so here is my layer, this dot, dot, dot just means we can have any number, so we’ll have X1, X2, and Xn.

So, I have this as my input layer, here, and what happens is, from between my input and my output, I can have just an output layer and that’s kinda what we had with our AND gate, we’re building our AND gate, is that we just had… We had two neurons here and these were the input layers, so the input layer was just X1 and X2 and then the output layer was just a single neuron here, being Y. These are actually represented as neurons as well, so the inputs are also neurons, they just have a single value.

And so, here is my input layer, and then I can have an output layer here. Suppose I have an output layer here, Y one, two, Ym, let’s say. Supposed they’re N neurons in the input layer and M neurons in the output layer.

The way that the neural network works is between two layers, every neuron is connected to every other neuron between any two consecutive layers. So, for example, X1 would be connected to Y1, X2 would be connected to Y1, all of these would be connected to Y1, and Xn would be connected to Y1, and then, for any other of the intermediary ones, like Y2, X1 is connected to Y2, X2 is connected to Y2, and so on. All the way to Ym, so now X1 is connected to Ym, X2 is connected to Ym, and so on. Then every neuron in this layer here, so let me actually label this as well, this is output.

Every neuron in this input layer is connected to every neuron in the output layer, and that’s what we call fully connected. And so this is fully connected, which is why also, these are also called fully connected layers sometimes, or fc, these are also abbreviated very commonly as fc layers because every neuron is connected to every other neuron between these two layers. And so, this is kinda how we have an input and an output.

So, this is a very simple one layer network here, so this is just like that layer here. There’s no what we call hidden layers, and so, we can increase this by adding what we call hidden layers and the hidden layers go in between the input and the output. The hidden layer, for example, would be, now, again, it can have any number of neurons, and so, maybe this is H1, for hidden layer, and this is like H sub p, so that there’s N inputs p neurons in the hidden layer and then M neurons in the output layer, but remember that they’re only connected, they’re fully connected between two consecutive layers.

So, what I mean by this is, X1 is not now, is not directly connected to Y1, but X1 is connected to H1 and X2 is connected to H1, and so on, and so on here, and then again with H2, X1 is connected to H2, X2 is connected to H2, and so on, and so on, until we get to this very last layer here, this H sub p. All of these are connected internally here, then H1 is then connected to Y1 and two, and three, and so on, and then same for this neuron here, and then we just get this sort of nice pattern here. So, this is then, nothing on this one, okay. So, then this is what we have, and so, in between two consecutive layers, all the neurons in those two layers are all connected to each other. That’s what we call fully connected, and so this is actually called the hidden layer because it’s…

And actually, we can have more than one hidden layer, we can have two hidden layers, and then all the neurons in the first hidden layer are connected to all the neurons in the second hidden layer, which are connected to all the neurons in the output, so we can kinda stack these as deep as we want. Let’s actually consider a more concrete problem and that is with this kind of dataset called the MNIST handwritten handwritten digits, and this is kinda like the Hello, World of neural networks and deep learning. We have these handwritten digits, there’s images where we can flatten them out into just a list of numbers and the goal is to, given a digit, be able to tell whether it’s zero, one, two, three, four, five, six, seven, eight, or nine.

We’re using MNIST, we actually have 10 output classes, and those classes are zero, one, two, all the way to nine, so that’s all 10 digits. And so, it’s kinda the goal with the MNIST dataset, is to pick which one of these digits is the correct one, given some input digit, and so, the goal of building this neural network AI is to train it on lots of examples of MNIST digits. That way, when it sees a new digit that it’s never seen before, it can be able to correct, it can correctly predict what kind of digit it is. Whether it’s zero, four, seven, or nine.

How we actually compute this is that, so you might be saying, how would we actually do this computation? So, what happens is, we take our image and we flatten it out into one giant list of numbers, called a vector, and so, that kind of works as the input.

Again, we apply the same perceptron idea where these are all weights. So, all these connections that I draw in here are actually all weights, they’re all different weights. And you know, there’s also a bias here, and then there’s another bias, but these are all weights and these are all parameters that we want our neural network to learn. So, these weights and these biases are the actual learning, is the actual learning that’s happening here. And so, initially, they’re just set to any random value, so when we run this through our neural network, initially, we’re gonna get some pretty bad outputs. As this sees more and more examples and as we learn, the output is gonna get better and better, and we can visually see that.

We can visualize our, something like our accuracy, we can visualize our accuracy and it should be going up. We should be getting more accurate as we see more examples. And so, how this computation actually works is, kinda going back to that, is we take our input and when we flatten it out into a vector, we apply that perceptron rule, so then we take this particular value here, multiply it by the weight here and it goes into the input of H1. Then we take the same X1 multiplied by this other rate and it goes into H2, and so on, and so on. This hidden layer computes its own values and then it submits it to the output layer, and then, the output layer computes a value, and then you kinda, at the end of this output layer, so this output layer, we get probabilities, basically.

We get probabilities for each class and so, what that means is, when I input an image, I get a list of 10 probabilities, if I’m using MNIST, and so each of these probabilities are the likelihood that Y input is, you know, falls in each of these classes. Suppose if I get, if I run a particular example through my neural network and I get that the… I get that the output distribution, there’s a 95, one of the probabilities is a 95% five, for example. Then I know that I should be selecting, this digit should be a five because it has the highest probability.

It’s very likely that this input is a five, for example, and so that’s kind of what the output there does, and the output layer can decide that. So, basically, what we’re trying to do with this is, we get this forward pass thing going on. It’s called a forward pass, so you take this input and you feed it through and you just kind of keep passing on from one layer to the other.

So, you start with the input, then you pass it on to this hidden layer, and the hidden layer passes it on to the output, and the output finally computes this. So, you just kinda keep passing on the activations from one to the other, and then eventually, at the end, you get a probability distribution and you pick the most likely one, if you’re using this softmax distribution. And you just pick the most likely one, so that’s kinda how that works.

So, I’m a little out of time, so I’m gonna stop right here and just do a quick recap.

So, with these neural networks, in this video, we kind of saw what happens when we scale these up, so instead of just having a single perceptron and two inputs, what happens when we have more inputs? What happens with an N input, specifically, and what happens when we have different, more layers, for example. And so, this kind of helps answer that question. Where we have an input, we can have three. Input any number of hidden layers and then an output.

And then, we have this forward pass, or also called feedforward, where you just take the input and you send it off to the first hidden layer and then it computes those activations, using the same perceptron rule, and then it passes those activations on to the next hidden layer, and then the next hidden layer computes its activations, and then you just kinda keep passing it on until you get to the end and the output layer. Then the output layer’s job is to take those activations from the last hidden layer, then build a probability distribution over all the possible classes.

So, in this case, for MNIST, for example, like I mentioned, with handwritten digits, there are 10 digits and so what happens is, it will produce a distribution for all of these 10 digits. And so, it’ll tell you something like, there’s a 97% chance that this input that you gave me is an eight, and so then I just pick the one with the highest probability, the highest likelihood, then. So, then that’s what the output layer does, and then eventually, I get a single value back. Say I give an image and the output I get is, I am this percent confident that this is an eight, for example, and so that’s kinda how we do this, feedforward neural network.

The real question is, where does the learning happening? What are we actually learning? And so, in between all these layers, in these layer’s connections here, we have these weights and biases. Exactly like what we have for our single. Instead of having two, we have how ever many we need from, to go from these two different layers. We have this weight matrix and this bias vector, those are the things that we wanna learn. And so, how do we actually learn these things?

I’m gonna kind of give you an intuitive way on how we do that with gradient descent very soon. So, we can look at these weights and these biases and basically, the idea is that we can, we wanna minimize our error and increase our accuracy, and so, we can make changes to these weights, little changes to these weights and these biases so that we can increase our accuracy or minimize our error.

Interested in continuing? Try our latest courses on machine learning in our Machine Learning Mini-Degree.

The Complete Programming and Full-Stack Bundle – 20 Course Smart Curriculum

Zenva — Sat, 24 Nov 2018 00:30:23 +0000

?? Go from beginner to full-stack developer!

The Complete Programming and Full-Stack Bundle is the world’s most effective way to go from beginner to professional coder. Whether your goal is to advance your career, start your own business or expand your existing skill-set, our 20-course Smart Curriculum has something in store for you.

This bundle is suitable both for absolute beginners and more advanced developers. Projects cover a wide range of different topics including:

Crafting interactive websites and web applications with Bootstrap, Angular, React, Node and Express
Coding in Python and building smart Machine Learning and AI applications
Building games with Unity, Phaser and JavaScript
Creating stunning Virtual Reality games and applications
Game artwork creation with Blender and Gimp

Access The Complete Programming and Full-Stack Bundle on Zenva Academy

An Introduction to AI

Zenva — Tue, 10 Apr 2018 07:57:37 +0000

You can access the full course here: The Complete Artificial Neural Networks Developer Course

Why do we even have artificial intelligence? Computers are really dumb machines! When we write code and programs, we give the computer a very explicit set of instructions that it isn’t allowed to deviate from. Inside of our program, we must handle every since point of error or user input. Humans, on the other hand, have the capability to learn and generalize. For example, if you were handed a pen from a brand new pen company, you would immediately recognize what it was and how to use it. In other words, you didn’t need to be taught how to use that particular pen; you generalized your knowledge of pens.

Good old-fashioned AI (GOFAI) began in the 1960s at the Massachussetts Institute of Technology (MIT) by the pioneer of artificial intelligence: Marvin Minsky. All of these AI algorithms were based on search, not actual learning. During this era, your goal was to rephrase your AI problem as a search problem, then use standard search algorithms, such as breadth-first/depth-first search, uniform cost search, A*, etc., to solve your problem and arrive at an answer. For game AI, adversarial search algorithms, such as minimax, was used. Knowledge-based systems, such as MYCIN, used a series of questions and logic rules to solve the problem. In all of these cases, there was no learning: they either looked through the entire search space or used pre-defined rules.

The modern approach to AI is focused on learning instead of search. Our AI algorithms are data-driven: we provide them with many examples to show them how to perform our task or solve our problem. What we’re learning are the parameters of our AI algorithm. We can use these parameters to generalize our algorithm to new data that it hasn’t seen before.

One example of machine learning includes image captioning. The goal of this task is to generate a novel, i.e., not retrieved/fetched, caption given an image. In this particular case, we’re merging image and text modalities into a single AI algorithm.

We can even perform the opposite task and generate an image given a text description!

In another example, we can use AI to “re-paint” provided images in the style of a particular artist. This is called Neural Style. For example, we provide any image to this algorithm and then a reference image of an artist, such as Van Gogh’s Starry Night, and the algorithm will modify our image to look like it was painted in the style of Starry Night!

In another example of merging images and text, we can train a model to perform visual question answering. In other words, we can provide an image and submit text queries to it, and the model will attempt to answer the query. For example, if we gave an image of giraffes, we can ask the model “how many giraffes are in the image?” and it would attempt to answer by returning a number.

There are many more examples of the kinds of amazing tasks that AI and neural networks can accomplish, and we’ll be learning about the fundamentals that allow us to train neural networks so they can perform these tasks.

Transcript

In this video, I just need to introduce the concept of what artificial intelligence and machine learning to then help give some background information. So, first of all, what is artificial intelligence and why do we actually need it?

And so to kind of think of this case, you have to accept the fact that the computers are actually really dumb machines. I mean if you think about it whenever you write a program you’re giving the computer an explicit set of instructions that it must do and operate for until the end of the program and these instructions are very very explicit. You say let two equals four or something like that and so the instructions that you’re giving it are actually really explicit and they can’t be vague in any sense they have to be very clear.

And you get another problem with this is, that you have to account for every possible kind of scenario in your code so think of this as being like switch cases or terms and tons of else blocks so when the user gives in some kind of input you have to do lots of checking to make sure that it’s within your parameters of the program and so on, so it’s really this is really you have to account for all these cases when you’re dealing with when you’re dealing with these programs.

And again this behavior isn’t really human-like, humans can deal with vagueness I mean they have so much more computing power mentally and they’re able to handle a lot of these cases. They don’t have to think about every possible case in order to do you know a task and they don’t have to be given incredibly explicit instruction, I mean it helps sure but they can, humans can handle vagueness actually quite well. So that’s kind of like the goal that we’re trying to get to with artificial intelligence, if you can build something that can think and reason similarly to humans.

So to begin our story we have to first talk about a good old-fashioned AI, so this actually started during the 1960s lead by Marvin Minsky and this actually happens at MIT and he actually started bringing in, he started like the MIT’s Artificial Intelligence group there and so the thing with good old-fashioned AI is that it’s not really learning of any kind, if you think about it so the whole point of good old-fashioned AI is it was all about search actually and so by search I mean things like graph searching I’ll share an example of that actually in the next slide.

On these different kinds of search problems and pretty much the thing with good old-fashioned AI was could you formulate your artificial intelligence problem as a search problem? And if you could, then you can use good old-fashioned AI to basically solve your problem and so there’s really no learning that’s going on because in these good old-fashioned AI systems you’re actually just enumerating all it’s kind of like, the thing about talking about programs, kind of like enumerating all possible states and then you’re just searching, searching isn’t really learning and so we’ll see this in the next slide but yeah so they had different kinds of graph search and then there’s actually you had adversarial search actually.

This was particularly useful for playing games like PAC-MAN, so if you remember PAC-MAN there’s like a little PAC-MAN character and he has to like eat all the pellets while you’re trying to avoid the ghosts and so this is an example of an adversarial search problem because your goal is not only to eat all the pellets it’s also to avoid the ghosts, and so that you can kind of frame this as like an adversarial search problem and then there’s like other kinds of these knowledge based systems where you basically just ask the computer a ton of or the computer asks you a ton of questions that you answered and it would just kind of find the answer but the whole point is that there’s really no learning that was going on, this is still technically under the branch of artificial intelligence.

Anyway let’s look at one of these search problems. So here is just a little crudely drawn map and so suppose your goal was to – you’re in Seattle and you want to get to New York City, and you’re booking flights. And maybe there’s these edges between these different cities represent some kind of cost like maybe it’s airfare, or maybe you want to drive there for some reason and they represent distances or you know different kinds of cost metrics.

And so this is an example of a search problem where you want to go from one city to another this is a pretty actually classic search problem and you want to minimize the total of past cost so and then you might think oh well I can just go from Seattle to Washington D.C and New York. You know that’s like two edges but that might not be the most optimal path instead maybe we can go from Seattle to Chicago to New York City or Seattle to Chicago to Columbus to New York City and it ends up costing us less overall than if we went from Washington D.C. to New York City.

So this is an example of a kind of problem that we can frame as a search problem and then we can use old-fashioned AI to solve this using like different search algorithms, depth-first, breadth-first search, a-star, and whatnot.

So anyway, now so what we discussed was good old-fashioned AI and as I mentioned it’s more focused on a search than it is on actual learning so now machine learning comes into the picture and this is where we’re actually taking in examples and learning from those examples and so this is, machine learning is technically a subfield of AI it’s probably the largest subfield of AI and some people have gone onto call it kind of like true AI, or pure AI and so this is kind of like the modern approach where people have been doing for at least the past few decades is you take examples and you learn from these examples and that way you can generalize better.

So I’m going to give you lots of examples of these and the last point I mentioned that there are parameters that you learn that’s you use these parameters so that when you get new data you can do whatever task you want whether that be processing or regression and we’ll talk about those later but let’s actually look at some of the cool tasks that you can do, just look, I selected a couple of research papers that talk about the different cool things that you can do with AI.

So just kind of selected a few papers, so here’s an example from the group at the Vision Group at Stanford. This was posted awhile back, what we can actually do is train a network to generate a caption given an image and note that these captions aren’t actually they’re not just being retrieved they’re actually being generated, new captions are being generated and so this is a really cool example of you’re taking two different modalities of data images and text and you’re kind of blending them together to get something like this where you can automatically caption images.

Here’s another popular paper it’s also colloquially known as Neural Style. This is where you can take a painting of a famous author like this one which would be Van Gogh, Starry Night right so you tan take a painting like that and you can take a picture that you’ve taken and using this Neural Style approach you can actually it’ll actually change your image so that it looks as if it was painted by this particular artist and so you can see here’s like the picture, and then based on these different reference paintings we can repaint the picture, this is another really cool example of artificial intelligence.

Here’s also a really neat example and this is published again pretty recently it was called StackGAN and GAN stands for Generative Adversarial Network. And what this is actually doing, this is doing the inverse of what we saw just a couple slides ago, instead of going from an image and generating text from an image this is actually generating an image from text. So you give it a text description the StackGAN and it will actually generate a picture. And so you can see that here are some of the text descriptions that people have given, and here are some of the images that have been generated and this last row is StackGAN’s resultants and they actually look pretty realistic, so again this is one of the cool things that you can do with artificial intelligence.

And another thing is called visual question answering where you have an image and you can ask a question about it and you can you know it will produce some text results and so here there’s actually, they actually have a web demo that you can go and upload your own images so for example I asked a question how many giraffes are in this image and it produced three. Another thing I can say you know, what is the cat sleeping on? And it produces the right text and so on.

This is another example by Google, and this is from particularly from their Google Neural Machine Translate group and this is actually really cool because what you can do is you can take two languages but you don’t have explicit examples from like say Japanese to Korean instead you use kind of like an intermediary language sort of and then it can translate that without having explicit examples from Japanese to Korean ’cause it kind of like changes them. I forget, it’s more complicated than this I’m just trying to give a top little overview.

Okay and this is probably fairly famous, it was on the news this is AlphaGo this is how you can train a reinforcement learning agent to learn to play Go and it turns out it actually plays it really well as you’ve probably seen in the news.

Okay so here’s all of the technologies that were used in two of those examples that I discussed. Neural Networks obviously a big portion, and there are just like tons of networks that were used. All the image based tasks had Convolutional Neural Networks, and all the text based tasks had Recurrent Neural Networks. And I made a small reference to Generative Adversarial Networks.

Now there’s a lot of technologies that are going on that, but all of them kind of central around this concept of Neural Networks so if you have a really good understanding of Neural Networks you should be able to at least have a very cursory understanding of you can gain a cursory understanding fairly easily of the other kinds of networks. Because they all kind of rely on some more principals, that’s where I’m gonna stop this video, yeah again so just to recap we kind of discuss a little bit about artificial intelligence, we talk about the old-fashioned way to do it using just search as well as the new way of doing this using learning and so we’re gonna get much more in-depth into Neural Networks.

Interested in continuing? Check out the full The Complete Artificial Neural Networks Developer Course, which is part of our Machine Learning Mini-Degree.

Free Ebook – Machine Learning For Human Beings

Pablo Farias — Thu, 04 Jan 2018 04:03:14 +0000

We are excited to announce the launch of our free ebook Machine Learning for Human Beings, authored by researcher in the field of computer vision and machine learning Mohit Deshpande, in collaboration with Pablo Farias Navarro, founder of Zenva.

In over 100 pages you will learn the basics of Machine Learning – text classification, clustering and even face recognition and learn to implement these algorithms using Python! This ebook covers both theoretical and practical aspects of Machine Learning, so that you have a strong foundation and understand what happens under the hood. Some of the topics covered in the book are:

Overview of Machine Learning – Supervised vs Unsupervised Learning
Text Classification with Naive Bayes
Data Clustering with K-Means
Clustering with Gaussian Mixture Models
Face Recognition with Eigenfaces
Dimensionality Reduction
Classification with Support Vector Machines
Reinforcement Learning using the OpenAI library

This book is provided at no cost in PDF format.

Download the ebook here

Dimensionality Reduction

Mohit Deshpande — Sat, 04 Nov 2017 01:41:13 +0000

Dimensionality Reduction is a powerful technique that is widely used in data analytics and data science to help visualize data, select good features, and to train models efficiently. We use dimensionality reduction to take higher-dimensional data and represent it in a lower dimension. We’ll discuss some of the most popular types of dimensionality reduction, such as principal components analysis, linear discriminant analysis, and t-distributed stochastic neighbor embedding. We’ll use these techniques to project the MNIST handwritten digits dataset of images into 2D and compare the resulting visualizations.

Download the full code here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Handwritten Digits: The MNIST Dataset

Before discussing the motivation behind dimensionality reduction, let’s take a look at the MNIST dataset. We’ll be using it as a running example.

The MNIST handwritten digits dataset consists of binary images of a single handwritten digit (0-9) of size . The provided training set has 60,000 images, and the testing set has 10,000 images.

We can think of each digit as a point in a higher-dimensional space. If we take an image from this dataset and rasterize it into a vector, then it becomes a point in 784-dimensional space. That’s impossible to visualize in that higher space!

These kinds of higher dimensions are quite common in data science. Each dimension represents a feature. For example, suppose we wanted to build a naive dog breed classifier. Our features may be something like height, weight, length, fur color, and so on. Each one of these becomes a dimension in the vector that represents a single dog. That vector is then a point in a higher-dimensional space, just like our MNIST dataset!

Dimensionality Reduction

Dimensionality reduction is a type of learning where we want to take higher-dimensional data, like images, and represent them in a lower-dimensional space. Let’s use the following data as an example.

(These plots show the same data, except the bottom chart zero-centers it.)

With these data, we can use a dimensionality reduction to reduce them from a 2D plane to a 1D line. If we had 3D data, we could reduce them down to a 2D plane, and then to a 1D line.

Most dimensionality reduction techniques aim to find some hyperplane, which is just a higher-dimensional version of a line, to project the points onto. We can imagine a projection as taking a flashlight perpendicular to the hyperplane we’re projecting onto and plotting where the shadows fall on that hyperplane. For example, in our above data, if we wanted to project our points onto the x-axis, then we pretend each point is a ball and our flashlight would point directly down or up (perpendicular to the x-axis) and the shadows of the points would fall on the x-axis. This is what we call a projection. We won’t worry about the exact math behind this since scikit-learn can apply this projection for us.

In our simple 2D case, we want to find a line to project our points onto. After we project the points, then we have data in 1D instead of 2D! Similarly, if we had 3D data, we would want to find a plane to project the points down onto to reduce the dimensionality of our data from 3D to 2D. The different types of dimensionality reduction are all about figuring out which of these hyperplanes to select: there are an infinite number of them!

Principal Component Analysis

One technique of dimensionality reduction is called principal component analysis (PCA). The idea behind PCA is that we want to select the hyperplane such that, when all the points are projected onto it, they are maximally spread out. In other words, we want the axis of maximal variance! Let’s consider our example plot above. A potential axis is the x-axis or y-axis, but, in both cases, that’s not the best axis. However, if we pick a line that has the same diagonal orientation as our data, that is the axis where the data would be most spread!

The longer blue axis is the correct axis! (The shorter blue axis is for visualization only and is perpendicular to the longer one.) If we were to project our points onto this axis, they would be maximally spread! But how do we figure out this axis? We can use a linear algebra concept called eigenvectors! Essentially, we compute the covariance matrix of our data and consider that covariance matrix’s largest eigenvectors. Those are our principal axes and the axes that we project our data onto to reduce dimensions.

Using this approach, we can take high-dimensional data and reduce it down to a lower dimension by selecting the largest eigenvectors of the covariance matrix and projecting onto those eigenvectors.

Linear Discriminant Analysis

Another type of dimensionality reduction technique is called linear discriminant analysis (LDA). Similar to PCA, we want to find the best hyperplane and project our data onto it. However, there is one big distinction: LDA is supervised! With PCA, we were using eigenvectors from our data to figure out the axis of maximum variance. However, with LDA, we want the axis of maximum class separation! In other words, we want the axis that separates the classes with the maximum margin of separation.

The following figure shows the difference between PCA and LDA.

(Source: https://algorithmsdatascience.quora.com/PCA-4-LDA-Linear-Discriminant-Analysis)

With LDA, we choose the axis so that Class 1 and Class 2 are maximally separated, i.e., the distance between their means is maximal. We must have class labels for LDA because we need to compute the mean of each class to figure out the optimal plane.

It’s important to note that LDA does make some assumptions about our data. In particular, it assumes that the data for our classes are normally distributed (Gaussian distribution). We can still use LDA on data that isn’t normally distributed, but we may not find an optimal hyperplane. Another assumption is that the covariances of each class are the same. In reality, this also might not be the case, but LDA will still work fairly well. We should keep these assumptions in mind when using LDA on any set of data.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

A more recent dimensionality reduction technique that’s been widely adopted is t-Distributed Stochastic Neighbor Embedding (t-SNE) by Laurens Van Der Maaten (2008). t-SNE fundamentally differs from PCA and LDA because it is probabilistic! Both PCA and LDA are deterministic, but t-SNE is stochastic, or probabilistic.

At a high level, t-SNE aims to minimize the divergence between two distributions: the pairwise similarity of the points in the higher-dimensional space and the pairwise similarity of the points in the lower-dimensional space.

To measure similarity, we use the Student’s t-distribution or Cauchy Distribution! This is a distribution that looks very similar to a Gaussian, but it is not the Gaussian distribution! To compute the similarity between two points and , we use probabilities:

where N is the number of data points and . In other words, this equation is telling us the likelihood that would choose as its neighbor. Notice the t-distribution is centered around . Intuitively, the farther is from , the smaller the probability becomes.

Similarly, we can compute the same quantity for the points in the lower-dimensional space.

Now how do we measure the divergence between two distributions? We simply use the Kullback-Leibler divergence (KLD).

This is our cost function! Now we can use a technique like gradient descent to train our model.

There’s just one last thing to figure out: . We can’t use the same for all points! Denser regions should have a smaller , and sparser regions should have a larger . We solidify this intuition into a mathematical term called perplexity. Think of it as a measure of the effective number of neighbors, similar to the of k-nearest neighbors.

t-SNE performs a binary search for the that produces a distribution with the perplexity specified by the user: perplexity is a hyperparameter. Values between 5 and 50 tend to work the best.

In practice, t-SNE is very resource-intensive so we usually use another dimensionality reduction technique, like PCA, to reduce the input space into a smaller dimensionality (like maybe 50 dimensions), and then use t-SNE.

t-SNE, as we’ll see, produces the best results out of all of the dimensionality reduction techniques because of the KLD cost function.

Dimensionality Reduction Visualizations

Now that we’ve discussed a few popular dimensionality reduction techniques, let’s apply them to our MNIST dataset and project our digits onto a 2D plane.

First, we need to import numpy, matplotlib, and scikit-learn and get the MNIST data. Scikit-learn already comes with the MNIST data (or will automatically download it for you) so we don’t have to deal with uncompressing it ourselves! Additionally, I’ve provided a function that will produce a nice visualization of our data.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import manifold, datasets, decomposition, discriminant_analysis

digits = datasets.load_digits()
X = digits.data
y = digits.target
n_samples, n_features = X.shape

def embedding_plot(X, title):
    x_min, x_max = np.min(X, axis=0), np.max(X, axis=0)
    X = (X - x_min) / (x_max - x_min)

    plt.figure()
    ax = plt.subplot(aspect='equal')
    sc = ax.scatter(X[:,0], X[:,1], lw=0, s=40, c=y/10.)

    shown_images = np.array([[1., 1.]])
    for i in range(X.shape[0]):
        if np.min(np.sum((X[i] - shown_images) ** 2, axis=1)) < 1e-2: continue
        shown_images = np.r_[shown_images, [X[i]]]
        ax.add_artist(offsetbox.AnnotationBbox(offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r), X[i]))

    plt.xticks([]), plt.yticks([])
    plt.title(title)

Using any of the dimensionality reduction techniques that we’ve discussed in scikit-learn is trivial! We can get PCA working in just a few lines of code!

X_pca = decomposition.PCA(n_components=2).fit_transform(X)
embedding_plot(X_pca, "PCA")
plt.show()

Below is the resulting plot.

After taking a closer look at this plot, we notice something spectacular: similar digits are grouped together! If we think about it, this result makes sense. If the digits looked similar, when we rasterized them into a vector, the points must have been relatively close to each other. So when we project them down to the lower-dimensional space, we also expect them to be somewhat close together as well. However, PCA doesn’t know anything about the class labels: it’s going off of just the axis of maximal variance. Maybe LDA will work better since we can separate the classes in the lower-dimensional space better.

Now let’s use LDA to visualize the same data. Just like PCA, using LDA in scikit-learn is very easy! Notice that we have to also give the class labels since LDA is supervised!

X_lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=2).fit_transform(X, y)
embedding_plot(X_lda, "LDA")

plt.show()

Below is the resulting plot from LDA.

This looks slightly better! We notice that the clusters are a bit farther apart. Consider the 0’s: they’re almost entirely separated from the rest! We also have clusters of 4’s at the top left, 2’s and 3’s at the right, and 6’s in the center. This is doing a better job at separating the digits in the lower-dimensional space. We can attribute this improvement to having information about the classes: remember that LDA is supervised! By knowing the correct labels, we can choose hyperplanes that better separate the classes. Let’s see how t-SNE compares to LDA.

(We may get a warning about collinear points. The reason for the warning is because we have to take a matrix inverse. Don’t worry too much about it since we’re just using this for visualization.)

Finally, let’s use t-SNE to visualize the MNIST data. We’re initializing the embedding to use PCA (in accordance with Laurens Van Der Maaten’s recommendations). Unlike LDA, t-SNE is completely unsupervised.

tsne = manifold.TSNE(n_components=2, init='pca').fit_transform(X)
embedding_plot(X_tsne,"t-SNE")

plt.show()

Below is the plot from t-SNE.

This looks the best out of all three! There are distinct clusters of points! Notice that almost all of the digits are separated into their own clusters. This is a direct result of minimizing the divergence of the two distributions: points that are close to each other in the high-dimensional space will be close together in the lower-dimensional space!

To summarize, we discussed the problem of dimensionality reduction, which is to reduce high-dimensional data into a lower dimensionality. In particular, we discussed Principal Components Analysis (PCA), Linear Discriminant Analysis (LDA), and t-Distributed Stochastic Neighbor Embedding. PCA tries to find the hyperplane of maximum variance: when the points are projected onto the optimal hyperplane, they have maximum variance. LDA finds the axis of largest class separation: when the points are projected onto the optimal hyperplane, the means of the classes are maximally apart. Remember that LDA is supervised and requires class labels! Finally, we discussed an advanced technique called t-SNE, which actually performs optimization across two distributions to produce an lower-dimensional embedding so that the points in the lower dimensionality are pairwise representative of how they appear in the higher dimensionality.

Dimensionality reduction is an invaluable tool in the toolbox of a data scientist and has applications in feature selection, face recognition, and data visualization.