Explore Free Data Analysis Tutorials – GameDev Academy

How to do Cluster Analysis with Python – Data Science

Zenva — Sun, 18 Dec 2022 14:00:19 +0000

You can access the full course here: Data Insights with Cluster Analysis

Want more Python topics for a school environment? The Zenva Schools platform which offers online courses, classroom management tools, reporting, pre-made course plans, and more for teachers.

Part 1

In this video we are going to discuss Cluster Analysis. We will discuss the following topics:

Intro to Cluster Analysis – what is it, what are it’s different applications, the kinds of algorithms we can expect.
K-means clustering
Density-based Spatial Clustering of Applications with Noise (DBSCAN)
Hierarchical Agglomerative Clustering (HAC)

k-means, DBSCAN and HAC are 3 very popular clustering algorithms which all take very different approaches to creating clusters.

Before diving in, you can also explore why you might want to learn these topics or just what career opportunities these skills will present you with!

Cluster Analysis

Imagine we have some data. In cluster analysis, we want to (in an unsupervised manner – no apriori information), separate different groups based on the data.

Looking at a plot of the above data, we can say that it fits into 2 different groups – a cluster of points in the bottom left and a larger, elongated cluster on the top right. When we give this data to a clustering algorithm, it will create a split. Algorithms like k-means need to be told how many clusters we want. In some cases, we don’t need to specify the number of clusters. DBSCAN for instance is smart enough to figure out how many clusters there are in the data.

The data above is from the IRIS data set. This was created by a famous statistician R.A. Fischer, who collected this data set of 3 different species of flowers and plotted their measured properties such as petal width, petal length, sepal width, sepal length. Since we are doing clustering, we have removed the class labels from the data set as that is what the clustering algorithms is trying to give us in terms of what data points belong together.

Clustering

Grouping data into clusters so that the data in each cluster has similar attributes or properties. For example the data in the small cluster in the above plot have small petal length and small petal width.

There are several applications of clustering analysis used across a variety of fields:

Market analysis and segmentation
Medical imaging – Xrays, MRIs, fMRIs
Recommender systems – such as those used on Amazon.com
Geospatial data – longitudinal coordinates etc
Anomaly detection

People have used clustering algorithms to detect brain anomalies. We see below various brain images. C1 – C7 are various clusters. This is an example of clustering in the medical domain.

Another example is a spatial analysis of user-generated ratings of venues on yelp.com

Cluster Analysis

A set of points X1….Xn are taken in, analyzed and out comes a set of mappings from each point to a cluster (X1 -> C1, X2 -> C2 etc).

There are several such algorithms that will detect clusters. Some of these algorithms have additional parameters e.g. Number of clusters. These parameters vary for each algorithm.

The input however is a set of data points X1…Xn in any dimensionality i.e. 2D, 3D, 100D etc. For our purposes, we will stick with 2D points. It is hard to visualize data of higher dimensions though there are dimensionality reduction techniques that reduce say 100 dimensions to 2 so that they can be plotted.

The output is a cluster assignment where each point either belongs to a cluster or could be an outlier (noise).

Cluster analysis is a kind of unsupervised machine learning technique, as in general, we do not have any labels. There may be some techniques that use class labels to do clustering but this is generally not the case.

Summary

We discussed what clustering analysis is, various clustering algorithms, what are the inputs and outputs of these algorithms. We discussed various applications of clustering – not necessarily in the data science field.

Part 2

In this video, we will look at probably the most popular clustering algorithm i.e. k-means clustering.

This is a very popular, simple and easy-to-implement algorithm.

In this algorithm, we separate data into k disjoint clusters. These clusters are defines such that they minimize the within-cluster sum-of-squares. We’ll discuss this more when we look at k-means convergence. Disjoint here means that 1 point cannot belong to more than 1 cluster.

There is only 1 parameter for this algorithm i.e. k (the number of clusters). We need to have an idea of k before we run the algorithm. Sometimes it is obvious how many clusters we should have, but sometimes it is not that clear. We will discuss later how to make this choice.

This algorithm is a baseline algorithm in clustering.

The cluster center/centroid is a point that represents the cluster. The figure above has a red and a blue cluster. X is the centroid – the average of the x and y coordinates. In the blue cluster the average of the x and y coordinates is somewhere in the middle represented by the X in the middle of the square.

K-means Clustering Algorithm

1. Randomly initialize the cluster centers. For example, in the above diagram, we pick 2 random points to initialize the clusters.
2. Assign each point to it’s nearest cluster using distance formula like Euclidian distance.
3. Update the cluster centroids using the mean of the points assigned to it.
4. Go back to 2 until convergence (the cluster centroids stop moving or they move small imperceptible amounts).

Let’s go through an example (fake data) and discuss some interesting things about this algorithm.

Visually, we can see that there are 2 clusters present.

Let’s randomly assign the cluster centers.

Let’s now assign each point to the closest cluster.

The points are now colored blue or red depending on which centroid they are closer to.

Next we need to update our cluster centroids. For the blue centroid, we take the average of all the x-coordinates – this will be the new x-coordinate for the centroid. Similarly we look at all the y-coordinates for the blue points, take their average and this becomes the new y-coordinate for the centroid. Likewise for the red points.

When I do this, the centroids shift over.

Once again I need to figure out which centroid each point is close to, which gives me the following.

Once again, I update my cluster centroids as before.

If we try to do another shift, the centroids won’t move again. This is evident from the last 2 figures where the same points are assigned to the cluster centroids.

At this point, we say that k-means has converged.

Convergence

Convergence mean that the cluster centroid don’t move at all or move a very very small amount. We use a threshold value to indicate that if the centroid does not move at least that much the k-means has converged.

Mathematically, k-means is guaranteed to converge in a finite number of iterations (assigning point to a cluster and shifting). It may take a long time, but will eventually converge. It does not say anything about best or optimal clustering, just that it will converge.

K-means is sensitive to where you initialize the centroids. There are a few techniques to do this:

Assign each cluster center to a random data point.
Choose k points to be farthest away from each other within the bounds of the data.
Repeat k-means over and over again and pick the average of the clusters.
Another advanced approach called k-means ++ does things like ANOVA (Analysis Of Variance). We won’t be getting into it though.

Choosing k (How many clusters to use)

One way is to plot the data points and try different values to see what works the best. Another technique is called the elbow method.

Elbow method

Steps:

Choose some values of k and run the clustering algorithm
For each cluster, compute the within-cluster sum-of-squares between the centroid and each data point.
Sum up for all clusters, plot on a graph
Repeat for different values of k, keep plotting on the graph.
Then pick the elbow of the graph.

This is a popular method supported by several libraries.

Advantages Of k-means

This is widely known and used algorithm.
It’s also fairly simple to understand and easy to implement.
It is also guaranteed to converge.

Disadvantages of k-means

It is algorithmically slow i.e. can take a long time to converge.
It may also not converge to the local minima i.e. the optimal solution.
It’s also not very robust against varying cluster shapes e.g. It may not perform very well for elongated cluster shapes. This is because we use the same parameters for each cluster.

This was a quick overview of k-means clustering. Lets now look at how it performs on different kinds of data sets.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Transcript 1

Hello, world, and thanks for joining me. My name is Mohit Deshpande. In this course, we’ll be learning about clustering analysis. In particular, we’re gonna use it in the context of data science, and we’re gonna analyze some data and see if we can segment out different kinds of customers so that we can provide them with all kinds of neat promotional discounts or special offers.

This is a very common thing that’s done for a lot of companies. We have a bunch of data. We wanna figure out unique groups of customers so that we can offer them special things.

In this course, we’ll be learning a little bit about three different kinds of clustering algorithms. First, I wanna introduce you to the concept of clustering, or what is clustering, and what are some other applications of it. Then we’ll get onto three different clustering algorithms, and what’s really neat is that they all approach clustering in a very different fashion.

So first, we’re gonna learn about a very popular kind of clustering called K-means clustering. Then we’ll look into a density-based clustering algorithm called DBSCAN. Then finally we’ll discuss a hierarchical clustering algorithm called hierarchical agglomerative clustering. And then we’ll see different kinds of data, where they work really well, and where they don’t work quite so well. So we’ll get a good idea of, at least some kind of notion, of which kind of clustering algorithms tend to work well on which kind of data.

Then, finally, we’re gonna tie everything together by looking at a real world data set and see if we can segment out different customers. We’ve been making courses since 2012, and we’re super-excited to have you on board. Online courses are a great way to learn new skills, and I take a lot of online courses myself.

Several courses consist mainly of video lessons that you can watch at your own pace and as many times as you want. So feel free to watch and re-watch and pause the video as many times as you want. We also have downloadable source code and project files and you’re getting everything that we build during the lesson.

It’s highly recommended that you code along with me. In my experience, that’s the best way to learn something is to get your feet wet. Lastly, we’ve seen the students who get the most out of these online courses are those who make a weekly plan and stick with it, depending, of course, on your own availability and learning style. Zenva, over the past six years, has taught all kinds of different topics on programming and game development, over 300,000 students, across 100 courses. The skills that they learned in these courses are completely transferrable. In fact, some of the students have used the skills that they’ve learned to advance their careers, to make a start-up, or to publish their own content from the skills that they’ve learned.

Thanks for joining, and I look forward to seeing the cool stuff you’ll be building. Now, without further ado, let’s get started.

Transcript 2

In this video, we are going to learn a little bit about cluster analysis. And this is a topic that we’re gonna be discussing over the duration of this course.

So just to give you an overview of the different things we’re gonna be covering, I’m gonna give you an introduction to cluster analysis, basically what is it and what are the different applications of it, as well as what kind of algorithms can we expect. And in fact, we’re gonna be covering three very popular algorithms, k-means clustering, DBSCAN, which stands for density-based spatial clustering of applications with noise, but usually we just call it DBSCAN, and then hierarchical agglomerative clustering, HAC.

These are three very popular clustering algorithms. And the interesting thing is they all take very different approaches to creating clusters. And we’re gonna get into all those in the subsequent videos. But first let’s talk a little bit about cluster analysis. And that’s what we’re gonna be focusing on primarily in this video, just to acquaint you with some of the terminology as well as some applications of cluster analysis for example. So clustering analysis, so imagine we have some data.

The whole point of clustering analysis is in an unsupervised way with no prior information, we want to be able to separate different groups based on the data that we have. And now sometimes these groups are predefined. You have the set of data like in this case, and you say, well, this seems, we plot this data, and you say, well, it seems to fit into two little groups. Here there’s a little clustering of points on the bottom left, and there’s a larger, kind of elongated cluster on the top right and so we might say, well, we can give a predefined number of clusters.

We want two clusters and we can give that to the clustering algorithms and then they’ll group these guys together. It’ll make a split and it actually, in some cases, we don’t need to specify the number of clusters. In fact, some algorithms, which is DBSCAN, are actually smart enough to be able to figure out how many clusters are based entirely on the data. But algorithms like k-means will actually need to be specified how many clusters that we have.

And so, for example, this data scan is actually taken, it’s a very famous data set called the Iris Dataset, collected by Ronald Fisher, which is, and here is a quick historical side note, he’s probably the most important statistician of the 20th century. A lot of statistical techniques that we have that are used in all kinds of companies were originally some of his work, but he collected this dataset of flowers. He has 50 different of three different kinds of species of flowers and he plots their measured properties like petal width, petal length, sepal width, and sepal length, and they’re all plotted out.

In this case, what I’ve actually done is removed the class labels, because usually when we’re doing clustering analysis, we don’t have the correct labels. In fact, that’s what the clustering is trying to give us. It’s trying to give us some notion that these things belong together and these other things belong together. So this is just a kind of data that you might expect with some clustering.

So clustering is taking our data and then putting it into groups such that the groups have some kind of similar properties or similar attributes here. So if we go back a slide here, so we have one cluster at the bottom left for example. That might be considered a cluster where the flowers in that cluster have a small petal length and a smaller petal width, for example. That’s an example of grouping, as I’m talking about.

And there’s so many different applications of clustering analysis, not just used for something like data science. But also things like medical imaging for things like x-rays or MRIs or FMRIs. They use clustering analysis. Recommender systems like those used on websites like Amazon.com and whatnot, they recommend you can use clustering analysis to help build recommendation systems. Geospatial data, of course, our data is in latitude and longitude coordinates and we can do some kind of clustering with that as well. And we can also use it for things like anomaly detection in our data. The top bullet point here is that we can use it for market analysis and segmentation, a very popular technique for that as well.

And so this kind of gives you a little bit of the applications of clustering analysis and certainly the algorithms that we’re gonna be learning are used in the field and can be used for your data as well. So for example, people have used clustering algorithms to actually do things like, to check brain anomalies, so here are just images I’ve taken of the brain and the different C-1 to C-Center are different clusters of the brain. They use a different clustering metric. And the result is, you can kind of segment out different kinds of brain anomalies. So it’s an application of clustering in the medical domain.

And another domain, what these guys have done is look at reviews on Yelp.com, and they’ve performed a spatial analysis of that. So there are a lot of different applications of clustering in many different fields.

So just a quick overview of clustering, but we can get a little bit more formal with this, with the algorithms. So you can think of clustering analysis on the bottom here, as just taking in a set of points, X1 to Xn. And then we churn it through a machine that is the cluster analysis, and then out comes an assignment that maps each point to a particular cluster. So on the bottom here, X1 is mapped to Cluster 1, X2 is mapped to Cluster 2, and there are a lot of different algorithms that exist to detect clusters.

In fact, like I’ve mentioned, we’re gonna be covering three of them, of the most popular ones. Now the parameters, some of these algorithms have additional parameters to them. So like I mentioned before, one of those parameters might be the number of clusters that you have that might have to be something that’s pre-defined that you would give to an algorithm. But there are sometimes other parameters as well, and they vary for each algorithm, so there’s no uniformness among the parameters. However, like you see in the chart here, the input to clustering algorithms is usually a set of data points, X1 to Xn, and they can be of any dimensionality, they can be in 2D, they can be 2D points, they can be 3D points, they can be 100 dimensional points.

But for our purposes, we’re just gonna stick with 2D points. It is certainly possible to do clustering on higher dimensionality data, but one of the issues is actually visualizing that higher dimensional data. Because we can’t really plot a hundred dimensional graph terribly easily, but there exist dimensionality production techniques that can press 100 dimensions down into two so we can plot it, but that’s a little beyond the scope of the course here. So we’re mostly gonna be sticking with 2D.

And then the output, like I said, is a cluster assignment. So each data point belongs to a cluster, or actually in some cases, some algorithms have some notion of outliers or noise built in, so just DBSCAN for example. So a point may not necessarily belong to an exact cluster, it might also belong to, it might be a noise point or some people like to think of it as being a noise cluster, although that’s not quite correct. But that is the output, and this is kind of a generalization of clustering algorithms. They take in this set of input points and then they produce a mapping that maps each point to a cluster or perhaps to a noise point.

And you can think of clustering analysis as being a kind of unsupervised machine learning, per se. And it’s unsupervised because generally, with clustering analysis, we don’t have any labels for our data. We just get a set of points and we’re set to cluster them. And so this unsupervised, and there are techniques that might use class labels to help do some kind of clustering or provide some other auxiliary information, but in many, many cases, clustering analysis is done unsupervised. We don’t have correct labels.

Alright, so that is just a quick overview of clustering analysis, so basically, what is clustering analysis, what kind of inputs we give to a clustering algorithm, what kinds of outputs do we expect from a clustering algorithm, and also discuss a little bit about different applications of clustering algorithms used in a wide variety of fields. It doesn’t even have to be something like data science or computer science, it can be used in the medical field, in business, finance, and so learning this information, we can take this and apply it to many other fields. So that is just a quick introduction to clustering analysis.

Transcript 3

In this video we are going to look at probably the most popular clustering algorithm called K-means clustering. So K-means clustering, the point of it is to take our cluster data and then separate it into K disjoint clusters, and these clusters are defined so that they minimize what’s called the within-cluster sum-of-squares, but we’ll get to that a bit later. You’ll see what that means when we discuss a little bit more about K-means convergence.

So the point of K-means clustering is to separate the data into K disjoint clusters. What I mean by disjoint clusters means that one point cannot belong to more than one cluster, and the only parameter that we have to set for this algorithm is the number of clusters. So you’ll see an example of an algorithm where you need to have some notion of how many clusters you want your data to have before you run the algorithm. Later, we’re gonna discuss how you can select this value of K, because sometimes it’s quite obvious how many clusters you have, but many times it’s not quite so clear how many clusters you should have, and so we’ll discuss a couple techniques a bit later on how to choose this parameter K.

So K-means clustering is very popular, very well-known. It actually, the algorithm itself is quite simple. It’s just a couple lines of algorithm code and the code itself is writing, if you had to write it from scratch also, it’s not something that would take you a long time. It’s very popular. I mean, it’s taught in computer science curriculums very commonly. So knowing this algorithm is kind of the first step to getting more acquainted with clustering. It’s kind of the baseline algorithm, and we’ll move on to more complicated algorithms a bit later, and then one point of terminology here.

You’ll often hear something called a cluster center or a centroid, and really that’s just a point that represents a cluster, so in the figure that I have here on the bottom right, we have two clusters. We have a red cluster and a blue cluster, and the X is the centroid. In other words, the centroid is really like the average of the X coordinates and the average of the Y coordinates and then that’s the point. So you can see in the blue, the average, if I average the X coordinates is somewhere gonna be in the middle, and if I average the Y coordinates, that’s gonna be somewhere in the middle and we end up with some centroid that’s in the middle of that square, and so that’s what I mean by centroid. So that’s all we really need to know about K-means.

Now we can get to the algorithm, we can actually get to the algorithm. I’m going to go through an example with you so you can kinda see it progress step-by-step with just some fake data, and then we’ll discuss some more interesting things about it such as convergence, properties, you know, talk a little bit about what convergence is, and then we’ll also discuss a bit later how do you select the value of K and what not. So but we’re gonna get to that.

So the clustering algorithm, the first step to the clustering is you randomly initialize the cluster centers. So suppose I have some data, that I have two clusters of four like in the previous chart. I have two clusters, so what I do is I just pick two random points to initialize the clusters. Now, the algorithm doesn’t, there are many ways to do this, and we’ll discuss a couple different ways a bit later, and so after you pick, we randomly initialize our cluster centers, then for each point we assign it to its nearest cluster using some kind of like metric, like the distance formula for example, Euclidean distance, and then we update the cluster centroids by taking the mean of the points assigned to it.

So after each point is assigned to a centroid, we look at the centroids and see how many points are assigned to this one centroid, and then we just average all the points, and that’s gonna shift the centroid. We basically keep doing this.

After the centroids have shifted, we then reset all of our points and figure out where the near, who, assign them to the nearest cluster again, and we just keep doing this until the cluster centroids stop moving or they move a very, very small imperceptible amount. Now, let’s actually do an example of this with some data so you have a better understanding of what I mean by each of these steps. So here is our data and we wanna cluster it and we want it to, we’re setting K equals two. There’s two clusters, okay?

If we look at this data we can kinda see that, well, yeah, there are two clusters here. So here is just like randomly initialized the two clusters. There’s red and blue, and just placed ’em somewhere on the plane. So now what I want to do is for each point, I’m going to assign it to the closest cluster. So after I do that I get something like this. So now I have colored all the points red or blue depending on if it’s closer to the red centroid or closer to the blue centroid.

So after I’ve done this, now I need to update my cluster centroid. So how I do that is I look at all the X, for example, for the blue centroid, I’ll look at all the X coordinates and take the average and that’s my new X coordinate for the centroid. I look at all the Y coordinates of all the blue points and I take the average of those and that becomes the new Y coordinate for my blue centroid. So after I do this, the centroids kinda shift over and then I reset the colors of all the points, and now I just do the same thing again.

So now for each point I have to figure out which centroid is it close to. So after I do that I get this assignment, and now I, again, take the average of the X coordinates and the average of the Y coordinates for each cluster and that becomes the location of my new centroid, and I shift it over again, and so now it’s gonna be here, and now let’s do another, now I do this one more time, and you’ll see that now, where if I try to do another shift, that nothing will, the cluster centroids won’t actually move again because in the previous step we see that the same points are assigned to the same cluster centroids.

So in other words, my centroids aren’t gonna move, and at this point we say that K-means has converged. So I hope that this gives you a bit of an overview of a K-means here as in how do you actually run the algorithm in step. So like I mentioned convergence, just ignore that big equation. It looks kinda scary but this is just to illustrate a point here. So convergence means that the cluster centroids either don’t move entirely or they move a very, very small very small amount, and you usually just set a threshold value. So if they don’t move at least this much, then we say they’ve converged.

The reason that I put this equation up here is just to illustrate the point that mathematically K-means is guaranteed to converge in a finite number of iterations, and by iterations I mean assigning the points to a cluster and then shifting and then assigning the points to a cluster and shifting.

In a finite amount of iterations, it is mathematically guaranteed to converge. Now, finite number, that still means that it might take a long time, but it will eventually converge. Now, this convergence, that theorem of K-means, doesn’t say anything about converging to the best cluster or converging to the optimal clustering. It just says that it will converge. This is still a useful property to know about K-means because this might not be true of other algorithms.

So one other point that I want to discuss is that K-means is actually quite sensitive to where you initialize the centroids, and there are a couple techniques to do this. So one technique that people like to use is you can just assign a cluster to be a random data point. So you pick a data point and say that is my cluster center. Another interesting thing is to choose K points that are the farthest away from each other. So if I have two points, I’ll choose two points that are the farthest away from each other within the bounds of my data, between the bound of the smallest X coordinate and the largest X coordinate, smallest Y coordinate, largest Y coordinate, and another approach is to repeat my K-means many times over and over again.

You just run the algorithm over and over again using one of these initializations and just take the average of the clusters. There’s another way more advanced approach called K-means plus plus that does all kinds of analysis of variance, but we’re not gonna get into that. In fact, we don’t have to implement it either ’cause it’s usually embedded in many libraries. Okay, so the only thing that remains to talk about is how do you choose a value of K. Well, one other thing to do is just plot your data, try a couple K values, and see which clustering works the best.

Another technique that’s actually quite popular is called the elbow method, and what you do is you choose some values of K, you run the clustering algorithm, you compute this thing called a within-cluster-sum-of-squares. In other words, you take the centroid and each data point assigned to that centroid and compute the sum of the square differences between those guys and you sum up all that. So for each cluster so you get one of those, and then you sum it up for all the clusters and you plot that on a graph, and you can keep plotting this on a graph for different values of K and then you pick the elbow of this graph here and that’s, it’s not a very clear-cut way as to what value of K you should use, it just gives you kind of an idea of what the range of different K you should choose there.

So this is also very popular and, again, there are lots of libraries that can actually generate this for you. You can do this yourself as well. So this is just a technique that you might wanna use to figure out how many values of K you should use. Okay, so that pretty much covers K-means clustering.

So I wanna talk a little bit about some of the advantages and disadvantages of this, and then later we’re gonna actually see it run on different kinds of data, and we’ll see how well it performs there. So it’s widely known and used, very popular like I said. It’s a fairly simple algorithm actually. It’s really only like four steps, and in lines of code wise it’s not that long either. It’s in the algorithm, easy to implement. One big perk is that it’s guaranteed to converge. It’s pretty nice.

Disadvantages of K-means, it’s algorithmically pretty slow if you know anything about big O notation. It’s take a long time for it to converge. Another issue is that it might not converge to the best solution. It’s guaranteed to converge, but it might not converge to the best solution, and it’s not very robust against clusters that have various shapes, like elongated clusters for example. It might not perform so well.

We’re actually gonna see this when we look at some fake data. We’ll see how well it performs against elongated data for example. And the reason for this is because we’re using the same parameters for each cluster when we do this. So anyway, that is just a quick overview of K-means clustering, and so now that we have this example we can go ahead and try to look at, analyze how it performs on different kinds of data sets.

Interested in continuing? Check out the full Data Insights with Cluster Analysis course, which is part of our Data Science Mini-Degree.

For teachers, you can also try Zenva Schools which offers online learning suitable for K12 school environments and offers a wide array of topics including Python, web development, Unity, and more.

Clustering with Gaussian Mixture Models – Data Science & ML

Mohit Deshpande — Tue, 29 Nov 2022 12:26:18 +0000

Gaussian Mixture Models are an essential part of data analysis – but do you know how they work?

In this article, we’ll seek to demystify how to analyze “clusters” of data and discuss Gaussian Mixture Models which can help us more efficiently and accurately sort out clusters of data points.

Let’s dive in!

What is Clustering?

Clustering is an essential part of any data analysis. Using an algorithm such as K-Means leads to hard assignments, meaning that each point is definitively assigned a cluster center. This leads to some interesting problems: what if the true clusters actually overlap? What about data that is more spread out; how do we assign clusters then?

Gaussian Mixture Models save the day! We will review the Gaussian or normal distribution method and the problem of clustering. Then we will discuss the overall approach of Gaussian Mixture Models. Training them requires using a very famous algorithm called the Expectation-Maximization Algorithm that we will discuss.

Download the full code here.

If you are not familiar with the K-Means algorithm or clustering, read about it here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Gaussian Distribution

The first question you may have is “what is a Gaussian?”. It’s the most famous and important of all statistical distributions. A picture is worth a thousand words so here’s an example of a Gaussian centered at 0 with a standard deviation of 1.

This is the Gaussian or normal distribution! It is also called a bell curve sometimes. The function that describes the normal distribution is the following

\[ \mathcal{N}(x; \mu, \sigma^2) = \displaystyle\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\displaystyle\frac{(x-\mu)^2}{2\sigma^2}} \]

That looks like a really messy equation! And it is, so we’ll use $\mathcal{N}(x; \mu, \sigma^2)$ to represent that equation. If we look at it, we notice there are one input and two parameters. First, let’s discuss the parameters and how they change the Gaussian. Then we can discuss what the input means.

The two parameters are called the mean $\mu$ and standard deviation $\sigma$. In some cases, the standard deviation is replaced with the variance $\sigma^2$, which is just the square of the standard deviation. The mean of the Gaussian simply shifts the center of the Gaussian, i.e., the “bump” or top of the bell. In the image above, $\mu=0$, so the largest value is at $x=0$.

The standard deviation is a measure of the spread of the Gaussian. It affects the “wideness” of the bell. Using a larger standard deviation means that the data are more spread out, rather than closer to the mean.

What about the input? More specifically, the above function is called the probability density function (pdf) and it tells us the probability of observing an input $x$, given that specific normal distribution. Given the graph above, we see that observing an input value of 0 gives us a probability of about 40%. As we move away in either direction from the center, we are decreasing the probability. This is one key property of the normal distribution: the highest probability is located at mean while the probabilities approach zero as we move away from the mean.

(Since this is a probability distribution, the sum of all of the values under the bell curve, i.e., the integral, is equal to 1; we also have no negative values.)

Why are we using the Gaussian distribution? The Expectation-Maximization algorithm is actually more broad than just the normal distribution, but what makes Gaussians so special? It turns out that many dataset distributions are actually Gaussian! We find these Gaussians in nature, mathematics, physics, biology, and just about every other field! They are ubiquitous! There is a famous theorem in statistics called the Central Limit Theorem that states that as we collect more and more samples from a dataset, they tend to resemble a Gaussian, even if the original dataset distribution is not Gaussian! This makes Gaussian very powerful and versatile!

Multivariate Gaussians

We’ve only discussed Gaussians in 1D, i.e., with a single input. But they can easily be extended to any number of dimensions. For Gaussian Mixture Models, in particular, we’ll use 2D Gaussians, meaning that our input is now a vector instead of a scalar. This also changes our parameters: the mean is now a vector as well! The mean represents the center of our data so it must have the same dimensionality as the input.

The variance changes less intuitively into a covariance matrix $\Sigma$. The covariance matrix, in addition to telling us the variance of each dimension, also tells us the relationship between the inputs, i.e., if we change x, how does y tend to change?

We won’t discuss the details of the multivariate Gaussian or the equation that generates it, but knowing what it looks like is essential to Gaussian Mixture Models since we’ll be using these.

The above chart has two different ways to represent the 2D Gaussian. The upper plot is a surface plot that shows this our 2D Gaussian in 3D. The X and Y axes are the two inputs and the Z axis represents the probability. The lower plot is a contour plot. I’ve plotted these on top of each other to show how the contour plot is just a flattened surface plot where color is used to determine the height. The lighter the color, the larger the probability. The Gaussian contours resemble ellipses so our Gaussian Mixture Model will look like it’s fitting ellipses around our data. Since the surface plot can get a little difficult to visualize on top of data, we’ll be sticking to the contour plots.

Gaussian Mixture Models

Now that we understand Gaussians, let’s discuss to Gaussian Mixture Models (GMMs)! To motivate our discussion, let’s see some example data that we want to cluster.

We could certainly cluster these data using an algorithm like K-Means to get the following results.

In this case, K-Means works out pretty well. But let’s consider another case where we have overlap in our data. (The two Gaussians are colored differently)

In this case, it’s pretty clear that these data are generated from Gaussians from the elliptical shape of the 2D Gaussian. In fact, we know that these data follow the normal distribution so using K-Means doesn’t seem to take advantage of that fact. Even though I didn’t tell you our data were normally distributed, remember that the Central Limit Theorem says that enough random samples from any distribution will look like the normal distribution.

Additionally, K-Means doesn’t take into account the covariance of our data. For example, the blue points seem to have a relationship between X and Y: larger X values tend to produce larger Y values. If we had two points that were equidistant from the center of the cluster, but one followed the trend and the other didn’t, K-Means would regard them as being equal, since it uses Euclidean distance. But it seems certainly more likely that the point that follows the trend should match closer to the Gaussian than the point that doesn’t.

Since we know these data are Gaussian, why not try to fit Gaussians to them instead of a single cluster center? The idea behind Gaussian Mixture Models is to find the parameters of the Gaussians that best explain our data.

This is what we call generative modeling. We are assuming that these data are Gaussian and we want to find parameters that maximize the likelihood of observing these data. In other words, we regard each point as being generated by a mixture of Gaussians and can compute that probability.

\begin{align}
p(x) = \displaystyle\sum_{j=1}^{k} \phi_j\mathcal{N}(x; \mu_j, \Sigma_j)\\
\displaystyle\sum_{j=1}^{k} \phi_j = 1
\end{align}

The first equation tells us that a particular data point $x$ is a linear combination of the $k$ Gaussians. We weight each Gaussian with $\phi_j$, which represents the strength of that Gaussian. The second equation is a constraint on the weights: they all have to sum up to 1. We have three different parameters that we need to write update: the weights for each Gaussian $\phi_j$, the means of the Gaussians $\mu_j$, and the covariances of each Gaussian $\Sigma_j$.

If we try to directly solve for these, it turns out that we can actually find closed-forms! But there is one huge catch: we have to know the $\phi_j$’s! In other words, if we knew exactly which combination of Gaussians a particular point was taken from, then we could easily figure out the means and covariances! But this one critical flaw prevents us from solving GMMs using this direct technique. Instead, we have to come up with a better approach to estimate the weights, means, covariances.

Expectation-Maximization

How do we learn the parameters? There’s a very famous algorithm called the Expectation-Maximization Algorithm, also called the EM algorithm for short, (written in 1977 with over 50,000 paper citations!) that we’ll use for updating these parameters. There are two steps in this algorithm as you might think: expectation and maximization. To explain these steps, I’m going to cover how the algorithm works at a high level.

The first part is the expectation step. In this step, we have to compute the probability that each data point was generated by each of the $k$ Gaussians. In contrast to the K-Means hard assignments, these are called soft assignments since we’re using probabilities. Note that we’re not assigning each point to a Gaussian, we’re simply determining the probability of a particular Gaussian generating a particular point. We compute this probability for a given Gaussian by computing $\phi_j\mathcal{N}(x; \mu_j, \Sigma_j)$ and normalizing by dividing by $\sum_{q=1}^{k} \phi_q\mathcal{N}(x; \mu_q, \Sigma_q)$. We’re directly applying the Gaussian equation, but multiplying it by its weight $\phi_j$. Then, to make it a probability, we normalize. In K-Means, the expectation step is analogous to assigning each point to a cluster.

The second part is the maximization step. In this step, we need to update our weights, means, and covariances. Recall in K-Means, we simply took the mean of the set of points assigned to a cluster to be the new mean. We’re going to do something similar here, except apply our expectations that we computed in the previous step. To update a weight $\phi_j$, we simply sum up the probability that each point was generated by Gaussian $j$ and divide by the total number of points. For a mean $\mu_j$, we compute the mean of all points weighted by the probability of that point being generated by Gaussian $j$. For a covariance $\Sigma_j$, we compute the covariance of all points weighted by the probability of that point being generated by Gaussian $j$. We do each of these for each Gaussian $j$. Now we’ve updated the weights, means, and covariances! In K-Means, the maximization step is analogous to moving the cluster centers.

Mathematically, at the expectation step, we’re effectively computing a matrix where the rows are the data point and the columns are the Gaussians. An element at row $i$, column $j$ is the probability that $x^{(i)}$ was generated by Gaussian $j$.

\[ W^{(i)}_j} = \displaystyle\frac{\phi_j\mathcal{N}(x^{(i)}; \mu_j, \Sigma_j)}{\displaystyle\sum_{q=1}^{k}\phi_q\mathcal{N}(x^{(i)}; \mu_q, \Sigma_q)} \]

The denominator just sums over all values to make each entry in $W$ a probability. Now, we can apply the update rules.

\begin{align*}
\phi_j &= \displaystyle\frac{1}{N}\sum_{i=1}^N W^{(i)}_j\\
\mu_j &= \displaystyle\frac{\sum_{i=1}^N W^{(i)}_j x^{(i)}}{\sum_{i=1}^N W^{(i)}_j}\\
\Sigma_j &= \displaystyle\frac{\sum_{i=1}^N W^{(i)}_j (x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^N W^{(i)}_j}
\end{align*}

The first equation is just the sum of the probabilites of a particular Gaussian $j$ divided by the number of points. In the second equation, we’re just computing the mean, except we multiply by the probabilities for that cluster. Similarly, in the last equation, we’re just computing the covariance, except we multiply by the probabilities for that cluster.

Applying GMMs

Let’s apply what we learned about GMMs to our dataset. We’ll be using scikit-learn to run a GMM for us. In the ZIP file, I’ve saved some data in a numpy array. We’re going to extract it, create a GMM, run the EM algorithm, and plot the results!

First, we need to load the data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture

X_train = np.load('data.npy')

Additionally, we can generate a plot of our data using the following code.

plt.plot(X[:,0], X[:,1], 'bx')
plt.axis('equal')
plt.show()

Remember that clustering is unsupervised, so our input is only a 2D point without any labels. We should get the same plot of the 2 Gaussians overlapping.

Using the GaussianMixture class of scikit-learn, we can easily create a GMM and run the EM algorithm in a few lines of code!

gmm = GaussianMixture(n_components=2)
gmm.fit(X_train)

After our model has converged, the weights, means, and covariances should be solved! We can print them out.

print(gmm.means_)
print('\n')
print(gmm.covariances_)

For comparison, I generate the original data data according to the following Gaussians.

\begin{align*}
\mu_1 &= \begin{bmatrix}3 \\ 3\end{bmatrix}\\
\Sigma_1 &=
\begin{bmatrix}
1 & 0.6 \\
0.6 & 1
\end{bmatrix}\\

\mu_2 &= \begin{bmatrix}1.5 \\ 1.5\end{bmatrix}\\
\Sigma_2 &=
\begin{bmatrix}
1 & -0.7 \\
-0.7 & 1
\end{bmatrix}
\end{align*}

Our means should be pretty close to the actual means! (Our covariances might be a bit off due to the weights.) Now we can write some code to plot our mixture of Gaussians.

X, Y = np.meshgrid(np.linspace(-1, 6), np.linspace(-1,6))
XX = np.array([X.ravel(), Y.ravel()]).T
Z = gmm.score_samples(XX)
Z = Z.reshape((50,50))

plt.contour(X, Y, Z)
plt.scatter(X_train[:, 0], X_train[:, 1])

plt.show()

This code simply creates a grid of all X and Y coordinates between -1 and 6 (for both) and evaluates our GMM. Then we can plot our GMM as contours over our original data.

The plot looks just as we expected! Recall that with the normal distribution, we expect to see most of the data points around the mean and less as we move away. In the plot, the first few ellipses have most of the data, with only a few data points towards the outer ellipses. The darker the contour, the lower the score.

(The score isn’t quite a probability: it’s actually a weighted log probability. Remember that each point is generated by a weighted sum of Gaussians, and, in practice, we apply a logarithm for numerical stability, thus prevent underflow.)

To summarize, Gaussian Mixture Models are a clustering technique that allows us to fit multivariate Gaussian distributions to our data. These GMMs well when our data is actually Gaussian or we suspect it to be. We also discussed the famous expectation-maximization algorithm, at a high level, to see how we can iteratively solve for the parameters of the Gaussians. Finally, we wrote code to train a GMM and plot the resulting Gaussian ellipses.

Gaussian Mixture Models are an essential part of data analysis and anomaly detection – which are important to learn when pursuing exciting developer careers! You can also find out more resources for exploring Python here.

A Bite-Sized Guide to Pandas

Zenva — Fri, 27 Sep 2019 15:00:21 +0000

You can access the full course here: Bite-Sized Pandas

Part 1

In this lesson we’ll learn about DataFrames (the data structure pandas works with) and learning how to access data in the DataFrame using pandas.

We’ll open Anaconda Navigator and load our environment file.

Now we need to copy the files we’ll be working with into your working directory – maybe create a folder specifically for this course. The files we’ll use are flights.csv, ReadMe.csv, Terms.csv and Tracks.xlsx.

Now we can get started with pandas!

In Anaconda Navigator make sure you have the pandas environment selected before launching Spyder.

Save your new file in the same directory as the spreadsheets so they’ll be easy to access.

Now we’ll start by importing pandas, and then numpy, which we’ll use to populate data.

import pandas as pd
import numpy as np

Creating a DataFrame:

A DataFrame is a like a spreadsheet – a 2D table with rows and columns, very similar to a spreadsheet you’d open in Excel. Using pandas gives us a lot more power behind how we can work with them.

Let’s start by creating a DataFrame from data we have.

Remembering it’s a 2D table, a way we can define a DataFrame is to make a dictionary first, then give that to pandas to convert into a DataFrame. So let’s make a python dictionary. Usually we use df to mean DataFrame:

df_data = {

}

Within this dictionary, each key will be a column, with their values being the row data in that column.

To see how this looks, let’s populate a column with some random values using numpy.

df_data = {
'col1': np.random.rand(5)
}

This will generate a single column with 5 rows. Let’s convert it to a DataFrame to see what it looks like before we add more columns.

df = pd.DataFrame(df_data)
print(df)

If you run this, you’ll see our DataFrame printed to the console populated with 5 random values thanks to numpy.

Let’s go ahead and make another couple of columns and run it again:

df_data = {
'col1': np.random.rand(5),
'col2': np.random.rand(5),
'col3': np.random.rand(5)
}

Now you can see how we can give data to pandas in the form of dictionaries and how they’ll look when converted into DataFrames, including how pandas automatically indexes our rows for us.

Fetching Rows:

We can deal with it much like you would a list, using indices to return the values we want. Try printing only the contents of the first 2 rows.

print(df[:2])

Remember: indices start at 0 and go up to but don’t include the final stated index. So if we only wanted the first row, we would use [:1].

Fetching Columns:

This is only slightly different – instead of numbered indices, we refer to the column’s name. Change your print command to fetch all the data from the first column only.

print(df['col1'])

You’ll see it also helpfully returns the type of data contained in that column’s rows.

Documentation:

Go to https://pandas.pydata.org/ and click on the documentation link.

Here we have all the documentation we need to understand and use pandas. If you’re ever stuck, try here!

Under User Guide, click on IO Tools (Text, CSV, HDF5, …), and you’ll find all the functions that we can use to read CSV files.

Now if you click on any function, say, pandas.read_csv()…

You can see all of its specific documentation, explaining what you can do and how. In this case, pandas.read.csv() starts with a filepath_or_buffer, and then has all kinds of other information you could use to customize that function.

Elsewhere in the User Guide there are sections that can also help you with using pandas to read from JSOM, Excel or HDF5, and more! There’s also a search bar under the Table of Contents so you can search for specific function names, so if you ever get stuck or want to improve your pandas skills, this is a great place to start!

Challenge:

This challenge will help you get used to using the pandas documentation to solve problems.

Your aim is to select multiple columns from the DataFrame we made – columns 1 and 2.

Clue: When looking in the documentation, the answer will be somewhere near the top of the page called Indexing and Selecting Data. Try to figure out how to do this on your own before reading on.

Solution:

In case you didn’t find it, the part of the documentation you need is under the subtitle Basics and begins “You can pass a list of columns to [] to select columns in that order.”

In the above example, columns B and A and selected and the data in their rows are swapped.

So, to get the data from multiple columns, instead of a string, we can use a list:

print(df[['col1', 'col2']])

Part 2

In this lesson we’ll learn how to read data from CSV and Excel into Pandas and save as a Pandas data file.

There’s a function in Pandas that we can use for this that’s easy to use and handles all of the formatting needed – all we have to do is give it a file name!

Make a new file in Spyder in the same directory as the project files.

Start by loading Pandas:

import pandas as pd

Next we’ll have Pandas read an Excel spreadsheet. Open Tracks.xlsx in Excel and let’s have a look at what we’ll be working with first.

We can see that it’s a list of songs with various related data, such as album ID, Artist and price.

The Pandas function we’re using is read_excel, and we want it to load the file Tracks.xlsx.

tracks = pd.read_excel('Tracks.xlsx')

We may also need to choose which sheet within the Excel Workbook we want to load, as we can only load one sheet per DataFrame. If you have multiple sheets, you’ll need to create multiple DataFrames, each utilizing this function. If we don’t specify which we want, it’ll select the first sheet by default.

We can specify which sheet either by using the actual name of the sheet or using a zero-indexed integer. In this case there is only one sheet in the workbook so it’s not necessary, but it’s worth practicing!

tracks = pd.read_excel('Tracks.xlsx', sheet_name=0)
print(tracks)

There’s a lot of data here! In addition to printing our sheet, this tells us how many rows and columns the sheet has.

Pandas seems to have removed the middle columns and rows, but it hasn’t actually. This is purely for display purposes to make it easier for us given the quantity of data we’re working with – the data is still there, don’t worry! This is obviously very convenient given that our sheet has over 3,500 rows!

If you’re ever concerned, you can ask for those missing columns and rows. Let’s do that now by printing the columns.

print(tracks.columns)

We can now see that it has all the columns we wanted. We can make extra sure by asking for the contents of an individual column. Let’s print out the milliseconds column:

print(tracks['Milliseconds'])

And there it is, abbreviated as before with dots in the middle, and with some helpful information at the bottom: the name of the column, the number of rows, and the data type.

Now let’s do the same with a CSV spreadsheet – flights.csv. Have a look at it before if you like. It contains lots of information about flights during 2017. Instead of read_excel, we’re using read_csv – other than that it’s exactly the same:

flights = pd.read_csv('flights.csv')
print(flights)

Run this. You may have to wait a little – it’s a big file!

You’ll notice that we have a similar issue. With 600,000 rows and 25 columns, the data we actually have printed out is reduced nicely. That’s good, but there seems to be a problem – the data isn’t matching up with the column names.

This is because when we load a CSV like this in Pandas, unlike with Excel files, it will try to use the first column as the index, which is why everything’s been offset by one. To resolve this, we need to add a parameter:

flights = pd.read_csv('flights.csv', index_col=False)
print(flights)

Fixed! Once again, if we want to check on those apparently missing rows and columns we can.

print(flights.columns)

Transcript 1

Hello world and thanks for joining me. My name is Mohit Deshpande and in this course, we’re going to be learning how to manage and analyze data using PANDAS, a library called PANDAS for data analysis.

So we’re gonna learn a lot about how we can read-in data from sources, and then manipulate it so we can use it for further data analysis. So some of the things that we’re going to be learning about in this course, is we’re gonna learn about DataFrames, which are how we can store data in PANDAS so we can later use them for any kind of analysis. We’ll learn how to read information from CSV files and Excel files. We’ll learn all about how to select, sort, filter our data and then we’re also gonna get into how we can do different kinds of grouping and data aggregation as well.

So we’re gonna be learning a lot of different things that center around this library data science, and data analysis library called PANDAS, and we’re gonna really learn how to use this, so that we can do further data analysis on the data that we have.

We’ve been making courses since 2012, and we’re super excited to have you on board. Online courses are a fantastic way to learn new skills, and I take a lot of courses myself. Zenva courses consist mainly of video lessons that you can watch and re-watch as many times as you want. We also have downloadable source code and project files and they contain everything that we build in the lessons. It’s highly, highly recommended that you code along with me. In my experience, it’s the best way to learn something is to kinda get your feet wet or get your hands dirty.

And lastly, we’ve seen that the students who get the most out of these online courses are those that make some kind of weekly plan, and stick with it, depending of course on your own availability and learning style. So Zenva over the past six years has taught all different kinds of topics on programming and game development. Over 300,000 students, over 100 courses. The skills that they learn in these courses are completely transferrable to other context and domain. In fact, some of the students have used the skills that they learned in these courses to advance their own careers, to make a startup, or to publish their own content from the skills that they’ve learned.

Thanks again for joining, and I look forward to seeing all the cool stuff you’ll be building. Now without further Ado, let’s get started.

Transcript 2

In this video we’re going to get started with Pandas, and so learn a little about what the fundamental data structure for Pandas is as well as learning how to access a data using a thing called data frame. But first of all what we’ll need to do is download the source code and we’ll need to copy them into whatever working directory on your computer. ‘Cause inside here we have these files that we’ll need. So you’ll have to copy these files over into your working directory. So I’ve already done this and created a folder called Pandas and I’ve copied over all of these things that we’ll need.

You’ll want to open up your Anaconda Navigator and make sure you have the right environment selected and then we’re going to launch Spyder. Okay, so lets get started. So we’ll need to import Pandas, import Pandas as pd. And I’m also going to import NumPy as np just so that we’re going to use it to populate data.

We’re gonna talk a little bit about Pandas, what the fundamentals data structure behind Pandas is. And the thing is called a DataFrame. And a DataFrame you can think of it as just being a single spreadsheet. A 2D table with rows and columns. So let’s just create a DataFrame from just some data that we have. So remember it’s a 2D table. So how I can define a dataframe is I can use a dictionary first, then give it to Pandas and say Hey, can you convert this dictionary into a dataframe. Each of the keys are going to be columns and the values are going to be rows. Column one, now I can just populate it with some random values, np.random.rand five.

Essentially what I have done is created a single column and it has five rows. So let’s just create this dataframe first so that we can see it. df equals, I can just create one by saying pd.DataFrame and I just pass it in this dictionary. Pass in now I’ve created a DataFrame. Let’s see what this looks like. I can run this guy and you’ll see I have column one and then just some random garbage values. So let’s go a head and create another column, create another one and you’ll see now we have three columns. And so this is just how we can give into or you can give data to Pandas into a dataframe just by using this dictionary where the keys are the columns and the values are going to be the actual values for that column.

We’ll see if we can fetch some rows and how we can fetch some columns. Index it like you would a list. So I’ll say, let’s get some, I’ll run this and you’ll see that we’ll get the first two rows because remember that this goes, we start at zero and we go up two but not including this index. So we get zero and one. Instead we need to see what the column name is. So let’s do col1 and what that will do is when I run it print out all of the rows the entire column, all the rows for this particular column and it even goes so far as to tell me the data type. In order to fetch multiple columns we actually use a list inside here.

Suppose I want to fetch the first two column I can say something like print df and then inside of here instead of doing just quotes here I can do a list, I’m indexing and I’m giving it a list to index on. So I can say col and col2. So now if you see we have extracted two columns.

Transcript 3

In this video we are going to learn how we can read data from CSV and excel into pandas, into a panda’s data frame. First thing only to do is import pandas. Alright so let’s load in, let’s read an Excel spreadsheet, and to give you an idea of the kind of the data that we’re going to be loading I have the spreadsheet opened up in excel so it’s just a list of different songs.

So I’ll say tracks equals and then I just call pd.read_excel and then I can just give it a file name, so I know this that the file is called Tracks.xlsx, so this will just load our data, so it’s really this simple to load data into Panda. So let’s print this guy out and see for ourselves, so you can see that I have some data here, and it’s actually already telling me how many rows and columns I have. So because we can’t see all of the columns here let’s print out the columns just so we can verify that they’re there, so I comment this out. I’m going to say tracks.columns and we can print out.

We can print out all of the columns, so I can run this and see that we can see all of the columns that are being printed out, and additionally what I can do, and now that I have this information you know I can do something like, let’s print out all of the entire column that’s milliseconds and just do something like this, and then it’ll print out all of the milliseconds and it’s giving me some useful information, like the name of the column is, how many rows we have and then what the data type.

What we can do is see how we can read a CSV, so I can load this guy up just by saying pd.read_csv and I have to give it the CSV file, so flights.csv and then we can do the same thing let’s just print this out just so that we can have some idea of what’s going on. But you can see that we have 600,000 rows and then 25 columns, so it’s a pretty big data set. If we can see if I can expand this out a little bit, alright so it’ll say year, month, but wait a minute this isn’t quite right, cause this should be the year, so it seems all the columns are offset by one. So this isn’t good, and this is because when we’re loading something like this in Panda is what it’s going to try to do is find, use this first column as the index, and we don’t want it to do that.

We want to just have natural, natural indexes, so just zero, one, two, three and so on, and so on. So what I can do is just use a parameter here and say index_col=False, and so now let me run this. Alright so now let’s see what we have. So okay, this seems to be, this seems to be promising alright, so this is the correct column for the year.

This is the correct column for the months, so month one being January, and now you got the indexes are correct so it’s zero up to 59,999 because remember it’s zero index. Flights.columns and we can see all of the columns, so that’s how we can read excel and CSV spreadsheets in Pandas.

Interested in continuing? Check out the full Bite-Sized Pandas course, which is part of our Bite-Sized Coding Academy.

A Bite-Sized Guide to SQL

Lindsay Schardon — Fri, 30 Aug 2019 15:00:44 +0000

You can access the full course here: Bite-Sized SQL

Part 1

In this lesson you will learn how to setup the development environment.

For managing our dependencies we are going to use this tool called “Anaconda.” Essentially you can use Anaconda to manage our dependencies and update them very easily.

You can download Anaconda from here: https://www.anaconda.com/

The direct link is here: Anaconda Download

Choose the version of Anaconda you need based on your current operation system you are using.

Once you have Anaconda downloaded go ahead and follow the setup wizards instructions.

Once Anaconda is installed go ahead and open the Anaconda Navigator application.

Once Anaconda is opened on your system it should look similar to this:

This already has all our dependencies bundled into these things called environments.

So select the Environments Tab from the Anaconda menu.

deeplearning selected" width="550" height="245" />

At the bottom you can see a tool bar to create and import environments.

You can change your environment by clicking on one of them in the menu. So if you clicked on the deeplearning one, you will then see all of the installed packages listed on the right.

In the projects that we will be working on for this course, you will be provided with an environment file. All you will need to do is import that environment file that was provided and then it will create the environment, download all the dependencies packages and install them.

You can download the environment file from the Course Home page (NOT AVAILABLE IN FREE WEBCLASS).

The file is called “my-awesome-project.zip” and once you have it downloaded you will need to extract it. Once extracted you will have one single file called “environment.yml”

The environment.yml file contains all of dependencies of the Python packages and the version numbers.

All we need to do now is import the environment file. From the Anaconda Navigator menu, go to Import, and that will pull up a window for you to select the environment file you now have and name it and then click the import button.

After its done loading you will now see it in the list to choose:

If you click on the green arrow next to the my-awesome-project you will pull up a drop down menu, and from this menu you can open the terminal or with Python.

Part 2

What is a Database

A high level definition of a database is that it is a highly efficient way of storing data for querying and data management. Querying is asking questions about the data I have. Another way to look at databases visually is to think of them as a spreadsheet workbook such as below.

A worksheet may have many sheets such as Track, Album etc. representing data from music tracks and the album they are part of. In database lingo, these would correspond to two different tables.

In the worksheet above, the header row (highlighted) tells us what fields(similar to columns in a database table), are present in the data. The worksheet rows below the header contain the actual entries (similar to rows in a database table).

Here are the header row (columns) and entries (rows) from the Track sheet (table).

Track, Name, Album, Media, Genre, Composer etc. represent database columns e.g. the Composer column contains names of all the composers for the various tracks. The data in the entries below the columns (such as the red highlighted entry) represent rows.

Conceptually we can think of data in a database as being organized as in the spreadsheet where each sheet represents a table, the sheet header represents the columns and the sheet entries represent rows. Databases are really fast and efficient at storing and retrieving millions of rows and we will soon see an experiment that demonstrates the huge time difference between using a database to answer a query versus using a CSV or text file to answer the same query.

Managing Data

We manage data in a database using SQL (Structured Query Language), also pronounced sequel. We can use SQL to query the database for data, add rows, change or delete rows. We can also use it to create new tables.

Kinds of Databases

There are several kinds of databases. We will be looking at SQLite (https://www.sqlite.org/index.html), which is really simple and easy to set up and use out-of-the-box. It’s size and compactness makes it very popular in mobile application development on platforms such as iOS and Android. Different databases may have their own proprietary operators. However the SQL we are going to write here for SQLite is transferable to to larger databases such as mySQL as well. mySQL is very popular in industry applications.

Why Databases, Experiment

Its worthwhile discussing why we need a database for storage with SQL for querying as opposed to just using a CSV file. Lets look at an experiment where we compare these approaches by using python scripts to retrieve data (list of orders) from a SQLite database Vs a CSV file. Both the database (single table) and the CSV file have the same order information (approximately 500,000 entries – a small number in comparison to large commercial databases).

data.csv – order data in CSV format

example.db – order data in SQLite format

db.py – script that retrieves data from example.db

text.py – script that retrieves data from data.csv

The query we will be using is a very classic commonplace query used all the time while buying from a website : Fetch me the order whose id is . We will be using python to ask both the text file and the database for this information. In the case of the database, we will use the python SQLite API. We will time both these approaches using the UNIX time command. The results are below.

CSV file

$ time python3 txt.py

The query takes roughly 2000 ms.

SQLite Database

$ time python3 db.py

The SQLite query takes about 64 ms. We can see the huge difference in retrieval time between using a text file Vs a database for just 500,000 rows. People use databases with millions of rows in the industry. Databases thus store data very efficiently thus allowing us to perform very powerful queries against the data stored. This is why we use databases.

Transcript 1

Hello, world, and thanks for joining me. My name is Mohit Deshpande, and in this course we’ll be learning about querying databases. The databases are ubiquitous and pretty much everything that we do nowadays, every time you go to a website, odds are there’s a database running somewhere on the backend that’s helping manage all of your data. Definitely anytime you log into the website, there’s databases going on that has all your login credentials and managing all that information.

So they’re ubiquitous, especially in a company, any kind of large company you’ve heard of, they definitely have and use a database on a regular basis. It’s kind of the backbone infrastructure of their entire operation, and we’re gonna be learning how we can query these databases.

So, primarily we’re gonna be focused on how we can query data. In other words, how can we ask questions about the data that we have stored in our database. So we can do interesting things like ask our database to retrieve a list of all of our purchases that we’ve made over the past six months that have cost more than $35. This is a kind of example query that you might hear about when you visit any kind of retail site, and this is what we’re gonna be learning about.

We’ll learn about how to query data, and we’ll learn about how we can do different kinds of sorting of the data that we already have. We’ll learn about grouping and aggregation, and then finally how we can take data from multiple, we can take different kinds of data and merge it into one combined information across different databases.

We’ve been making courses since 2012, and we’re super excited to have you on board. Online courses are a fantastic way to learn new skills, and I take a lot of online courses myself. And the courses consist mainly of video lessons that you can watch and rewatch as many times as you want. We also have downloadable source code and pocket files and they contain everything that we’re working on during the lessons.

It’s highly recommended that you code along with me. In my experience, that’s the best way to learn something – to get your feet wet, so to speak. And lastly, we’ve seen that students who get the most out of these online courses are those who make a weekly plan and stick to it, depending on your own availability and learning style, of course.

So Zenva, over the past six years or so, has taught all different kinds of topics on programming and game development to over 300,000 students over 100 courses. The skills that they’ve learned in these courses and guides are completely transferrable to other domains, as well. In fact, some of the students have used these skills to advance their own careers, to start a company, or to publish their own content from the skills that they’ve learned. Thanks again for joining, and I look forward to seeing all the cool stuff that you’ll be doing.

Now without further ado, let’s get started.

Transcript 2

So for managing our dependencies, we’re gonna use this really awesome tool called Anaconda. Anyway, we go anaconda.com. On the right here there’s this green button called Downloads. So we go to Downloads. You wanna download the right one. This is the latest version of Python that we have. So we’ll click Downloads and that’ll install both of the steps and it will install Anaconda.

And we’ll get this application called Anaconda Navigator. It’ll look kinda something like this. You might not have the same packages as I do installed. This already has all of our dependencies bundled into these things called environments – that I’ve already mentioned.

So let’s go ahead and go over to the environments. And you’ll see. I already have one for deep learning that I like to use. Here is list of environments. And bottom here, there’s buttons to create and import an environments. You can change which environment you are using. In other words, you’re changing all the dependencies that you have just by clicking on one of these guys. And you’ll see that we will have changed our environment.

So now they are here all. I go installed. Here are all of the packages. All the dependencies and packages, Python packages that I use for deep learning for example. We’ll have an environment file for you to download. So in this case, this is a Zip file called my-awesome-project.zip. And let me open that guy up. So you’ll want to extract this guy here.

Okay so inside of this Zip file there’ll be a single a file called environment.yml. In Anaconda navigator, we’ll import. And then it’ll pull up this little dialogue box that says File to import from. Just click on this little folder icon. And then navigate to the environment.yml file. And I just click on this. Click open and you’ll see it’s loaded up with the appropriate name, and I can just click import and in just a minute or so, all update dependencies and packages will be downloaded and installed in my environment.

One other thing that all discussed before we go is – you see this little green arrow. If we click on that, you’ll see we’ve got a couple of options here we would say. Open Terminal or open with Python. Those that either savvy with the Terminal, notice when I open Terminal, what will happen is I get a Command Prompt that’s already set up to use my-awesome-project. You can use this for. If you doing any kind of command line stuff with Python. In our environment, is where you want all of the dependencies to be working. All you have to do is click the green arrow and go to Terminal.

Transcript 3

So first of all, is what is a database? A database is a very efficient way to store highly structured data for querying and data and data management. What I mean by querying is really just question and answering, so I have a ton of data and I wanna ask a question about it. You can conceptually think of the data stored inside of a database as being kind of like in a workbook like this. With a workbook, we have these different, down here at the bottom, you can see we have two different sheets.

In database lingo, these would be considered two different tables. We have a hetero here that tells me what each of the entries looks like, just like we would expect in a regular table. Database lingo also uses the same conceptualization of rows and columns. So a row would be a slice this way, and rows are just entries inside of these tables. And then columns represent slices across all of the rows. So here’s a Composer column and inside of this column, has all of the different composers.

There are a ton of different kinds of databases. The one that we’re gonna be looking at is called SQLite, or SQLite. So I have a little experiment, an example.db SQLite file, and I have a data.csv, which is just a text version of this database and a two Python scripts just to make it fair. And this db.py, what it’s going to do is run the query against this database, and the txt.py is gonna run a query against the CSV file which is this plain text. So the query that we’re asking – so this database, by the way, I should mention that the database and CSV contain a list of different orders.

And so the question I’m gonna ask is if it would fetch me all the information you can about the order whose ID is blank, and whatever the order ID I chose. So we’re gonna time both of these and see how long it takes.

So let’s first time the TXT file. So there’s UNIX command time that I can just run, time python3 txt.py. So if you run this guy, the real-time is how much time, like a wall clock, it’s called wall clock time, so how much time has elapsed overall. So we see it’s about two seconds. And we can convert this into milliseconds so it’ll be useful. So it’s about 2,000 milliseconds. So that’s how long it would take. And I should mention that the number of rows we have in this table that we’re running the query against is a little over half a million rows, which we’ll keep that in mind.

Now I’m gonna show you how fast we can get this working using the database. So I can run this guy, and you see this? Not even one second, not even .1 seconds, but it ends up taking 64 milliseconds. So you can this huge difference between using a TXT file and using a database.

Interested in continuing? Check out the full Bite-Sized SQL course, which is part of our Bite-Sized Coding Academy.

A Bite-Sized Guide to NumPy

Lindsay Schardon — Fri, 28 Jun 2019 15:00:27 +0000

You can access the full course here: Bite-Sized NumPy

Part 1

In this lesson you will learn how to download an install Anaconda and Jupyter notebooks on your mac.

What is Anaconda?

Anaconda is a distribution software that provides everything a user would need to start Python development.
It includes:
- Python language
- Libraries
- Editors(Such as Jupyter Notebooks)
- Package manager

What are Jupyter Notebooks?

A Jupyter Notebook is an open source web application that allows users to share documents with text, live code, images, and more.
We will use this to write code as it provides an interactive and easy to use interface.

Download and Install Anaconda

Anaconda is available for download from the following link: https://www.anaconda.com/distribution/#download-section

Here is the Direct Link: Anaconda Download MAC

Once you download the file go ahead and open it, and follow the instructions in the installation wizard.

Just follow through the prompts and choose where you want to install Anaconda on your system.

You do not need to install Microsoft Visual Code. So just hit the Continue button at that prompt.

Once Anaconda is installed on your Mac system go ahead and open the Anaconda-Navigator from your applications menu.

Once Anaconda is opened it may take a view moments to initialize.

Once it’s open, you will see a few options to choose from, but we will be using the Jupyter Notebook launch button. So go ahead and select the Launch button.

Once the Juptyer Notebook loads it should look like this:

There will be a list of files and directories, and you can go to the Desktop then Jupyter Notebooks.

From here navigate to the New button and select Python 3.

This will open up a new Jupyter notebook and in the cells, you can write in the code.

These cells can contain text, or code. So we will choose code from the drop down menu.

So we now have Anaconda downloaded and installed for MAC.

Part 2

In this lesson you will learn how to download an install Anaconda and Jupyter notebooks on Windows.

What is Anaconda?

Anaconda is a distribution software that provides everything a user would need to start Python development.
It includes:
- Python language
- Libraries
- Editors(Such as Jupyter Notebooks)
- Package manager

What are Jupyter Notebooks?

A Jupyter Notebook is an open source web application that allows users to share documents with text, live code, images, and more.
We will use this to write code as it provides an interactive and easy to use interface.

Download and Install Anaconda

Anaconda is available for download from the following link: https://www.anaconda.com/distribution/#download-section

Here is the Direct Link: Anaconda Download Windows

Once you download the file go ahead and open it, and follow the instructions in the installation wizard.

Just follow through the prompts and choose where you want to install Anaconda on your system.

You do not need to select any of the Advanced Options. So just select the Install button, but do not check any of the advanced options.

You can Skip the install of Microsoft VSCode.

Once Anaconda is installed on your Windows system go ahead and open the Anaconda-Navigator.

Once Anaconda is opened it may take a view moments to initialize.

Once its open, you will see a few options to choose from, but we will be using the Jupyter Notebook launch button. So go ahead and select the Launch button.

Once the Juptyer Notebook loads it should look like this:

There will be a list of files and directories, and you can go to the Desktop then Jupyter Notebooks.

From here navigate to the New button and select Python 3.

This will open up a new Jupyter notebook and in the cells, you can write in the code.

These cells can contain text, or code. So we will choose code from the drop down menu.

So we now have Anaconda downloaded and installed for Windows.

Part 3

What are Numpy Arrays?

Numpy arrays are like lists with extra functionality
Often used in data science and machine learning
Very powerful with many built-in functions written in maximum efficiency

The different ways in Which Numpy Arrays can be Started:

Pre-assigned values
Zeroes
Ones
Range of vlaues
Linspace

Open Jupyter Notebook in the Anaconda Navigator menu by selecting the Launch button.

Start a new Jupyter Notebook using Python 3.

Rename this to “Creating Numpy Array.”

We will start with the first cell being a Header so select Markdown from the drop down menu:

The Header will say “Creating Numpy Arrays”

So for starters you may have seen lists fin Python before. So in this example we will create “list1.”

See the code below and follow along:

list_1 = [1, 2, 3]
print(list_1)

So we will print this list1 and then select Run from the menu:

Numpy arrays are setup not so differently. Numpy arrays will take in a list as an argument and we’ll convert that list into a numpy specific array. This attaches a bunch of extra functionality to it. However before we can actually do anything we’ll want to add numpy as an import. The reason we need to do this is because numpy isn’t actually built into the native Python library. So we need to simply add the import of the numpy to cell two.

See the code below and follow along:

import numpy as np

Make sure you run cell 2.

This gives us access to that library.

So let’s say we want to create a numpy based on list1 that we created in cell 1.

See the code below and follow along:

numpy_1 = np.array(list_1)
print(numpy_1)

Run the cell.

We again got 1, 2, 3 printed out just like before.

You will notice the format is a little different.

The first cell was a list but the third cell was a numpy array. Earlier, it was mentioned that we can create arrays of zeroes or ones. This is very often a starting point for such topics as machine learning. Like if you were to build machine learning software, and we know that we want an array maybe of size 20. So we want 20 elements in an array. We might initialize the array to be 20-zeroes or 20-ones. Then we can go and modify each element.

So the way in which we do this is by calling the zeroes or the ones operator. So we could create for example a zeroes array. See the code below and follow along:

zeros_array = np.zeros(5)
print(zeros_array)

Run the cell.

So we get an array of five zeroes.

We can do the same thing for ones. See the code below and follow along:

ones_array = np.ones(10)
print(ones_array)

Run the cell.

So we have the array of ones that was printed out for us.

So if we wanted maybe a start point and an end point, and add a step interval. See the code below and follow along:

range_array_1 = np.arange(5)
print(range_array_1)

Run the cell.

So we get five elements printed out, but this might not be what we expected, because we put in a “5” with the “arange” operator this is the upper bound. So this is saying its going to stop before it reaches five. We were only specifying the end point.

So if we want to specify a start point as well, let’s say we want the elements five to 10. See the code below and follow along:

range_array_2 = np.arange(5, 11)
print(range_array_2)

Run the cell.

Now let’s say we want to step by twos. So maybe we want to start at zero go until 20 and we want to skip every other element. We can do that be specifying a third parameter being the step. See the code below and follow along:

range_array_3 = np.arange(0, 20, 2)
print(range_array_3)

Run the cell.

The final thing we will cover is the linspace operator. Linspace is a conjunction for linear space. With the linspace we basically take a lower bound an upper bound and the number of elements we want it to populate between the upper bound and lower bound. See the code below and follow along:

linspace_array = np.linspace(0, 10, 5)
print(linspace_array)

Run the cell.

So what this has done is created an array of five elements evenly spaced between zero and ten. Again ten inclusive, zero inclusive.

This will be really useful to us when it comes to graphing values. Because we can just specify and upper bound and a lower bound and then how many points we want in between those two.

Transcript 1

We’ll simply search for download Anaconda, and we’ll want to select this first link here, this is conda.io. So what we wanna do is download Anaconda here, so we’ll click on this link and this brings us to the Anaconda website.

So there’s a couple of options to choose from depending on whether you’re using Mac, Windows, or Linux. So, we’ll just simply click on this link (unless you really want Python 2.7). I would actually recommend that you use Python 3.6 anyway. So, we’re gonna go ahead and download this. By the way, if you are using a PC, then actually feel free to skip to the next section. We’ll be showing you the same stuff for PC.

Alright, so it looks like that’s finished downloading. We’ll simply double click on this, and it should bring up an installation wizard. So let’s go and bring up the installer. So, we’ll just go ahead and continue. Alright, so it just took a few minutes, and it’s done now. We’re not gonna bother installing Microsoft Visual Studio Code or anything like that, so let’s not worry. We’ll just continue and go ahead and close this; we can simply move that installer to trash.

Okay, so now what we’ll want to do is just go into our applications, and you should see Anaconda-Navigator. So, we are going to select this guy here. There are a few options to choose from, but we’ll definitely want Jupyter notebook here. This is version 5.5.0 at this time. So let’s go ahead and actually just launch this. What this’ll do is start a new instance of Terminal, and actually, weirdly enough, we can go to the Terminal itself and just type Jupyter notebook, and we’ll start notebook that way, but we might as well use Anaconda. It’s a nice piece of software to use anyway.

So as you can see here, this actually brings up a list of a bunch of my files and directories. So what we’re gonna do is actually go to Desktop here. This is just my file system on my computer, and we’re just gonna go into Jupyter Notebooks. So from here, I can simply select a new notebook. I’m going to go for Python 3. Otherwise, you could do like Text File, Folder, Terminal. What we’ll actually want is a Python 3 notebook.

This now opens up a new Jupyter notebook and a new tab here, and if we navigate back here, you can see that there is a new notebook IPYNB. This is a Jupyter notebook here. Now, we haven’t given it a name. This is running right now so we can go ahead and give this a name if we want up here. And then, we will just write a code in these little cells. So these cells can contain text or code which have we choose. We can choose Code, Markdown, NBConvert, or Heading. Typically, we’ll just go with Markdown for regular text, and we’ll just write- but we’ll choose the Code one for some Python code that we can actually write and run.

So let’s say for example, we just wanted to actually just print out a variable. I could create a variable code like int1 and I could set this equal to five, and then I could print int1. Okay, If I go to “Run this Cell”, then it simply runs all of the code that’s in this particular cell and starts a new one.

Transcript 2

Anaconda downloads page. If you’re not sure where to find this, then you can simply search for Download Anaconda. Okay, we’ll go to this first one. Then, we’re going to go down to Download Anaconda for free, and it should take you right to that page here. So what we’ll want is the Windows Installer, so let’s go ahead and select that. I would recommend going with the later version 3.6, as there are a bunch of tools that aren’t included in 2.7. If you’re really dead set on 2.7 you can use that, but I would, again, recommend 3.6.

So let’s just go ahead and download that, and I’m just gonna actually pause the recording on my end and cut back once this is done.

Alright, and we are back. So it finally finished downloading (it took a few minutes), and I just opened up the executable file. Alright, and it looks like we’re finally done with the installation, so let’s just go ahead and finish this off. We’re gonna skip this, we don’t really want to install Microsoft Visual Studio code right now. We’ll just uncheck these boxes, unless you do indeed want to learn more about these.

So let’s go ahead and finish that up, and now let’s open up Anaconda, start a new Jupyter notebook, and then I’ll just really quickly explain what they’re all about. So we’re just gonna search for Anaconda, and we’ll want this guy here, the Anaconda Navigator. So let’s just go ahead and select it, and we’re gonna go for a Jupyter notebook here. So let’s launch that up, and it will launch a new directory kinda thing in the browser here.

As you can see, it says home and gives us a list of files and directories in the directory that Anaconda is saved in. Okay, so what we’re gonna do for now is just navigate to the correct directory that you want, and we’re just gonna select a new Python 3 notebook. So what this’ll do is it’ll start a new Jupyter notebook. This is what a default one looks like; you can give it a title by clicking up here, and there’s a few options to choose from as far as actions go. You can see that there’s this cell here, and the way Jupyter notebooks work is that basically we write and run code in individual cells. That being said, the code that we write in one cell persists into the next cell.

Let’s say, for example, I’m just gonna create a quick markdown just to show you what this is. It’s just to hold some text, so it’s kind of like a text holder. Okay, we run this cell to complete it, okay? Because this is just text, there’s nothing to run. But let’s say I created a variable called variable one, I set its value equal to five, I run this, and unless there’s a problem with the code, it should compile and run just fine. Now if I go to print variable one, that code should persist between the cells, so if I run this, it will execute the code found here, and it will start me a new cell.

That’s how Jupyter notebooks work. It’s a really nice, clean way to organize things into individual cells.

Transcript 3

Okay, so for starters, you might have seen lists in Python before. For example, we could create list_1, we go to “Run the Cell”, we just get the simple array.

Well NumPy arrays are set up not so differently, so we’ll simply add the import of NumPy. We can also rename this to np as you’ll often see. So we’ll go ahead and run that cell. Nothing exciting happened because we’re not printing or calculating anything, we’re just gaining access to that library.

So let’s say I want to create a NumPy based on my list_1. So I’m gonna call this numpy_1. This is just going to actually be equal to np.array, and then we can pass in whatever value we want. So we could pass in a list of values here, or we could actually pass in a pre-made list. In our case, maybe we want to pass in list_1 and then perhaps we’ll want to print out numpy_1. So let’s go ahead and save that, and if we were to run the cell, we would again get one, two, three printed out just like before.

Now as I mentioned earlier we can also create arrays of zeros or ones. This is going to be np- not array this time- but np.zeroes. Now zeroes actually takes in an argument, and it takes in how many elements we want in this array. So say we want five zeroes here, and then we want to perhaps print out a zeroes array. And if we were to run this cell an array of five zeroes here. Again, we could do the same thing for one, and we’ll print this out, and then, of course, the array of ones here.

But what if we want to specify a range of values so we want maybe a start point and an endpoint? Np dot, and we’re gonna use this arange operator, operation rather, and this can take in several different parameters. So if we want to just specify a range from zero to some value, maybe let’s say zero to five, we can enter that. Okay, and as you can see we get five elements printed out. So if we want to specify a start point as well, let’s say we want the elements five to 10 and we’re going to set np.arange from five up until, and if we want to include 10, we would have to put up until 11. So now we go to run the cell, we get five, six, seven, eight, nine and 10 printed out.

Now let’s say we want to step by twos, so we want to maybe start at zero, go until 20 and we want to skip every other element. Well we can do that simply by specifying a third parameter, being the step zero to 20, and then we’ll want to include the step of two. So let’s go ahead and run that cell. And we get exactly what we would want here.

So linspace is a conjunction for linear space, and with a linspace we basically take a lower bound, an upper bound, and the number of elements we want it to populate between that upper bound and lower bound. This is going to be np.linspace like this, okay? And we’ll need to provide some parameters. So the first will be the lower bound. Let’s actually just go from zero. The second will be the upper bound. Let’s go to 10. So let’s see if we just pass in the value of five and then go to print. So we get zero, 2.5, five, 7.5, and 10. So what this has done is it’s basically created an array of five elements, evenly spaced between zero and 10, and again, 10 inclusive, zero inclusive.

Interested in continuing? Check out the full Bite-Sized NumPy course, which is part of our Bite-Sized Coding Academy.

Probability for Data Science Tutorial

Zenva — Fri, 29 Mar 2019 04:00:12 +0000

You can access the full course here: Probability Foundations for Data Science

Part 1

In this lesson, we’re going to see an introduction to the Probability Theory.

Probability definitions

We define probability as the likelihood of some event happening.

For coin flipping, there is an equal probability of having heads or tails (1/2 each), and we represent it by the following expression:

Probability is usually represented by “p” and the event is denoted with a capital letter between parentheses, but there’s not really a standard notation as seen above.

The event, in turn, is some sort of action that has a probabilistic outcome. In the case of a coin, we do not know what the outcome is until we’ve flipped it.

Dice toss is another classic example where, for a 6-sided dice, we have a 1/6 chance of the dice landing on any particular side.

Now, the probability of the dice landing on an even number, however, is equal to 3 (as there are 3 even numbers in the range 1-6) divided by the total number of sides (6):

In general, here’s how we compute the probability of an event E happening:

All probabilities go from a chance of zero to one, and a good way of understanding this is as shown below:

All probabilities will never be less than zero or greater than one.

There’s also the notion of the complement of an event, which basically consists of outcomes not in the event. It can be written in various different ways:

Let’s move to an example to better understand the concept of complement:

Suppose we want to compute the probability that a dice roll is not one. That is the same as the sum of the probabilities of it being numbers two, three, four, five and six (all other numbers but number one):

Another way to compute the complement is by doing 1 minus the probability of the actual event. So that, given any probability, we can immediately compute the probability of it not happening (that is, its complement) by computing its difference to 1.

Dice event probability and its complement

Let’s compute the probability of the outcome of a roll of a dice being strictly greater than 4. Well, in that case, we have two possibilities (sides 5 and 6 of the dice). It results in 2 outcomes out of the 6 total possible outcomes, which can be reduced to a chance of 1/3:

Computing the complement of this event we have 2/3, which is easily done by the definition of complement (namely 1 – 1/3):

It is important to notice that the probability of a given event added to the probability of the complement of this same event will always add up to 1.

Now that we’ve seen the basic definitions of probability, let’s move on to the next lesson.

Part 2

We are now going to use Pandas to do some probability computations.

Setup

To get started we’ll be using the following files:

There are a bunch of CSV files here, where the one called “flights.csv” is the main dataset we are going to work with. It has a little over half a million U.S. domestic flights from the year 2017, containing all kinds of information about the flights, such as origin city, origin state, destination city, destination state, flight airline, distance of the flight, departure and arrival times, and so on.

File “ReadMe.csv” explains in more detail the different columns of “flight.csv”:

There are also additional files (“L_AIRLINE_ID.csv”, “L_AIRPORT.csv”, and “L_WEEKDAYS.csv”) containing airline, airports, and weekdays codes.

“Terms.csv” has flight-specific terminology, with several terms and their definitions for your aid. In case you are not familiar with any of the terms used in the other files, you can read this file. See part of its contents down below:

Now that we took a look at the files on our source code folder, we launch Spyder:

Save your Spyder running instance in the same folder you unzipped before, where the CSV files are:

Starting the code

We start our Python code by importing Pandas, and reading from our flights spreadsheet:

import pandas as pd

flights = pd.read_csv('flights.csv', index_col=False).dropna()

By calling the dropna function, Pandas reads our flights’ file and drops any line containing at least one missing value.

We first compute the probability of, given no other information, picking a random flight that started in California. To calculate it, if we go back to the definition of probability, we have to divide the total sum of flights starting in California by the total amount of flights in general.

For the first part of our equation, we need to find the total sum of flights originating in California:

num_flights_in_CA = (flights['ORIGIN_STATE_NM'] == 'California').sum()

To get the total amount of flights is pretty straightforward. We just need the length of our variable “flights”. It gives us the number of rows in “flights”:

total_flights = len(flights)

And then we print the result, which is the number of flights from California divided by the total amount of flights:

print('p(flight started in California) = {}'.format(num_flights_in_CA/total_flights))

We see that the probability for a flight to start in California is about 13%:

p(flight started in California) = 0.13300369068719986

California was just an example, though. Let’s get a full probability distribution for all states in our flights’ file.

Full probability distribution

For every state, we want to compute the probability of a flight starting in that particular state. For that, we need to group the states by name:

flight_states = flights.groupby('ORIGIN_STATE_NM')

We use Pandas function “size” to get a sum of the total number of flights for each individual state:

num_flights_per_state = flight_states.size()

We are just lacking the division for the number of all flights from the year 2017, done by an apply call, as follows:

flight_state_prob = num_flights_per_state.apply (lambda num_flights: num_flights/total_flights)

The lambda function applies the code to each state group.

Printing “flight_state_prob”, we have a list with all states and their calculated probabilities. See some of the probabilities:

Nebraska
0.004045
Nevada
0.029836
New Hampshire
0.001100
New Jersey
0.021017
New Mexico
0.003916
New York
0.042568

Finally, to find out what the maximum probability is and its corresponding state, we run:

print(flight_state_prob.max())
print(flight_state_prob.idxmax())

It turns out that the state with maximum probability as origin state of a randomly picked flight of all 2017 domestic U.S. flights is, in fact, California (with its 13% probability).

So this introduced us to how to perform some basic probability operations using Pandas.

You can find Pandas documentation here: http://pandas.pydata.org/

Transcript 1

Hello world and thanks for joining me. My name is Mohit Deshpande. In this course, we’ll be learning all about probability theory and building a naive Bayes classifier that will be able to predict if our flight will land late or not. We’re gonna be learning all about probability in this course.

The first thing that I wanna do is introduce you to the concept of probability, if you’re not already familiar with it and just to get all of the rotation out of the way. Then we’re gonna move on to conditional probability and that’s kinda the backbone of a lot of machine learning and data science algorithms that’s working. Having a good idea of conditional probability will also help you out in a ton of other fields as well.

Next we’re gonna move on to Bayes’ theorem which is gonna follow from a conditional probability. And Bayes’ theorem, again, is a very fundamental statistical theorem that’s used in all kinds of different applications. One application in particular is using it in a naive Bayes classifier. We’re gonna build a naive Bayes classifier that we’re gonna train it on a data set of flights and see if we can predict whether a flight will land late or not, given some set of features, such as how long the flight is in the air, the distance between the two airports, their departure time, the airline, and so on and so on.

We can experiment with the groupings of features to see if we can get a really accurate classifier. We’ve been making courses since 2012 and we’re super excited to have you onboard. Online courses are a great way to learn new skills and I take a lot of online courses myself.

Zenva courses consist mainly of video lessons that we can watch and rewatch at your own pace as many tines as you want. We also have downloadable source code and project files and they contain everything that we build in the lessons. It’s highly recommended that you code along with me. In my experience, that is the best way to learn something new, is to get your hands dirty.

Lastly we’ve seen that students who get the most out of online courses are those that make a weekly plan and stick with it, depending, of course, on your own availability and learning style. Zenva, over the past six years, has taught all kinds of different topics on programming and game development to over 300,000 students across over a hundred courses. The skills that they learn in these courses are completely transferrable to other domains.

In fact, some of the students that have used, have taken these courses, have used the skills to advance their own careers, to start a company or to publish their own content from the skills that they’ve learned. Thanks again for joining. I look forward to seeing all the cool stuff you’ll be building. Now without further ado, let’s get started.

Transcript 2

In this video, I want to introduce you guys to a little bit about probability theory and how to compute it and so that we can get sort of using it in many of our applications that we’re gonna be working on, so what we’re gonna talk about, let’s go through, gonna introduce probability just to make sure that everyone’s on the same page regarding things like notation and how to actually compute it, and then we’re gonna quickly move on to conditional probability, and thus on to Bayes theorem.

So Bayes theorem depends on having a knowledge of conditional probability and we’re gonna have lots of examples with all of this information as well, and then finally, we’re gonna culminate in knowing, we’re gonna be learning about the naive Bayes classifier and we’re gonna use it, apply it to a set of flights to see if we can predict if our flight is going to arrive late based on any number of given factors, so things like the distance between the two airports, what our departure time is, maybe the airline that we are flying, given all this information.

We’re gonna try to see if we can build a naive Bayes classifier that can predict if our flight is going to arrive late so it’s a really cool application of all the probability that we’re gonna be learning but we have to actually get started learning some of this probability, so I just wanna start off just introducing some concepts in probability and some notation, just so that everyone is on the same page. So probability is a likelihood of some event happening.

So I have two examples here, I have one involving a fair coin, one involving a six-sided dice. So if you think about a fair coin, a fair coin has two sides, and so there’s an equal chance or equal probability of the coin landing on heads or landing on tails if you flip it, and so here’s just some examples, some notation that we might use, so we have the probability that the coin lands on heads is gonna be equal to 0.5 or 1/2, and here’s some notation that you might encounter in other places if you see it, so sometimes probability is denoted as lowercase p and the event is gonna be in parentheses, sometimes, it’ll, especially with coins, sometimes just shortened to h or t.

Sometimes we capital p, sometimes we’ll write the full word Prob for probability but this is just some notation. We might see in many place, there is no standardized way of this notation. And speaking of events, so an event to just some kind of action that has a probabilistic outcome, so flipping a coin is an event, because we don’t know what the outcome is yet until we flip the coin. And the chance of the coin is 1/2 on each.

Again, a dice toss is another example of an event. There is a one in six chance of it landing on any one particular number, so if you roll a fair dice and it lands on three, the probability that it landed on that three is one out of six. We can do some more advanced, we can do a bit of more advanced problems, so we can just ask, what’s the likelihood that this dice lands on an even number? Well, how many even numbers are there on a six-sided dice, right, there’s two, four, and six, and how many different sides are there?

There are six. So three divided by six, 1/2 is equal to 0.5, so in general here’s how we compute probability of some event, e is quite simply just a number of ways that e can happen divided by the number of total possible outcomes of our event, so in the case of a coin toss, the probability of heads, there’s only one way that it can happen, lands on heads, the number of possible outcomes is two, because there is a heads and a tails, it could have landed on heads or tails, so just some precursor, this is just some information to just get everyone on the same page.

Speaking of probability values, all probabilities go from zero to one, including zero and one, so if you have something with a probability of zero, then we are saying that this event is impossible. If you have something with a probability of one, we are saying that this event is certain, and then somewhere in the middle, even chance is 50%. So it’s important to know that probabilities only range from zero to one, it doesn’t make sense for anything to have a probability greater than one or less than zero.

Okay, so one other concept that I want to just discuss is this notion of a complement of an event. So the complement of an event, and here are all the different ways that you can write it. Again, there’s no standard notation for this. You might see, you might as end up seeing all of these. The complement event, the complement of an event is all of the outcomes that are not in that particular event, so let’s do an example to get a better picture of what’s going on.

So suppose I wanna compute the probability that a roll of the dice is not one. So I want, so what are the different outcomes where the dice would not be one? Well, it turns out there are five, right? The dice can be two, three, four, five, or six. So there are five out of six and so I get end up with five six so that’s just kind of the complement of this, what it means for an event to have a complement. And we can compute it in another way. We just take the complement is equal to one minus the actual event, so here’s an alternate way of computing this. I can say, what is the probability that the dice is not equal to one is equivalent to saying what’s the probability of one minus the dice actually being one?

You can see that mathematically they equate to the same thing, but this is just an alternate way to compute a complement, so really given any probable, any kind of probability, we can immediately compute what’s the probability of this not happening just by taking one minus that one minus that number. All right, so let’s do just one more example to get this in your head.

So suppose I want to compute the probability that a roll of a dice is strictly greater than four. Well, outcome is that strictly greater four, it doesn’t include four, so the only two possible outcomes are five and six, so that’s two, two divided by how many outcomes, there are six, two divided by six and I should just reduce that to 1/3. Now I can say, I can take the complement of that event and if you use the formula, you can actually immediately compute it as being 2/3, because one minus 1/3 is 2/3.

So here’s just some notation of the probability that a dice is strictly greater than four, is equal to if I say what are the different outcomes of the dice not being strictly greater than four, well that means that the dice roll will have to be less than four, including four, and turns out, again, these are equivalent things here, so what is the likelihood that the dice lands on a number that’s less than four inclusive?

Well that’s gonna be one, two, three, four, four out of six possible outcomes, so that’s gonna be 2/3, so I can, there’s another way to compute this, if you have something like a dice or if you have a coin or some other kind of classic probability thing, classic probability object, classic probabilistic object if you wanna think of it that way, you can try to take the complement of an event and tie that into the actual geometry of the object itself, like what I’ve done here.

Alternatively, if I had something else, I would have to do the trick, the one minus trick to compute this output. Notice that all the numbers add up, all of them adds up, right, so the probability that a dice is strictly greater than four is gonna be one minus the probability that the dice is not strictly greater than four so it’s just gonna be one minus 1/3, and so that would be 2/3 and again, this is all this, all of this adds up.

So this hopefully is just a, this is really just to keep everyone on the same page when it comes to probability in terms of notation and how we actually compute probability, so hopefully this is just kinda giving you, it’s kinda quickly just giving you an idea of what probability is and how we’re gonna be using it in the future.

Transcript 3

So, what we’re gonna get started doing some probably computations on our dataset. So, the first thing you’ll need to do is go download the source code and then you’ll want to unzip it.

Make sure that you unzip it, and then put it somewhere. And if you go inside, I have a ton of CSV files and the main dataset that we’re gonna be working with, this is flights.csv, and it has a little over half a million US domestic flights from the year 2017. And that’s a kind of, that’s what we’re gonna be working with. It has all kinds of information about the flights, the origin, city origin, state, destination city, destination state, the flight id, the airline, the distance of the flight, the arrival/departure, the expected arrival/departure, and the actual arrival/departure time, and the time in the air, and all other kinds of information.

In fact, you can read all about the information in this ReadMe.csv. It shows you all of the different columns are as well as what they actually mean, and just for your curiosity, I have all of the information like airport codes, the airline codes, codes about week days, and as well as there’s some terminology, there’s a CSV sheet of terminology as well in case you’re unfamiliar with flight terminology. I certainly am not too familiar with it so I like to read through this as well.

So, please use all of these CSV files I have to your advantage so you get a better understanding of the dataset, so let’s get started. And so we’re gonna need to make sure we have the right environment and then launch an instance of Spider. And we actually have one running. And I’m just gonna save this as probability.pi inside of the same folder that houses the CSV files as well.

All right, so now we can get started and we’re gonna do some basic, we’re gonna do some basic probability computations. So, I’m gonna import pandas first. We’re gonna be needing that. Now I’ll just load in the data using pandas. Flight equals pd.read_csv flights.csv, index_col equals False, and then immediately we’re just gonna drop values that we’re not going to need. We’re gonna drop any no values or not a number. There are some but we just don’t wanna deal with them in our probability computation.

All right, I’m gonna write in comments the probabilities that we’re gonna be computing so that we get a better idea of how we can actually compute them. All right, so let’s start by computing the probability that with no other information, you just pick a random flight from the Air 2017, what is the probability that the flight started in California? I’ll just use the full code. What is the probability that the flight started in California?

Well if you go back to our definition of probability, we’re gonna have to compute the number of flights who’s origin state is California. Take the sum of all that and then divide by the number of total flights overall. So we can do that. So, num flights in CA equals, we can do flights, the column we’re looking for is Origin State Name. So I can get all the flights who’s origin state name is California. Then I’m gonna take the sum of all those. So this will give me all the flights that started in California.

Now I need to get the total flights and that’s also pretty easy to do really. We can just do something like lengths of flights. So let’s do len flights. And this just counts up the number of rows, and that’s total number of flights. And that’s all that there is. So, this is the probability so we can just print out this probability. We’re gonna say print. This is equal to dot format. You just divide these two numbers. So I’m gonna say this divided by total flights here.

All right, so this tells me, so let’s actually run this and see the results here. So after we run this, we see that the probability that any given flight started in the state of California is about 13 percent. Now, I just happened to pick California off the top of my head, but let’s get a full probability distribution for every state. So in other words, for every state I wanna compute the probability that the flight started in that particular state.

So I want to know the probability that a flight started in in New York, in California, in Wyoming, and Texas, and so on and so on. So I wanna compute the probability that a flight started in X for all states X. And then we can maybe do something like take a max operation and then figure out just looking at all the flights that happened in the past year, which was most probable in terms of which state you were leaving. So we can do that but what we’re gonna have to do is run an aggregate operation. I need to do a group by operation and then get the size of each group. So I can do that as well but what

I’m gonna leave as a bit of a challenge for you guys is to break down the line of code that would run a group by operation on this origin state name. So you’ll need to use the group by function in pandas to do that. Just go ahead and do that and we’ll be right back with the answer. Okay. So we’ll need to use what is called something like flight states equals flights dot group by then origin state name. So now we’ve grouped our flights by the state.

Now what we need to do is count up the number of flights in each state. So this is kind of like doing the sum operation here. So we can do that, we’ll say num flights per state. And there’s actually a convenient function for pandas that you might be aware of. This we just call dot size, will give us the number of flights in each group. ‘Cause since I’m grouping the flights by the state, I just need to basically run a count operation on each of those groups, and that’s what size does here, instead of having to use something like an apply function.

All right, so now I have the counts but I have the raw counts. Now what I need to do is actually divide by the total number of flights. And so I can do that just by running a simple apply operation. So, I can actually just print this. Or I should probably save it because we might want to do a max operation on it. So I’ll just say flight state probability is going to be the num flights per state, and what I’m gonna do is apply a function to this. So this is, again, a lambda function. In other words, the code that I’m gonna write here is gonna be applied to each group.

So, num flights. I just want to take each of the number of flights in a state num flights, I’m just gonna divide that by the total flights numbers. There’s just single number and I’m just taking each one of these groups and dividing it by total flights. That’s all that this is doing. And then we can print this out just to see. We’ll print this out to see the distribution of each. Of every state we’ll get a probability of what is the likelihood that if you pick any random flight from the year 2017, what is the likelihood that the origin was in this state? So let’s run this. Okay. So now we get this information right here.

So we can see that if you just pick any random state, Massachusetts, the probability of any one flight who’s origin was Massachusetts was about two percent. And again, we see a number here, about 13 percent in California, for example. But let’s get what the maxed is. So I can do that, I can just say print. We’re gonna say print the maximum, I’ll just say print max and that will give us a maximum value. But I want to– oops. I want to know which state that actually is. So I can print that out as well. I’m gonna comment this out. So then we run this. And we see that it turns out that California is actually the most likely state of your origin.

So if you pick a random flight in the past year, then odds are it’s most likely that the state was in California. Although that probability, again, it’s still quite small. It’s only about 14 percent. So this just kind of introduces us to performing some basic probability, computing some basic probability, using pandas and the different operations there. So just to recap where you can find the documentation for pandas. You go to pandas.pydata.org and click documentation.

There’s a ton of documentation if you need to refresh your memory on how to work with pandas here. So that’s all we’re gonna do in this video. So we just learned about how to perform basic probability operations using pandas.

Interested in continuing? Check out the full Probability Foundations for Data Science course, which is part of our Data Science Mini-Degree.

Hypothesis Testing for Data Science Guide

Zenva — Fri, 22 Mar 2019 04:00:06 +0000

You can access the full course here: Hypothesis Testing for Data Science

Part 1

To start this course, we’re going to cover the following topics:

Random Variables
Normal Distribution
Central Limit Theorem

Random Variables

A random variable is a variable whose value is unknown. Namely, the outcome of a statistical experiment.

Consider the example of a single coin toss X. We do not know what X is going to be, though we do know all the possible values it can take (heads or tails), which are called the domain of the function. We also know that each of these possible values has 50% probability of happening, that is p(X = H) = p(X = T) = 1/2.

Similarly, if X now is a single dice toss, we have six different sides (going from 1 to 6) each equally likely:

p(X = 1) = p(X = 2) … = p(X = 6) = 1/6.

Note: X refers to a random variable, while x (lowercase) is usually used for a very specific value.

We can divide random variables into two categories:

Discrete Random Variable: can only take on a countable number of values (there’s a finite list of results it can take). Examples of this category are coin tosses, dice rolls, number of defective light bulbs in a box of 100.
Continuous Random Variable: may take on an infinite number of values (vary a lot). Examples of this second category are the heights of human, lengths of flower petals, time to check out an online cart on a website.

In other words, discrete random variables are basically used for properties that are integers or in a situation where you can list out and enumerate the possibilities involved. Continuous variables usually describe properties that are real numbers (such as heights and lengths of objects in general).

Probability Distribution

It’s the representation of random variable values alongside their associated probabilities. We call it probability mass function (pmf) for discrete random variables and probability density function (pdf) for continuous random variables.

The graphics below are called discrete uniform distributions and they are examples of mass functions, as they associate a probability to each possible discrete outcome. On the other hand, a normal distribution is a form of density function that we’ll see later on.

We see the domain of each function listed on the x-axis (e.g. all the possible outcomes) and the y-axis brings the possibilities for each one of the outcomes.

Rules of Probability Distributions

All probabilities must be between 0 and 1, inclusive.
The sum/integral of all probabilities of a random variable must equal 1.

The examples of the image above already show us the application of these two rules. All probabilities listed for the two graphics are between 0 and 1 (1/2 and 1/6), and they all sum up to 1 (1/2 + 1/2 = 2/2 = 1 and with the second example likewise)!

A probability value of zero would be considered as “impossible to happen” and if it is one then it is “certain to happen”.

In the next lesson, we’re going to study the normal distribution.

Part 2

In this lesson, we’re going to talk about the normal distribution.

It is also called the Gaussian distribution, and it’s the most important continuous distribution in all statistics. Many real-world random variables follow the normal distribution: IQs, heights of people, measurement errors, etc.

Normal distributions are influenced by the mean µ (that is the peak, the value that appears the most) and by the standard deviation σ which affects the height of the peak as seen below (σ² stands for variance):

We won’t be getting into details of the formula above though, as computer libraries already do all the work for us. Remember that, from a notation point of view, all capital letters stand for random variables and lowercase letters are actual, specific values.

Let’s take a look at some properties of normal distributions:

Mean, median and mode are equal (and they are all at the peak)
Symmetric across the mean (both sides look the same)
Follows the definition of a probability distribution
- Its largest value is equal or less than 1 and the tails are always above zero (asymptote at the x-axis)
- The area under the curve is equal to one (integral)

The Empirical Rule (68-95-99.7)

This rule says that 68% of the data in a normal distribution is between +-1 standard deviation of the mean (σ), 95% is between +-2 standard deviations and in up to +- 3σ we have almost everything in the graph included (99.7%):

Suppose that we know the normal distribution for adult male heights and that our µ = 70 inches and σ =4 inches. Applying the empirical rule, we have:

That means that 68% of adult males are going to have between 66 and 74 inches of height, 95% are between 62 and 78 inches tall, and almost all adult males are between 58 and 82 inches tall.

Computing Probabilities

In addition to the probability density function we mentioned in the previous lesson, we also have the cumulative density function (cdf). It is the probability that a random sample drawn from X is less than x: cdf (x) = p(X < x).

An interesting thing here is that we can find this probability by calculating the area under the curve (i.e. the integral). However, if we want the probability for a precise value x then we cannot find the answer as there is no curve (just a point!) for the computation of the area. There’s no density enough to answer that! What this means is that p(X = x) = 0 for any x because there’s no area “under the curve” for a single point!

Note that a probability density function is not a probability, it is a probability density. We have to integrate it in order to have an actual probability.

To the right-hand side of the image above we see that the complement can be applied for probability computations with normal distributions, such that the green area can also be computed by taking the difference between 1 and the value of the red area.

We have to be careful not to assume that everything follows the normal distribution, in fact, we need to present some justification to assume that. There are a lot of different kinds of distribution, such as the log-normal distribution. The example below is clearly not a normal distribution as it is not symmetric (normal distributions are symmetric), it is a log-normal distribution which is followed by some real-world phenomenons such as the duration of a chess game.

Transcript 1

Hello world, and thanks for joining me. My name is Mohit Deshpande, and in this course, we’re gonna learn all about hypothesis testing. We’re gonna be building our own framework for doing this hypothesis testing. That way, you’ll be able to use it on your own data in your own samples, and you’ll be able to validate your own hypotheses.

So the big concepts that we’re gonna be learning about in this course, we’re gonna learn a little bit about probability distribution. We’ll talk about random variables and what a probability distribution actually is. We’ll talk about some very important distributions, like the Gaussian distribution and the normal distribution, that kind of serve as the backbone for z-tests and eventually t-tests.

And then we’re gonna get on to the actual hypothesis testing section of this, which is gonna include z-tests and t-tests and they’re really ways that we can have a claim and then back it up with statistical evidence. And so we’re gonna learn about how we can do that as well as the different conditions where we might wanna use one or the other, and all of our hypothesis testing examples are gonna be chock full of different examples so that you get practice with running hypothesis tests, and in our frameworks, we’re also gonna use code, and so you will get used to using code to validate hypotheses as well.

So we’re gonna take the math that we learned in this, which is not gonna be that much, and then we’re gonna apply it to, and we’re gonna implement that in code as well, and then we’re gonna use all this to build our framework and then answer some real-world questions and validate real-world hypotheses. We’ve been making courses since 2012, and we’re super excited to have you on board.

Online courses are a great way to learn new skills, and I take a lot of online courses myself. ZENVA courses consist mainly of video lessons that you can watch and rewatch as many times as you want. We also have downloadable source code and project files that contain everything that we build in the lessons. It’s highly recommended that you code along. In my experience, it’s the best way to learn something, is to get your hands dirty. And lastly, we see the students who get the most out of these online courses are the same students that make a weekly plan and stick with it, depending, of course, on your own availability and learning style.

So at ZENVA, over the past six years, has taught all kinds of different topics on programming and game development to over 300,000 students. This is across a hundred courses. These skills that they learn in these courses, by the way, are completely transferrable to other domains. In fact, some of the students have used these skills that they’ve learned to advance their careers, to make a startup, or to publish their own content from the skills that they’ve learned in these course.

Thanks again for joining, and I look forward to seeing all the cool stuff you’ll be building. And without further ado, let’s get started.

Transcript 2

Hello everybody. We are going to talk about hypothesis testing. But before we quite get into that, we have to know a little bit of background information in order to do hypothesis testing.

In specific, we have to know a little bit about random variables and probability distributions, as well as a very important distribution called the normal distribution, as well as a very important theorem called the central limit theorem. And all these things gonna to tie together when we get into doing hypothesis testing. We’re going to use all these quite extensively. So the first thing we need to talk about are random variables.

So, a random variable is just a variable whose value is unknown. Another way you can think about this is, a variable that is the outcome of some kind of statistical experiment. So I have two examples here, say Let X equal a single coin toss. So we don’t know what X, we don’t know what the value of X is, but we know all of the possible values that it can take on. We just don’t know what the actual value is, because it is a random variable.

But we can say that, well the probability that X is gonna be heads is a half, the probability that X is gonna be tails is also a half. We don’t know what the actual value is, but we know about all the values it can take on. In other words, we call this the domain of a random variable. For these variables here, it is the different values that it can take. So, think of a dice toss that I have here. We have possible values here that X can be one, two, three, four, five, or six, and each of these are equally likely. And just a point on notation is that this capital X is the random variable, if you see lowercase x that usually means a very specific value of the random variable.

Speaking of random variables, we can broadly separate them into two different categories. We have discrete random variables and continuous random variables. So discrete random variables, as the name implies, can only take on a countable number of values. So, picture things like doing a coin toss or a dice roll. They’re very discrete number values. Using this interesting example, that’s used a lot in statistics textbooks, it’s a seminal problem that you’ll see in statistics textbooks.

If you’ve taken a course on statistics, you’ll probably have some question like this, the number of defective light bulbs in a box of 100. So the different outcomes here are: Light bulb one is defective or not, light bulb two is defective or not, light bulb three is defective, and so on and so on. This is an example of a discrete random variable. So, we have discrete random variables and we also have continuous random variables, and these random variables can take on an infinite number of values, within a given range. So, things like the heights of humans is continuous, things like the lengths of flower petals, or the time to checkout an online cart if you’re buying something from an online retailer, the time to checkout is also an example of a continuous random variable.

Right, so think of things that are real numbers for example. Usually continuous random variables describe properties that are real numbers. So heights of humans for example, are real numbers, they will be measured in feet and inches or centimeters. They can take on an infinite number of values within a particular range, or they don’t even have to be bounded, they can just go from negative infinity to positive infinity, it depends.

And discrete random variables then, can usually describe properties of things that are integers or things that you can actually list out and innumerate. So that’s just kind of the way you can think of it if you ever have a question of whether a variable is discrete or continuous, think about what its possible values could be. Can it take on an infinite number of values? If so, then it’s usually gonna be a continuous random variable.

Okay, so now that we know what random variables are, let’s talk a little bit about what a probability distribution actually is. So it’s really just a representation of the random variable values and their associated probabilities. So you’re gonna encounter two things, different kinds of function, there’s probability mass function, we say (PMF) for discrete random variables. We have probability density function for continuous random variables. So let me use an example. So here’s a probability distribution for a single coin toss. So on the X-axis we have all of the different possibilities, in other words the domain of the random variables. So heads and tails are the only two possible values here.

And on the Y-axis are their associated probabilities. So heads has a probability of 0.5 or half, tails has a probability of 0.5 or a half. In another example, we have the single toss of a six-sided dice. Again, we have put numbers one, two, three, four, five, six and their associated probabilities. Each of them have a probability of 1/6. So these are examples of, actually these two are examples of something called a uniform distribution.

Now, for a uniform distribution, each outcome is equally likely. So for heads and tails, they’re both equally likely. For each of the dice toss for a six-sided dice, each of these outcomes are equally likely. So we call these a uniform distribution and specifically the discrete uniform distribution. And these two are examples of probability mass functions, because they associate a probability to each possible discrete outcome. When we talk about the normal distribution soon, that is going to be an example of a continuous distribution. So we can’t talk about PMFs, probability mass function, we have to talk about the (PDF), or probability density function.

So, this is really just what a probability distribution is and it’s quite easily representable in a nice picture, pictoral format. I think it tends to work quite well for showing what actually is going on with these probabilities. So now, these probabilities are great, but they have some rules. So let’s talk a little bit about some of the rules of these probability distributions. So, all of the probabilities in a probability distribution have to be between zero and one.

Probabilities in general have to be between zero and one, including both zero and one. Right, so the probability of zero is impossible, the probability of one is certain. Anything that goes outside of those ranges doesn’t really make sense in terms of probabilities. The other thing is that the sum or the integral over the domain of the random variable in other words, all the probabilities of a random variable, that has to equal one. So if we’re talking about discrete random variables that’s a sum, if we’re talking about continuous random variables that’s the integral. But don’t worry, we’re not gonna use any calculus.

So what I mean by that is we look at the possible outcomes of heads and tails, if we summed them up we should get one. Intuitively you can think of this as, we’re guaranteed to observe something, something is gonna happen. If we do a coin toss, we’re either gonna get heads or tails. That’s essentially what the second rule of the probability distributions is trying to say, is if we perform or if we look at our random variable and it’s a coin toss or a dice toss or something, we’re guaranteed that something is gonna happen. That’s why the sum has to, that’s why everything has a sum of one.

And you can see for the dice toss, 1/6 plus 1/6 plus 1/6 plus 1/6 plus 1/6 plus 1/6. That’s 6/6, in other words, one. So, also these are actually probability distributions. Because all the probabilities are between zero and one, and they sum up to one. If we had a continuous distribution, we would use the integral. So, that is where I’m gonna stop here for random variables.

So this is just to introduce you to the concept of what is a random variable and what are probability distributions to begin with. And then, now we’re gonna look at some very important distribution theorems in statistics that even allow you to do hypothesis testing.

Okay, so just to give you a quick recap, we talked about what a random variable is, a variable whose value is the outcome of some kind of statistical experiment. In other words, we don’t really know what the value itself is, but we know about the different values that it can take. And we talked about the different kinds of random variables, discrete and continuous, talked a little bit about what probability distributions are, they’re just representations of the probabilities and then they’re actually the domains of the random variables and their associated probabilities.

And we talked a little bit about some of the rules. Probabilities have to be between zero and one, and they have to sum up to, all the probabilities have to sum up to one. So that is all with with random variables. And so we’re gonna get into probably the most important distribution using statistics and many other fields, called the normal distribution.

Transcript 3

In this video we are going to talk about the most important distribution in all of statistics and probably in many many other fields called the Normal Distribution.

So like I said, it’s the most important continuous distribution that you’ll ever encounter. It’s used in so many different fields: biology, sociology, finance, engineering, medicine, it’s just, so ubiquitous throughout so many fields, I think a good understanding of it is very transferrable. Sometimes you also hear it called the Gaussian Distribution. They’re just two words for the same distribution. And as it turns out, another reason why it’s so important is that it turns out many real world random variables actually follow this Normal Distribution.

So if you look at things like the IQs of people, heights of people, measurement errors by different kinds of instruments. These can all be modeled really nicely with a Normal Distribution. And we know a lot about the Normal Distribution both statistically and mathematically. So here’s a picture of what it looks like at the bottom there.

Sometimes you’ll also hear it called a bell curve ’cause it kind of looks like the curve of a bell. And it’s parametrized by two things, that’s the mean, which is that lowercase u, in other words, the average or expected value. What that denotes is where the peak is, all normal distributions have a kind of peak, so the mean just tells you where that peak is. And then we have the standard deviation, which is the lowercase sigma. Sometimes you also see it written as sigma squared, which is called the variance.

The standard deviation tells us the spread of the data away from the mean. Basically it just means how peak-y is it? Is it kind of flat, or is it very peak-y? That’s really what the standard deviation is telling you. And here is the probability density function, it looks kind of like, kind of complicated there. But if you were to run that through some sort of graphing program, given a mean and a standard deviation, it would produce this graph.

Fortunately there are libraries that compute this for us so we don’t have to, we don’t have to look into this too much. So, another notation point I should mention is that capital X, capital letters are usually random variable and lowercase letters are an actual, specific value. That’s just a notation point. So let’s talk a little bit about some of the properties of the Normal Distribution. So, mean, median, and mode are all equal, and they’re all at the peak, we call it the peak. So the peak is the mean, as we’ve said, and the location.

Another really nice property is that it’s perfectly symmetric across the mean. And that’s also going to be useful for hypothesis testing because if we know particular value to the right of the curve, if you take the negative of that around the mean, then we’ll know what the value on the other side of the curve is.

And by the way, this is a true probability distribution, so the largest value is going to be less than one, the tails are always above zero. If you’ve heard this word before, asymptote, that’s what they are, they’re asymptotes at the x-axis. They get really, infinitely close, but never quite touch zero. And if you were to take the integral of this, it would actually equal one. It’s called the Gaussian Integral, in case you’re interested. Another neat property of the Normal Distribution is called the Empirical Rule, and this is just mostly used in rule of thumb, and the neat thing about this is that it works for any mean and any standard deviation.

It will always be true that about 68% of the data are gonna be within plus minus one standard deviation of the mean. About 95% of the data are gonna be between plus minus two, and 99.7 between plus minus three. And later, when we get to some code, we’re gonna verify this so that you don’t just think I’m making these numbers up. We’ll verify this with a scientific library. And, so again, this is just a rule that, a nice rule of thumb to know, if you wanna do some kind of back of the hand calculations, for example.

So let me put actual numbers to this, all right? So suppose that I happen to know what the distribution for adult male heights are, and that is in, they’re in a mean of 70 inches, and a standard deviation of four inches. Well then I can be sure that if just, just asking one random person, 68% of people are gonna be between 66 inches and 74 inches of height. And by the time I hit plus minus three standard deviations, between 58 inches and 82 inches, 99.7% of people are gonna be within that range.

And again, this is also gonna be useful for hypothesis testing because if we encounter someone who’s, let’s say, like 90-92 inches, very very tall, we know that that’s a data point that we didn’t expect, that’s in the 0.3% of data, approximately, so this’ll be useful when we get over to hypothesis testing, because that’s kind of an abnormal, or unusual, actually unusual value, and that might be an indicator as to whether this mean is correct or not, or maybe it’s, actually maybe we think that, maybe the value that we think is the mean is not actually the mean, in fact, maybe it should be a little higher, for example. So again, this will all become clear when we do hypothesis testing, but this is just a good rule to know.

Alright, so how do we actually compute probabilities of this with the, discreet random variables you just look at the different outcomes and they tell you what the probabilities are. This is not the case for continuous random variables. It’s a bit more complicated. But again, we’re gonna have libraries that can compute this for us. So, in addition to the probability density function, we have the cumulative density function called the CDF, and what the cdf(x) represents, that’s a lowercase x, is equal to the probability that my random variable takes a value that is less than x.

In other words, if I just pick a random sample out of my probability distribution, the cumulative density function here will tell me the likelihood that I observe that value or less than that value. And, you do some mathematics, it’s actually equal to the area under this curve. In other words, it’s equal to the integral. Again, we’re not going to be doing any calculus, so don’t worry.

So, if I want to know, again, suppose this is, this example of heights, suppose I want to know, if I were to just ask some random person, wanted to ask them, hey, what’s your height? I want to figure out what is the probability that they’re going to be at least 62 inches tall. Well, how I do that is I’d use the Cumulative Density Function, the CDF, and plug in cdf(62), and that’ll tell me what the probability is that if a random, if I asked a random person what their height is, the likelihood that they’re going to be at least 62 inches tall. That’s really all the Cumulative Density Function tells us. The interesting point is that, I can’t ask, what is the probability that I encounter someone that’s exactly 62 inches.

I can’t ask that question because if we go by our cumulative density function, there’s no area under a curve, because it’s not a curve, it’s just a point, there’s no density to it, right? So how we compute this, we actually integrate this density function. But, we can’t do that, because we don’t have a point there, intuitively, you can think of this as, we have, if we compute using the definition of probability, it’s what are the outcomes where this is true, well it’s one, divided by what’s all possible outcomes it could take, it can take on an infinite amount of outcomes! So, hence, it’s equal to zero for any particular x.

One important thing to note, is that the Probability Density Function is not a probability, it’s a probability density, so you have to integrate it in order to get an actual probability. Okay, that’s a lot of words. So the other picture that I have here on the right is to show you that complementation, or the compliment, still holds for continuous density functions. So, suppose that I want to know, well, what’s the likelihood that I encounter someone that is taller than 82 inches? Well the CDF only tells me from negative infinity up to 82 inches. How am I gonna know, how do I compute values greater than that? Well, I can take the compliment because the probability that x is greater than some value b is going to be equal to one minus the probability that x is less than b.

By the way, we don’t have to be too worried about less than or equal to’s. So I could’ve also said, probably, that x is greater than or equal to, it doesn’t really matter because, because of this second bullet point here, that the probability of capital x equals lowercase x, is equal to zero for any x, there’s no area under that curve, so, we can kind of forego the less than or equal to’s. So if I want to compute the probability that I encounter someone that is taller than 82 inches that’s equal to one minus the probability that I encounter someone that’s less than 82 inches, they’re just compliments of each other. Okay, so that’s how we’d compute probabilities, and don’t worry, we don’t have to do this integral ourselves, there are library functions that can do this for us.

So one last point that I want to impart on you is that not all variables follow the Normal Distribution. Many scientific papers, or in many things you’ll read, that people assume the Normal Distribution, but you need some kind of justification to assume that, and one thing that we’ll talk about called the Central Limit Theorem, kind of gives us a justification to assume normal distributions under a very specific set of conditions.

But you cannot just assume that, oh this variable must follow the Normal Distribution ’cause it’s used everywhere! It’s not something that we can assume. In fact, there’s lots of distributions that don’t follow the Normal Distribution. For example, here I have a picture of what’s called a Log-normal Distribution. As you can see, it’s not a normal distribution, because, first of all, easy way to look at it is it’s not symmetric. Normal distributions have to be symmetric, it’s not. But, it turns out that real world phenomenons still follow this.

For example, the duration of chess game actually follows a log-normal distribution. Things like blood pressure follow the Log-normal Distribution. Things like comment length, as in how long a comment is that people leave under some kind of online post, that also follows the Log-normal Distribution. So, a lot of random variables can follow other distribution, that’s okay, we have a lot of distributions. But, you can’t just assume that they follow the Normal Distribution without doing some kind of, presenting some kind of justification. So, that’s just something to keep in mind.

Alright, so this was just a pre-cursor to the Normal Distribution, we talked a little bit about what it was, and some of the nice properties of the Normal Distribution, we talked, again, about the Empirical Rule, as well as how we can compute probabilities from a continuous probability distribution, it’s not as easy with a discreet one. So, now that we have a little bit of information about the Normal Distribution, let’s actually see it in practice.

Interested in continuing? Check out the full Hypothesis Testing for Data Science course, which is part of our Data Science Mini-Degree.