Explore Free Computer Vision Tutorials – GameDev Academy https://gamedevacademy.org Tutorials on Game Development, Unity, Phaser and HTML5 Sat, 25 Feb 2023 21:14:33 +0000 en-US hourly 1 https://wordpress.org/?v=6.1.1 https://gamedevacademy.org/wp-content/uploads/2015/09/cropped-GDA_logofinal_2015-h70-32x32.png Explore Free Computer Vision Tutorials – GameDev Academy https://gamedevacademy.org 32 32 Face Recognition with Eigenfaces – Computer Vision Tutorial https://gamedevacademy.org/face-recognition-with-eigenfaces/ Sun, 02 Oct 2022 02:49:02 +0000 https://pythonmachinelearning.pro/?p=1878 Read more]]> Face recognition – or the ability of computers to recognize faces and facial features – is an imminent concern to our future.

In this tutorial, we’re going to explore face recognition in-depth and learn how with techniques like eigenfaces, we can create our own software programs capable of identifying human faces. The applications for facial recognition are endless, and whether good or bad, it’s a skill that can help take your Python skills to the next level.

Let’s jump in and explore the world of computer vision and faces!

Intro & Project Files

Face recognition is ubiquitous in science fiction: the protagonist looks at a camera, and the camera scans his or her face to recognize the person. More formally, we can formulate face recognition as a classification task, where the inputs are images and the outputs are people’s names. We’re going to discuss a popular technique for face recognition called eigenfaces. And at the heart of eigenfaces is an unsupervised dimensionality reduction technique called principal component analysis (PCA), and we will see how we can apply this general technique to our specific task of face recognition.

Download the full code here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

Face Recognition

Before discussing principal component analysis, we should first define our problem. Face recognition is the challenge of classifying whose face is in an input image. This is different than face detection where the challenge is determining if there is a face in the input image. With face recognition, we need an existing database of faces. Given a new image of a face, we need to report the person’s name.

A naïve way of accomplishing this is to take the new image, flatten it into a vector, and compute the Euclidean distance between it and all of the other flattened images in our database.

There are several downsides to this approach. First of all, if we have a large database of faces, then doing this comparison for each face will take a while! Imagine that we’re building a face recognition system for real-time use! The larger our dataset, the slower our algorithm. But more faces will also produce better results! We want a system that is both fast and accurate. For this, we’ll use a neural network! We can train our network on our dataset and use it for our face recognition task.

There’s an issue with directly using a neural network: images can be large! If we had a single $m\times n$ image, we would have to flatten it out into a single $m\dot n\times 1$ vector to feed into our neural network as input. For large image sizes, this might hurt speed! This is related to the second problem with using images as-is in our naïve approach: they are high-dimensional! (An $m\times n$ image is really a $m\dot n\times 1$ vector) A new input might have a ton of noise and comparing each and every pixel using matrix subtraction and Euclidean distance might give us a high error and misclassifications!

These issues are why we don’t use the naïve method. Instead, we’d like to take our high-dimensional images and boil them down to a smaller dimensionality while retaining the essence or important parts of the image.

Dimensionality Reduction

The previous section motivates our reason for using a dimensionality reduction technique. Dimensionality reduction is a type of unsupervised learning where we want to take higher-dimensional data, like images, and represent them in a lower-dimensional space. Let’s use the following image as an example.

These plots show the same data, except the bottom chart zero-centers it. Notice that our data do not have any labels associated with them because this is unsupervised learning! In our simple case, dimensionality reduction will reduce these data from a 2D plane to a 1D line. If we had 3D data, we could reduce it down to a 2D plane or even a 1D line.

All dimensionality reduction techniques aim to find some hyperplane, a higher-dimensional line, to project the points onto. We can imagine a projection as taking a flashlight perpendicular to the hyperplane we’re project onto and plotting where the shadows fall on that hyperplane. For example, in our above data, if we wanted to project our points onto the x-axis, then we pretend each point is a ball and our flashlight would point directly down or up (perpendicular to the x-axis) and the shadows of the points would fall on the x-axis. This is a projection. We won’t worry about the exact math behind this since scikit-learn can apply this projection for us.

In our simple 2D case, we want to find a line to project our points onto. After we project the points, then we have data in 1D instead of 2D! Similarly, if we had 3D data, we want to find a plane to project the points down onto to reduce the dimensionality of our data from 3D to 2D. The different types of dimensionality reduction are all about figuring out which of these hyperplanes to select: there are an infinite number of them!

Princpal Component Analysis

One technique of dimensionality reduction is called principal component analysis (PCA). The idea behind PCA is that we want to select the hyperplane such that when all the points are projected onto it, they are maximally spread out. In other words, we want the axis of maximal variance! Let’s consider our example plot above. A potential axis is the x-axis or y-axis, but, in both cases, that’s not the best axis. However, if we pick a line that cuts through our data diagonally, that is the axis where the data would be most spread!

The longer blue axis is the correct axis! If we were to project our points onto this axis, they would be maximally spread! But how do we figure out this axis? We can borrow a term from linear algebra called eigenvectors! This is where eigenfaces gets its name! Essentially, we compute the covariance matrix of our data and consider that covariance matrix’s largest eigenvectors. Those are our principal axes and the axes that we project our data onto to reduce dimensions. Using this approach, we can take high-dimensional data and reduce it down to a lower dimension by selecting the largest eigenvectors of the covariance matrix and projecting onto those eigenvectors.

Since we’re computing the axes of maximum spread, we’re retaining the most important aspects of our data. It’s easier for our classifier to separate faces when our data are spread out as opposed to bunched together.

(There are other dimensionality techniques, such as Linear Discriminant Analysis, that use supervised learning and are also used in face recognition, but PCA works really well!)

How does this relate to our challenge of face recognition? We can conceptualize our $m\times n$ images as points in $m\dot n$-dimensional space. Then, we can use PCA to reduce our space from $m\dot n$ into something much smaller. This will help speed up our computations and be robust to noise and variation.

Aside on Face Detection

So far, we’ve assumed that the input image is only that of a face, but, in practice, we shouldn’t require the camera images to have a perfectly centered face. This is why we run an out-of-the-box face detection algorithm, such as a cascade classifier trained on faces, to figure out what portion of the input image has a face in it. When we have that bounding box, we can easily slice out that portion of the input image and use eigenfaces on that slice. (Usually, we smooth that slide and perform an affine transform to de-warp the face if it appears at an angle.) For our purposes, we’ll assume that we have images of faces already.

Eigenfaces Code

Now that we’ve discussed PCA and eigenfaces, let’s code a face recognition algorithm using scikit-learn! First, we’ll need a dataset. For our purposes, we’ll use an out-of-the-box dataset by the University of Massachusetts called Labeled Faces in the Wild (LFW).

Feel free to substitute your own dataset! If you want to create your own face dataset, you’ll need several pictures of each person’s face (at different angles and lighting), along with the ground-truth labels. The wider variety of faces you use, the better the recognizer will do. The easiest way to create a dataset for face recognition is to create a folder for each person and put the face images in there. Make sure each are the same size and resize them so they aren’t large images! Remember that PCA will reduce the image’s dimensionality when we project onto that space anyways so using large, high-definition images won’t help and will slow down our algorithm. A good size is ~512×512 for each image. The images should all be the same size so you can store them in one numpy array with dimensions (num_examples, height, width) . (We’re assuming grayscale images). Then use the folder names to disambiguate classes. Using this approach, you can use your own images.

However, we’ll be using the LFW dataset. Luckily, scikit-learn can automatically load our dataset for us in the correct format. We can call a function to load our data. If the data aren’t available on disk, scikit-learn will automatically download them for us from the University of Massachusetts’ website.

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier


# Load data
lfw_dataset = fetch_lfw_people(min_faces_per_person=100)

_, h, w = lfw_dataset.images.shape
X = lfw_dataset.data
y = lfw_dataset.target
target_names = lfw_dataset.target_names

# split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

The argument to our function just prunes all people without at least 100 faces, thus reducing the number of classes. Then we can extract our dataset and other auxiliary information. Finally, we split our dataset into training and testing sets.

Now we can simply use scikit-learn’s PCA class to perform the dimensionality reduction for us! We have to select the number of components, i.e., the output dimensionality (the number of eigenvectors to project onto), that we want to reduce down to, and feel free to tweak this parameter to try to get the best result! We’ll use 100 components. Additionally, we’ll whiten our data, which is easy to do with a simple boolean flag! (Whitening just makes our resulting data have a unit variance, which has been shown to produce better results)

# Compute a PCA 
n_components = 100
pca = PCA(n_components=n_components, whiten=True).fit(X_train)

# apply PCA transformation
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

We can apply the transform to bring our images down to a 100-dimensional space. Notice we’re not performing PCA on the entire dataset, only the training data. This is so we can better generalize to unseen data.

Now that we have a reduced-dimensionality vector, we can train our neural network!

# train a neural network
print("Fitting the classifier to the training set")
clf = MLPClassifier(hidden_layer_sizes=(1024,), batch_size=256, verbose=True, early_stopping=True).fit(X_train_pca, y_train)

To see how our network is training, we can set the verbose  flag. Additionally, we use early stopping.

Let’s discuss early stopping as a brief aside. Essentially, our optimizer will monitor the average accuracy for the validation set for each epoch. If it notices that our validation accuracy hasn’t increased significantly for a certain number of epochs, then we stop training. This is a regularization technique that prevents our model from overfitting!

Consider the above chart. We notice overfitting when our validation set accuracy starts to decline. At that point, we immediately stop training to prevent overfitting.

Finally, we can make a prediction and use a function to print out an entire report of quality for each class.

y_pred = clf.predict(X_test_pca)
print(classification_report(y_test, y_pred, target_names=target_names))

Here’s an example of a classification report.

                   precision    recall  f1-score   support

     Colin Powell       0.86      0.89      0.87        66
  Donald Rumsfeld       0.85      0.61      0.71        38
    George W Bush       0.88      0.94      0.91       177
Gerhard Schroeder       0.67      0.69      0.68        26
       Tony Blair       0.86      0.71      0.78        35

      avg / total       0.85      0.85      0.85       342

Notice there is no accuracy metric. Accuracy isn’t the most specific kind of metric to begin. Instead, we see precision, recall, f1-score, and support. The support is simply the number of times this ground truth label occurred in our test set, e.g., in our test set, there were actually 35 images of Tony Blair. The F1-Score is actually just computed from the precision and recall scores. Precision and recall are more specific measures than a single accuracy score. A higher value for both is better.

After training our classifier, we can give it a few images to classify.

# Visualization
def plot_gallery(images, titles, h, w, rows=3, cols=4):
    plt.figure()
    for i in range(rows * cols):
        plt.subplot(rows, cols, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i])
        plt.xticks(())
        plt.yticks(())

def titles(y_pred, y_test, target_names):
    for i in range(y_pred.shape[0]):
        pred_name = target_names[y_pred[i]].split(' ')[-1]
        true_name = target_names[y_test[i]].split(' ')[-1]
        yield 'predicted: {0}\ntrue: {1}'.format(pred_name, true_name)

prediction_titles = list(titles(y_pred, y_test, target_names))
plot_gallery(X_test, prediction_titles, h, w)

(plot_gallery  and titles  functions modified from the scikit-learn documentation)

We can see our network’s predictions and the ground truth value for each image.

Another thing interesting thing to visualize is are the eigenfaces themselves. Remember that PCA produces eigenvectors. We can reshape those eigenvectors into images and visualize the eigenfaces.

These represent the “generic” faces of our dataset. Intuitively, these are vectors that represent directions in “face space” and are what our neural network uses to help with classification. Now that we’ve discussed the eigenfaces approach, you can build applications that use this face recognition algorithm!

We discussed a popular approach to face recognition called eigenfaces. The essence of eigenfaces is an unsupervised dimensionality reduction algorithm called Principal Components Analysis (PCA) that we use to reduce the dimensionality of images into something smaller. Now that we have a smaller representation of our faces, we apply a classifier that takes the reduced-dimension input and produces a class label. For our classifier, we used a single-layer neural network.

Face recognition is a fascinating example of merging computer vision and machine learning and many researchers are still working on this challenging problem today!

Nowadays, deep convolutional neural networks are used for face recognition, and have vast implications in terms of the development career world. Try one out on this dataset!

]]>
An Introduction to Image Recognition https://gamedevacademy.org/image-recognition-guide/ Sat, 31 Oct 2020 15:00:35 +0000 https://pythonmachinelearning.pro/?p=2698 Read more]]>

You can access the full course here: Convolutional Neural Networks for Image Classification

Intro to Image Recognition

Let’s get started by learning a bit about the topic itself. Image recognition is, at its heart, image classification so we will use these terms interchangeably throughout this course. We see images or real-world items and we classify them into one (or more) of many, many possible categories. The categories used are entirely up to use to decide. For example, we could divide all animals into mammals, birds, fish, reptiles, amphibians, or arthropods. Alternatively, we could divide animals into carnivores, herbivores, or omnivores. Perhaps we could also divide animals into how they move such as swimming, flying, burrowing, walking, or slithering. There are potentially endless sets of categories that we could use.

Among categories, we divide things based on a set of characteristics. When categorizing animals, we might choose characteristics such as whether they have fur, hair, feathers, or scales. Maybe we look at the shape of their bodies or go more specific by looking at their teeth or how their feet are shaped. Once again, we choose there are potentially endless characteristics we could look for.

Analogies aside, the main point is that in order for classification to work, we have to determine a set of categories into which we can class the things we see and the set of characteristics we use to make those classifications. This allows us to then place everything that we see into one of the categories or perhaps say that it belongs to none of the categories. The more categories we have, the more specific we have to be. It’s easier to say something is either an animal or not an animal but it’s harder to say what group of animals an animal may belong to. However complicated, this classification allows us to not only recognize things that we have seen before, but also to place new things that we have never seen. Good image recognition models will perform well even on data they have never seen before (or any machine learning model, for that matter).

How do we Perform Image Recognition?

We do a lot of this image classification without even thinking about it. For starters, we choose what to ignore and what to pay attention to. This actually presents an interesting part of the challenge: picking out what’s important in an image. We see everything but only pay attention to some of that so we tend to ignore the rest or at least not process enough information about it to make it stand out. Knowing what to ignore and what to pay attention to depends on our current goal. For example, if we were walking home from work, we would need to pay attention to cars or people around us, traffic lights, street signs, etc. but wouldn’t necessarily have to pay attention to the clouds in the sky or the buildings or wildlife on either side of us. On the other hand, if we were looking for a specific store, we would have to switch our focus to the buildings around us and perhaps pay less attention to the people around us.

The same thing occurs when asked to find something in an image. We decide what features or characteristics make up what we are looking for and we search for those, ignoring everything else. This is easy enough if we know what to look for but it is next to impossible if we don’t understand what the thing we’re searching for looks like. 

This brings to mind the question: how do we know what the thing we’re searching for looks like? There are two main mechanisms: either we see an example of what to look for and can determine what features are important from that (or are told what to look for verbally) or we have an abstract understanding of what we’re looking for should look like already. For example, if you’ve ever played “Where’s Waldo?”, you are shown what Waldo looks like so you know to look out for the glasses, red and white striped shirt and hat, and the cane. To the uninitiated, “Where’s Waldo?” is a search game where you are looking for a particular character hidden in a very busy image. I’d definitely recommend checking it out. However, if we were given an image of a farm and told to count the number of pigs, most of us would know what a pig is and wouldn’t have to be shown. That’s because we’ve memorized the key characteristics of a pig: smooth pink skin, 4 legs with hooves, curly tail, flat snout, etc. We don’t need to be taught because we already know.

This logic applies to almost everything in our lives. We learn fairly young how to classify things we haven’t seen before into categories that we know based on features that are similar to things within those categories. If we come across something that doesn’t fit into any category, we can create a new category. For example, there are literally thousands of models of cars; more come out every year. However, we don’t look at every model and memorize exactly what it looks like so that we can say with certainty that it is a car when we see it. We know that the new cars look similar enough to the old cars that we can say that the new models and the old models are all types of car. 

By now, we should understand that image recognition is really image classification; we fit everything that we see into categories based on characteristics, or features, that they possess. We’re intelligent enough to deduce roughly which category something belongs to, even if we’ve never seen it before. If something is so new and strange that we’ve never seen anything like it and it doesn’t fit into any category, we can create a new category and assign membership within that. The next question that comes to mind is: how do we separate objects that we see into distinct entities rather than seeing one big blur?

The somewhat annoying answer is that it depends on what we’re looking for. If we look at an image of a farm, do we pick out each individual animal, building, plant, person, and vehicle and say we are looking at each individual component or do we look at them all collectively and decide we see a farm? Okay, let’s get specific then. Let’s say we aren’t interested in what we see as a big picture but rather what individual components we can pick out. How do we separate them all? 

The key here is in contrast. Generally, we look for contrasting colours and shapes; if two items side by side are very different colours or one is angular and the other is smooth, there’s a good chance that they are different objects. Although this is not always the case, it stands as a good starting point for distinguishing between objects. 

Coming back to the farm analogy, we might pick out a tree based on a combination of browns and greens: brown for the trunk and branches and green for the leaves. Of course this is just a generality because not all trees are green and brown and trees come in many different shapes and colours but most of us are intelligent enough to be able to recognize a tree as a tree even if it looks different. We could find a pig due to the contrast between its pink body and the brown mud it’s playing in. We could recognize a tractor based on its square body and round wheels. This is why colour-camouflage works so well; if a tree trunk is brown and a moth with wings the same shade of brown as tree sits on the tree trunk, it’s difficult to see the moth because there is no colour contrast.

Another amazing thing that we can do is determine what object we’re looking at by seeing only part of that object. This is really high level deductive reasoning and is hard to program into computers. This is one of the reasons it’s so difficult to build a generalized artificial intelligence but more on that later. As long as we can see enough of something to pick out the main distinguishing features, we can tell what the entire object should be. For example, if we see only one eye, one ear, and a part of a nose and mouth, we know that we’re looking at a face even though we know most faces should have two eyes, two ears, and a full mouth and nose. 

Although we don’t necessarily need to think about all of this when building an image recognition machine learning model, it certainly helps give us some insight into the underlying challenges that we might face. If nothing else, it serves as a preamble into how machines look at images. The main problem is that we take these abilities for granted and perform them without even thinking but it becomes very difficult to translate that logic and those abilities into machine code so that a program can classify images as well as we can. This is just the simple stuff; we haven’t got into the recognition of abstract ideas such as recognizing emotions or actions but that’s a much more challenging domain and far beyond the scope of this course.

How do Machines Interpret Images?

The previous topic was meant to get you thinking about how we look at images and contrast that against how machines look at images. We’ll see that there are similarities and differences and by the end, we will hopefully have an idea of how to go about solving image recognition using machine code. 

Let’s start by examining the first thought: we categorize everything we see based on features (usually subconsciously) and we do this based on characteristics and categories that we choose. The number of characteristics to look out for is limited only by what we can see and the categories are potentially infinite. This is different for a program as programs are purely logical. As of now, they can only really do what they have been programmed to do which means we have to build into the logic of the program what to look for and which categories to choose between

This is a very important notion to understand: as of now, machines can only do what they are programmed to do. If we build a model that finds faces in images, that is all it can do. It won’t look for cars or trees or anything else; it will categorize everything it sees into a face or not a face and will do so based on the features that we teach it to recognize. This means that the number of categories to choose between is finite, as is the set of features we tell it to look for. We can tell a machine learning model to classify an image into multiple categories if we want (although most choose just one) and for each category in the set of categories, we say that every input either has that feature or doesn’t have that feature. Machine learning helps us with this task by determining membership based on values that it has learned rather than being explicitly programmed but we’ll get into the details later.

Often the inputs and outputs will look something like this:

Input: [ 1 1 0 0 0 1 0 0 1 0 ] 

Output: [ 0 0 1 0 0 ]

In the above example, we have 10 features. A 1 means that the object has that feature and a 0 means that it does not so this input has features 1, 2, 6, and 9 (whatever those may be). We can 5 categories to choose between. A 1 in that position means that it is a member of that category and a 0 means that it is not so our object belongs to category 3 based on its features. This form of input and output is called one-hot encoding and is often seen in classification models. Realistically, we don’t usually see exactly 1s and 0s (especially in the outputs). We should see numbers close to 1 and close to 0 and these represent certainties or percent chances that our outputs belong to those categories. For example, if the above output came from a machine learning model, it may look something more like this:

[ 0.01 0.02 0.95 0.01 0.01]

This means that there is a 1% chance the object belongs to the 1st, 4th, and 5th categories, a 2% change it belongs to the 2nd category, and a 95% chance that it belongs to the 3rd category. It can be nicely demonstrated in this image:

Visual of how machine learning identifies objects

How do Machines Interpret Images?

This provides a nice transition into how computers actually look at images. To a computer, it doesn’t matter whether it is looking at a real-world object through a camera in real time or whether it is looking at an image it downloaded from the internet; it breaks them both down the same way. Essentially, in image is just a matrix of bytes that represent pixel values. When it comes down to it, all data that machines read whether it’s text, images, videos, audio, etc. is broken down into a list of bytes and is then interpreted based on the type of data it represents. 

For images, each byte is a pixel value but there are up to 4 pieces of information encoded for each pixel. Grey-scale images are the easiest to work with because each pixel value just represents a certain amount of “whiteness”. Because they are bytes, values range between 0 and 255 with 0 being the least white (pure black) and 255 being the most white (pure white). Everything in between is some shade of grey. With colour images, there are additional red, green, and blue values encoded for each pixel (so 4 times as much info in total). Each of those values is between 0 and 255 with 0 being the least and 255 being the most. If an image sees a bunch of pixels with very low values clumped together, it will conclude that there is a dark patch in the image and vice versa.

Below is a very simple example. An image of a 1 might look like this:

Hand drawn 1 digit

And have this as the pixel values:

[[255, 255, 255, 255, 255],

 [255, 255, 0, 255, 255],

 [255, 255, 0, 255, 255],

 [255, 255, 0, 255, 255],

 [255, 255, 255, 255, 255]]

This is definitely scaled way down but you can see a clear line of black pixels in the middle of the image data (0) with the rest of the pixels being white (255).

Images have 2 dimensions to them: height and width. These are represented by rows and columns of pixels, respectively. In this way, we can map each pixel value to a position in the image matrix (2D array so rows and columns). Machines don’t really care about the dimensionality of the image; most image recognition models flatten an image matrix into one long array of pixels anyway so they don’t care about the position of individual pixel values. Rather, they care about the position of pixel values relative to other pixel values. They learn to associate positions of adjacent, similar pixel values with certain outputs or membership in certain categories. In the above example, a program wouldn’t care that the 0s are in the middle of the image; it would flatten the matrix out into one long array and say that, because there are 0s in certain positions and 255s everywhere else, we are likely feeding it an image of a 1. The same can be said with coloured images. If a model sees pixels representing greens and browns in similar positions, it might think it’s looking at a tree (if it had been trained to look for that, of course). 

This is also how image recognition models address the problem of distinguishing between objects in an image; they can recognize the boundaries of an object in an image when they see drastically different values in adjacent pixels. A machine learning model essentially looks for patterns of pixel values that it has seen before and associates them with the same outputs. It does this during training; we feed images and the respective labels into the model and over time, it learns to associate pixel patterns with certain outputs. If a model sees many images with pixel values that denote a straight black line with white around it and is told the correct answer is a 1, it will learn to map that pattern of pixels to a 1.

This is great when dealing with nicely formatted data. If we feed a model a lot of data that looks similar then it will learn very quickly. The problem then comes when an image looks slightly different from the rest but has the same output. Consider again the image of a 1. It could be drawn at the top or bottom, left or right, or center of the image. It could have a left or right slant to it. It could look like this: 1 or this l. This is a big problem for a poorly-trained model because it will only be able to recognize nicely-formatted inputs that are all of the same basic structure but there is a lot of randomness in the world. We need to be able to take that into account so our models can perform practically well. This is why we must expose a model to as many different kinds of inputs as possible so that it learns to recognize general patterns rather than specific ones. There are tools that can help us with this and we will introduce them in the next topic.

Hopefully by now you understand how image recognition models identify images and some of the challenges we face when trying to teach these models. Models can only look for features that we teach them to and choose between categories that we program into them. To machines, images are just arrays of pixel values and the job of a model is to recognize patterns that it sees across many instances of similar images and associate them with specific outputs. We need to teach machines to look at images more abstractly rather than looking at the specifics to produce good results across a wide domain. Next up we will learn some ways that machines help to overcome this challenge to better recognize images. In the meantime, though, consider browsing our article on just what sort of job opportunities await you should you pursue these exciting Python topics!

 

Transcript

What is up, guys? Welcome to the first tutorial in our image recognition course. This is also the very first topic, and is just going to provide a general intro into image recognition. Now we’re going to cover two topics specifically here. One will be, “What is image recognition?” and the other will be, “What tools can help us to solve image recognition?”

The first part, which will be this video, will be all about introducing the problem of image recognition, talk about how we solve the problem of image recognition in our day-to-day lives, and then we’ll go onto explore this from a machine’s point of view. After that, we’ll talk about the tools specifically that machines use to help with image recognition. Specifically, we’ll be looking at convolutional neural networks, but a bit more on that later.

Let’s get started with, “What is image recognition?” Image recognition is seeing an object or an image of that object and knowing exactly what it is. At the very least, even if we don’t know exactly what it is, we should have a general sense for what it is based on similar items that we’ve seen. Essentially, we class everything that we see into certain categories based on a set of attributes. That’s why image recognition is often called image classification, because it’s essentially grouping everything that we see into some sort of a category.

Now the attributes that we use to classify images is entirely up to us. For example, if we’re looking at different animals, we might use a different set of attributes versus if we’re looking at buildings or let’s say cars, for example. If we’re looking at vehicles, we might be taking a look at the shape of the vehicle, the number of windows, the number of wheels, et cetera. If we’re looking at animals, we might take into consideration the fur or the skin type, the number of legs, the general head structure, and stuff like that. It’s entirely up to us which attributes we choose to classify items. And this could be real-world items as well, not necessarily just images.

Now, this allows us to categorize something that we haven’t even seen before. In fact, this is very powerful. We can take a look at something that we’ve literally never seen in our lives, and accurately place it in some sort of a category. We can often see this with animals. I highly doubt that everyone has seen every single type of animal there is to see out there. No doubt there are some animals that you’ve never seen before in your lives. But, you should, by looking at it, be able to place it into some sort of category. You should know that it’s an animal. You should have a general sense for whether it’s a carnivore, omnivore, herbivore, and so on and so forth.

Now, another example of this is models of cars. Now, every single year, there are brand-new models of cars coming out, some which we’ve never seen before. Some look so different from what we’ve seen before, but we recognize that they are all cars. We can take a look again at the wheels of the car, the hood, the windshield, the number of seats, et cetera, and just get a general sense that we are looking at some sort of a vehicle, even if it’s not like a sedan, or a truck, or something like that.

Now, how does this work for us? Well, a lot of the time, image recognition actually happens subconsciously. In fact, we rarely think about how we know what something is just by looking at it. We just kinda take a look at it, and we know instantly kind of what it is. And a big part of this is the fact that we don’t necessarily acknowledge everything that is around us. If we do need to notice something, then we can usually pick it out and define and describe it.

Take, for example, if you’re walking down the street, especially if you’re walking a route that you’ve walked many times. It’s highly likely that you don’t pay attention to everything around you. Maybe there’s stores on either side of you, and you might not even really think about what the stores look like, or what’s in those stores. However, when you go to cross the street, you become acutely aware of the other people around you, of the cars around you, because those are things that you need to notice. In fact, even if it’s a street that we’ve never seen before, with cars and people that we’ve never seen before, we should have a general sense for what to do. The light turns green, we go, if there’s a car driving in front of us, probably shouldn’t walk into it, and so on and so forth.

Now, this kind of process of knowing what something is is typically based on previous experiences. If we’d never come into contact with cars, or people, or streets, we probably wouldn’t know what to do. However, we’ve definitely interacted with streets and cars and people, so we know the general procedure. So, go on a green light, stop on a red light, so on and so forth, and that’s because that’s stuff that we’ve seen in the past. Even if we haven’t seen that exact version of it, we kind of know what it is because we’ve seen something similar before.

Now, sometimes this is done through pure memorization. Maybe we look at a specific object, or a specific image, over and over again, and we know to associate that with an answer. This is just kind of rote memorization. However, the more powerful ability is being able to deduce what an item is based on some similar characteristics when we’ve never seen that item before. And that’s really the challenge.

It’s easy enough to program in exactly what the answer is given some kind of input into a machine. You could just use like a map or a dictionary for something like that. However, the challenge is in feeding it similar images, and then having it look at other images that it’s never seen before, and be able to accurately predict what that image is. Now, this kind of a problem is actually two-fold. The problem is first deducing that there are multiple objects in your field of vision, and the second is then recognizing each individual object.

So, step number one, how are we going to actually recognize that there are different objects around us? Typically, we do this based on borders that are defined primarily by differences in color. This makes sense. If we’ve seen something that camouflages into something else, probably the colors are very similar, so it’s just hard to tell them apart, it’s hard to place a border on one specific item. However, if you see, say, a skyscraper outlined against the sky, there’s usually a difference in color. It’s very easy to see the skyscraper, maybe, let’s say, brown, or black, or gray, and then the sky is blue. So there’s that sharp contrast in color, therefore we can say, ‘Okay, there’s obviously something in front of the sky.’

Now, again, another example is it’s easy to see a green leaf on a brown tree, but let’s say we see a black cat against a black wall. We might not even be able to tell it’s there at all, unless it opens its eyes, or maybe even moves. Now, we don’t necessarily need to look at every single part of an image to know what some part of it is. Take, for example, if you have an image of a landscape, okay, so there’s maybe some trees in the background, there’s a house, there’s a farm, or something like that, and someone asks you to point out the house. Well, you don’t even need to look at the entire image, it’s just as soon as you see the bit with the house, you know that there’s a house there, and then you can point it out.

This is even more powerful when we don’t even get to see the entire image of an object, but we still know what it is. Take, for example, an image of a face. Let’s say we’re only seeing a part of a face. Specifically, we only see, let’s say, one eye and one ear. But we still know that we’re looking at a person’s face based on the color, the shape, the spacing of the eye and the ear, and just the general knowledge that a face, or at least a part of a face, looks kind of like that. Our brain fills in the rest of the gap, and says, ‘Well, we’ve seen faces, a part of a face is contained within this image, therefore we know that we’re looking at a face.’

That’s, again, a lot more difficult to program into a machine because it may have only seen images of full faces before, and so it gets a part of a face, and it doesn’t know what to do. No longer are we looking at two eyes, two ears, the mouth, et cetera. We’re only looking at a little bit of that.

Now, before we talk about how machines process this, I’m just going to kind of summarize this section, we’ll end it, and then we’ll cover the machine part in a separate video, because I do wanna keep things a bit shorter, there’s a lot to process here. So some of the key takeaways are the fact that a lot of this kinda image recognition classification happens subconsciously. We just look at an image of something, and we know immediately what it is, or kind of what to look out for in that image. Obviously this gets a bit more complicated when there’s a lot going on in an image.

Also, image recognition, the problem of it is kinda two-fold. The first is recognizing where one object ends and another begins, so kinda separating out the object in an image, and then the second part is actually recognizing the individual pieces of an image, putting them together, and recognizing the whole thing. Also, know that it’s very difficult for us to program in the ability to recognize a whole part of something based on just seeing a single part of it, but it’s something that we are naturally very good at.

Okay, so, think about that stuff, stay tuned for the next section, which will kind of talk about how machines process images, and that’ll give us insight into how we’ll go about implementing the model. Okay, so thanks for watching, we’ll see you guys in the next one.

What’s up guys? Welcome to the second tutorial in our image recognition course. Here we’re going to continue on with how image recognition works, but we’re going to explore it from a machine standpoint now. We just finished talking about how humans perform image recognition or classification, so we’ll compare and contrast this process in machines.

For starters, contrary to popular belief, machines do not have infinite knowledge of what everything they see is. So, let’s say we’re building some kind of program that takes images or scans its surroundings. Well, it’s going to take in all that information, and it may store it and analyze it, but it doesn’t necessarily know what everything it sees it. It might not necessarily be able to pick out every object.

Machines only have knowledge of the categories that we have programmed into them and taught them to recognize. And, actually, this goes beyond just image recognition, machines, as of right now at least, can only do what they’re programmed to do. So this means, if we’re teaching a machine learning image recognition model, to recognize one of 10 categories, it’s never going to recognize anything else, outside of those 10 categories.

Now, a simple example of this, is creating some kind of a facial recognition model, and its only job is to recognize images of faces and say, “Yes, this image contains a face,” or, “no, it doesn’t.” So basically, it classifies everything it sees into a face or not a face. Now, this means that even the most sophisticated image recognition models, the best face recognition models will not recognize everything in that image. It’s never going to take a look at an image of a face, or it may be not a face, and say, “Oh, that’s actually an airplane,” or, “that’s a car,” or, “that’s a boat or a tree.”

It’s just going to say, “No, that’s not a face,” okay? Because that’s all it’s been taught to do. It’s classifying everything into one of those two possible categories, okay? So even if something doesn’t belong to one of those categories, it will try its best to fit it into one of the categories that it’s been trained to do. So, essentially, it’s really being trained to only look for certain objects and anything else, just, it tries to shoehorn into one of those categories, okay? So that’s a very important takeaway, is that if we want a model to recognize something, we have to program it to recognize that, okay? Otherwise, it may classify something into some other category or just ignore it completely.

Now, to a machine, we have to remember that an image, just like any other data, is simply an array of bytes. So it’s really just an array of data. It doesn’t look at an incoming image and say, “Oh, that’s a two,” or “that’s an airplane,” or, “that’s a face.” It’s just an array of values. Even images – which are technically matrices, there they have columns and rows, they are essentially rows of pixels, these are actually flattened out when a model processes these images.

Generally speaking, we flatten it all into one long array of bytes. So, I say bytes because typically the values are between zero and 255, okay? So that’s a byte range, but, technically, if we’re looking at color images, each of the pixels actually contains additional information about red, green, and blue color values. Lucky for us, we’re only really going to be working with black and white images, so this problem isn’t quite as much of a problem. But realistically, if we’re building an image recognition model that’s to be used out in the world, it does need to recognize color, so the problem becomes four times as difficult.

Now, if an image is just black or white, typically, the value is simply a darkness value. I guess this actually should be a whiteness value because 255, which is the highest value as a white, and zero is black. And, that means anything in between is some shade of gray, so the closer to zero, the lower the value, the closer it is to black. And, the higher the value, closer to 255, the more white the pixel is.

Now, this is the same for red, green, and blue color values, as well. If we get a 255 in a red value, that means it’s going to be as red as it can be. If we get 255 in a blue value, that means it’s gonna be as blue as it can be. But, of course, there are combinations. So, for example, if we get 255 red, 255 blue, and zero green, we’re probably gonna have purple because it’s a lot of red, a lot of blue, and that makes purple, okay? So this is kind of how we’re going to get these various color values encoded into our images.

Now, machines don’t really care about seeing an image as a whole, it’s a lot of data to process as a whole anyway, so actually, what ends up happening is these image recognition models often make these images more abstract and smaller, but we’ll get more into that later. To process an image, they simply look at the values of each of the bytes and then look for patterns in them, okay?

So if we feed an image of a two into a model, it’s not going to say, “Oh, well, okay, I can see a two.” It’s just gonna see all of the pixel value patterns and say, “Oh, I’ve seen those before “and I’ve associated with it, associated those with a two. “So we’ll probably do the same this time,” okay? So they’re essentially just looking for patterns of similar pixel values and associating them with similar patterns they’ve seen before. In this way, image recognition models look for groups of similar byte values across images so that they can place an image in a specific category.

Again, coming back to the concept of recognizing a two, because we’ll actually be dealing with digit recognition, so zero through nine, we essentially will teach the model to say, “‘Kay, we’ve seen this similar pattern in twos. “We’ve seen this pattern in ones,” et cetera. So when it sees a similar patterns, it says, “Okay, well, we’ve seen those patterns “and associated it with a specific category before, “so we’ll do the same.”

Now, I should say actually, on this topic of categorization, it’s very, very rarely going to be the case that the model is 100% certain an image belongs to any category, okay? That’s why these outputs are very often expressed as percentages. So it might be, let’s say, 98% certain an image is a one, but it also might be, you know, 1% certain it’s a seven, maybe .5% certain it’s something else, and so on, and so forth. So it’s very, very rarely 100% it will, you know, we can get very close to 100% certainty, but we usually just pick the higher percent and go with that.

Now, an example of a color image would be, let’s say, a high green and high brown values in adjacent bytes, may suggest an image contains a tree, okay? Now, if many images all have similar groupings of green and brown values, the model may think they all contain trees. So it will learn to associate a bunch of green and a bunch of brown together with a tree, okay? So this is maybe an image recognition model that recognizes trees or some kind of, just everyday objects.

Now, the unfortunate thing is that can be potentially misleading. There are plenty of green and brown things that are not necessarily trees, for example, what if someone is wearing a camouflage tee shirt, or camouflage pants? Well, that’s definitely not a tree, that’s a person, but that’s kind of the point of wearing camouflage is to fool things or people into thinking that they are something else, in this case, a tree, okay? So really, the key takeaway here is that machines will learn to associate patterns of pixels, rather than an individual pixel value, with certain categories that we have taught it to recognize, okay?

Now, we can see a nice example of that in this picture here. So, there’s a lot going on in this image, even though it may look fairly boring to us. There’s the decoration on the wall. There’s the lamp, the chair, the TV, the couple of different tables. There’s a vase full of flowers. There’s a picture on the wall and there’s obviously the girl in front. And, the girl seems to be the focus of this particular image.

Now, we are kind of focusing around the girl’s head, but there’s also, a bit of the background in there, there’s also, you got to think about her hair, contrasted with her skin. There’s also a bit of the image, that kind of picture on the wall, and so on, and so forth. So there may be a little bit of confusion. It’s not 100% girl and it’s not 100% anything else. And, that’s why, if you look at the end result, the machine learning model, this is 94% certain that it contains a girl, okay? It’s, for a reason, 2% certain it’s the bouquet or the clock, even though those aren’t directly in the little square that we’re looking at, and there’s a 1% chance it’s a sofa.

Now, I know these don’t add up to 100%, it’s actually 101%. But, you’ve got to take into account some kind of rounding up. Also, this definitely demonstrates how a bigger image is broken down into many, many smaller images and ultimately is categorized into one of these categories. So, in this case, we’re maybe trying to categorize everything in this image into one of four possible categories, either it’s a sofa, clock, bouquet, or a girl. And, in this case, what we’re looking at, it’s quite certain it’s a girl, and only a lesser bit certain it belongs to the other categories, okay?

So again, remember that image classification is really image categorization. Machines can only categorize things into a certain subset of categories that we have programmed it to recognize, and it recognizes images based on patterns in pixel values, rather than focusing on any individual pixel, ‘kay? So when we come back, we’ll talk about some of the tools that will help us with image recognition, so stay tuned for that.

Otherwise, thanks for watching! See you guys in the next one!

Interested in continuing?  Check out the full Convolutional Neural Networks for Image Classification course, which is part of our Machine Learning Mini-Degree.

]]>
How to Classify Images using Machine Learning https://gamedevacademy.org/image-classification-tutorial/ Fri, 01 Mar 2019 05:00:56 +0000 https://pythonmachinelearning.pro/?p=2268 Read more]]>

You can access the full course here: Build Sarah – An Image Classification AI

Transcript 1

Hello everybody, and thanks for joining me, my name is Mohit Deshpande, and in this course we’ll be building an image classification app.

Given a set of images, we’re going to train an AI to learn what these images are, and then we can actually assign them labels. So, you see some of what our data set is gonna kinda look like, you have things like trucks, cats, airplane, deer, horse, and whatnot. And so, when, what we will be building is an AI that can actually classify these images and assign them labels so that we know what’s in the image. And so, we can build an AI to do that. So, kind of the big topic here is all about image classification.

So first, I want to introduce you to what image classification is, in case you’re not familiar with it. I will also do like a quick intro to machine learning as well. And, kinda the first approach that we’re going to take is through this thing called the nearest neighbor classifier, and so we’ll kind of build the intuition behind how that works, and then write the code for that from scratch.

Then, we’ll move on to something a bit more generic than that, and a bit better, and it’s called a k nearest neighbors classifier. So, instead of just the nearest neighbor, you look at the top k hostess neighbors, is kind of the intuition behind that. And I’m going to go into much more depth with that And, for this actually we’re going to use a pre-built, pre-built models, or pre-built classifier, whose code is already written so it can get kind of complicated with that. Then, we’re going to talk about hyperparameter tuning, because the question is then, you know, how do we choose the value of k, what is k, and so we’re going to be discussing how we pick these values and the approaches that we can take to get the best possible hyperparameters.

And finally, I also want to discuss the CIFAR-10 dataset, and what’s really cool about CIFAR-10 is that it’s a very popular, widely-used, real dataset that people doing research in image classification use to, when they’re reporting their results. And so, it’s going to be really cool, because you’ll be using that same dataset that the top researchers have used before. So, we’ll also be looking at that CIFAR-10 dataset. So, we’ve been making video courses since 2012, and we’re super excited to have you onboard. Online courses are a great way to learn new skills, and I take a lot of online courses myself. Zenva courses consist mainly of video lessons that you can watch at your own pace and as many times as you want.

All the source code that we make is downloadable, and one of the things that I want to mention is the best way to learn this material is to code along with me. So, we highly recommend that you code along so that you can better learn the material, because there’s a big difference between watching someone code and coding yourself. And finally, we’ve seen the students who get the most out of these online courses are also the same students who make, kind of, a weekly planner or a weekly schedule and stick with it, depending on your own availability and your learning style. Remember that these video lectures, you can watch and rewatch as many times as you want. So, that really gives you more flexibility.

At Zenva we’ve taught programming and game development to over 200,000 students, over 50 plus courses, since 2012. And these students have used the skills that they’ve learned in these courses to advance their careers, start up a company, or publish their own apps and games. Thanks for joining, and I look forward to seeing the cool stuff you’ll be building. Now, without further ado, let’s get started.

Transcript 2

Hello everybody, my name is Mohit Deshpande and in this video I wanna give you guys an overview of machine learning.

And we’ll talk a little bit about where it came from and towards the end I just wanna list a few different subfields within machine learning that there’s a lot of ongoing research currently going into that. So that’s what I’m gonna be talking about in this video.

So let’s get started. So before we had machine learning or actually just artificial intelligence in general, AI, computers were very unintelligent machines. Because even though they were really good at computing large numbers or performing large computations and things of that nature, even though they could do those really fast, they had to be told exactly what to do.

And so like I said, that’s something worth writing down. Is something like, before AI, computers had to be told, had to be told exactly, oh that’s a bad exactly, told exactly what to do. Told exactly what to do. You had to account for every possible input or change in your machine state or something like that, you had to account for every single possibility. And that became tedious very fast because there were cases where this becomes incredibly time consuming to have to hard code in your program all of these possible configurations or possible inputs. And that also adds to the length of your program.

And so way back then it was just something that before AI it’s something that you just had to do or you had to have some sort of fail safe condition or something like that. But then towards, after, then people started asking the question, instead of telling computers exactly what to do each time, can we teach them to learn on their own? And as it turns out there was a lot of stuff going around in science fiction particularly, authors and writers in science fiction, were starting to depict robots and they had robots being sentient beings and they looked like mechanical men is I guess what the term was, but eventually turned into robots.

And they had all these futuristic stuff with robots like they could greet you and shake your hand and they just had this repository of knowledge that they could draw from and they were sentient, they knew that they were, they knew their own existence and everything and they learned. And that’s probably the most important aspect of the thing that AI researchers were taking from science fiction is that robots could learn. And so then they started getting into, how can we model knowledge and how can we get some kind of representation with which to learn. And that starts getting into this period of time when we were doing stuff called classic AI, classic AI. And that was actually more centered around intelligent search instead of actual learning.

So what I mean by that is let’s suppose that we were playing a game, something simple that we all know, so tic-tac-toe or something. Where let’s say that I am the blue circles. So and suppose I play a move here and then it’s the computers turn and so then the computer has one, two, three, four, five, six, seven, eight, the computer has eight possible places where it can put an X.

So what classic AI was trying to do is it will try every one of these possible combinations and then it’ll try to predict. So if the X was put here for example, then after that X was played then it’ll try to predict what my motion is. Then maybe I’ll play something like this and then from there the AI could one, two, three, four, five, six different moves. And so it tries each one of them and eventually you get this giant search space basically where you’re looking at every single possible way that the game could be played out from the human just playing a single O here. I mean there are so many possible combinations.

And as it turns out there are different techniques that you can actually get this working reasonably well. You can brighten AI to play tic-tac-toe with you and such that it will choose the best move to try to prevent you from winning. It’s actually called, that’s called a minmax strategy. But anyway, you can build this and it’s actually not that hard to do and it runs reasonably fast. And so this is something that you can build, but this is for something like tic-tac-toe, this is a really simple game.

Imagine if we had something like chess. That’s wrong color there. I mean, imagine if we had something like chess where it’s not just eight possible moves, it’s so, so many moves. Tons and tons of moves on this chess board. And as it turns out, I think way back in, I think sometime in the mid-1990s or something one of IBM’s machines, Deep Blue I think is what it was called, actually ended up beating the national chess champion or something similar to that.

And so trying to do this classic AI stuff with search when it comes to large games like chess or even with even larger games like there’s a game, an ancient Chinese game called go that’s often played and it has even more configuration possible moves than chess, so at some point it just becomes. The number of possible ways a game could be played out is so big that it would either one, use up all the RAM on your computer and crash or two, it just, computing all of this stuff out would take much, much longer than you could actually play a game with. And so search is not a good thing to really do, but back then it was the only viable option at that time.

There was some dabbling going on in actual learning, but a lot of the stuff with classic AI was using search, different kinds of searching algorithms and so you could have it play tic-tac-toe or chess or something. But recent, relatively recently I should say, there’s been this move from instead of search we move towards actual learning.

So we move towards actual learning. So instead of looking at all possible configurations, we start training an AI, we start teaching an AI by giving it lots of example data that it can draw from and so when it gets new input data it can intelligently, it knows because it’s seen previous data, what to do with this new problem. So if you have a particular problem when you’re training an AI, you give it lots of examples with the problem and then it can start learning ways that it can approach a problem. And this is all, I am speaking in the abstract sense because I wanna make this as general as possible. But as, there are a lot of different subfields that I don’t wanna get to specific because then it won’t apply to some subfields.

But right, so when we’re trying to solve a problem we train an AI and then it’s, the AI has seen examples of how to solve the problem and so then it knows from new input it can reason through how to solve that problem with some new input. So that’s a broad level overview of machine learning. But there are actually a few subfields within this.

This is, machine learning itself is a fairly big field. So there’s research going on into, I’m sure you’ve heard of neural networks, I think they’ve been in the news at some point. But neural networks try to take the more biological route and they try to model what’s going on in our brains. Albeit it’s a very overly simplistic model, it’s still a model and it turns out that it works really well. It turns out we can also break down neural networks into things like language with recurrent neural networks or vision with a convolutional neural networks.

But we could even branch this off even further. There are people researching deep learning. Specifically, and that’s kind of related to neural networks, but it’s deep learning, the issue is how deep can we make these neural networks, how many layers can we go and what kind of challenges do we encounter as we make these layers really deep? And so they’re trying to find solutions for that. There’s stuff going on with reinforcement learning is also pretty popular. And reinforcement learning is actually used, it’s very popular to use for teaching AI to play games actually, I think there’s a, if you look around, there’s an AI that can actually play the original Super Mario Bros. or something like that. They can play through the original Mario game.

And reinforcement learning helps let you build that kind of model. I think they can also play, like they’ve built reinforcement learning models that can play Asteroid and a ton of the old Atari games, fairly well, too. So right, these are just some of the subfields. I can’t possibly list all of them because it’s a really big field, but we’re just gonna stop right here and do a quick recap.

So with machine learning, before AI, computers weren’t very intelligent, we had to tell them exactly what to do and this became impossible in some cases because you can’t think of all possible configurations or inputs that you can get. And so this is when we start getting into classic AI. But even with classic AI we were technically just doing searching, we weren’t actually learning anything about this. And now we’ve moved from search more to learning and where we actually are learning of knowledge representations and using those.

And I just mentioned a couple subfields of machine learning here with neural networks, deep learning and reinforcement learning to show you that this is a very popular field at this point and it’s a very, very rapidly expanding field. And you can definitely expect many more cool advances to come in the future.

Transcript 3

Hello, everybody, my name is Mohit Deshpande and in this video, I want to introduce you guys to one particular subfield of machine learning and that is supervised classification and so, classification is a very popular thing to do with machine learning.

So, let me actually define this. So, classification is the problem of trying to fit new data…. I should make this a bit more specific, I should say, fit or label new data based on previously seen data. This seems kind of like a weird description at this point but with classification, the task is to… We’ve seen a lot of data and it’s labeled and given some new data, we want to give it a label based on some of the previously labeled data that we’ve seen. I should mention that these are… I’ll put it over here, actually. I should mention that classification is… We have discrete classes or labels to each data point or input and so, let me illustrate this by an example. So, suppose I have a… That was a really bad line.

Suppose I have like a scatter plot, over here or something. Let me just add in some stuff here. So, I’m just adding in a ton of red x’s and then, we’ll add like, blue circles, over here. So, this data is labeled so, these will actually correspond to actual points. So, this for the X direction and this for the Y direction. These would correspond to actual points. I haven’t actually like, plotted all the points, but trust me, they correspond to actual points and you see, I’ve labeled them. I’ve labeled them, but they’re only two classes and there is the red X and the blue circle. If I wanted to, I could add, like some other class, like a green triangle. We’ll add a couple green triangles or something, up here.

So, there’s three classes. We’ll say there’s three classes and so, I have all these points and they’re labeled and so, the problem with classification is now that I have these points, if I received some new point, what label would I assign to it? Would I assign to it a red X, a blue circle or a green triangle? So, what we’re trying to do with classification is to find a way and to build a model so that given this new input, we can actually assign it one of these labels.

So, let’s just do a human intuitive, example kind of thing. So, suppose my point, I’m gonna put in, let’s see, purple. If my point was in here, or something. Suppose this is my new point, here. So, with this being my new point, I would ask the classifier what label should I assign to this? Should it be a blue circle, a red X or a green triangle?

And so, as a human, if you were thinking about this, if I gave you this point and I asked you, what would you assign it, you would say, “Well, I would assign it as a blue circle.” and I would ask you, “Well, wait a minute.’ “Why would you assign it as a blue circle?” and you’d say something, probably along the lines of “Well, if I look at what’s around it, “they’re lots of blue circles, around here.” and it turns out, I guess this region of the plane, here tends to have more blue circles, here than red X’s, so, I can try to carve out this portion, over, here, seems to be a lot of blue circles. So, this is probably what I would assign this point and it turns out, that if you were probably to give this to a classifier, he would probably give this a blue circle.

So, I say, “All right. “Now, what about a point, over here?” And so, you would say, “Well, I would give that a red X.” When I ask you again, “Why would you give it a red X?” and the reason for that, is you give the same answer. You say, “Well, in this portion of the plane, over here “of this given data, it’s closer around that question point, “around that new input, there’s a lot of red X’s “and so, I would think that it would be most likely “to be given with a red X.” and so, that’s right and now, I can do the same thing, where I say, I have a point up here, or something and you’d say, “Well, this part of the plane, here is more… “like this part over here, you’re more likely to encounter “a green triangle than you are any of these.”

And so, I would probably give this point a… Probably say that, that new point should be a green triangle and so, this is kind of like, the thought process that is going on with these classifiers and so, what you use to make your decision, was this kind of… I kind of drew it in, here.

This kind of imaginary boundary sort of thing, between our data and so, this called the decision boundary. The decision boundary, right here and it helps us make decisions when it comes to a supervised classification because we can take our point and depending, we can take any sort of input data and find some way to put it on a plane, like this and then, just find what the decision boundary is and then, we can plot this, and so, with a lot of classification algorithms, what they try to do, is they try to find this boundary, is what they’re all concerned about, because once you have this boundary, then, if you get a new point, then it’s fairly easy to classify.

You can say, “Well, I want this portion to be “part of the boundary is blue. “This part of the boundary is red. “This part of the boundary is green.” so, if you get points that are inside one of these boundaries, you just give it a label of what’s around there and so, this is what supervised classification algorithms try to find, some kind of boundary. It might not be the case that you have, such a nice, two dimensional data, like this but there are ways that you can fit it onto a plane.

I’m not gonna get into, too much but, here’s a question. So, what if my point was like, right over here. Then it’s not so obvious as to if it is a blue circle or a red X and so, you know, there’s some inherent there’s some confidence value or some measure that says that, “I think that this is a blue value “with this confidence or with this probability” and so, even the points that we we’re classifying, here they did.

Even though that it seemed kind of obvious, that around them, there are blue circles, there is some inherent uncertainty about this and it turns out that, well, for each of these points, there is a chance that it could have been a red X or it could have been a green triangle, but that chance was very, very low and we only assigned it the label that has the maximum chance. So, it’s not necessarily the case that this must be a blue circle, instead, we say, that this was a high probability a blue circle and so, you can’t be 100% certain.

If you look at this point, over here, it becomes clear that this could be a red X or this could be a blue circle. It just kind of depends on what this boundary specifically looks like, but given new inputs I want to be able to, like give them one of these labels, here.

So, and this is where I’m going to stop, right here and I’ll do a quick recap. So, with supervised classification, it is a subfield of machine learning and it’s all, where the problem that we’re trying to solve is, we have these labels and our input data and we want to, now that we’ve seen our data, we want to, given some new input, we want to give it a label based on the labels that we already have and that is kind of the problem of supervised classification.

We want to fit or label some new input based on what we have already seen before and so, I kind of gave this example of, like, if we had red X’s, green triangles and blue circles, given the new point, how would you figure out if it is one of these categories and we use these things called decision boundaries to try to get that and figure it out. So, that is supervised classification.

There are tons and tons of algorithms that can do this. Some of them work better than others. It all depends on what kind of data you’re looking at but the point is that they are lots of different algorithms for this, and so you can take a look around and see if there’s one that you want to know more about but anyway, this is a problem of supervised classification.

Transcript 4

Hello, everybody. My name is Mohit Deshpande, and in this video, I want to give you kind of a, I want to define this problem called image classification, and I want to talk to you about some of the challenges that we can encounter with image classification as well as, you know, some of, get some definitions kind of out of the way and sort of more concretely discuss image classification.

So first of all, I should define what image classification is and so what we’re trying to do with image classification is assign labels to an input image, to an input image. So this kind of fits the scheme of just supervised classification in general, is we’re trying to given some new input, we want to assign some labels to it. There’s some specific, there’s some challenges specific to images that we have to talk about, but before we really get into this, I want to remind you that images are just, images consist of pixels, and so what we’re trying to do here is just remember again that the computer just sees like this grid of, the computer just sees this grid of pixels and so what we’re trying to do with this is we’re trying to give this labels like “bird” for example.

Suppose I have an image of a bird or something over here or something like that. I have some picture of a bird and so what I want to do is give this to my classifier and my classifier will tell me that this, the label that works well with this, the label that closely can be tied to this image is “bird”. And so that’s the goal of image classification and we’re trying to add some higher level meaning to this image. In fact, what we’re trying to do is we’re trying to determine what is inside of an image and that’s what these labels are.

These labels tell us what is inside of the image. Not just random labels, but for image classification we want to know, we’re particularly interested as to what is inside of this image, but this isn’t an easy problem by any means. And so there’s some challenges that are specific to, there’s some challenges, I misspelled that. I forgot about the “n”, there should be an “n” in there.

Challenges specific to image classification so I just want to talk about a couple of them. We won’t get to all of them, but one particular challenge here is scaling and that is if I have a picture of a bird, if I have a picture of a small bird as opposed to when I feed my classifier the same picture, but it’s now maybe doubled in size, then my classifier should be robust to this. I should be able to take an image, and there shouldn’t be any dependence on size. If I give you a picture of a small bird, I can give you a picture of a large bird and it should be able to figure out either which bird that is or that this is a bird, right? If I give this an image of some object or something.

So suppose my class, I should probably define some of these class labels. So suppose my class labels, I don’t know, suppose my class labels are something like “bird”, “cat”, or “dog”. These are just like some example class labels, for example. So if I give it a picture of one of these things, and depending on if it’s a big dog or a small dog, it should be able to identify this as a dog. If I give it a picture of a small cat or a large cat, it should still be able to identify this as a cat. And so there’s challenges with scaling.

There’s this other challenge called occlusion. And occlusion is basically when part of the image is hidden so part of image is hidden or behind another, behind something so that would be like if I had a picture of a bird and maybe like a branch or something is in the way and it’s covering up this portion here. We want our classifier to be robust to things like occlusion this is a pretty big challenge with occlusion because depending on what part you see, we have to make our classifier robust to this.

So occlusion is like a part of an image and it’s hidden behind something else like for example, like this tree branch that’s blocking half of my bird or something. I still want to classify this as a bird so that’s kind of the challenge of occlusion. I guess we can do one more. Another good one is illumination. I can’t spell today, I guess. Illumination is what I mean, and illumination is lighting. Illumination is basically lighting so depending on my lighting conditions of whenever the input image was taken, I still want to be robust to that kind of thing.

I don’t want my image to be classified poorly because my cat is standing in sunlight or something like that. Or my cat is in darkness or if my bird is, it’s a cloudy day or something like that, I don’t want that. I want my classifier to also be robust to illumination and there’s so many more things, so many more challenges with image classification and it makes it kind of difficult and so there’s work going around, there’s still research going into finding ways to be more robust to some of these challenges.

And sort of build a really good classifier, we need to take a data driven approach, so data driven, data driven approach and what I mean by that is we basically give our AI tons of labeled examples so for example, if we were doing this thing that differentiates between these three classes, we would give our AI tons of images of birds and tell them that, tell our AI that this is a bird. We give our AI tons of pictures of cats and say, “This is a cat”. We give our AI tons of pictures of dogs and we say, “This is a dog”.

Alright, so with data driven, we want to give our AI labeled example images and these labeled images are also commonly called ground truths. This labeled example is commonly called ground truth because when we go to evaluate it, we actually compare what the classifier thinks this is to what the actual value or what the actual truth of this image, the truth of what the label is on the image and we call it ground truth so we compare the prediction to ground truth and say how well is our classifier performing.

So yeah, we want this to be data driven so we take this approach by giving our AI lots of labeled example images and then it can learn some features off of that, but if you want to take this approach, however, you’ll need, you can’t just give it two images of a bird or two of each and be done with it, right?

The more good training data that you have, the more high quality training data that you give your AI, the more examples that you give your AI, the better it will be to discriminate between bird, cat, dog. To make that distinction between these classes, you want to give lots of high quality examples to your AI.

And I’m going to talk a little bit about this a bit more, but when we collect this data set, this data set is actually something you have to collect yourself. There’s tons of image classification data sets online. I mean, there’s ImageNet has a few million images across tons of different classes. There’s much smaller data sets, of course. There’s the C4-10 data set that has 10 different images. I think it’s maybe 60,000 images, but the point is lots of good quality training data is always preferable to some super complicated classification algorithm.

So that kind of illustrates that with image classification we want this to be data driven. There’s no way to hard code this for every bird or for every cat or for dog. Hard coding would not be a good approach so we’re taking the more data driven approach by giving our classifier lots of examples with labels on them so it can learn what a bird looks like and what a cat looks like, and so on.

So that’s where I’m going to stop right here and I’m just going to do a recap real quick. So with image classification, we want to give labels to an input image based on some set of labels that we already have. And so given suppose I have three labels like “bird”, “cat” and “dog or something and so given a new input image, I want to say whether it’s a bird, a cat, or a dog, where I want to assign that label and so suppose, so computers only see, the computers only see the image as pixels so we have to find some way to build a classifier out of just given these pixel values, and lots of challenges that are with that.

Like I mentioned scaling, that’s if you have a big bird or a small bird, you want to be able to still say that it’s a bird. There’s occlusion. If I have a tree branch in the way, or something like that, I still want to classify this as a bird. There’s illumination, if I have like a dog, it’s standing in direct sunlight as opposed to a dog in a darker room or something. I still want to classify that as a dog. And kind of, that also gets into other challenges like what’s going on in the background. You want a very sterile background when you’re getting training data. You don’t want a lot of background clutter because that could mess up your classifier. It might learn the wrong thing to associate with your label that you’re trying to give.

But anyway, moving on, so a good approach to doing this is the data-driven approach and that is we give our AI lots of labeled example images. We give it lots of images of birds and tell it that this is what a bird looks like. We give it lots of images of cats and we say, “This is what a cat looks like” and so forth for a dog and for any other classes that you might have. But we give these example images and it will learn some representation of what a bird is and what a cat is and what a dog is, and given that, it can generalize and when you have a new input image, it will do it’s function and that is to label it as one of these labels, or give it one of these labels, I should say.

So I’m going to stop right here and what we’re going to do in the next video, I want to talk probably the simplest kind of image classifier that’s called the nearest neighbors classifier so I’m going to talk about that in the next video.

Interested in continuing? Check out the full Build Sarah – An Image Classification AI course.  You can also check out our Machine Learning Mini-Degree and Python Computer Vision Mini-Degree for more Python development skills.

]]>
A Comprehensive Guide to Optical Flow https://gamedevacademy.org/optical-flow-tutorial/ Fri, 22 Feb 2019 05:00:18 +0000 https://pythonmachinelearning.pro/?p=2266 Read more]]>

You can access the full course here: Video and Optical Flow – Create a Smart Speed Camera

Part 1

In this lesson, you will learn the basics of videos, and how function notation can be applied to find pixel intensities of videos.

Videos are a sequence of images (called frames), which allows image processing to handle videos.

The rate at which the images change is called the frames per second, and is known as the FPS of the video. A common example of this is 60 FPS, which means the video shows 60 frames every second.

For images, a function I can be applied to the image so that I (x,y) = p, where x and y are coordinates, and p is the pixel intensity.

For videos, there needs to be additional information to find the pixel intensity, as there are numerous frames to choose from. The parameter t needs to be added, with t being when in the video the desired frame is located. Then, adding t to the image function,  I (x,y,t) = p for videos.

Part 2

In this lesson, you will learn about the basics of optical flow and the mathematics of it.

For images, the only information that you can access is the spatial positioning in relation to other pixels. The benefit of videos over static images is that it adds temporal information, so that you can know not only the location spatially, but also when it exists. This is what allows optical flow to function.

Optical flow is type of computer vision technique that is used to track the apparent motion of objects in a video. Optical flow can be used to track objects, stabilize and compress videos, and allow AI to generate descriptions of videos.

Optical flow functions by tracking a pixel through consecutive frames. This allows for the path of that pixel’s movement to be generated, as shown in this image.

Optical flow shown going through pixels

Optical flow makes two assumptions that drastically simplify this process. These are:

  1. Pixel intensities don’t rapidly change between consecutive frames.
  2. Groups of pixels move together.

These two assumptions apply when a video is smooth, as pixels should change gradually and not teleport around the image. These assumptions apply for almost all real world videos, but can be broken if someone specifically edits a video to make that occur.

When analyzing videos, the pixel you are tracking will be displaced from its original position by some value u in the x direction and some value v in the y direction, as shown in the image below. The goal of optical flow is to find the u and v values, as they allow you to create a displacement vector and track the pixel’s path.

Mathematical representation of pixel being displaced

Part 3

In this lesson, you will continue to learn about optical flow, with this lesson providing more information about the mathematics behind this technique.

To recap from the previous lesson, the point of optical flow is to find the displacement vector of a pixel, with u representing the change in the x direction and v representing the change in the y direction. In the image, t represents the time at which the frame occurs, with Δt representing how much time has passed since the previous frame.

Using the function notation from the earlier lessons, the pixel intensity for the frame at time t can be stated as I(x,y,t). The pixel intensity for the second frame can be stated as I(x+u, y+v, t+Δt).

Image showing how to determine pixel intensity

These two functions should be equal, due to the first assumption of optical flow, meaning that I(x,y,t) = I(x+u, y+v, t+Δt).

After using some calculus, this equality turns in to the equation Ixu + Iyv + It = 0. Ix represents how much the frame changes horizontally, Iy represents how much the frame changes vertically, and It represents the difference in time between the frames. 

Ix, Iy, and It can all be computed, so the equation comes down to solving for u and v. There are numerous methods to solve this, but as they require calculus and linear algebra, this course will not be covering all of them. Certain techniques which will allow you to find u and v will be addressed in later lessons.

 

Transcript 1

Hello everybody, my name is Mohit Deshpande, and in this course we’ll be building an app that can track objects through video and actually determine what their speed is based on certain properties of our camera.

So we’ll be able to build this. And I’ve shown this visualization here. You can see that I have a pencil here that I’m waving, and the points are being tracked on that pencil, and we can get the speed readings from those points. And so that’s what we’re gonna be building in this course. And you can try this out with other objects as well.

So kind of the big topic that we’re gonna be discussing is called optical flow. And optical flow allows us to take points in video and track them through each frame. So we’re gonna discuss how that kind of works. And we’re also gonna talk a little bit about camera intrinsics, because we have to know a little bit more about our own cameras before we can get accurate speed measurements. And lastly, I’m gonna show you how we can visualize these optical flow patterns.

What I mean by that, and you saw it in the previous slide, how I can kind of draw on top of my frames and we can kind of draw a path. So we’ve been making video courses since 2012, and we’re super excited to have you on board. Online courses are a great way to learn new skills, and I take a lot of these courses myself.

So general courses consist mainly of video lessons, and you can watch, re-watch them as many times as you want, and at your own pace. Everything we do in source code is downloadable by the way. And when I’m coding this stuff in the videos, I really, really recommend that you code along with me, because coding along helps you learn material better than just watching code. The last thing that I wanna mention is that students who get the most out of these online courses are the same students that make a weekly plan or schedule and stick with it based on your own availability and whatever your learning style is. And remember that these videos, you can watch or re-watch them as many times as you want, so that kind of gives you a lot of flexibility.

So at ZENVA, we’ve taught programming and game development to over 200,000 students, over 50 plus courses since 2012. And a lot of these students have used the skills that they’ve learned in these courses to advance their own careers. Some have started companies and published their own games and apps. So thanks for joining, and I look forward to seeing the cool stuff you’ll be building. But now, without further ado, let’s get started.

Transcript 2

Hello, everybody. My name is Mohit Deshpande. And in this video, I want to first discuss what videos are, and we’re going to have kind of a change in notation before we get into optical flow because it just becomes easier to work with and understand optical flow if we have this change of notation kind of thing.

So before we can talk about it though, we have to formally define videos, and that’s because optical flow actually operates on videos. So we need to have a good understanding of what is a video before we can really get into optical flow. So we know that we can represent images as matrices for grayscale, but what about video?

So what is videos? So to kind of answer that question, think about some of the videos that you’ve seen at the movies, or on your TV, or your computer, tablet, phone, YouTube, and what not. So it seems that when you watch a video on YouTube or something, you can pause it and you get a still image, right? And so what does that kind of imply about videos? Well, you can take a scrubber or hit the rewind or forward button and kind of go through each of the still frames. So that produces something enlightening about video and that is that video, video is just a sequence of images.

Sequence of images. Because, you know, you can pause a video and kind of scrub back and forth using some sort of scrubber or a fast-forward, rewind sort of thing. And so you’ll notice that videos are just a sequence of images and they’re played fast enough that they don’t look like still images. They look like one fluid motion into what we call a video. And this also becomes apparent if you ever had like some kind of sticky notes, like a flip book, or something.

You can draw each image and when you flip through them fast enough, it looks like one continuous animation, for example, like that. So that kind of gives you the background for videos. They’re just a sequence of images. And in fact, if you’ve heard of something like FPS, right? That’s actually called frames per second. And that’s just like the speed. That tells us how many frames are being shown in one second.

In the context of videos, these still images, we actually call them frames. So we say that video is actually comprised of different frames. Rather than images, we just say frames. And so when you see something like FPS, that means frames per second. That’s how many frames are shown per second. And you know, you might have heard values for this like 60 FPS is a very common number to see next to frames per second, and that’s saying that you see 60 frames in one second. That’s 60 frames, 60 still images in one second.

So this is moving pretty fast. So the frames are moving pretty fast. You can also have lower FPS, like 30 FPS or 15 FPS. And as you decrease the FPS, it becomes clearer that you’re looking at still images played in succession rather than one continuous motion. But anyway, so that’s just kind of an intuitive understanding of videos.

But now, if we want to formally discuss videos, then we have to change the notation a bit. Particularly, we have to go from a matrix notation to function notation because if we try to stick with matrix notation when we’re discussing videos and we have to talk about these things called tensors, and that just gets like way beyond. It kind of gets out of control at that point. So let’s just start with images first. With images, let’s kind of start with images and then convert from images to video. We’ll start with this notation. So with images, we know that they’re matrices, and we can represent grayscale as matrices.

And so what we can do is, analogously, we can define a function, I’m going to call it capital I, that takes an X and a Y coordinate and returns some pixel value or intensity. And it turns out when it comes with videos, we’re only concerned with pixel intensities. So we can drop any colors and just consider a grayscale. But it turns out that this is like an analogous notation to matrices. So this is, you know, kind of the function notation, function notation, here for images. And it’s kind of like if you were trying to find a book or something in a library, whether that’s a physical library or like some online library.

You know, if we wanted to find a particular book, we need some information about it like it’s book number, the genre, the author’s name, or what not, and you know, et cetera. But when we have that information, we can find the book that we’re looking for. Each book in our library is uniquely identified by some group of attributes or values. Usually, that’s something like title, author, and book number, or something like that. So you know, given a set of values, if I tell you three values, it points to a unique book. It’s never like if I tell you three values, you have two possible books you can choose from. It’s always if I tell you some set of values then you get the unique book.

When we’re using this notation, think of our library as the image and that X, Y is the information that we need. And so this function I is kind of analogous to the act of finding a book that we want or finding a particular pixel intensity that we want. So this is saying that at, you know, this gets us at we’re basically getting the pixel value at X comma Y. And you can think of these with the old matrix notation as king of being the indices.

If we have an image, then we can just like have an image here. And then at a particular location, X comma Y in our image if we think of this as a coordinate plane, there’s some pixel intensity P that’s there. So that is the kind of notation for images. So like, you know, given an X and a Y, X and Y are unique, and so we can get the pixel value. But for videos, this is a bit different ’cause we can’t just use X and Y because we have that additional component of, well, in which frame are we doing this look up?

So, this works well for a single image, but when you have a sequence of images like videos, then we can’t just use X, Y. We need actually one other parameter and I’m going to call that T, and then this will get us a particular pixel value. And so this T represents where, when in the image, or when in the video, I should say. So this kind of represents a when in our video, which frame do we find, you know, look at the X and Y to get a particular pixel intensity.

So this is kind of like a spatial position here and this is a temporal position here, ’cause this has to do with the spatial positioning of the pixels in a particular frame. This has to do with when is that frame. And so that’s why for videos, we need three values here instead of just two images for images because just using X and Y isn’t sufficient for finding a particular pixel in the video because (mumbling) this video, we don’t have a single image, we have a sequence of images, so we need another parameter to tell us where in the duration of the video that we can be finding this pixel. But anyway.

So for video, we need three parameters. So this is kind of the notation that I’m going to be using for the rest of this course because for optical flow, it just becomes easier to deal with this I function rather than dealing with matrices, like I said, ’cause then we have to get into tensors and it just gets kind of out of control at that point. But anyway. So going forward, we’ll be using this notation to look at optical flow. So I guess I’m going to stop right here and do a quick recap.

So in this video, we discussed, well, videos, and we defined them as being just a sequence of frames. These frames are just like, you can think of them as being still images. And they’re played so fast, we see them appear so fast that they appear to be like one continuous motion. They appear to have motion. And so, you know, like, frames per second is a common measure of this and frames per second tells you how many of these frames do you see in one second. So common values for this were, like I said, were like 60. You’re seeing 60 frames in one second, so that’s pretty fast.

So I just kind of gave you an intuitive understanding of videos, then we moved onto this change of notation from matrix notation to function notation here. So I just described this I function as basically like a lookup table. So we go to the pixel at coordinate X comma Y in our image starting, you know, at the standard image coordinate system. And then P is the pixel intensity that we get at coordinate X, Y. And then for videos, I said that X, Y just isn’t sufficient, so we need another parameter, T, and then tells us when or what frame we’re looking at, basically. So when in the duration of the video do we look. So kind of moving forward, we’re going to be using this notation.

So that’s where I’m going to stop with this video, and then actually in the next video, in the next sequence of videos, I’m going to kind of give you an intuitive understanding of optical flow, and so we’ll kind of get through that. So we’re going to start optical flow in the next video.

Transcript 3

Hello everybody my name is Mohit Deshpande, and in this, just gonna start the sequence of videos that are just gonna kind of give you an intuitive and complete understanding of optical flow and we’re also gonna get a little bit into the mathematics of particularly the actual equation for optical flow and I’ll talk about some of the stuff there. I won’t be going too far into the mathematics, but I just wanna kind of give you, I want to start off by giving you more of an intuitive understanding first and then we can kind of take that intuition and solidify it into more concrete math terms.

So first of all, we have to kind of talk a little bit about what this is and the motivation behind this. So images are great. Videos are even cooler, because we have more information in a video than in an image. Because in an image we just have the spatial positioning of the pixels. That is where they are in the image relative to each other. So that’s all we get in an image but in a video, we get the same spatial information but we also add an additional temporal component. Meaning we not only have the location of a pixel spatially, but we also have when does this pixel exist? Maybe it’s only at one particular frame. Maybe it exists for a sequence of like 50 or 100 frames or something like that.

You get this kind of additional, you get this additional information about the time duration of pixels, and so optical flow is kind, so a lot of, when you have this additional information it really opens up a lot of doors as to what we can look into.

And so optical flow is one of these doors. It is a computer vision, I’ll just write this down. It is a computer vision technique that is used to track the apparent motion, apparent motion of objects in a video. So using this technique of optical flow, we can actually find a pixel value or you can kind of make this more generic like an object and we can track it through the video. And so you know later we can draw kind of like a path, we could draw like a pathing. I’m gonna show you how to draw this, just a second, but this is actually really interesting because of optical flow because first of all it’s not just used for things like object tracking through videos.

It actually has a ton of different applications that we’re gonna be discussing in a later video. Like video compression, video stabilization and actually just recently there’s been some very recent research, that is using optical flow patterns to help give descriptions of snippets of video. You can give this AI a snippet of video, and it will generate a language or generate like a video description, and that’s really cool in optical flow features, turns out that optical flow is actually pretty useful to this sort of thing, but we’re gonna discuss this stuff at a very, very top level in a later video, but let’s first derive the intuition behind optical flow. So remember that if you want to track an object through the video, remember that on a computer level we only have access to these raw pixels.

So suppose I have like a frame here, here’s one frame, and then here is another frame. We only have access to the raw pixels and I kind of drew these, these probably should be the same size, but we only have access to, here is, oops, here is a frame t, maybe here’s a frame t plus one. And then I’ll draw another one short. Maybe here’s a frame t plus two and whatnot. What we’re trying to do with flow is to take a point here and a point here, and then track it through through this frame. It’s gonna go here and then it kind of goes down here, and this is what we’re trying to do with flow.

And so to do this, we consider two consecutive frames, and we kind of build the path and just kind of, it becomes like connect the dots. You have, where the dots are the position of this particular pixel at each of the frame, so this is kind of connect the dots. Initially, at first glance this might seem impossible because you have to consider so many things like the size of the frame, how do we know which pixels are which and whatnot, but it turns out that there are two assumptions that optical flow makes that really help simplify this.

So assumptions and we’re gonna be using these assumptions in later videos. So there are two assumptions that optical flow makes. One is that pixel intensities, pixel intensities don’t rapidly change don’t rapidly change between consecutive or successive frames. So what I mean by this is that this pixel value don’t just, in two successive frames they don’t just immediately change. So that would be like, in our case at least that would be like this pixel value is green here and in the next frame, it becomes like blue or something like that. So the assumption that flow makes is that this doesn’t happen. If you consider things like, like this has real world implications.

So the frames are taken such that there is such little time between them but unless you were actually video editing each frame, you wouldn’t really encounter something like this. Now this is not to say that maybe pixels don’t change after a longer period of time. That’s fine. But this is just saying that they don’t just flip between two consecutive frames and the time between each frames is like really small. If your pixels are flipping in between frames that’s kind of weird. Maybe there’s like some video editing stuff that you can do to make that happen, but naturally this doesn’t really happen. Anyway that’s the first assumption.

The second assumption is that groups of pixels, groups of pixels move together. So what I mean by this is that the pixels don’t really move between frames. This is just like saying that pixels don’t teleport. So if I have a pixel over here like a group of pixels here, they’re not just gonna like jump to, that are at the top of the image, they’re not gonna jump to the bottom of the image in the next frame there. That kind of hinders good flow tracking when pixels just teleport. Ideally you don’t want your pixels to teleport between frames, and again this is also you want the motion to be smooth and optical flow works really well when the motion is smooth, not when it’s like jumpy or teleporting.

So like I said, these assumptions have real world implications, so in the real world if you’re taking video, stuff just doesn’t teleport everywhere. That would be really bad. These assumptions are perfectly valid to make based on the real world implications of this. Now there are ways if you were take a video and do like some video editing stuff, you could break these assumptions intentionally but we are not really going to be considering that. I’ve kind of drawn a picture here, but let me draw like a, let me draw a, just almost two frames.

And so actually I’ll probably draw a third one right here. Yeah okay so of course I have my pixel here and here’s one particular frame and I’m gonna color that green, then here is the same pixel in the next frame here, and so let me actually on the label use frames.

So this will be something like t and this will be something t plus delta t and what I mean by delta t is that just some short amount of time has elapsed. So if I were to look at both of these pixels in the same context, then I’m gonna get something like this. Where there are two different, they’re a little bit apart here and so the problem of flow, the thing that we’re trying to solve here on this red here is, I want to find there is some displacement. So the pixel moves in the x direction by some amount u and down in the y direction by some amount v.

So the challenge of flow is to find this u and v that it moves, because if we have that then we track the path. Then now that we have this displacement then we get this thing called this displacement vector, we know how much this pixel has moved. I’m gonna stop this video right here.

In the next video we’re gonna talk a little bit more about the solution to this but yeah this is the problem that optical flow is trying to solve. How we find these values, this u and v? So I’m gonna stop right here, do a quick recap, and then we’re gonna kind of continue flow in the next video. So I’ll just do a quick recap here.

We discussed optical flow and it’s the computer vision technique to track the motion of objects through videos. If here it frames a video, I want to build this path throughout my video tracking a particular pixel. So this can be really challenging but there are two assumptions that optical flow makes that are kind of rooted in the real world.

That is that pixel intensities don’t rapidly flip between frames and that pixels don’t teleport are the two assumptions and so these are valid assumptions to make in everything, but in specific I kind of showed a sequence of frames here, but in specific we get two frames that are some time unit or part of some delta t, then the problem flow is to find this u and v, like how much has this pixel moved in the x direction and how much the pixel has moved in the y direction?

So that’s the problem of flow and then in the next video, I’m gonna kind of take this intuition and make it a bit more concrete using mathematics, so we will get to that in the next video.

Transcript 4

Hello everybody, my name’s Mohit Deshpande, and in this video, we’re gonna be delving a little bit more into optical flow and add a little bit of mathematics to this.

So if you recall from the previous video, the point of flow is to find this u and v, and we have two assumptions. Actually, let me take this image and expand it out a little bit so you can see it a bit better. So I have two consecutive frames here, and then in the resulting frame, so we’ll have, let’s say we’re considering this pixel here in frame at time t, and then this same pixel is over here at the frame at some time t plus delta t, and so delta t is the elapsed time. So that’s what delta t just means, delta just means difference, and so this t plus delta t just means that from this frame a little bit of time has elapsed and now my pixel is in a different location. So let me start here, and then you know, sort of like over here-ish.

And so the point of flow, like I said, in some elapsed time, this pixel has moved to the right by some amount that I’m gonna call u, and then has moved down by some amount that I’m gonna call v. And so that’s what we’re trying to find with optical flow, and actually if I do this, I can complete the triangle. So this is what we’re trying to find, we’re trying to find this u and this v with optical flow. So how do we find this u and v? Well, we use mathematics. So I provided an intuitive picture here, but let’s actually kind of formalize this picture a little bit.

So if you remember at the first assumption, this is saying that pixel intensities don’t rapidly change between consecutive frames. So it’s reasonable to say, and I colored it in such a fashion that these two pixels at different frames have the same pixel intensity, and just to remind you, intensity is just when we drop the, we’re just gonna drop any sort of color and we’re just gonna consider our pixel intensity. So it’s reasonable to say that at these two instances in time, at some t and t plus this delta t, at this point in time, the pixels have the same value. So actually if we go back to, if we use our notation, our function notation, this is saying that the I function applied to this is equal to the I function applied to this.

So let’s actually, you know let’s write that out a bit more formally. So what is the pixel intensity in this frame? Well this pixel intensity is I(x,y,t) because this pixel, let’s say that it’s a coordinate x comma y. So now the question is, what is the pixel coordinate here in terms of x and y, and u and v? Well, this pixel, the x coordinate is the same x coordinate here plus this small change u. So this is gonna be x plus u, because here, the x coordinate is like right here, and the x coordinate here is new here, and so this difference between them I said here’s x, and then here is x plus u, because I’m moving u units to the right.

So this is x plus u, which is why I call it x plus u, and then similarly if this were y, then this is then y plus v. So in this coordinate is x plus u, y plus v. And so now I can write this frame as being I, and then the x coordinate of this pixel is x plus u, because I’m at the same, here’s the coordinate for the first frame and here’s it for the second frame, and so I’m moving right u units. So that’s x plus u, and then comma, what’s the y coordinate? It’s y, which is the same coordinate here, plus v, because remember here’s the initial coordinate in frame t, and then in t plus delta t, I’m moving down by v, and so this is y plus v. And so now what’s the time? Well, I just told you what it is, it’s t plus delta t. T plus delta t.

And so now I’ve written this frame, this next frame, in terms of this current frame, and so actually just to make this clear, so this is movement in x direction, movement in x, which is what u is, and then v is movement in y direction, in y axis, I should say. Then x axis. Right, so then these are just the displacement, and this is movement in time. Movement in time. Time axis, so just like the next frame. So that’s what these three values represent. And so I can represent these two pixels here, but what is I? I is just a measure of pixel intensity. 3

And what is I here? This is just a measure in pixel intensity as well. And so if you remember from the first assumption that pixel intensities don’t rapidly change between two successive frames, these are actually equal. And so this is the optical flow equation here.

This is a really important equation, and it’s not quite in a term that we can use quite yet. So this is an important equation. So this is saying that at some time t, the pixel intensity here, is equal to the pixel intensity at, you know some time has elapsed between one frame and the next frame, and we can write it in terms of this u and this v.

And so just you know, take a second and look at this equation and make sure that it logically follows that from our first assumption that these two things should be equal, and that the x coordinate of the pixel in this frame is the x coordinate in this frame plus u, and then y plus v, and then t plus delta t, and so like the position here so that everything, so that this makes sense. Actually, let me draw these markers as well since I drew it for the x axis. And so hopefully this makes sense.

If you have any questions, I guess I’d just hopefully post a comment, but we want to find the values of u and v but they’re in our function. So how do we separate them from our function? And it turns out that we actually use calculus to do this. So I’m just gonna put dot dot dot calculus. I’m just gonna put calculus, and I’m just gonna write down the final equation here, and that is I sub x u plus I sub y v plus I sub t equals zero. So this direct conversion from here to here actually using calculus, I’m not really gonna talk about that at all, but, so these two things I will talk about.

So we end up with a single equation here, and x, so this I sub x represents how much the frame changes with respect to the x direction horizontally. This y is how much y changes, or how much the frame changes with respect to the y direction, and so vertically, and then I sub t is just the difference between… Is the image difference between the two frames, so how much do the frames change along the time dimension. And so it turns out that we know this. I’m gonna put like a check mark. We can compute this, we can compute this. We compute this, ’cause it’s just an image difference and then these two things we can actually use convolution to compute.

So we have this equation of variables, and it turns out that these things, we can compute. And so, ah be we have u and v but we don’t know u and v, so these are two things that we’re trying to find, but we have one equation, but we have two variables, so how do we solve one equation with two variables? This is also related to something called the aperture problem, in case you’re curious about it. But how do we solve this? But don’t worry, it turns out that there is a way to solve this equation.

There, open CV has ways that we can solve this equation, and the ways that we can approximate u and v. And one particular method that’s good is Lucas Cunardy method, and there’s some other ones along with that. There are actually quite a bit that you can use to find u and v. But to actually use that method again requires calculus and linear algebra, so I won’t talk about that but trust me when I say that there are ways that we can solve this equation, ’cause we’re trying to find u and v, so there are definitely ways that we can solve this. So don’t worry about that.

Okay, so that is, I’m gonna stop right here actually and in the next video there are a couple smaller things with optical flow that I kinda want to wrap up. And so I’m just gonna do a quick recap here, and so with optical flow here, we have the difference between two frames that we’re trying to find this u and this v, which is how much this pixel has moved in the x direction and the y direction. And so we can write down the pixel intensity here, and using the first assumption that pixel intensity’s don’t change quickly, we can say that these two things are equal, and so now we’ve written the second frame in terms of the first frame, and we get one equation here.

And so remember to get this x plus u, is just, if I’m defining u as being how much this pixel has moved in the x direction, and so here is the pixel in some frame t, and here’s the pixel after some time has elapsed, and I say that the difference between this x is u, then this new position must be x plus u, and similarly this must be at y plus v, if I define v to be how much this pixel has changed in the y direction. And then for this… For time is just delta t, which is some elapsed time has happened between these two frames.

And so using the first assumption we can set these two equal to each other, and then dot dot dot calculus, and we end up with this single equation, and it turns out that there are three things that we know and we compute easily, but u and v are things that we don’t quite know. U and v, at least we’ve gotten them out of the function here but there are things that we don’t know yet, but at least when they’re in this form, we can calculus and linear algebra to at least approximate them using several different techniques.

So that’s kind of a quick overview of optical flow, and so like a lot of the techniques are trying to, that you’ll see in optical flow, try to find these u and v values. And so we’re gonna be looking at one in particular, but this is where I’m gonna stop right here, and in the next video is where I want to kind wrap up some things with optical flow. And so I’m gonna go ahead and do that wrap up in the next video.

Interested in continuing? Check out the full Video and Optical Flow – Create a Smart Speed Camera course, which is part of our Python Computer Vision Mini-Degree.

]]>
A Comprehensive Guide to Face Detection and Recognition https://gamedevacademy.org/face-detection-recognition-tutorial/ Fri, 15 Feb 2019 05:00:04 +0000 https://pythonmachinelearning.pro/?p=2264 Read more]]>

You can access the full courses here: Build Lorenzo – A Face Swapping AI and Build Jamie – A Facial Recognition AI

Part 1

In this lesson, we’re going to see an overview of what face detection is.

In a nutshell, it answers the question of whether or not there is a face in a given image. Note that face detection is not face recognition, though it is its first step, as face recognition goes further into trying to see to whom the face detected belongs to.

Man with square around face representing face detection

The way we’ll do it with OpenCV is by, first, using machine learning training with tons of learning examples which include positives (images that do contain faces in it) and negatives (images without faces).

Secondly, we need to analyze the features gathered during the previous step where the positives will contain face features and the negatives will contain other kinds of features.

As the features are too many, the third step is to smartly define a smaller set of features for the machine learning model to perform well. The model should be able to say whether an image has a face and where its location in the image. The technique we’re going to use for that is called Adaboost, to help us reduce the number of total features. However, even with Adaboost, that reduction may still not be enough.

The fourth step we’re going to follow is the Cascade of Classifiers which will drastically help us speed up our face detection. Now, we can send our model to the OpenCV and it’ll return to us the faces and their respective locations as we wanted.

This is our quick overview. We’re going to get into more details of all this in the next lessons.

Part 2

Before going deeper into face detection, let us see the concept of features first.

Features are quantifiable properties shared by examples/pictures used primarily in machine learning. In other words, they represent important properties of our data. For instance, we have a set of features such as color, sound, size, etc., that we can use to train an AI for bird detection. Given a new image of a bird, extracting these features will help our algorithm tell us the category of the bird.

For face detection, the computer will deal specifically with Haar features. A computer interprets the pixels values themselves, so here we have features that are close to the pixel level. We can see some examples in the image below:

Image example of how pixels appear during feature detection

We have the edge patterns which are half white and half black (both for horizontal and vertical directions), lines, and a four rectangle pattern. The computer detects these type of patterns present in the images. But how do we extract the features?

We overlay each Haar feature against our image (you can think of it as a sliding window) while computing the sum of the pixels in the white region minus the sum of the pixels in the black region. That gives us a single value f, which is our feature for that particular part of the image. Repeating the process all over our image, we end up with a ton of these features:

Features from image after processing

It may result in about 150.000 features at least (even for a small image). It is too time-consuming to get through all that info. That’s why we have the Adaptive Boosting (AdaBoost) step in face detection. It aims to select the best features to represent the face we’re trying to detect. We’re going to enter into more details of how it works in the next lesson.

Part 3

This is an introductory lesson about face recognition and its related topics.

While face detection is concerned with whether there is a face in a given image or not, face recognition tries to answer to whom that face belongs. In fact, face detection is the first step in facial recognition.

For face recognition, we do not use pre-built models as we did for face detection.

In order to build an AI, we need lots of labeled training examples, each containing an image (with the actual face) and label (the name of the person whose face is in the image) tuple. We need to go through two distinct phases with our samples: training and testing.

For the training phase, we first run face detection for all our training examples, where each will give us back one face. Then the second step is to capture the features of the faces in the images, and the third step is to store these features for future comparisons.

Now, for testing, we will have new images. The first step is also face detection, and the images here can have several faces in them. Secondly, we’re going to capture the features of all faces just like in the previous phase (notice that we have to make sure that we are using the same algorithm as in the training). The third step is to compare the captured features in the previous step to training features. Then, the fourth step is to do the prediction by having the recently detected faces associated with their closest match from the stored images under the same label. There’s also the concept of Confidence Value, which basically measures how confident is our AI in each face recognition.

Regarding the choice of the features and how to capture them, it is important to notice that it is all about using good features. You do not need a complicated or a cutting-edge machine learning algorithm of any kind if you have good features, so you should not try to compensate bad features with your algorithm.

Part 4

In this lesson, we start to work with our face recognition data set.

First, we need to set up the environment we’re going to use:

Anaconda Navigator with Spyder highlighted

Remember to download the course files (NOT INCLUDED IN FREE WEBCLASS) and unzip the contents (included is some code for the data set loading to be quicker, and our entire data set called YaleFaces):

Folder with yalefaces folder in it

The “haarcascade_frontalface” file has code that we need only load using the CascadeClassifier class in OpenCV. The other file contains our training and testing data. Folder “yalefaces” has a lot of files varying from “subject01” to “subject15” as seen below:

Blank files of Yale Faces dataset

Each of the subjects has several suffixes, for instance, “.centerlight” or “.glasses”, “.noglasses”, “.sleepy”, and so on. These are attributes of the same person. We want to be able to detect a person under varied conditions, that is considering light variations, or whether the person is wearing glasses or not, if their eyes are closed, etc. By taking multiple pictures of the same person with different facial expressions and poses, we tend to get better results with facial recognition.

If we open the files with an image viewer, we’re actually able to take a look at the pictures:

Subject01.leftlight from Yale Faces dataset

In case of trouble opening the files, you can add an extension such as “.png”  to the end of each file’s name just to be able to visualize them.

As we can imagine, what happens is that all features of the same subject map to the same label (the respective person’s name).

We can even add ourselves to the data set provided we take selfies with the needed facial/environmental features (11 features, in this case), get rid of the “.pgn” or similar extension, and rename our files accordingly to the pattern of the ones in the folder. Thus, we’d be “subject16” and our selfie smiling would be entitled “subject16.happy”, for example.

We’re going to use part of these images for training, and the remaining will be used in the test phase. We’ll feed our program with all but the last picture of each subject for training, where the last picture is the one we reserve for testing.

Remember that the more pictures we have in distinct conditions the better for our face recognition outcomes. Also, our recognizer will always return us the label of the closest match it finds.

Part 5

This lesson demystifies the concept of dimensionality reduction.

The goal of dimensionality reduction is simply to receive higher dimension data and represent it well in a lower dimension. We’re going to take a Cartesian coordinate plane as an example, and we are going to reduce it from 2D to 1D.

Cartesian coordinate plane with various datapoints

In the image above we see coordinates “x” and “y” for a group of points in a way that each point is uniquely represented in a 2D plane. But now we need to find a good axis to identify all these points just as uniquely in 1D. For that, we’ll be using projections.

Consider the “x” axis as a wall and the points as objects dispersed in a room. If we shoot various rays of light from flashlights into this room in a way that the light comes in perpendicularly to “x”, then we’re going to have the shadows of our objects projected on the wall like demonstrated below:

Datapoints being shown as perpendicular to the x axis

The shadows are basically each point’s coordinate in “x”. Now, similarly, if we shoot light from the side and consider the “y” axis as our wall, we have:

Data points being shown as perpendicular to the y axis

We notice that several points happened to map to the same shadow and we do not want that for our reduction. All points must be reduced to a new unique correspondent piece of information in the lower dimension. Thus, we can conclude that the “x” axis is a pretty good axis for our reduction, given that the “y” axis does not satisfy our mapping requirement. In the end, the class of algorithms for dimensionality reduction is most concerned about picking a good axis.

How do we do it for images?

We can think of an image in terms of its pixels:

Square showing dimensions

However, do we really need a hundred dimensions to tell us if two images are the same? No, we can highlight some points and reduce the whole thing to a lower dimension. In fact, actual images will be much larger than a hundred dimensions, but the point here is that we can operate more efficiently over lower dimensions obtaining the same result.

In the next lesson, we’ll see two algorithms to help us decide how to pick a good axis for our reductions.

 

Transcript 1

Hello everybody, my name is Mohit Deshpande. In this video, I want to introduce you guys to this notion of face detection. And I just want to give you an overview of the two different topics that we’ll be covering in the next few videos.

Probably a good place to start would be to discuss what face detection is and what it isn’t also. Face detection answers a question, face detection answers the question is there a face, is there a face in this image. This is the question that, after we’re done with face detection, we hope to be able to answer given any image. What I should say, and I’ll put this in red, is that it is not recognition. Face detection and face recognition are two different things. And it turns out face detection is the first step of face recognition.

But face detection is trying to answer the question, is there a face in this image? And then when we go to face recognition, then we can answer the further question, whose face does this belong to? I just want to make that point very clear, is that face detection and face recognition are two different things. So if we were running a face detection on this image, we might get something like, hey, look here’s a face. And then regarding whose face it is we can’t really answer that with just face detection. This is where recognition would have to come in to play.

But we’ll primarily be discussing face detection. The process that OpenCV uses, the actual process was actually released in a research paper in about 2001. It was called Rapid Object Detection using a boosted cascade of simple features by Paul Viola and Michael Jones back in 2001. And this title might sound a bit scary, but don’t worry about it all. I’m going to give you an intuitive understanding of face detection rather than the rigorous one that they go through in the paper.

The way that I want to discuss face detection is to first, in this video in particular, just give you an overview of what it all entails. Don’t worry if some of the stuff I mention in the overview doesn’t quite make sense, ’cause we’re going to get into it and really define some more things in the next few videos. I just want to give you an overview of what we’re going to be looking at in the next few videos. We’re gonna go through the entire face detection stack.

What OpenCV does is it actually uses machine learning for face detection. And that makes sense because you can’t really go around taking pictures of everybody’s face and then compare it too, instead we can learn these features. The first step is part of that machine learning. That’s the largest step. With machine learning, we need lots and lots of examples. We need tons and tons of machine learning examples.

We need positives and negatives. Positives are images with faces. And negatives then are images without faces. We need to collect this dataset of images that have faces and images that don’t have faces and manually label them as being, well, here are the images with faces and here are the images without faces. And luckily OpenCV actually has all this data already collected for us, so we don’t actually have to deal with any of that.

The second thing after that is we need to look at the features of, giving a training example, we need to extract these features. And what I mean by feature, and we’re gonna go into it a bit later, it’s kind of the essence of what a face is. We need to extract that from the positive and the negatives. The positives are going to have the essence of a face, so these features. And negatives will also have features, but they won’t necessarily correspond to faces. (whispers) Actually make this a bit clearer here, I’m going to label this as Overview.

We want to extract these features from our image. And we need to look at all the portions of our image. When we do this, we’ll find that the result we get tons and tons of features, thousands, tens of thousands, hundreds of thousands of features that we get. And these features, just numerical ways, of describing the essence of a face, for example.

We need to extract these features from all the portions of the image, and we end up with so many features. If we’re trying to apply it using that many features, face detection would take forever. So we need some way to reduce the number of features that we have and reduce them smartly, because we want to reduce the number of features, but we still want our machine learning model to be able to perform well. So given a new image, the model should be able to say, well, this image does have a face in it and here is the location of the face. We don’t want to diminish our accuracy. We want our model to stay really accurate and precise, but at the same time, we want to reduce the number of features.

There’s this technique called Adaboost that we’ll be using to help reduce the number of total features that we have, (whispers) let me get rid of that, that will help reduce the total number of features from hundreds of thousands to maybe just like a few thousand, which is in two orders of magnitude, much smaller. But it turns out that even with Adaboost, that still might not be enough. We still have thousands of features that we have to check. What they propose in their paper and what we’re going to be discussing about is this cascade, this Cascade of Classifiers. That’s gonna help drastically speed up our face detection.

There’s an f-I there, okay. This is gonna help drastically speed up our face detection because then we’re not applying all thousands, 30,000 features to each part of the image. We build this cascade thing, kind of like a waterfall. That’s where that name, Cascade of Classifiers, comes from. You can think of it as a waterfall. After we’ve done all of this, then we’ll have a great machine learning model and then we can just send it through to OpenCV and say, is there a face? And OpenCV returns all the faces and the coordinates of the faces.

And then we can use that for really cool things, like we can pass it on to a face recognition algorithm and that can determine which face the image, which faces these are, or who’s faces they are. Or we can do something, and pass it to something a little less complicated like some sort of face swapping algorithm that can extract one face, extract another face, and swap the faces. There are so many other different applications of face detection that we can use.

This is the foundation of face detection that I wanted to discuss. If the stuff that I talked about in the overview doesn’t make much sense, that’s okay, ’cause we’re gonna get into this really in the next few videos. This is just an overview. I’m going to discuss more about the intuition behind this instead of the raw mathematics, so that you have a better understanding. This is kind of an overview of face detection. Just to do a quick recap, face detection answers the question, is there a face in this image? And I made the point that it’s not the same as recognition which is going to be answering the question of whose face is this? And then I kind of gave an overview, I mentioned the first thing is machine learning examples.

We have to have a lot of the image examples with faces and without faces. OpenCV actually handles this for us. Now the next thing with this is features. What are features and how do we get them from our images or training examples? Third is Adaboost, which is gonna be an algorithm that we can use to reduce the number of features from hundreds of thousands to maybe just a few thousand. And then Cascade of Classifiers which is what they propose in their paper, ’cause even with a few thousand features, you can imagine, it’s gonna be still kind of slow.

And so Cascade of Classifiers helps speed this up. And that’s the fundamentals of face detection and then after that, there’s some parameters that we’re gonna be discussing as well. So that’s it for this video and in the next video we’re gonna be discussing this topic of features. We’ll be discussing that in the next video.

Transcript 2

Hello, everybody, my name is Mohit Deshpande and in this video, I wanna talk about features and feature extraction in the context of face detection, but before we actually get into that, I wanna start by defining what are features and I kinda wanna go through an analogy so that you kinda better understand intuitively what they are and then we’ll transition on to how face detection is specific use of these features.

So, what is a feature? Well, features are just quantifiable. Features are, first of all, they’re quantifiable. There’s some way that we describe them either numerically or through some sort of fixed list or something like that, but they’re quantifiable properties that are shared by our examples and they’re used primarily for machine learning, in our case. That’s not to say that features are only used for machine learning, but in the context of face detection, they’re used for, whoops, that should say machine learning.

They’re used for machine learning, but they have many other things, like set features and stuff, but anyway, specifically to machine learning and we care about these features because they represent important properties about our data that we can use to make decisions like categorizing or grouping.

So, let me start with an example, and let’s suppose that we wanted to teach an AI to classify different types of birds or something like that. That’s my really bad, really bad image of a bird. So, pretend that that’s a bird, I’ll even label it, bird. I’m not really good at drawing. Suppose this is a bird, so what kind of features might we have with birds? So, there are a couple that we can come up with. There are maybe, like, color of the bird or maybe like sound that it makes, there’s, you know, size, maybe how big the bird is, and et cetera.

There are many more from these that we can use and as it turns out, using a combination of these features, if we saw some new bird and we wanted to label it, we could compare the new bird’s features here with all the features that we’ve seen before in our training examples of birds and we can kinda, reasonably guess what category or group our bird should, our new bird should belong to, and so this is kind of the intuition behind it. So, let me actually just label this, AI for birds.

So, purpose we were trying to do something like this. Here some of the example of features that we might use. And then, the thing is that the AI would be given lots of these features from birds and it’d be told, well, if you have this combination of features, then this bird is a falcon or if you have this combination of features, then this bird is a sparrow or if you have this combination of features, then this bird is a duck, or something like that, and it’s been given lots and lots of these examples so that when it encounters a new bird, it finds it, it extracts these same features of the bird.

Maybe it needs a human to actually tell it, here’s the color and the sound that it makes and then here’s the size and whatnot. Maybe it needs a human to actually give those features, but when it’s given, the AI’s given these features, it can categorize this new bird as being similar in one of the categories that it already knows about, and so that kind of the intuition behind what these features are. They represent, like I said, important properties about our data and we use them so that we can better make decisions.

And so, you might be saying, well, we’re dealing with all these birds, how does this work in face detection? And so, it’s a bit different, these features here are a bit different than for faces and that’s because you have to remember that for a computer, when it sees an image, it just sees the raw pixel values. We use those pixel values to maybe give us some more understanding about the image, but initially, we just get raw pixel values. And so, our features then for this, they are a bit closer to the pixel level and so, in particular for face detection, what works well are these things called Haar features.

So, H-A-A-R and let me actually draw some, and they might seem kind of low-level. What I mean by that is, for example, here are edge features and they’re basically like, if I draw a box around it, now they basically look like this here, except this portion’s colored in. And so, this portion up here is white and this portion down here is black, and this, you know, this kind of looks like a edge, right? So, here’s one portion that’s white and here’s one portion that’s black, so this seems to detect horizontal edges.

Now, as it turns out, there’s also an analogous one that detects vertical edges, so something like this. This is also a Haar feature and so we have something that detects edges, and we also have other Haar features that detect lines, and these detect both horizontal and vertical lines. So, here’s an example of a vertical line that’s being detected, and so, this kind of makes sense, right? So, here’s the white portions and if the black portion were aligned, if I were to stack these kind of on top of each other, they would make a line.

And just like with the edges, there is an analogous one for horizontal lines and then, there’s this other unique one called a four-rectangle and it kinda looks a little unusual. It’s actually set up to be like this and this is kinda where we have some alternating thing going on here where this is, these two squares are black and these are actually all the Haar features that are used for a thing like face detection.

You might look at these initially and say, well, how could these possibly be used for something like face detection? It turns out that these work really well because the detect these low-level features that are shared among faces. They detect edges and lines and other different features of faces. It’s been shown that using these features, it works well, but so, how do we actually extract these features?

So, if you remember back to convolution, this works somewhat similarly to convolution but it’s not quite, it’s not quite the same. And so, what happens is, we basically take, where like convolution, we kind of overlay this, for example. Where we overlay, if I had an image like this, we kinda overlay one of the Haar features on top of our image here, and instead of doing this convolution operation, what you actually do is, you take the sum of the pixels in the white region minus the sum of the pixels in the black region. And that, eventually, if you do that, then you get a single value and that’s one of our features, and you basically take this and you can move it around our image, like a sliding window sort of thing, and we get a ton of features.

In fact, we get a lot of features, we get a ton of features. We can get anywhere from like, you know, maybe like 150,000 features, for even just a relatively small image. Somewhere around there, for example. Let me put an example here, ’cause it might not exactly turn out to be this, but important is, you get a ton of features and that’s a lot of numbers to work with.

So, imagine trying to get a new face and then also getting all these, trying to apply the features that we’ve seen before across all of our training examples and applying all of these to the new input image. That’s just way too time-consuming, there’s no way that you’d set this up for doing face detection, you have to come back, you know, a day or so afterwards for you to get an actual result back, but this is just way too many features and it’s gonna be way too time-consuming.

And so, this is kind of a problem that we’re encountering with this, there’s just too many features. So, what they suggest in the paper and what we’re gonna be going into more intuitively is they suggest an algorithm called AdaBoost, which is short for Adaptive Boosting, where they use this algorithm to try to select the best features that represent the face that we’re looking for. And so, using AdaBoost, we can kind of reduce this number of, you know, hundreds of thousands down to like, maybe just a few thousand of the best features. And so, that’s what we are going to be discussing more in the next video, but I just wanted to stop here and we’ll do a recap.

So, what I discuss in this video, we discussed what features are. Now, in particular, I mentioned that there’s some quantifiable properties that are shared by different examples and in our case, we’re using them for machine learning specifically. And so, I made this analogy of, if we’re building an AI to classify different kinds of birds. Some of the example features would be like, color of the bird, sound the bird makes, the size of the bird, and so on.

So, these are the kind of features and then if I give my AI lots of examples with these features and say, well, a bird with this particular arrangement of features, these values, then we can classify this as a blue jay, or et cetera. And so, we can do something similar with face detection, except we have to use low-level pixel features and that’s what these edges, lines, and this four-rectangle feature kind of is. These are like the lower-level Haar-like features. It’s been shown that using these features works really well for things like face detection.

And so, how do we actually extract these features? Well, it’s kinda similar to convolution where we take this, treat it as a sliding window, and you kinda slide it over the image, but it’s not quite convolution because we take the sum of all the pixels in the white area, subtract that from the sum of all the pixels in the black area, and then we get that single value to our feature. When we slide this over our entire image and do all that kind of good stuff, then we can end up with hundreds of thousands of features and that’s way too many.

And so, in the next video, we’re gonna talk about, intuitively, we’re gonna discuss an algorithm called AdaBoost and AdaBoost will let us reduce this list from hundreds of thousands of features to maybe just a few thousand features, but those few thousand features are actually gonna be the best of the hundred, in this case, 150,000 features. It’s gonna be the best of those, so we’re gonna discuss AdaBoost in the next video.

Transcript 3

Hello everybody. My name is Mohit Deshpande. In this video I want to introduce you guys to face recognition and some of the topics I will be discussing over these next few videos. Face recognition is the step above face detection.

After detecting a face in an image, the next natural question to ask is to whom does this face belong to? This is a question that face detection tries to answer. With face detection we’re trying to answer the question of whose face is in this image? That’s a question that we are gonna be answering with face recognition. This is different than face detection because with face detection, we’re just concerned with is there a face in this image or not? Face recognition is more specific. And you’ll see that face detection is the first step in face recognition but face recognition is a more specific question that we’re asking than whose face is in this image?

And so to answer this question, we can’t just hard code all of the values that we want in there. In fact we have to use machine learning and artificial intelligence and we have to construct an AI that, given lots of labeled examples of people’s faces, it can take a new face and say this face, it belongs to so and so for example. So that’s what we’re gonna be doing and we, actually, have to because of the way how face recognition is so specific, we can’t do something like what we did in face detection and use the pre built model because that might change depending on whose face we want to include in our data set.

I’m gonna be talking about the data set that we’re gonna be using in a future video. So just keep that in mind. So anyway, if we want to build this AI, we need lots of labeled training examples.

What are these examples? With these training examples, we need an image and then a label. This image is the actual face and this label is the person whose face is depicted in that image. This is, really, all we need. They’re just of this form image and label. There’s two phases that we need for this and right now this is, I’m gonna this out. There’s the training phase and then there’s the testing phase. We’re gonna have to do it with both phases actually. With the training phase, we have lots of examples like these. What we need to do for the training phase is to first run face detection here. That will, actually, identify the region of this image that is the face.

We’ve already discussed face detection so I won’t talk too much about that. We just use, exactly, the same stuff that we have been using with face detection. Just run the cascade classifier on this input image and then we get the results of the face back. We do that for all of these training examples. So this is for all training examples. We do this training has to go across all the examples. So for each example we’d run face detection is the first step. The second step and, probably, the most important step is to capture features of that face. That’s, probably, the most important step there is why would you, actually, detect the face?

We have to detect features about the face. ‘Cause remember with face recognition, this is a more specific question that we’re asking rather than face detection. We just can’t use those same hard features that we were talking about because those are meant to just detect general faces. You can’t really use them to detect a specific person’s face. We have to capture some features that can help us uniquely identify different faces.

That’s where this step two comes along. Once we’ve captured features for that particular face, we have to store these features with the label. Once we’ve captured these features, then we can store them and use the labels so that when we get in new training examples and new images, we can compare them to the stuff we already have seen and just take the label of the face that’s closest to whatever the new input image is.

Speaking about getting new examples, let’s talk about testing. Testing is for new images. I’m just going to abbreviate image as img. For new images, what we have to do is, firs of all, we have to run face detection again. We will want to detect the face. If the image has multiple faces, that’s perfectly fine. We will just detect all of them. With the second step, what we have to do is capture features of all faces. ‘Cause remember in the training examples there’s, generally, just going to be one face per image. That just makes it simpler when we’re dealing with training data because we have control over our training data.

With testing data, we don’t really know how many faces can be in a particular image and, actually, it turns out it’s not gonna matter. We want to capture features of all of the faces. Then after we have those features, using the same algorithm, then we can, actually, compare to training data or to training features I should say. The last step, down here is, the coolest step and that’s the prediction.

Once we have the features of all the faces in our testing image then we can compare those features with faces that we’ve previously seen, from the training phase and we can find the closest image. Once we have that closest image, then we can just take the same label that’s from the closest image and say well then this is what the person that’s depicted in this image. That’s the person that’s depicted for that face. That’s something that we can do in OpenCV fairly simply. Along with the prediction, we also get something called a Confidence Value. Intuitively that’s how confident is our AI? Is it unsure about this or is it really confident that this face belongs to this person? Anyway, this is just that OpenCV can just give us when we train our machine learning model or AI.

You can just go to new input image and you call function on it with the new input image and it will return you the label and the confidence. So that’s not something that we have to worry about too much. But the key step that I want to be focusing on is this step right here. That is capture features of all faces. Capturing these facial features.

‘Cause like I mentioned before, face recognition is more specific than detection so we have to be smart about which features we capture and how we capture the features of the face. Because if you are going into any kind of data mining or data science or anything like that, you’re gonna learn, very quickly, that it’s all about using good features. Usually if you have good features and a relatively simple machine learning algorithm or something like that, then you’re, generally, going to do better than if you have truly bad features and you try to compensate by using some really complicated, over the top algorithm. This capturing analysis of the features is, probably, the most important step in face recognition.

And so in the next few videos we’re gonna be discussing a couple of different ways that we can do that and the majority of the ways that we do this are centered around this topic called dementiality reduction. And it sounds much scarier than it actually is. I’m gonna give you a more intuitive understanding of dementiality reduction in the next video actually. But I’m just gonna stop right here and do a quick recap.

With face recognition, we want to answer the question whose face is in this image? That’s more specific than face detection because face detection is just is there a face in this image? To answer this question, we can’t just hard code all of the values in there. We have to use some kind of artificial intelligence and machine learning to identify the features of a particular face given lots of training examples, and then given a new image, we compare it with those previous examples that we’ve seen before and then you can make a prediction.

And so what I mean by examples, unless I show it here, examples are just a tubal containing the image of the face and then the label. And the label, in this case, is the name of the person. So given lots of these, we can go through the training phase and apply face detection to identify the face more specifically and then capture features of that particular face and then store those features with the labeling or the person name.

In this giant data model that has the name of the person and then the face. Multiple examples. So you can have the same person but multiple faces and we’re gonna see that that is gonna happen with the new data set that we’ll be using. It might be a person that’s smiling, frowning, the same person is winking, the same person is, you know, you just collect different images, different facial expressions from the same person.

We’re gonna get in to the data set a bit later. So anyway, that’s the training phase. With the testing phase, when you get a new image, the first two steps are the same. You run face detection and get the features. You have to make sure you get the features using the same algorithm as when you trained. And then once you have those features, then you can compare with all of the examples that you’ve seen before and then make your prediction.

That’s face recognition in a nut shell. The most important step of this is to capture these features. How do we get these features given a sub section of an image that contains a face? That’s the most important portion and that’s a portion that we’re gonna be also focusing on. A lot of the techniques that deal with that actually discuss dementiality reduction. It will give you an intuitive understanding of dementiality reduction in the next video.

This is just an overview of face recognition. In the next video, to introduce two of the algorithms that we’re gonna use for capturing these features. I’m going to discuss dementiality reduction in the next video.

Transcript 4

Hello everybody, my name is Mohit Deshpande and in this video I want to introduce you guys to the dataset that we’ll be using for face recognition.

In the provided code I’ve actually bundled this dataset in it as well. It’s actually a pretty popular face recognition dataset. It’s called the Yale face dataset and so here what some of the pictures look like over here and so when you open it up you might notice that on Linux it’ll render but if you are opening up, if you’re looking at the dataset on a different operating system, sometimes it doesn’t quite render the images of the faces because there’s no file extension, really. They all follow the same naming convention for the file name.

So, a subject and then their number and then dot and then some property about, some sort of description about the person in the image. It ends right there, there’s no .png or something like that so it just ends right there. So sometimes, I’ve had this issue on MacOS, sometimes these images don’t really render so you have to open them up in preview. In Windows you might have to actually open them in the photo viewer or something to look at them. But on Linux, luckily for us, it recognizes them as being images. Like I said, in this video I just want to get you guys acquainted to this dataset that we’ll be using and also show you how we can add ourselves to this dataset actually.

If I scroll all the way down, see I’ve added myself to this dataset as subject16 afterwards So, I’m gonna explain how we can do that as well. So, first of all, this Yale dataset, let’s pretend that I’m not in this dataset. The official Yale face dataset has 15 subjects and it’s about 11 images, or it’s exactly 11 I should say, exactly 11 images per subject. So, that comes out to us with 165 different images and so each subject has, 15 subjects and each subject has 11 images, and I should mention that these are actually gif images.

When you’re using your, if you’re trying to use your images they don’t have to be in a particular file format, by the way. You can just use them as whatever but I guess when they were making this dataset they used that particular format so what kind of image file it is doesn’t really matter if you’re going to be putting your own stuff in there anyway. Lemme, just gonna talk a little bit about the way that this dataset is set up.

So, like I said there’s 15 subjects and then each subject has 11 images and these 11 images are all different and that’s something that you should be, if you’re gonna put your own stuff in there they should be different images. You don’t just wanna use the same image 11 times. That’s not really that robust, but lets select your subject one, after the dot it tells you what conditions are in the image like, centerlight is, you know, center light. There’s one with glasses, there’s happy like a smiling expression, there’s leftlight so the light is kind of, one side of the face is a bit darker than the other. There’s one with no glasses to kinda contrast glasses. There’s a normal one which is just added in there, there’s rightlight where the right side of the face is lighter than the left side, there’s sad where they’re frowning, there’s sleepy where your eyes are closed, there’s surprised, and then I think wink is the last one. Yeah, wink is the last one.

So, you can see that for each subject they asked them to do the same thing. So like, smile, left light, center light, wink, sad, you know, and all that stuff, so those 11 images. And they’re all different. So for no glasses and normal, for example, they look really similar but they’re two distinct images and so that’s something that, when you’re creating your own dataset that’s also something that you should do. And so this is quite commonly known as the training set.

And it’s called training set because these images are what we’re gonna be feeding to our AI and telling our AI, hey this is a picture of subject one. We’ll say, here is another picture of subject one. And then we just kinda keep adding pictures of subject one and then eventually we’ll go to subject two and we’ll be like, oh these are pictures of subject two. And you give it the 11 images and it can help classify correctly and recognize whose face this is.

And one thing I should mention is that the way that I described it is, we are just gonna be ignoring what’s after the dot. We just care about the 11 images classifying the subject. They’re supposed to be different images because the more different, the more variety is in your image, or is in your training set for a particular person, the better that it’s probably gonna recognize your face. Because if you take one picture and copy it 11 times and try to teach the classifier that, it’s just gonna pick up the same features 11 times.

So that’s not really that useful. Which is why there’s all these different conditions like center light, there’s one with glasses or without glasses. If you don’t actually wear glasses then you can just borrow your friend’s or do something else with that. When you’re doing your dataset they don’t have to match up with this. You can just take 11 images from different angles and different facial expressions.

And one thing I should mention, is that you might be tempted, looking at this, to see if you can do sentiment analysis and that’s a whole ‘nother field in computer vision. What I mean by that is, given a picture of a face, can you tell if it’s smiling or winking or stuff like that. You might be tempted to think that you can do that given this image set and you might be able to but I don’t think it’ll work that well. In fact, there are actually a lot of APIs that are starting to come about that actually already do sentiment analysis.

So, you can usually just piggyback off of one of those. If you were trying to do sentiment analysis what you would do, your training set will then be different. You would need, just for one subject, you would need multiple pictures of them, like, winking, or multiple pictures of them being happy, or multiple pictures of them sad, and you know, so on and so on. It’s really all dependent on the training set and so I think that’s pretty much covers all that I wanted to cover about the Yale face dataset and so in the next few videos, actually in the next video I’m gonna show you how you can put yourself in this dataset and some of the stuff that has to go with that.

But, I guess I’m gonna stop this video right here for this kinda introduction to the dataset here. Okay so, I’m just gonna do a quick recap, so, for our face recognition dataset we’ll be using the Yale face dataset and there are plenty of other datasets out there. I think AT&T has a face dataset, but it’s way bigger than this so I just wanted to, I chose this dataset particularly because it’s not a massive dataset, it’s like 165 images so you can skim through them really quickly but some of the other datasets have hundreds and hundreds of images so it just gets kinda tedious. Plus, when I actually get to the training phase that might take a bit longer for more images of course.

But anyway, this is the Yale face dataset and it’s split up so that there are 15 subjects and each subject is uniquely identified by a number and each subject takes 11 images. And these images are unique images in the sense that they’re not just the same image copied 11 times. So, here’s all these different kinds of images of different facial expressions and different lighting conditions so that we can really try to make our AI more robust in these changes. So it’s 15 subjects, 11 images per subject and we’re gonna be using this dataset for training our AI.

I guess I wanna stop right here with the Yale face dataset and then in the next video, if I scroll all the way to the bottom here, you can see I managed to put myself into this dataset and so if you wanna add your own face recognition here then you have to add your own face in here and there are some nuances that you have to kinda look at when discussing that so this is probably where I’m gonna stop right here. In the next video I’m gonna show you how you can add yourself to this face dataset.

Transcript 5

Hello everybody, my name is Mohit Deshpande and in this video I want to introduce you guys to this notion of dimensionality reduction and so name sounds much scarier than it actually is so let me just kinda give you an intuitive understanding of what it is and then we’re just gonna look at an example of how we would do this and the reason we’re talking about this is because two of the algorithms that we’re gonna be discussing, particular eigenfaces and fisherfaces are used for face recognition.

Under the hood they actually use two algorithms called principle components analysis and linear discriminant analysis and both of these are a kind of dimensionality reduction and so this video is gonna kinda introduce you to this concept of dimensionality reduction so that we can talk about the two face recognition algorithms that you use here. So what is dimensionality reduction?

So this just describes a set of algorithms whose purpose is to take data in a higher dimension so higher dimension data and represent it well in a lower dimension data, in a lower dimension basically and so I’ve been using this word dimension a lot so here’s an example actually that I’ve drawn of a scatterplot, this is just a plain old scatterplot, nothing fancy about it so this scatterplot is actually in 2D and in the example that we’re gonna be using, we’re primarily gonna be going from two dimensions to one dimension and so 2D is a plane and so this, no, is our x, y plane.

This is also called the Cartesian coordinate plane but the point is it’s in two dimensions and why it’s in two dimensions is because we to represent any point in this coordinate system, you need two things to identify uniquely.

You need an x coordinate and a y coordinate so each of these needs an x coordinate and a y coordinate to uniquely identify this and so you know kind of the examples that we’re gonna be going from is 2D to 1D and so what is one dimension, well that’s just a line, it’s a number line right so we only need one thing to uniquely identify a point on a number line, you just a need a single, a one component basically so that’s what I mean and so then this is kind of what we’ll be using for examples because and this can be applied to, like going from three dimensions to two dimensions but it’s just easier to draw it if it’s from two dimensions to one dimension so I’ll just gonna be using this for now.

So yes, so what we’re gonna be doing is we need to find a way to take this data in two dimensions and represent it well in one dimension and so you might be asking well wait a minute, how do we do that if the data’s in two dimensions, you know how do we just cut off one portion of the dimension, you know.

So what we’ll be doing is we’ll be doing these things called projections and to handle this intuitively, let me explain to you what a projection really does so how we can reduce dimensionality is commonly through finding an axis to project our points down to a lower dimensional axis to project our points down to so what I mean by, I’ve been using the word projection but what I mean by that is that suppose that we wanted to project all these points onto the x axis, so what I mean by that is we want to take all these points and then plot them along just the x axis and because the x axis is a line, then we’ve effectively done our job.

We’ve taken something in 2D, the scatterplot, in two dimensions and we can represent that in one dimensional line. So to do this, we have to use projections so what a projection is is imagine I have like imagine I have like a flashlight or a series of different flashlights and this x axis was a wall and these points on the scatterplot were like some sort of object that’s kinda in the way and so what I do is I take this light and I make it so that it is perpendicular to the x axis.

In other words if one’s a right angle so I shoot rays of light, I have my flashlight and I keep you know shooting these rays of light onto this so that they all kinda go down in this way and so what happens is these points are in the way so these points are actually gonna cast shadows along this wall right so these objects are gonna cast shadows and so where along the wall are these objects gonna cast shadows?

Well if I have a ray of light coming in here, then it’s gonna cast a shadow and it’s gonna be right here is where it will, where the shadow will be on the wall and so I can do the same from this point, I can, you know if I have a ray of light going right here then it seems that this point right here would also you know, have a shadow right here. Let’s do this for all the points so let me just go back to this point and if I draw this point and I draw this line from the shadow, it’s gonna appear roughly right here and so if I do this point, then the shadow should appear right here ’cause if I have my rays of light are gonna be casting, you know these objects are gonna cast shadows and I keep doing this and you know I will have an object right here now and then I have this right here.

Okay and so now what I’ve done is I’ve actually taken my points and projected them along the x axis so I’ve done my job of dimensionality reduction. Now I have points that are in one dimension. Right so these points are along a line and so what we’ve done is using projections we’ve taken data in two dimensions and projected it into data that’s in one dimension and this representation, this particular one is actually a pretty good, we’ve chosen a pretty good axis actually to begin with.

The x axis so I want to show you what it’d look like to choose a good axis first and then we’re actually gonna go ahead and choose a bad, let’s choose a bad axis to project onto now so the x axis, this is you know how we did projections so now let me actually project along the y axis and so you know how do we do something like how do we project along the y axis, well we just take our light and we make it so that it’s now gonna be going to the left so now I can take my flashlights or something and then now I’m gonna make them go to the left and so now let’s do the same thing with projections.

Well then if I take this point here, it’s gonna be right here where the light is gonna cast a shadow so that on this wall of my y axis if I have some light here, it’s gonna cast a shadow like this. When I pick this point, then I get something like this here and so if I get this point though, you can see that it’s actually kind of overlapping with point and so these two points are kinda like the shadows will be the same shadow like right here and the same for these three points actually.

It turns out that these three points actually share, if I, you know, put two points, if I put these objects in a line and I cast some light under the shadows, they’re gonna kinda like overlap here and so we just get one point here and so you may be thinking well hey great, this is good because we’re kind of reducing our, we’re reducing the number of data that we have but it turns out that this is actually isn’t a good, an axis to use. It’s not a good line to use.

The x axis was good and so the class of algorithms for dimensionality reduction are most concerned about picking, picking a good axis so that’s kinda what this whole notion of dimensionality reduction like the two algorithms that we’re gonna be talking about in the next video, they’re most concerned about picking a good axis to cast our light, to cast our shadows on and this is what’s called a, what’s known as a projection and so you know, how does this, I’ve been talking a lot about these like scatterplots but how does this work with images actually because the two algorithms that we’re gonna be discussing are dealing with images.

Well it turns out you can think of the dimensionality of you know, an image so like suppose I have this, is a 10 by 10 image here so you know what is the dimensionality of this image? Well the dimensionality is equal to the number of pixels in this image so this image is actually in a 100, it’s a point in 100 dimensional space and that’s really hard to think about, I mean most experienced mathematicians have difficulty imagining, visualizing the fourth dimension and I’m asking you to think of something in 100 dimensions.

That’s just not, you know that’s not good and we kinda get the same thing, we kinda get the same principle when we were discussing face detection is do we really need a hundred dimensions to tell us if two images are you know the same or not of if two faces are identical and it turns out that no, we really don’t and so what we’re trying to do is take something in like 100 dimensional space and bring it down to something like, I don’t know maybe like 10 dimensional space or something like that and then we can just compare the two, the input image, we do the same thing with the input image is we convert it to 100, we convert it from 100 dimensional space down to 10 dimensional space or something and compare the two.

You intend to natural spaces very impossible to visualize while what it might be that 10 things that can identify a point in 10 dimensions will you know work. Maybe 10 numbers works well and we can kinda reduce, you know our dimensionality of the image and all this is gonna be kind of happening under the hood by OpenCV so you don’t have to try to wrap your head around a hundred dimensional space. But the principle here is the same.

We just wanna take something that’s in a higher dimension and move it to a lower dimension with a simple representation so that we don’t have to deal with like a hundred numbers representing a face where in this case versus 10 numbers representing a face and in reality the images, the input images are gonna be much higher than a hundred dimensions. This is actually really small but we want to reduce the dimensionality so that we can compare two images in a relatively, lower dimensionality and so we can get a good result that helps improve accuracy and efficiency and so that is just and actually just a quick cyber.

As it turns out, you can use the same dimensionality reduction techniques to take a point, take an image and actually plot it on like a scatterplot and you know if two images are close to each other, that would mean that these two images are actually similar like they’re really close to each other and so you can do all sorts of cool things with dimensionality reduction but it turns out it’s also super useful for face recognition and so there are two algorithms that we’re gonna be talking about in the next few videos about how you know and they describe basically ways that how we can choose a good access so anyway, this one messed up right here and

I’m gonna do a quick recap so with dimensionality reduction, what we’re trying to do is take something that’s in a higher dimensionality and represent it simply in a lower dimensionality so I showed this example with the scatterplot.

I want to take this two dimensional data in x and y and just represent it on a line and so a way to do that is we use the thing called projections and so projections, imagine the axis that you want to project on is a wall and you have a light, like a flashlight that’s projecting in this, like projecting light rays in perpendicular to this wall so that the points are like random objects and so when you cast the light that will make shadows along the wall and you plot where the shadows are on the wall and then that gives you, you know, you can now successfully take data from two dimensions and put it in one dimension by looking at the shadows that these cast.

It’s also kind of why they’re called projections ’cause you would think of it as a light projecting a shadow basically so that is dimensionality reduction and then the next two videos, we’re gonna be discussing two particular algorithms that we can use to answer those question of how do we pick a good axis so I’m gonna get to the first one called eigenfaces and principle components analysis in the next video.

Interested in continuing? Check out the full Build Lorenzo – A Face Swapping AI course and Build Jamie – A Facial Recognition AI course, which are both part of our Python Computer Vision Mini-Degree.

]]>
Recognizing Images with Contour Detection using OpenCV https://gamedevacademy.org/image-recognition-opencv-tutorial/ Fri, 08 Feb 2019 05:15:43 +0000 https://pythonmachinelearning.pro/?p=2262 Read more]]>

You can access the full course here: Advanced Image Processing – Build a Blackjack Counter

Transcript 1

Hello everybody. My name is Mohit Deshpande. And in this video, I want to kinda introduce you guys to the concept of image segmentation.

So to motivate this discussion, here is an image of a wallet on a table. And so as humans, we can easily differentiate between the wallet and the table, and we kinda know where one begins and one ends. And I can actually draw this outline, pick a nice bright color here. So here is the wallet and here is the table.

And so as humans, like I said, we can easily differentiate between the two, but this is a bit more difficult for a computer. Because remember that computers only have access to the raw pixel data of this image. So if we wanna perform any image processing operations on the wallet, we’re at kind of a loss, because we first need to separate the wallet from the table. We don’t wanna apply the same operations to both the wallet and the table. We just want to apply it on the wallet. So this particular challenge of separating the different parts of an image is called image segmentation.

So image segmentation. Let me see if I can get that m up there actually. There we go. So it’s a problem of image segmentation. That’s like, given an image, we want to separate the image into different parts. We wanna kinda carve out the different parts of the image based on some of the criteria.

And just as a terminology thing, if we’re looking for one object in particular, let’s just call the object of interest, which is, pretty self-explanatory there. So if I just wanted to get this wallet, I can carve out this portion of the image, like what I did here. However, like I mentioned, this is actually a bit more challenging for a computer, ’cause we have to tell the computer exactly how we want to make this, how do we wanna make this determination. And this is part of the reason why this is so challenging.

And to kind of further explain why this is challenging, let’s consider a different image here. So now I have a lot of different objects that are on, on this table. And there’s still lots of research done in this field of computer vision, particularly it’s called SDS: Simultaneous Detection and Segmentation. And so there’s kinda two parts here. There is the detection part. So the detection part would say, this is a card. This is a pencil, pencil. This is also a pencil. And this is a pen. So that’s kind of what the detection part of SDS is, but we’re not gonna be getting too much into that. But the interesting part that we’re concerned about is the segmentation part. And that is separating this, separating the different parts of the image.

So here is a foreground object, it’s a card. And here’s another object here, this is a pencil. And then there’s some questions that come into play like, these are both pencils, so should I be drawing these boxes individually around them? Or should I draw one big box and carve out one big region for both of them, because they’re both labeled pencils? So these are some of the kind of questions that come into play. And here is just a pen here. So this is kind of the problem of simultaneous detection and segmentation. We’re not gonna be worrying about the detection part, but we’re particularly interested in the segmentation part.

And so in the next video, we’re going to look at, probably the simplest form of image segmentation called thresholding. So I’ll get to that in the next video.

Transcript 2

Hello everybody. My name is Mohit Deshpande, and in this video I want to introduce you guys to this concept called contour detection.

So on the bottom left you see one of the test images that we’ve been using. It’s just a picture of a playing card, some pencils and a pen, and on the right is actually the same image, but thresholded. So you can kind of see the difference between the two, and the reason that we need contour detection and contours is because thresholding just isn’t enough. If we look at the thresholded image, that gives us a good indication of where our object of interest is, visually, assuming we can see this, but that’s still not quite enough because you notice there’s still some other errors that are kind of over here.

And so we want to go one step further, and that’s to actually figure out what this contour is. So let me put it in red. And the goal is to actually draw this boundary here, this contour. I’m doing kind of a bad job at it, but in reality, it’s gonna be much, much smoother than what I’ve drawn it as, but the goal is to detect this contour here. Because then when we have this contour, then we know the dimensions of our object of interest, the total perimeter around it, and we can do all sorts of image processing tasks to it and completely ignore these. So I’ve been saying the word contour a lot, but let me actually define it.

So a contour, a contour is just a closed curve along a boundary of color or intensity. And intensity is just the same meaning as color, except in a gray scale context. Intensity is just another image metric, but in our case we can use either. So the apparent change of intensity would be like going from black to white, that’s a change in intensity, because black has a value of zero and white has a value of 1 or 255.

And so…Well I can mention that a contour is a closed curve along a boundary of color or intensity, and intuitively that’s what the computer looks for when it’s doing this contour detection, because intuitively that boundary of color is a very good indicator of a contour, or a closed curve like this of some shape, for example. In this case, it happens to be a shape like a rectangle, but it doesn’t have to be a shape, it just has to be a closed curve thing, like what I’ve drawn here. And so we can draw a contour around this playing card, and what we get is, more concretely, it’s a list of points. It’s a list of points or coordinates in the curve, or I should say that define the curve.

For example, the curve is this list of coordinates, and so the contour is more concretely the list of points or coordinates that define the curve and define the boundary. And the details of contour detection are actually fairly complicated, so I don’t really want to get too much into the details, but there are a few parameters that are very useful to know and understand.

And so in the next few videos, we are going to be looking at a couple of the parameters for contour detection, and then another topic about how we can fit the contour to any shape, so we’ll be discussing first the parameters in the next videos.

Transcript 3

Hello everybody, my name is Mohit Deshpande, and in this video, I want to show you guys what we will be building.

What I have here is… I have some images here, and in particular, I have something like this. Oh one thing I should mention is that some of these images are actually not mine. They’re taken from… On Github there’s this arnab.org, and so I’ve borrowed some images, so I should give credit where credit is due and mention that these aren’t really mine. But, I just want to mention that these are kind of the images that we’ll be working with. You know, and in particular there’s already lots of code out there on Github. Particular, you know people have actually already build images like this. This is also by the same person, arnab.org on Github, and what he’s done is actually built this kind of training set of all of the cards, in playing cards, in a deck of cards.

So, we’re actually going to be using that to compare with our new input here. We can compare these images to kind of our like our training set here, and we’ll be able to detect which cards have been selected. And we’ll actually be able to see which cards have been selected. And so, let me actually run this and we can see the demo.

So, when I run this, it’s going to be really cool because… Just a second, yeah there we go. You can see that, what’s really cool about this if you look at the title, it’s actually representative of what card we’re looking at. This is a 9 of spades and we’re looking at 9, S for spades, or like 7, H for hearts, or 8, S for spades, 3, H for hearts. So, let’s actually see if these are correctly being detected.

So, if I open up Test… See we have 3 of hearts, yeah, that’s this one. We have our, whoops… There’s so many images here. What else, we have the 9, or we have 7 of hearts which is this one right here. and then we have what? We have our 9 of spades, and then we have our 8 of spades. Let me find that, that 8 of spades. There we go, our 8 of spades. And so you can still, we found all of our images. We were able to detect the playing cards on the deck.

And in fact one extra thing that we did was, notice how these are all nice and square, they’re flat? That’s not the case with these images, right? These are tilted, these are kind of at an angle, and so we’ll also learn how we can de-warp these images, so that they look nice and flat like this. And so you know, from here, this is where, from here you can kind of go on. So now that we’ve detected the cards, this, you know you can take the next steps, and once we have the card detection going, this is where you can go forward with this. Use this kind of like a baseline for any other projects that you might use that involve playing cards for example.

So, this is a really good detection scheme. What we’re going to be talking about in the next few videos is we were going, or… I’m going to give you some code first of all. I should mention that you don’t have to code all of this from pure, from like absolute scratch. You’re not going to have to do that. I’ve actually have some code that I’m going to provide you guys. and then there’s going to be, we’re just going to kind of fill in portions of the code, in particular, the training of our model. We’re kind of building like a really, really small machine learning model that we’re just going to be using to compare. I don’t want you guys to worry about that. We didn’t cover any of that.

So, I’ve just coded that up for you. But, what we’ll have to do is we’ll have to build the… Well to write the code that will actually detect this card and then do this de-warping thing. So, that we can actually compare it to what’s in our model. And the result you’ll see is going to be all of… You know we can see all of these images here. and we get a, let me pull this up here. And we get a list of tuples, where the first thing in the tuple is what number it is, and the second thing is what suit it is. This is something that we’re going to be building, and I’ve already provided some code for you, and I just want to show you what it’s going to look like. You can use this as a kind of the base or foundation or framework for any other projects that you’re going to be using regarding cards.

Anyways, this is kind of the app that we’re going to be building. In the next video, I just want to kind of… Like I said I’ve written a ton of code for you guys that we can just use. In the next video, I’m going to explain what code is there, so that you kind of have a better understanding. And then you know off we’re going to go. We’re going to go ahead and build our app here. On the next video like I said, I just want to kind of introduce you to the code that’s currently there.

Transcript 4

Hello everybody, my name is Mohit Deshpande, and in this video I want to go over some of the code that I’m going to be providing you guys so that you don’t have to worry about getting all of the small machine learning models set up.

You don’t have to worry about that. I’ve already provided everything for you. But I kind of want to go through this step by step, kind of, so that you are at least familiar with the code here. And so, the first thing you might notice is that this is actually not that many lines of code to get something this cool up and running, and that’s kind of the standard that you can expect with OpenCV, is that it doesn’t take a lot of code to get something really cool up and running.

So, as you look through this code, you can see well, hey, this code is what we’re going to have to write, probably, because it’s currently not really doing anything. So, you know, we’re going to have to write this code here and, actually, let me return an empty list here, so that we know that this will actually still function.

So just to kind of go over some of this code here, first thing’s first, we have to kind of build some model, some, you know, means of looking at all of our images and our training data And our labels so that we can correctly know that, well this card looks like the king of spades, and that’s really what it is. So I provided you this training image and this comma-separated value list. I should say that it’s separated by tabs, actually. Tab-separated values list that correspond to the label so all that is there and this model just kind of extracts all that data and builds kind of our model that I’m calling our machine learning model.

But anyway that’s what this, whoops, let me scroll back here, that’s what this function does, but you know, there are some utility functions that are here. In particular, there’s this reorder function. Basically, what this reorder function does is there’s a comma here that says “Reorder points in a rectangle in clockwise order to be consistent with OpenCV.”

And that’s really what this does is when we do any kind of detection with, or contour we have to make sure that we’re consistent in OpenCV in particular, OpenCV likes it when our rectangles start at the top left, and then the points are in clockwise order. I mean, we need this function because what we’re going to do is, when we detect a contour for our cards we’re robust, we want to be robust to if they’re tilted or if there’s any sort of perspective tilt to it.

You notice that when I ran the full code, you notice that when we saw our images they weren’t tilted or anything. They were actually perfectly square, and so this function, we’re going to need that so that we when can detect our contour around our card we make sure that we get the points in just some order we can order the points so that they fit OpenCV’s scheme so that then we can just apply what’s called an affine transform. We can apply that so that we kind of square up our corner of our images. So that’s kind of basically what this function does. It’s a utility function that reorders our points so that we’re consistent with OpenCV so that we can get that nice square image.

There’s this preprocessing function, that’s actually really common with a lot of computer applications. There’s some sort of cleaning that’s basically done here. And what we’re doing here for this kind of cleaning is, first of all, we’re converting to greyscale, then we’re applying a little bit of blur and, you might be saying, “Well, why do I want my image to be blurry? That seems kind of counterintuitive.”

And the reason behind this is so that you kind of get rid of all of the small noise. I kind of talk about this when we’re doing contour detection that you want to apply a little bit of blur so that you don’t have to take, like, these really small contours that are just like these random image noise from like the camera that you were using at the time or something. So the noise, images can be kind of noisy, so this blur helps kind of smooth out, smooth out the noise. And then you see we have this adaptive threshold thing.

Hey wait a minute. What are we doing here with this adaptive threshold thing? And so what adaptive threshold is, I didn’t want to really get too much into this because it can be kind of complicated, adaptive threshold is a way that we can make our it’s a different algorithm used for thresholding and it’s used to help make our image a bit more, it’s a bit more robust to things like lighting. Because if you notice in the image with the random objects on my desk, you notice that one part of the image is kind of lighter than the other and the other part’s like darker.

And so when we were computing like with just plain old thresholding, the issue is that it sometimes can be kind of hard to find a global threshold, and so what adaptive threshold basically does is it will kind of find a local threshold and do these threshold operations more locally instead of one giant, global threshold so that’s what adaptive threshold does. How it does it is a bit more complicated than regular thresholding, so I just kind of passed on that. But anyway, this is just part of the image cleaning and preprocessing steps so that the images, you know, are really nice.

So, here is, the next function is our comparison and this is how we’re going to look at two images A and B, and say “How similar are they?” So then what we’re doing here is, we just, again, apply a blur so that we kind of smooth out any noise. We take the absolute value of the difference of these two images, basically just the difference using matrices and we take the absolute value of each of the elements and then we apply another thresholding operation on that so that, you know, it just helps reduce any of, we’re trying to minimize noise, because when we’re dealing with this, you know, there’s a lot of opportunities for noise, for image noise, to kind of mess up our, mess up our results. So we’re trying to minimize that as much as possible.

And then, this np.sum just returns a single value that represents all of the errors. It’s the sum of all our errors. This closest card will, given our machine learning model, then what this does is given any input image it tries to find the card that’s closest to this node, to the given card. And so what it first does, is it preprocesses this image, then it will sort our image, and what this does is it looks through all the image in our model and compares it to this input image, and it uses that comparison and then because it returns this single value we can sort on this value.

And so what we want to do is sort so that the first, here, the first value here represents the label that corresponds to the image that’s the closest to this image. So we use this to compute, this is a metric of image comparison, and the smaller this value is, then we know that these two images are really close together so we try to find the closest possible match and then return the number and the suite of this, of our closest match here. So that’s kind of what we are trying to do, and we can actually return the image itself.

So, and we’ll get to how we do this image, but probably one of the most important functions here that we’re going to be writing is called extract cards and that takes this raw image, and it takes however many cards that are in that image, this is what we have to kind of fix ourselves, and what it will do is, what we want to return is a list a list of images that are, you know, de-warped so that there’s no, like, kind of bending to the image or, like, perspective tilting or anything like that. We want to return those nice, clean images so that we can see them.

And you might be saying, “Well, where’s this used?” And I say, well let’s find out. So if we look for extract cards, you see that it’s particularly used in our model and in actually executing the code. And so, you know, we’re, it’s kind of used throughout our app and it’s really important and so we have to make sure we’re very careful when we’re coding this. So you can see it’s used here so that we can actually see the images where we get all of the cards that are in this image so that we can see it. And then here is where we get all the cards in our image. Basically this line, we get all the cards in our image, and then we return the card that is closest to any of the, you know, this particular card here.

So we basically get all the cards in our image and we pass is through our model and say, “Hey, which card is this? What is the number and the suite of this card?” And that is what our model will tell us. So that is basically the code that I’ve given you and so in the next few videos we’re actually going to be coding this function. And it’s not actually surprisingly not going to take that many lines of code to get this up and running. But it’s a very important function, and we have to make sure we’re coding this appropriately.

So, in this video, I just kind of wanted to show an overview of the stuff that I’ll be giving you, the code, so just to go from top to bottom here. This just is a utility function that helps reorder our points when we get our contour. This preprocessing stuff just cleans up our image a bit. The image comparison is actually used to find the closest card. You want to basically find the card that has the smallest error to our given card in our training data, or our machine learning model.

And this training data, this train function, just returns our machine learning model based on these two key pieces of data that I’m going to be giving, the images and then the correct ground truth label, so to speak. And then, you know, this just helps you to visualize what the resulting cards are going to look like.

So, anyway, that is kind of this provided code here and then in the next few videos, we’re going to be filling out this function here.

Interested in continuing? Check out the full Advanced Image Processing – Build a Blackjack Counter course, which is part of our Python Computer Vision Mini-Degree.

]]>
An Overview of Structural Similarity https://gamedevacademy.org/structural-similarity-tutorial/ Fri, 01 Feb 2019 05:00:34 +0000 https://pythonmachinelearning.pro/?p=2260 Read more]]>

You can access the full course here: Create a Raspberry Pi Smart Security Camera

In this lesson we will discuss a different approach to image-similarity called structural similarity(SSIM). A Mean Squared Error is a really good measure of error difference, but the issue with mean squared error is that it looks at each pixel individually and independently. This is different to how humans perceive images because we don’t look at two images and look at all the pixels of the images and compare them.

We look at the images in a more holistic sense. This is what structural similarity is trying to capture.

So instead of treating it pixel-independently structural similarity actually looks at groups of pixels to try to better determine if two images are different or not. 

MSE and SSE look at pixels individually. SSIM looks at groups of pixels, and this is better than looking at pixels individually because then all of those small changes in noise and variation don’t tend to affect groups of pixels than they do with individual pixels.

As it turns out that some structural similarity is actually a product of three other kinds of measures.

Structural similarity formula

The L stands for luminance, C is for contrast, and S is for structure. 

Structural similarity formula with luminence, contrast, and structure labeled

When you look at groups of pixels, then you get a better representation of the image and a better representation subsequently of the difference between images when you look at groups instead of just one individual pixel.

The good thing about structural similarity and its much different than MSE, or SSE, is that it actually has a lower and upper bound.

Visual representation of structural similarity bounds

 

Transcript

Hello everybody. My name is Mohit Deshpande, and in this video I want to discuss a different approach to emit similarity called structural similarity.

And so a mean squared error is a really good measure of error difference, but the issue with mean squared error is that it looks at each pixel individually and independently. And this is different to how humans perceive images because we don’t look at two images and look at all the pixels of the images and compare them. We look at the images in a more holistic sense. And so this is what structural similarity is trying to capture.

Well, humans looks at images more holistically and structural similarity tries to capture this using statistics. And that’s the issue with mean squared error is we’re just looking at all of the pixels. And as humans, we don’t do that. We can’t see all of the small variations and noises in our image, and overall it doesn’t really matter. If we show you two images and one of the images has all the pixels that are just shifted by like one value, then you’re probably not gonna see, you’re probably not gonna notice the difference at all, and you’ll say, well, these two images are exactly the same.

But mean squared error will see it, and it will pick that up. So instead of doing these pixel-wise independently, of treating it pixel-independently, structural similarity actually looks at groups of pixels to try to better determine if two images are different or not. And so, let me actually write this down. That MSE, and also sum of squared error, if you choose to use that, they look at pixels individually. Look at pixels individually. But with structural similarity, that’s a terrible S. Third time’s a charm. There we go, yeesh.

With structural similarity, we look at groups of pixels. So groups of pixels. And this is a bit better than looking at pixels individually because then all those small changes in noise and variation don’t tend to affect groups of pixels than they do with individual pixels.

And as it turns out that some structural similarity is actually a product of three other kinds of measures. So let me actually write out structural similarity. So structural similarity between an A and B is actually a product of some other things. It is a product of, L of AB times C of AB times S of AB, and I’m gonna explain what these are in just a second. So L stands for luminance. So, luminance. C is contrast, and S is structure.

And I’m not actually gonna expand this out because there’s an actual formula dealing with individual pixels and actually with groups of pixels using statistics like the mean and variance. And I actually don’t wanna expand that out because that kind of gets more into statistics and histograms and stuff that I don’t really want to get into that much. It requires a pretty decent amount of knowledge about statistics.

But A and B, it turns out that when you look at groups of pixels, then you get a better representation of the image and a better representation subsequently of the difference between images when you look at groups instead of just an individual pixel. And so we look at things like, we look at the histogram basically. We look at the averages of pixels. We look at the spread of a group of pixels, and all of these measurements use statistics to actually determine similarity. We get kind of a better idea of the difference between the images when we look at groups instead of just individual pixels. In the next video, I’m actually gonna show you why this a bit better. And this is slightly more representative of how humans perceive images.

And there’s more topics to discuss about that that we won’t be able to get into, but structural similarity is a really good measure of image difference. And the good thing about structural similarity and it’s much different than MSE, mean squared error, or sum squared error is that it’s actually has a lower and upper bound. So what I mean by that is, let me just use a different color here, is that structural similarity is always between minus one and one. Where positive one is perfect match and minus one we’re not gonna talk too much about but, minus one means that it is imperfect.

Or I should say it is perfectly imperfect, and that just has to deal with the way that the stats works out, but we usually don’t consider this realm here. We’re just more concerned about this realm. So the closer that the result of structural similarity, and it returns a number just like mean squared error and sum squared error, the closer it is to one, the closer two images are to being related, or I should say, the closer the two images are to being exactly the same. ‘Cause if I have a structural similarity index, I get the value back, it’s one, then I know that these two images are exactly the same. If it’s a little less than one, then I know that well, they’re a little bit different.

And so this kind of works in a similar fashion as mean squared error, but if you notice, mean squared error actually doesn’t have an upper bound. Besides barring any sort of bounds that you put on the pixel values themselves, mean squared error just has a lower bound of zero, and then it could be any large number after that. But with structural similarity, it’s always between minus one and one, and that’s really nice.

And so I think this is where I’m going to stop this video, and we’re actually going to see an example of structural similarity and how it compares to mean squared error in the next video. And don’t worry, we won’t have to implement this at all, even using non-py because as it turns out there’s this library under, well it’s a component of SciPy. It’s called a scikit image, or I should say it’s a clone of scikit, not necessarily SciPy. At a component of scikit called scikit image, and it actually just gives us a function called SSIM. And you just pass it two images, and it computes a value. So it looks almost exactly like this. And it has other parameters, but we don’t actually have to implement it. We don’t have to know what these are. We just can pass it into the function there and run it.

But anyway, I’ll stop right here, and I’ll just kind of do a quick recap. So with structural similarity, we try to take a different approach to considering image similarity, and that is we look at groups of pixels and perform statistics on those groups, rather than looking at each pixel individually like mean squared error or sum squared error does. And looking at groups is a bit better because it’s more representative to how humans view images. And so any of the small tiny variations in the image that mean squared error will pick up, structural similarity actually will account for that.

And so, structural similarity is also generally a better approach to image similarity, although at some cost. Another important thing that I want to mention is that structural similarity also has a defined range. It goes between minus one and one, that’s it. Mean squared error, it starts at zero ’cause it’s an error metric, but beyond that point, we don’t know what the upper bound necessarily is, barring any sort of limits that we place on the pixel values.

But anyway, we’re going to look at how we can, we’re gonna compare structural similarity and mean squared error in the next video and look at how we can call structural similarity using that scikit image library. So we’ll get to all that good stuff in the next video.

Interested in continuing? Check out the full Create a Raspberry Pi Smart Security Camera course, which is part of our Python Computer Vision Mini-Degree.

]]>
How to Process Video Frames using OpenCV and Python https://gamedevacademy.org/video-processing-opencv-tutorial/ Fri, 25 Jan 2019 05:00:35 +0000 https://pythonmachinelearning.pro/?p=2258 Read more]]>

You can access the full course here: Create a Raspberry Pi Smart Security Camera

Transcript

Hello everybody, my name is Mohit Deshpande and in this video, we’re going to start building our app.

Actually, before we really get into our app, we first have to discuss something really important. And that is, how do we actually look at video in terms of images? One way to think about it is that video is really just a sequence of still images. And you can see that, but if you take any video and you pause it, you can kind of increment it just a little bit and you’re kind of going frame-by-frame.

That’s what they’re called in video, They’re called frames. It’s just a particular still image. You know, you can go through all of the frames and when you play them really fast, it appears as video because our eye can’t really detect those changes that fast. We don’t really see them as still images when we play it fast enough, we see it as one coherent video. As it turns out, when we’re dealing with OpenCV, this is exactly what OpenCV likes to think of videos as, as still images. So any of the image processing stuff that we’ve already talked about, we can apply to each frame of this video.

There are a couple different ways that we can setup video and we’re gonna get into the Python code, actually. The first thing I should do is import some of my core things here and that is cv2, and then I’m gonna need numpy as np at some point. Those two things are, you know, when you’re starting any sort CV project, these two imports are really good as the first two lines, you just start reimporting cv2 and numpy ’cause odds are, if you’re doing anything with CV, you’re gonna need these two.

Anyway, so now the question is, how do we get video from, for example, how do we open a file? There’s lots of video files that we can use. .mov or mp4, or even better, for the purposes of our security camera, we don’t want to just open up from a video. We actually wanna open up from a camera, a live camera. So as it turns out in OpenCV, this is actually really, really easy to do, to transition between an image file and the camera itself.

We’re gonna be dealing mostly with image files because some of you may not have a webcam, for example. So we’re primarily going to be dealing with image files, but I’m gonna show you how we could extend this to your webcam. Actually, if we were to run this code on the Raspberry Pi, you would use the camera that I showed you how to install.

First things first, we actually have to tell OpenCV whether we wanna use a file or the actual live camera stream. There’s something in OpenCV that we can use. I’m just gonna call this cap for capture. We’re gonna say cv2.VideoCapture and video.mp4.

Actually, what I have is in the same directory. This is my, I call this app.py in my Developer/security, this is the same Developer/security I have in video and we’re gonna run this against our security camera. We’re gonna be kinda testing it against this so that we can see if it works or not. And so this is security camera footage and if I double click on it it shows me breaking into my roommate’s room.

This is the video feed that we’re gonna be using and I’m gonna provide you this video so you can test it. It’s fairly short because I didn’t want to have to use a really long video file because it takes up a lot of space on your hard disk and we really don’t need that much of a, I tried to balance it out so that there’s at least a few seconds of just stillness here so that we can compare it against when we do the image comparison with video frames. Anyway, well we’re gonna get to that much later.

I just wanted to introduce you to, this is kind of like the data set, the test video that we’ll be using. You can see why I paused it, this is one image and this is what OpenCV is going to be interpreting this as. OpenCV, what it’s gonna do is look through each single frame of our video and we can sort of iterate through those until we reach the end of the video. So let me quit out of this.

So you’ll be provided with video and you can feel free to use your own or you can feel free to use live streams from your webcam, but I’m just gonna provide this video so that everyone’s consistent there. I’ve got some other stuff here but I’ll explain that as we move along. So anyway, this is gonna be video capture, so I just called this cap for capture and now what we want to do is, we want to actually recreate a video player, basically. What I was gonna do is I’m gonna load this video, I just wanna play it back frame by frame. So, how do we exactly do this?

Well, first thing is we need some kind of loop, some kind of structure, to make sure that our video, we’re getting valid frames from our video. So to actually display these videos what we want to do is, if we think about it, we just want to always be pulling frames from this video until we reach the end. What we’re gonna do, like I mentioned, is to just build a small app, kind of, that just replays this video file, and that’s a pretty good start. So to do this, I’m going to first start off an infinite loop. And you may be saying, whoa wait a minute is there some way to check to make sure that, how do we know if we reach the end of the video? And I say, hold on a second, I’m getting to that.

The reason that we put it in a while True loop is so that we keep checking frames from the video and if we pull a frame that we’ve already seen, or at the end of the video which which won’t pull a frame, that frame will actually be None in Python. Once we are fetching a frame that’s None then we know that we’ve reached the end of the video and that we need to stop, thus break out of this loop. So that’s basically what we’re gonna do. Actually before I get to that, one important thing that you have to remember to do is call release() on this videoCapture. That’s sort of like a clean up thing. It’s kinda like a cv2.destroyAllWindows sorta thing.

It just releases the resources that are allocated to this video or the webcam. In this while True loop, what I want to do is pull a frame from this capture. So I can do that really easily using cap.read(). And it’s kinda like imRead except with videos and print. And it actually returns two things. So first thing, it actually returns a tuple or a list. First thing is just a return value that we don’t really care about. But the second this is the actual frame.

So now that we have the frame, this is where we can do some sort of checking like, something like, if frame is None then we want to do something like break out of this loop. So when we enter this case, then we know that we’re done with the video and we just end. With webcam stuff you don’t necessarily need this because as long as the webcam is plugged in and running then we really shouldn’t ever encounter this case.

Except for maybe there’s some weird CV thing that might, some error with OpenCV that could potentially happen that causes this to return None. This still might be a good idea to leave in just in case. Just in case. Anyway, now that we have a frame we can treat this just like an image, any of our image processing techniques that we’ve learned about before, we can apply to this frame. It’s just a frame, it’s a single image. Anything that we know, we can just apply to this frame and that’s super awesome about OpenCV.

The one thing that we have to keep in mind though, and that video is a sequence of frames, and so the thing is, you don’t want to do any kind of image processing stuff that’s going to take a long time with a frame. And we’re gonna see, as we get towards the end of the app, that performance becomes really important.

The speed of your code becomes pretty important when you’re dealing with video. Because if your code is slow on these image processing tasks then you’re gonna notice it because the frame rate in the video is just gonna drop way down. So we have to make sure that we’re not doing too intensive computer vision operations. It can’t be too intensive when we’re dealing with frames. We just want to get in the frame, do what we can with it, and then move on to the next one, quickly.

Now that I have this frame though, I can just show it like an image. I’m gonna do cv2.imshow() and I’ll pass in uploading to the window and the frame, and I can show the frame. There’s one other thing that we have to do, because again, we’re dealing with video, we can do something like cv2.waitKey(1) and then there’s this & 0xFF. If this is equal to q, then that’s another reason for us to break. And this q just lets us quit out of this while loop with a single key press.

And it turns out that we need this sort of thing because we have to make sure that when we’re fetching things from a file or from the webcam, we have to actually give it just a small amount of time for us to actually fetch the frame and do something with it. It turns out, if you get rid of this line then it’s gonna be the case that your app’s not gonna work because it’s gonna crash along this line saying, hey you haven’t provided a frame. That’s because we have to give our camera a second to actually pull the frames here.

Now that I have this sort of thing going, let’s actually run this. And actually, one thing I forgot to do is cv2.destroyAllWindows() so that we get rid of the image or of the very last, a window’s gonna pop up that’s gonna show our video. So this is all we need to run our video so let’s actually run this and see our video playing. And I also, I guess I forgot to put this if here. That’s also important. Okay, and this is all we need to play our video.

So let’s go ahead and run this and we should see our video play. And yeah, awesome! We can see our video playing and it should close out of this in just a second. Excellent, okay! Now we have our video playing and this is actually just where I want to stop, right here because we’ve actually covered quite a bit in this video. So lemme just do a quick recap. And actually before we stop, I just want to mention one thing.

And that’s how we’re playing video right now, but if I wanted to run a webcam, how would I do that? It’s actually really simple. I’m looking to copy this line here. I’ll copy paste it. This is really awesome with OpenCV, is that to go from a video to a webcam we use this exact same line of code, except instead of this we put zero. And that’s it. And so we can replace line four with line five. This will actually get video from your webcam or the camera on the Raspberry Pi instead of a particular video file.

But like I mentioned, just to keep everything consistent, we’re gonna be using video files, but I’m gonna leave this in here and I’ve commented it out so that you can easily flip between the two if you so want. This is where I’m actually gonna stop the video because we’ve actually covered a lot.

So, just to do a quick recap. We’ve covered how to load a video from a file and how to use it from, or how to load it, stream it from a webcam. I then mentioned that videos are just a bunch of still images, frames, and they’re just played so fast that as a human we see them as being one continuous video. So now that we’re loading up this video, how do we actually pull frames from it? That’s what this cap.read() does, is it’ll just pull one frame and it’ll pull the next frame or the first frame. And with the first value we just send a return value, but the second value in this tuple is what we care about and that’s the frame.

So then we can just show this frame, and we can do anything that we want to this frame that we learned about image processing, cause this frame is just an image. To illustrate this we just use cv2.imshow and this is what we use for images. So now we’re using this for frames. Of course, cause frames are images, which is why this works out. And one thing that we have to have here is cv2.waitkey() because we have to give our camera a second to actually take the frame and give it to us so that we can work with it. Or in this case, take a frame from the video and give it to us so that we can work with it. This also adds in some functionality that we can just quit out of our app any time we want.

And then one special case that we have to think about is, what if we’re at the end of the video, in particular if we’re loading video and not streaming from the webcam. What if we’re at the end of the video? So what happens is, if we’re at the end of the video this should return None because we’re trying to get the next frame after the last frame, but there is no frame after the last frame, which is why we return None. So all this makes sense.

So there’s two last things that you have to do any time you’re using anything with video capture. You have to make sure that you call cap.release(). This will release your app’s control of a webcam, or it’ll close up any resources that deal with this video and then we have our classic cv2.destroyAllWindows().

Okay, so this is what we have covered in this app and what we covered in this video. We started building our app, and if you run it this will just load the video and play it. This is really good start but in the next video we’re going to build on this concept a bit more. And we’re gonna actually get into thinking about how we can build a security camera and some of the different talking points with that. So we’re gonna get into that in the next video.

Interested in continuing? Check out the full Create a Raspberry Pi Smart Security Camera course, which is part of our Python Computer Vision Mini-Degree.

]]>
Python Mini-Degree https://gamedevacademy.org/python-mini-degree/ Sat, 01 Dec 2018 00:30:20 +0000 https://pythonmachinelearning.pro/?p=2216 Read more]]> Go from Zero to Python Expert – Learn Computer Vision, Machine Learning, Deep Learning, TensorFlow, Game Development and Internet of Things (IoT) App Development.

We live in a world that is continuously advancing as a result of technological innovation. To succeed in this ever-changing world, you’ll need to learn and gain expertise in the technologies that power the next generation of consumer and enterprise applications.

The Python Mini-Degree is an on-demand curriculum composed of 13 professional-grade courses, suitable for both beginners and more advanced developers alike. The goal of this innovative and robust program is to teach you how to code in Python and to build applications that incorporate artificial intelligence (AI), machine learning, computer vision, image processing and data visualization.

With the talent and skills in the powerful technologies above, you’ll be prepared to take on the world as a well-informed and highly prepared developer.

Access this Mini-Degree on Zenva Academy

]]>
Free Ebook – Machine Learning For Human Beings https://gamedevacademy.org/free-ebook-machine-learning-for-human-beings/ Thu, 04 Jan 2018 04:03:14 +0000 https://pythonmachinelearning.pro/?p=2125 Read more]]>

We are excited to announce the launch of our free ebook Machine Learning for Human Beings, authored by researcher in the field of computer vision and machine learning Mohit Deshpande, in collaboration with Pablo Farias Navarro, founder of Zenva.

In over 100 pages you will learn the basics of Machine Learning – text classification, clustering and even face recognition and learn to implement these algorithms using Python! This ebook covers both theoretical and practical aspects of Machine Learning, so that you have a strong foundation and understand what happens under the hood. Some of the topics covered in the book are:

  • Overview of Machine Learning – Supervised vs Unsupervised Learning
  • Text Classification with Naive Bayes
  • Data Clustering with K-Means
  • Clustering with Gaussian Mixture Models
  • Face Recognition with Eigenfaces
  • Dimensionality Reduction
  • Classification with Support Vector Machines
  • Reinforcement Learning using the OpenAI library

This book is provided at no cost in PDF format.

Download the ebook here

]]>