Explore Free Supervised Learning Tutorials – GameDev Academy

Radial Basis Function Networks – Regression for ML

Mohit Deshpande — Mon, 19 Dec 2022 02:34:28 +0000

Machine learning is an expansive field – one often made better by techniques common to data science like regression.

In this tutorial, we’re going to explore the topic of neural networks and discover how we can make use of regression techniques when working with them. We’ll also dive into how to use Python to work with these concepts.

Let’s dive in and start learning!

Intro to Neural Networks and RBF Nets

Neural Networks are very powerful models for classification tasks. But what about regression? Suppose we had a set of data points and wanted to project that trend into the future to make predictions. Regression has many applications in finance, physics, biology, and many other fields.

Radial Basis Function Networks (RBF nets) are used for exactly this scenario: regression or function approximation. We have some data that represents an underlying trend or function and want to model it. RBF nets can learn to approximate the underlying trend using many Gaussians/bell curves.

Prerequisites

You can download the full code here.

Before we begin, please familiarize yourself with neural networks, backpropagation, and k-means clustering. Alternatively, if this is your first foray into the world of machine learning, we recommend first learning the fundamentals of Python. You can check out our tutorials, or try a robust curriculum like the Python Mini-Degree to do so.

For educators, we can also recommend Zenva Schools. Not only does the platform have tons of online courses for Python, but is suitable for K12 environments with classroom management tools, course plans, and more.

BUILD YOUR OWN GAMES

Get 250+ coding courses for

LEARN MORE

AVAILABLE FOR A LIMITED TIME ONLY

RBF Nets

An RBF net is similar to a 2-layer network. We have an input that is fully connected to a hidden layer. Then, we take the output of the hidden layer perform a weighted sum to get our output.

But what is that inside the hidden layer neurons? That is a Gaussian RBF! This differentiates an RBF net from a regular neural network: we’re using an RBF as our “activation” function (more specifically, a Gaussian RBF).

Gaussian Distribution

The first question you may have is “what is a Gaussian?” It’s the most famous and important of all statistical distributions. A picture is worth a thousand words so here’s an example of a Gaussian centered at 0 with a standard deviation of 1.

This is the Gaussian or normal distribution! It is also called a bell curve sometimes. The function that describes the normal distribution is the following

\[ \mathcal{N}(x; \mu, \sigma^2) = \displaystyle\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\displaystyle\frac{(x-\mu)^2}{2\sigma^2}} \]

That looks like a really messy equation! And it is, so we’ll use $\mathcal{N}(x; \mu, \sigma^2)$ to represent that equation. If we look at it, we notice there are one input and two parameters. First, let’s discuss the parameters and how they change the Gaussian. Then we can discuss what the input means.

The two parameters are called the mean $\mu$ and standard deviation $\sigma$. In some cases, the standard deviation is replaced with the variance $\sigma^2$, which is just the square of the standard deviation. The mean of the Gaussian simply shifts the center of the Gaussian, i.e. the “bump” or top of the bell. In the image above, $\mu=0$, so the largest value is at $x=0$.

The standard deviation is a measure of the spread of the Gaussian. It affects the “wideness” of the bell. Using a larger standard deviation means that the data are more spread out, rather than closer to the mean.

Technically, the above function is called the probability density function (pdf) and it tells us the probability of observing an input $x$, given that specific normal distribution. But we’re only interested in the bell-curve properties of the Gaussian, not the fact that it represents a probability distribution.

Gaussians in RBF nets

Why do we care about Gaussians? We can use a linear combination of Gaussians to approximate any function!

Source: https://terpconnect.umd.edu/~toh/spectrum/CurveFittingB.html

In the figure above, the Gaussians have different colors and are weighted differently. When we take the sum, we get a continuous function! To do this, we need to know where to place the Gaussian centers $c_j$ and their standard deviations $\sigma_j$.

We can use k-means clustering on our input data to figure out where to place the Gaussians. The reasoning behind this is that we want our Gaussians to “span” the largest clusters of data since they have that bell-curve shape.

The next step is figuring out what the standard deviations should be. There are two approaches we can take: set the standard deviation to be that of the points assigned to a particular cluster $c_j$ or we can use a single standard deviation for all clusters $\sigma_j = \sigma \forall j$ where $\sigma=\frac{d_\text{max}}{\sqrt{2k}}$ where $d_\text{max}$ is the maximum distance between any two cluster centers. and $k$ is the number of cluster centers.

But wait, how many Gaussians do we use? Well that’s a hyperparameter called the number of bases or kernels $k$.

Backpropagation for RBF nets

K-means clustering is used to determine the centers $c_j$ for each of the radial basis functions $\varphi_j$. Given an input $x$, an RBF network produces a weighted sum output.

\begin{equation*}
F(x)=\displaystyle\sum_{j=1}^k w_j\varphi_j(x, c_j) + b
\end{equation*}

where $w_j$ are the weights, $b$ is the bias, $k$ is the number of bases/clusters/centers, and $\varphi_j(\cdot)$ is the Gaussian RBF:

\begin{equation*}
\varphi_j(x, c_j) = \exp\left(\displaystyle\frac{-||x-c_j||^2}{2\sigma_j^2} \right)
\end{equation*}

There are other kinds of RBFs, but we’ll stick with our Gaussian RBF. (Notice that we don’t have the constant up front, so our Gaussian is not normalized, but that’s ok since we’re not using it as a probability distribution!)

Using these definitions, we can derive the update rules for $w_j$ and $b$ for gradient descent. We use the quadratic cost function to minimize.

\begin{equation*}
C = \displaystyle\sum_{i=1}^N (y^{(i)}-F(x^{(i)}))^2
\end{equation*}

We can derive the update rule for $w_j$ by computing the partial derivative of the cost function with respect to all of the $w_j$.

\begin{align*}
\displaystyle\frac{\partial C}{\partial w_j} &= \displaystyle\frac{\partial C}{\partial F} \displaystyle\frac{\partial F}{\partial w_j}\\
&=\displaystyle\frac{\partial }{\partial F}[\displaystyle\sum_{i=1}^N (y^{(i)}-F(x^{(i)}))^2]~\cdot~\displaystyle\frac{\partial }{\partial w_j}[\displaystyle\sum_{j=0}^K w_j\varphi_j(x,c_j) + b]\\
&=-(y^{(i)}-F(x^{(i)}))~\cdot~\varphi_j(x,c_j)\\
w_j &\gets w_j + \eta~(y^{(i)}-F(x^{(i)}))~\varphi_j(x,c_j)
\end{align*}

Similarly, we can derive the update rules for $b$ by computing the partial derivative of the cost function with respect to $b$.

\begin{align*}
\displaystyle\frac{\partial C}{\partial b} &= \displaystyle\frac{\partial C}{\partial F} \displaystyle\frac{\partial F}{\partial b}\\
&=\displaystyle\frac{\partial }{\partial F}[\displaystyle\sum_{i=1}^N (y^{(i)}-F(x^{(i)}))^2]~\cdot~\displaystyle\frac{\partial }{\partial b}[\displaystyle\sum_{j=0}^K w_j\varphi_j(x,c_j) + b]\\
&=-(y^{(i)}-F(x^{(i)}))\cdot 1\\
b &\gets b + \eta~(y^{(i)}-F(x^{(i)}))
\end{align*}

Now we have our backpropagation rules!

RBF Net Code

Now that we have a better understanding of how we can use neural networks for function approximation, let’s write some code!

First, we have to define our “training” data and RBF. We’re going to code up our Gaussian RBF.

def rbf(x, c, s):
    return np.exp(-1 / (2 * s**2) * (x-c)**2)

Now we’ll need to use the k-means clustering algorithm to determine the cluster centers. I’ve already coded up a function for you that gives us the cluster centers and the standard deviations of the clusters.

def kmeans(X, k):
    """Performs k-means clustering for 1D input
    
    Arguments:
        X {ndarray} -- A Mx1 array of inputs
        k {int} -- Number of clusters
    
    Returns:
        ndarray -- A kx1 array of final cluster centers
    """

    # randomly select initial clusters from input data
    clusters = np.random.choice(np.squeeze(X), size=k)
    prevClusters = clusters.copy()
    stds = np.zeros(k)
    converged = False

    while not converged:
        """
        compute distances for each cluster center to each point 
        where (distances[i, j] represents the distance between the ith point and jth cluster)
        """
        distances = np.squeeze(np.abs(X[:, np.newaxis] - clusters[np.newaxis, :]))

        # find the cluster that's closest to each point
        closestCluster = np.argmin(distances, axis=1)

        # update clusters by taking the mean of all of the points assigned to that cluster
        for i in range(k):
            pointsForCluster = X[closestCluster == i]
            if len(pointsForCluster) > 0:
                clusters[i] = np.mean(pointsForCluster, axis=0)

        # converge if clusters haven't moved
        converged = np.linalg.norm(clusters - prevClusters) < 1e-6
        prevClusters = clusters.copy()

    distances = np.squeeze(np.abs(X[:, np.newaxis] - clusters[np.newaxis, :]))
    closestCluster = np.argmin(distances, axis=1)

    clustersWithNoPoints = []
    for i in range(k):
        pointsForCluster = X[closestCluster == i]
        if len(pointsForCluster) < 2:
            # keep track of clusters with no points or 1 point
            clustersWithNoPoints.append(i)
            continue
        else:
            stds[i] = np.std(X[closestCluster == i])

    # if there are clusters with 0 or 1 points, take the mean std of the other clusters
    if len(clustersWithNoPoints) > 0:
        pointsToAverage = []
        for i in range(k):
            if i not in clustersWithNoPoints:
                pointsToAverage.append(X[closestCluster == i])
        pointsToAverage = np.concatenate(pointsToAverage).ravel()
        stds[clustersWithNoPoints] = np.mean(np.std(pointsToAverage))

    return clusters, stds

This code just implements the k-means clustering algorithm and computes the standard deviations. If there is a cluster with none or one assigned points to it, we simply average the standard deviation of the other clusters. (We can’t compute standard deviation with no data points, and the standard deviation of a single data point is 0).

We’re not going to spend too much time on k-means clustering. Visit the link at the top for more information.

Now we can get to the real heart of the RBF net by creating a class.

class RBFNet(object):
    """Implementation of a Radial Basis Function Network"""
    def __init__(self, k=2, lr=0.01, epochs=100, rbf=rbf, inferStds=True):
        self.k = k
        self.lr = lr
        self.epochs = epochs
        self.rbf = rbf
        self.inferStds = inferStds

        self.w = np.random.randn(k)
        self.b = np.random.randn(1)

We have options for the number of bases, learning rate, number of epochs, which RBF to use, and if we want to use the standard deviations from k-means. We also initialize the weights and bias. Remember that an RBF net is a modified 2-layer network, so there’s only only one weight vector and a single bias at the output node, since we’re approximating a 1D function (specifically, one output). If we had a function with multiple outputs (a function with a vector-valued output), we’d use multiple output neurons and our weights would be a matrix and our bias a vector.

Then, we have to write our fit function to compute our weights and biases. In the first few lines, we either use the standard deviations from the modified k-means algorithm, or we force all bases to use the same standard deviation computed from the formula. The rest is similar to backpropagation where we propagate our input going forward and update our weights going backward.

def fit(self, X, y):
    if self.inferStds:
        # compute stds from data
        self.centers, self.stds = kmeans(X, self.k)
    else:
        # use a fixed std 
        self.centers, _ = kmeans(X, self.k)
        dMax = max([np.abs(c1 - c2) for c1 in self.centers for c2 in self.centers])
        self.stds = np.repeat(dMax / np.sqrt(2*self.k), self.k)

    # training
    for epoch in range(self.epochs):
        for i in range(X.shape[0]):
            # forward pass
            a = np.array([self.rbf(X[i], c, s) for c, s, in zip(self.centers, self.stds)])
            F = a.T.dot(self.w) + self.b

            loss = (y[i] - F).flatten() ** 2
            print('Loss: {0:.2f}'.format(loss[0]))

            # backward pass
            error = -(y[i] - F).flatten()

            # online update
            self.w = self.w - self.lr * a * error
            self.b = self.b - self.lr * error

For verbosity, we’re printing the loss at each step. Notice we’re also performing an online update, meaning we update our weights and biases each input. Alternatively, we could have done a batch update, where we update our parameters after seeing all training data, or minibatch update, where we update our parameters after seeing a subset of the training data.

Making a prediction is as simple as propagating our input forward.

def predict(self, X):
    y_pred = []
    for i in range(X.shape[0]):
        a = np.array([self.rbf(X[i], c, s) for c, s, in zip(self.centers, self.stds)])
        F = a.T.dot(self.w) + self.b
        y_pred.append(F)
    return np.array(y_pred)

Notice that we’re allowing for a matrix inputs, where each row is an example.

Finally, we can write code to use our new class. For our training data, we’ll be generating 100 samples from the sine function. Then, we’ll add some uniform noise to our data.

# sample inputs and add noise
NUM_SAMPLES = 100
X = np.random.uniform(0., 1., NUM_SAMPLES)
X = np.sort(X, axis=0)
noise = np.random.uniform(-0.1, 0.1, NUM_SAMPLES)
y = np.sin(2 * np.pi * X)  + noise

rbfnet = RBFNet(lr=1e-2, k=2)
rbfnet.fit(X, y)

y_pred = rbfnet.predict(X)

plt.plot(X, y, '-o', label='true')
plt.plot(X, y_pred, '-o', label='RBF-Net')
plt.legend()

plt.tight_layout()
plt.show()

We can plot our approximated function against our real function to see how well our RBF net performed.

From our results, our RBF net performed pretty well! If we wanted to evaluate our RBF net more rigorously, we could sample more points from the same function, pass it through our RBF net and use the summed Euclidean distance as a metric.

We can try messing around with some key parameters, like the number of bases. What if we increase the number of bases to 4?

Our results aren’t too great! This is because our original function is shaped the way that it is, i.e., two bumps. If we had a more complicated function, then we could use a larger number of bases. If we used a large number of bases, then we’ll start overfitting!

Another parameter we can change is the standard deviation. How about we use a single standard deviation for all of our bases instead of each one getting its own?

Our plot is much smoother! This is because the Gaussians that make up our reconstruction all have the same standard deviation.

There are other parameters we can change like the learning rate; we could use a more advanced optimization algorithm; we could try layering Gaussians; etc.

To summarize, RBF nets are a special type of neural network used for regression. They are similar to 2-layer networks, but we replace the activation function with a radial basis function, specifically a Gaussian radial basis function. We take each input vector and feed it into each basis. Then, we do a simple weighted sum to get our approximated function value at the end. We train these using backpropagation like any neural network! Finally, we implemented RBF nets in a class and used it to approximate a simple function.

RBF nets are a great example of neural models being used for regression!

Want to learn more about how Python can help your career? Check out this article! You can also expand your Python skills with the Python Mini-Degree or, for teachers in need of classroom resources, the Zenva Schools platform.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Perceptrons: The First Neural Networks for Machine Learning

Mohit Deshpande — Wed, 28 Sep 2022 16:31:42 +0000

Machine learning is not only a fascinating topic but one with a variety of approaches. Neural networks, one such approach, have come to the forefront largely due to their accessibility for those taking their first step into the machine learning world.

In this tutorial, we’re going to use Python and NumPy to explore the fundamentals of creating neural networks and combining them – using analogies for how our brains work to better understand them.

If you’re ready to start diving deep into the world of machine learning, let’s get started!

Intro & Project Files

Neural Networks have become incredibly popular over the past few years, and new architectures, neuron types, activation functions, and training techniques pop up all the time in research. But without a fundamental understanding of neural networks, it can be quite difficult to keep up with the flurry of new work in this area.

To understand the modern approaches, we have to understand the tiniest, most fundamental building block of these so-called deep neural networks: the neuron. In particular, we’ll see how to combine several of them into a layer and create a neural network called the perceptron. We’ll write Python code (using NumPy) to build a perceptron network from scratch and implement the learning algorithm.

For the completed code, download the ZIP file here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Biological Neurons

Perceptrons and artificial neurons actually date back to 1958. Frank Rosenblatt was a psychologist trying to solidify a mathematical model for biological neurons. To better understand the motivation behind the perceptron, we need a superficial understanding of the structure of biological neurons in our brains.

(Credit: https://commons.wikimedia.org/wiki/File:Neuron_-_annotated.svg)

Let’s consider a biological neuron. The point of this cell is to take in some input (in the form of electrical signals in our brains), do some processing, and produce some output (also an electrical signal). One very important thing to note is that the inputs and outputs are binary (0 or 1)! An individual neuron accepts inputs, usually from other neurons, through its dendrites. Although the image above doesn’t depict it, the dendrites connect with other neurons through a gap called the synapse that assigns a weight to a particular input. Then, all of these inputs are considered together when they are processed in the cell body, or soma.

Neurons exhibit an all-or-nothing behavior. In other words, if the combination of inputs exceeds a certain threshold, then an output signal is produced, i.e., the neuron “fires.” If the combination falls short of the threshold, then the neuron doesn’t produce any output, i.e., the neuron “doesn’t fire.” In the case where the neuron does fire, the output travels along the axon to the axon terminals. These axon terminals are connected to the dendrites of other neurons through the synapse.

Let’s take a moment to recap biological neurons. They take some binary inputs through the dendrites, but not all inputs are treated the same since they are weighted. We combine these weighted signals and, if they surpass a threshold, the neuron fires. This single output travels along the axon to other neurons. Now that we have this summary in mind, we can develop mathematical equations to roughly represent a biological neuron.

Artificial Neurons

Now that we have some understanding of biological neurons, the mathematical model should follow from the operations of a neuron.

In this model, we have n binary inputs (usually given as a vector) and exactly the same number of weights $W_1, …, W_n$. We multiply these together and sum them up. We denote this as z and call it the pre-activation.

\[ z = \displaystyle\sum_{i=1}^{n} W_i x_i = W^T x \]

(We can re-write this as an inner product for succinctness.) There is another term, called the bias, that is just a constant factor.

\[ z = \displaystyle\sum_{i=1}^{n} W_i x_i + b = W^T x + b \]

For mathematical convenience, we can actually incorporate it into our weight vector as $W_0$ and set $x_0 = +1$ for all of our inputs. (This concept of incorporating the bias into the weight vector will become clearer when we write code.)

\[ z = \displaystyle\sum_{i=0}^{n} W_i x_i = W^T x \]

After taking the weighted sum, we apply an activation function, $\sigma$, to this and produce an activation a. The activation function for perceptrons is sometimes called a step function because, if we were to plot it, it would look like a stair.

\[
\sigma(q)=
\begin{cases}
1 & q\geq 0 \\
0 & q < 0
\end{cases}
\]

In other words, if the input is greater than or equal to 0, then we produce an output of 1. Otherwise, we produce an output of 0. This is the mathematical model for a single neuron, the most fundamental unit for a neural networks.

\[ a = \sigma (W^T x) \]

Let’s compare this model to the biological neuron. The inputs are analogous to the dendrites, and the weights model the synapse. We combine the weighted inputs by summing and send that weighted sum to the activation function. This acts as our all-or-nothing response function where 0 means the neuron didn’t produce an output. Also note that our inputs and outputs are also binary, which is in accordance with the biological model.

Capabilities and Limitations of Perceptrons

Since the output of a perceptron is binary, we can use it for binary classification, i.e., an input belongs to only one of two classes. The classic examples used to explain what perceptrons can model are logic gates!

Let’s consider the logic gates in the figure above. A white circle means an output of 1 and a black circle means an output of 0, and the axes indicate inputs. For example, when we input 1 and 1 to an AND gate, the output is 1, the white circle. We can create perceptrons that act like gates: they take 2 binary inputs and produce a single binary output!

However, perceptrons are limited to solving problems that are linearly separable. If two classes are linearly separable, this means that we can draw a single line to separate the two classes. We can do this easily for the AND and OR gates, but there is no single line that can separate the classes for the XOR gate! This means that we can’t use our single-layer perceptron to model an XOR gate.

An intuitive way to understand why perceptrons can only model linearly separable problems is to look the weighted sum equation (with the bias).

\[ \displaystyle\sum_{i=0}^N W_i x_i + b \]

This looks very similar to the equation of a line! (Or, more generally, a hyperplane.) Hence, we’re creating a line and saying that everything on one side of the line belongs to one class and everything on the other side belongs to the other class. This line is called the decision boundary, and, when we use a single-layer perceptron, we can only produce one decision boundary.

In light of this new information, it doesn’t seem like perceptrons are useful! But, in practice, many problems are actually linearly separable. Hope is not lost for non-linearly separably problems however! It can be shown that organizing multiple perceptrons into layers and using an intermediate layer, or hidden layer, can solve the XOR problem! This is the foundation of modern neural networks!

Single-Layer Perceptron Code

Now that we have a good understanding of how perceptrons works, let’s take one more step and solidify the math into code. We’ll use object-oriented principles and create a class. In order to construct our perceptron, we need to know how many inputs there are to create our weight vector. The reason we add one to the input size is to include the bias in the weight vector.

import numpy as np

class Perceptron(object):
    """Implements a perceptron network"""
    def __init__(self, input_size):
        self.W = np.zeros(input_size+1)

We’ll also need to implement our activation function. We can simply return 1 if the input is greater than or equal to 0 and 0 otherwise.

def activation_fn(self, x):
    return 1 if x >= 0 else 0

Finally, we need a function to run an input through the perceptron and return an output. Conventionally, this is called the prediction. We add the bias into the input vector. Then we can simply compute the inner product and apply the activation function.

def predict(self, x):
    x = np.insert(x, 0, 1)
    z = self.W.T.dot(x)
    a = self.activation_fn(z)
    return a

All of these are functions of the Perceptron class that we’ll use for perceptron learning.

Perceptron Learning Algorithm

We’ve defined a perceptron, but how do perceptrons learn? Rosenblatt, the creator of the perceptron, also had some thoughts on how to train neurons based on his intuition about biological neurons. Rosenblatt intuited a simple learning algorithm. His idea was to run each example input through the perceptron and, if the perceptron fires when it shouldn’t have, inhibit it. If the perceptron doesn’t fire when it should have, excite it.

How do we inhibit or excite? We change the weight vector (and bias)! The weight vector is a parameter to the perceptron: we need to keep changing it until we can correctly classify each of our inputs. With this intuition in mind, we need to write an update rule for our weight vector so that we can appropriately change it:

\[ w \leftarrow w + \Delta w \]

We have to determine a good $\Delta w$ that does what we want. First, we can define the error as the difference between the desired output d and the predicted output y.

\[ e = d – y \]

Notice that when d and y are the same (both are 0 or both are 1), we get 0! When they are different, (0 and 1 or 1 and 0), we can get either 1 or -1. This directly corresponds to exciting and inhibiting our perceptron! We multiply this with the input to tell our perceptron to change our weight vector in proportion to our input.

\[ w \leftarrow w + \eta\cdot e\cdot x \]

There is a hyperparameter $\eta$ that is called the learning rate. It is just a scaling factor that determines how large the weight vector updates should be. This is a hyperparameter because it is not learned by the perceptron (notice there’s no update rule for $\eta$!), but we select this parameter.

(For perceptrons, the Perceptron Convergence Theorem says that a perceptron will converge, given that the classes are linearly separable, regardless of the learning rate. But for other learning algorithms, this is a critical parameter!)

Let’s take another look at this update rule. When the error is 0, i.e., the output is what we expect, then we don’t change the weight vector at all. When the error is nonzero, we update the weight vector accordingly.

Perceptron Learning Algorithm Code

With the update rule in mind, we can create a function to keep applying this update rule until our perceptron can correctly classify all of our inputs. We need to keep iterating through our training data until this happens; one epoch is when our perceptron has seen all of the training data once. Usually, we run our learning algorithm for multiple epochs.

Before we code the learning algorithm, we need to make some changes to our init function to add the learning rate and number of epochs as inputs.

def __init__(self, input_size, lr=1, epochs=10):
    self.W = np.zeros(input_size+1)
    # add one for bias
    self.epochs = epochs
    self.lr = lr

Now we can create a function, given inputs and desired outputs, run our perceptron learning algorithm. We keep updating the weights for a number of epochs, and iterate through the entire training set. We insert the bias into the input when performing the weight update. Then we can create our prediction, compute our error, and perform our update rule.

def fit(self, X, d):
    for _ in range(self.epochs):
        for i in range(d.shape[0]):
            y = self.predict(X[i])
            e = d[i] - y
            self.W = self.W + self.lr * e * np.insert(X[i], 0, 1)

The entire code for our perceptron is shown below.

class Perceptron(object):
    """Implements a perceptron network"""
    def __init__(self, input_size, lr=1, epochs=100):
        self.W = np.zeros(input_size+1)
        # add one for bias
        self.epochs = epochs
        self.lr = lr
    
    def activation_fn(self, x):
        #return (x >= 0).astype(np.float32)
        return 1 if x >= 0 else 0

    def predict(self, x):
        z = self.W.T.dot(x)
        a = self.activation_fn(z)
        return a

    def fit(self, X, d):
        for _ in range(self.epochs):
            for i in range(d.shape[0]):
                x = np.insert(X[i], 0, 1)
                y = self.predict(x)
                e = d[i] - y
                self.W = self.W + self.lr * e * x

Now that we have our perceptron coded, we can try to give it some training data and see if it works! One easy set of data to give is the AND gate. Here’s a set of inputs and outputs.

if __name__ == '__main__':
    X = np.array([
        [0, 0],
        [0, 1],
        [1, 0],
        [1, 1]
    ])
    d = np.array([0, 0, 0, 1])

    perceptron = Perceptron(input_size=2)
    perceptron.fit(X, d)
    print(perceptron.W)

In just a few lines, we can start using our perceptron! At the end, we print the weight vector. Using the AND gate data, we should get a weight vector of [-3, 2, 1]. This means that the bias is -3 and the weights are 2 and 1 for $x_1$ and $x_2$, respectively.

To verify this weight vector is correct, we can try going through a few examples. If both inputs are 0, then the pre-activation will be -3+0*2+0*1 = -3. When applying our activation function, we get 0, which is exactly 0 AND 0! We can try this for other gates as well. Note that this is not the only correct weight vector. Technically, if there exists a single weight vector that can separate the classes, there exist an infinite number of weight vectors. Which weight vector we get depends on how we initialize the weight vector.

To summarize, perceptrons are the simplest kind of neural network: they take in an input, weight each input, take the sum of weighted inputs, and apply an activation function. Since they were modeled from biological neurons by Frank Rosenblatt, they take and produce only binary values. In other words, we can perform binary classification using perceptrons. One limitation of perceptrons is that they can only solve linearly separable problems. In the real world, however, many problems are actually linearly separable. For example, we can use a perceptron to mimic an AND or OR gate. However, since XOR is not linearly separable, we can’t use single-layer perceptrons to create an XOR gate. The perceptron learning algorithm fits the intuition by Rosenblatt: inhibit if a neuron fires when it shouldn’t have, and excite if a neuron does not fire when it should have. We can take that simple principle and create an update rule for our weights to give our perceptron the ability of learning.

Perceptrons are the foundation of neural networks so having a good understanding of them now will be beneficial when learning about deep neural networks! This will also help as you pursue exciting developer opportunities.

Want to learn more Python in general? Check out our free course on Kivy!

An Introduction to Image Recognition

Lindsay Schardon — Sat, 31 Oct 2020 15:00:35 +0000

You can access the full course here: Convolutional Neural Networks for Image Classification

Intro to Image Recognition

Let’s get started by learning a bit about the topic itself. Image recognition is, at its heart, image classification so we will use these terms interchangeably throughout this course. We see images or real-world items and we classify them into one (or more) of many, many possible categories. The categories used are entirely up to use to decide. For example, we could divide all animals into mammals, birds, fish, reptiles, amphibians, or arthropods. Alternatively, we could divide animals into carnivores, herbivores, or omnivores. Perhaps we could also divide animals into how they move such as swimming, flying, burrowing, walking, or slithering. There are potentially endless sets of categories that we could use.

Among categories, we divide things based on a set of characteristics. When categorizing animals, we might choose characteristics such as whether they have fur, hair, feathers, or scales. Maybe we look at the shape of their bodies or go more specific by looking at their teeth or how their feet are shaped. Once again, we choose there are potentially endless characteristics we could look for.

Analogies aside, the main point is that in order for classification to work, we have to determine a set of categories into which we can class the things we see and the set of characteristics we use to make those classifications. This allows us to then place everything that we see into one of the categories or perhaps say that it belongs to none of the categories. The more categories we have, the more specific we have to be. It’s easier to say something is either an animal or not an animal but it’s harder to say what group of animals an animal may belong to. However complicated, this classification allows us to not only recognize things that we have seen before, but also to place new things that we have never seen. Good image recognition models will perform well even on data they have never seen before (or any machine learning model, for that matter).

How do we Perform Image Recognition?

We do a lot of this image classification without even thinking about it. For starters, we choose what to ignore and what to pay attention to. This actually presents an interesting part of the challenge: picking out what’s important in an image. We see everything but only pay attention to some of that so we tend to ignore the rest or at least not process enough information about it to make it stand out. Knowing what to ignore and what to pay attention to depends on our current goal. For example, if we were walking home from work, we would need to pay attention to cars or people around us, traffic lights, street signs, etc. but wouldn’t necessarily have to pay attention to the clouds in the sky or the buildings or wildlife on either side of us. On the other hand, if we were looking for a specific store, we would have to switch our focus to the buildings around us and perhaps pay less attention to the people around us.

The same thing occurs when asked to find something in an image. We decide what features or characteristics make up what we are looking for and we search for those, ignoring everything else. This is easy enough if we know what to look for but it is next to impossible if we don’t understand what the thing we’re searching for looks like.

This brings to mind the question: how do we know what the thing we’re searching for looks like? There are two main mechanisms: either we see an example of what to look for and can determine what features are important from that (or are told what to look for verbally) or we have an abstract understanding of what we’re looking for should look like already. For example, if you’ve ever played “Where’s Waldo?”, you are shown what Waldo looks like so you know to look out for the glasses, red and white striped shirt and hat, and the cane. To the uninitiated, “Where’s Waldo?” is a search game where you are looking for a particular character hidden in a very busy image. I’d definitely recommend checking it out. However, if we were given an image of a farm and told to count the number of pigs, most of us would know what a pig is and wouldn’t have to be shown. That’s because we’ve memorized the key characteristics of a pig: smooth pink skin, 4 legs with hooves, curly tail, flat snout, etc. We don’t need to be taught because we already know.

This logic applies to almost everything in our lives. We learn fairly young how to classify things we haven’t seen before into categories that we know based on features that are similar to things within those categories. If we come across something that doesn’t fit into any category, we can create a new category. For example, there are literally thousands of models of cars; more come out every year. However, we don’t look at every model and memorize exactly what it looks like so that we can say with certainty that it is a car when we see it. We know that the new cars look similar enough to the old cars that we can say that the new models and the old models are all types of car.

By now, we should understand that image recognition is really image classification; we fit everything that we see into categories based on characteristics, or features, that they possess. We’re intelligent enough to deduce roughly which category something belongs to, even if we’ve never seen it before. If something is so new and strange that we’ve never seen anything like it and it doesn’t fit into any category, we can create a new category and assign membership within that. The next question that comes to mind is: how do we separate objects that we see into distinct entities rather than seeing one big blur?

The somewhat annoying answer is that it depends on what we’re looking for. If we look at an image of a farm, do we pick out each individual animal, building, plant, person, and vehicle and say we are looking at each individual component or do we look at them all collectively and decide we see a farm? Okay, let’s get specific then. Let’s say we aren’t interested in what we see as a big picture but rather what individual components we can pick out. How do we separate them all?

The key here is in contrast. Generally, we look for contrasting colours and shapes; if two items side by side are very different colours or one is angular and the other is smooth, there’s a good chance that they are different objects. Although this is not always the case, it stands as a good starting point for distinguishing between objects.

Coming back to the farm analogy, we might pick out a tree based on a combination of browns and greens: brown for the trunk and branches and green for the leaves. Of course this is just a generality because not all trees are green and brown and trees come in many different shapes and colours but most of us are intelligent enough to be able to recognize a tree as a tree even if it looks different. We could find a pig due to the contrast between its pink body and the brown mud it’s playing in. We could recognize a tractor based on its square body and round wheels. This is why colour-camouflage works so well; if a tree trunk is brown and a moth with wings the same shade of brown as tree sits on the tree trunk, it’s difficult to see the moth because there is no colour contrast.

Another amazing thing that we can do is determine what object we’re looking at by seeing only part of that object. This is really high level deductive reasoning and is hard to program into computers. This is one of the reasons it’s so difficult to build a generalized artificial intelligence but more on that later. As long as we can see enough of something to pick out the main distinguishing features, we can tell what the entire object should be. For example, if we see only one eye, one ear, and a part of a nose and mouth, we know that we’re looking at a face even though we know most faces should have two eyes, two ears, and a full mouth and nose.

Although we don’t necessarily need to think about all of this when building an image recognition machine learning model, it certainly helps give us some insight into the underlying challenges that we might face. If nothing else, it serves as a preamble into how machines look at images. The main problem is that we take these abilities for granted and perform them without even thinking but it becomes very difficult to translate that logic and those abilities into machine code so that a program can classify images as well as we can. This is just the simple stuff; we haven’t got into the recognition of abstract ideas such as recognizing emotions or actions but that’s a much more challenging domain and far beyond the scope of this course.

How do Machines Interpret Images?

The previous topic was meant to get you thinking about how we look at images and contrast that against how machines look at images. We’ll see that there are similarities and differences and by the end, we will hopefully have an idea of how to go about solving image recognition using machine code.

Let’s start by examining the first thought: we categorize everything we see based on features (usually subconsciously) and we do this based on characteristics and categories that we choose. The number of characteristics to look out for is limited only by what we can see and the categories are potentially infinite. This is different for a program as programs are purely logical. As of now, they can only really do what they have been programmed to do which means we have to build into the logic of the program what to look for and which categories to choose between.

This is a very important notion to understand: as of now, machines can only do what they are programmed to do. If we build a model that finds faces in images, that is all it can do. It won’t look for cars or trees or anything else; it will categorize everything it sees into a face or not a face and will do so based on the features that we teach it to recognize. This means that the number of categories to choose between is finite, as is the set of features we tell it to look for. We can tell a machine learning model to classify an image into multiple categories if we want (although most choose just one) and for each category in the set of categories, we say that every input either has that feature or doesn’t have that feature. Machine learning helps us with this task by determining membership based on values that it has learned rather than being explicitly programmed but we’ll get into the details later.

Often the inputs and outputs will look something like this:

Input: [ 1 1 0 0 0 1 0 0 1 0 ]

Output: [ 0 0 1 0 0 ]

In the above example, we have 10 features. A 1 means that the object has that feature and a 0 means that it does not so this input has features 1, 2, 6, and 9 (whatever those may be). We can 5 categories to choose between. A 1 in that position means that it is a member of that category and a 0 means that it is not so our object belongs to category 3 based on its features. This form of input and output is called one-hot encoding and is often seen in classification models. Realistically, we don’t usually see exactly 1s and 0s (especially in the outputs). We should see numbers close to 1 and close to 0 and these represent certainties or percent chances that our outputs belong to those categories. For example, if the above output came from a machine learning model, it may look something more like this:

[ 0.01 0.02 0.95 0.01 0.01]

This means that there is a 1% chance the object belongs to the 1st, 4th, and 5th categories, a 2% change it belongs to the 2nd category, and a 95% chance that it belongs to the 3rd category. It can be nicely demonstrated in this image:

How do Machines Interpret Images?

This provides a nice transition into how computers actually look at images. To a computer, it doesn’t matter whether it is looking at a real-world object through a camera in real time or whether it is looking at an image it downloaded from the internet; it breaks them both down the same way. Essentially, in image is just a matrix of bytes that represent pixel values. When it comes down to it, all data that machines read whether it’s text, images, videos, audio, etc. is broken down into a list of bytes and is then interpreted based on the type of data it represents.

For images, each byte is a pixel value but there are up to 4 pieces of information encoded for each pixel. Grey-scale images are the easiest to work with because each pixel value just represents a certain amount of “whiteness”. Because they are bytes, values range between 0 and 255 with 0 being the least white (pure black) and 255 being the most white (pure white). Everything in between is some shade of grey. With colour images, there are additional red, green, and blue values encoded for each pixel (so 4 times as much info in total). Each of those values is between 0 and 255 with 0 being the least and 255 being the most. If an image sees a bunch of pixels with very low values clumped together, it will conclude that there is a dark patch in the image and vice versa.

Below is a very simple example. An image of a 1 might look like this:

And have this as the pixel values:

[[255, 255, 255, 255, 255],

[255, 255, 0, 255, 255],

[255, 255, 255, 255, 255]]

This is definitely scaled way down but you can see a clear line of black pixels in the middle of the image data (0) with the rest of the pixels being white (255).

Images have 2 dimensions to them: height and width. These are represented by rows and columns of pixels, respectively. In this way, we can map each pixel value to a position in the image matrix (2D array so rows and columns). Machines don’t really care about the dimensionality of the image; most image recognition models flatten an image matrix into one long array of pixels anyway so they don’t care about the position of individual pixel values. Rather, they care about the position of pixel values relative to other pixel values. They learn to associate positions of adjacent, similar pixel values with certain outputs or membership in certain categories. In the above example, a program wouldn’t care that the 0s are in the middle of the image; it would flatten the matrix out into one long array and say that, because there are 0s in certain positions and 255s everywhere else, we are likely feeding it an image of a 1. The same can be said with coloured images. If a model sees pixels representing greens and browns in similar positions, it might think it’s looking at a tree (if it had been trained to look for that, of course).

This is also how image recognition models address the problem of distinguishing between objects in an image; they can recognize the boundaries of an object in an image when they see drastically different values in adjacent pixels. A machine learning model essentially looks for patterns of pixel values that it has seen before and associates them with the same outputs. It does this during training; we feed images and the respective labels into the model and over time, it learns to associate pixel patterns with certain outputs. If a model sees many images with pixel values that denote a straight black line with white around it and is told the correct answer is a 1, it will learn to map that pattern of pixels to a 1.

This is great when dealing with nicely formatted data. If we feed a model a lot of data that looks similar then it will learn very quickly. The problem then comes when an image looks slightly different from the rest but has the same output. Consider again the image of a 1. It could be drawn at the top or bottom, left or right, or center of the image. It could have a left or right slant to it. It could look like this: 1 or this l. This is a big problem for a poorly-trained model because it will only be able to recognize nicely-formatted inputs that are all of the same basic structure but there is a lot of randomness in the world. We need to be able to take that into account so our models can perform practically well. This is why we must expose a model to as many different kinds of inputs as possible so that it learns to recognize general patterns rather than specific ones. There are tools that can help us with this and we will introduce them in the next topic.

Hopefully by now you understand how image recognition models identify images and some of the challenges we face when trying to teach these models. Models can only look for features that we teach them to and choose between categories that we program into them. To machines, images are just arrays of pixel values and the job of a model is to recognize patterns that it sees across many instances of similar images and associate them with specific outputs. We need to teach machines to look at images more abstractly rather than looking at the specifics to produce good results across a wide domain. Next up we will learn some ways that machines help to overcome this challenge to better recognize images. In the meantime, though, consider browsing our article on just what sort of job opportunities await you should you pursue these exciting Python topics!

Transcript

What is up, guys? Welcome to the first tutorial in our image recognition course. This is also the very first topic, and is just going to provide a general intro into image recognition. Now we’re going to cover two topics specifically here. One will be, “What is image recognition?” and the other will be, “What tools can help us to solve image recognition?”

The first part, which will be this video, will be all about introducing the problem of image recognition, talk about how we solve the problem of image recognition in our day-to-day lives, and then we’ll go onto explore this from a machine’s point of view. After that, we’ll talk about the tools specifically that machines use to help with image recognition. Specifically, we’ll be looking at convolutional neural networks, but a bit more on that later.

Let’s get started with, “What is image recognition?” Image recognition is seeing an object or an image of that object and knowing exactly what it is. At the very least, even if we don’t know exactly what it is, we should have a general sense for what it is based on similar items that we’ve seen. Essentially, we class everything that we see into certain categories based on a set of attributes. That’s why image recognition is often called image classification, because it’s essentially grouping everything that we see into some sort of a category.

Now the attributes that we use to classify images is entirely up to us. For example, if we’re looking at different animals, we might use a different set of attributes versus if we’re looking at buildings or let’s say cars, for example. If we’re looking at vehicles, we might be taking a look at the shape of the vehicle, the number of windows, the number of wheels, et cetera. If we’re looking at animals, we might take into consideration the fur or the skin type, the number of legs, the general head structure, and stuff like that. It’s entirely up to us which attributes we choose to classify items. And this could be real-world items as well, not necessarily just images.

Now, this allows us to categorize something that we haven’t even seen before. In fact, this is very powerful. We can take a look at something that we’ve literally never seen in our lives, and accurately place it in some sort of a category. We can often see this with animals. I highly doubt that everyone has seen every single type of animal there is to see out there. No doubt there are some animals that you’ve never seen before in your lives. But, you should, by looking at it, be able to place it into some sort of category. You should know that it’s an animal. You should have a general sense for whether it’s a carnivore, omnivore, herbivore, and so on and so forth.

Now, another example of this is models of cars. Now, every single year, there are brand-new models of cars coming out, some which we’ve never seen before. Some look so different from what we’ve seen before, but we recognize that they are all cars. We can take a look again at the wheels of the car, the hood, the windshield, the number of seats, et cetera, and just get a general sense that we are looking at some sort of a vehicle, even if it’s not like a sedan, or a truck, or something like that.

Now, how does this work for us? Well, a lot of the time, image recognition actually happens subconsciously. In fact, we rarely think about how we know what something is just by looking at it. We just kinda take a look at it, and we know instantly kind of what it is. And a big part of this is the fact that we don’t necessarily acknowledge everything that is around us. If we do need to notice something, then we can usually pick it out and define and describe it.

Take, for example, if you’re walking down the street, especially if you’re walking a route that you’ve walked many times. It’s highly likely that you don’t pay attention to everything around you. Maybe there’s stores on either side of you, and you might not even really think about what the stores look like, or what’s in those stores. However, when you go to cross the street, you become acutely aware of the other people around you, of the cars around you, because those are things that you need to notice. In fact, even if it’s a street that we’ve never seen before, with cars and people that we’ve never seen before, we should have a general sense for what to do. The light turns green, we go, if there’s a car driving in front of us, probably shouldn’t walk into it, and so on and so forth.

Now, this kind of process of knowing what something is is typically based on previous experiences. If we’d never come into contact with cars, or people, or streets, we probably wouldn’t know what to do. However, we’ve definitely interacted with streets and cars and people, so we know the general procedure. So, go on a green light, stop on a red light, so on and so forth, and that’s because that’s stuff that we’ve seen in the past. Even if we haven’t seen that exact version of it, we kind of know what it is because we’ve seen something similar before.

Now, sometimes this is done through pure memorization. Maybe we look at a specific object, or a specific image, over and over again, and we know to associate that with an answer. This is just kind of rote memorization. However, the more powerful ability is being able to deduce what an item is based on some similar characteristics when we’ve never seen that item before. And that’s really the challenge.

It’s easy enough to program in exactly what the answer is given some kind of input into a machine. You could just use like a map or a dictionary for something like that. However, the challenge is in feeding it similar images, and then having it look at other images that it’s never seen before, and be able to accurately predict what that image is. Now, this kind of a problem is actually two-fold. The problem is first deducing that there are multiple objects in your field of vision, and the second is then recognizing each individual object.

So, step number one, how are we going to actually recognize that there are different objects around us? Typically, we do this based on borders that are defined primarily by differences in color. This makes sense. If we’ve seen something that camouflages into something else, probably the colors are very similar, so it’s just hard to tell them apart, it’s hard to place a border on one specific item. However, if you see, say, a skyscraper outlined against the sky, there’s usually a difference in color. It’s very easy to see the skyscraper, maybe, let’s say, brown, or black, or gray, and then the sky is blue. So there’s that sharp contrast in color, therefore we can say, ‘Okay, there’s obviously something in front of the sky.’

Now, again, another example is it’s easy to see a green leaf on a brown tree, but let’s say we see a black cat against a black wall. We might not even be able to tell it’s there at all, unless it opens its eyes, or maybe even moves. Now, we don’t necessarily need to look at every single part of an image to know what some part of it is. Take, for example, if you have an image of a landscape, okay, so there’s maybe some trees in the background, there’s a house, there’s a farm, or something like that, and someone asks you to point out the house. Well, you don’t even need to look at the entire image, it’s just as soon as you see the bit with the house, you know that there’s a house there, and then you can point it out.

This is even more powerful when we don’t even get to see the entire image of an object, but we still know what it is. Take, for example, an image of a face. Let’s say we’re only seeing a part of a face. Specifically, we only see, let’s say, one eye and one ear. But we still know that we’re looking at a person’s face based on the color, the shape, the spacing of the eye and the ear, and just the general knowledge that a face, or at least a part of a face, looks kind of like that. Our brain fills in the rest of the gap, and says, ‘Well, we’ve seen faces, a part of a face is contained within this image, therefore we know that we’re looking at a face.’

That’s, again, a lot more difficult to program into a machine because it may have only seen images of full faces before, and so it gets a part of a face, and it doesn’t know what to do. No longer are we looking at two eyes, two ears, the mouth, et cetera. We’re only looking at a little bit of that.

Now, before we talk about how machines process this, I’m just going to kind of summarize this section, we’ll end it, and then we’ll cover the machine part in a separate video, because I do wanna keep things a bit shorter, there’s a lot to process here. So some of the key takeaways are the fact that a lot of this kinda image recognition classification happens subconsciously. We just look at an image of something, and we know immediately what it is, or kind of what to look out for in that image. Obviously this gets a bit more complicated when there’s a lot going on in an image.

Also, image recognition, the problem of it is kinda two-fold. The first is recognizing where one object ends and another begins, so kinda separating out the object in an image, and then the second part is actually recognizing the individual pieces of an image, putting them together, and recognizing the whole thing. Also, know that it’s very difficult for us to program in the ability to recognize a whole part of something based on just seeing a single part of it, but it’s something that we are naturally very good at.

Okay, so, think about that stuff, stay tuned for the next section, which will kind of talk about how machines process images, and that’ll give us insight into how we’ll go about implementing the model. Okay, so thanks for watching, we’ll see you guys in the next one.

What’s up guys? Welcome to the second tutorial in our image recognition course. Here we’re going to continue on with how image recognition works, but we’re going to explore it from a machine standpoint now. We just finished talking about how humans perform image recognition or classification, so we’ll compare and contrast this process in machines.

For starters, contrary to popular belief, machines do not have infinite knowledge of what everything they see is. So, let’s say we’re building some kind of program that takes images or scans its surroundings. Well, it’s going to take in all that information, and it may store it and analyze it, but it doesn’t necessarily know what everything it sees it. It might not necessarily be able to pick out every object.

Machines only have knowledge of the categories that we have programmed into them and taught them to recognize. And, actually, this goes beyond just image recognition, machines, as of right now at least, can only do what they’re programmed to do. So this means, if we’re teaching a machine learning image recognition model, to recognize one of 10 categories, it’s never going to recognize anything else, outside of those 10 categories.

Now, a simple example of this, is creating some kind of a facial recognition model, and its only job is to recognize images of faces and say, “Yes, this image contains a face,” or, “no, it doesn’t.” So basically, it classifies everything it sees into a face or not a face. Now, this means that even the most sophisticated image recognition models, the best face recognition models will not recognize everything in that image. It’s never going to take a look at an image of a face, or it may be not a face, and say, “Oh, that’s actually an airplane,” or, “that’s a car,” or, “that’s a boat or a tree.”

It’s just going to say, “No, that’s not a face,” okay? Because that’s all it’s been taught to do. It’s classifying everything into one of those two possible categories, okay? So even if something doesn’t belong to one of those categories, it will try its best to fit it into one of the categories that it’s been trained to do. So, essentially, it’s really being trained to only look for certain objects and anything else, just, it tries to shoehorn into one of those categories, okay? So that’s a very important takeaway, is that if we want a model to recognize something, we have to program it to recognize that, okay? Otherwise, it may classify something into some other category or just ignore it completely.

Now, to a machine, we have to remember that an image, just like any other data, is simply an array of bytes. So it’s really just an array of data. It doesn’t look at an incoming image and say, “Oh, that’s a two,” or “that’s an airplane,” or, “that’s a face.” It’s just an array of values. Even images – which are technically matrices, there they have columns and rows, they are essentially rows of pixels, these are actually flattened out when a model processes these images.

Generally speaking, we flatten it all into one long array of bytes. So, I say bytes because typically the values are between zero and 255, okay? So that’s a byte range, but, technically, if we’re looking at color images, each of the pixels actually contains additional information about red, green, and blue color values. Lucky for us, we’re only really going to be working with black and white images, so this problem isn’t quite as much of a problem. But realistically, if we’re building an image recognition model that’s to be used out in the world, it does need to recognize color, so the problem becomes four times as difficult.

Now, if an image is just black or white, typically, the value is simply a darkness value. I guess this actually should be a whiteness value because 255, which is the highest value as a white, and zero is black. And, that means anything in between is some shade of gray, so the closer to zero, the lower the value, the closer it is to black. And, the higher the value, closer to 255, the more white the pixel is.

Now, this is the same for red, green, and blue color values, as well. If we get a 255 in a red value, that means it’s going to be as red as it can be. If we get 255 in a blue value, that means it’s gonna be as blue as it can be. But, of course, there are combinations. So, for example, if we get 255 red, 255 blue, and zero green, we’re probably gonna have purple because it’s a lot of red, a lot of blue, and that makes purple, okay? So this is kind of how we’re going to get these various color values encoded into our images.

Now, machines don’t really care about seeing an image as a whole, it’s a lot of data to process as a whole anyway, so actually, what ends up happening is these image recognition models often make these images more abstract and smaller, but we’ll get more into that later. To process an image, they simply look at the values of each of the bytes and then look for patterns in them, okay?

So if we feed an image of a two into a model, it’s not going to say, “Oh, well, okay, I can see a two.” It’s just gonna see all of the pixel value patterns and say, “Oh, I’ve seen those before “and I’ve associated with it, associated those with a two. “So we’ll probably do the same this time,” okay? So they’re essentially just looking for patterns of similar pixel values and associating them with similar patterns they’ve seen before. In this way, image recognition models look for groups of similar byte values across images so that they can place an image in a specific category.

Again, coming back to the concept of recognizing a two, because we’ll actually be dealing with digit recognition, so zero through nine, we essentially will teach the model to say, “‘Kay, we’ve seen this similar pattern in twos. “We’ve seen this pattern in ones,” et cetera. So when it sees a similar patterns, it says, “Okay, well, we’ve seen those patterns “and associated it with a specific category before, “so we’ll do the same.”

Now, I should say actually, on this topic of categorization, it’s very, very rarely going to be the case that the model is 100% certain an image belongs to any category, okay? That’s why these outputs are very often expressed as percentages. So it might be, let’s say, 98% certain an image is a one, but it also might be, you know, 1% certain it’s a seven, maybe .5% certain it’s something else, and so on, and so forth. So it’s very, very rarely 100% it will, you know, we can get very close to 100% certainty, but we usually just pick the higher percent and go with that.

Now, an example of a color image would be, let’s say, a high green and high brown values in adjacent bytes, may suggest an image contains a tree, okay? Now, if many images all have similar groupings of green and brown values, the model may think they all contain trees. So it will learn to associate a bunch of green and a bunch of brown together with a tree, okay? So this is maybe an image recognition model that recognizes trees or some kind of, just everyday objects.

Now, the unfortunate thing is that can be potentially misleading. There are plenty of green and brown things that are not necessarily trees, for example, what if someone is wearing a camouflage tee shirt, or camouflage pants? Well, that’s definitely not a tree, that’s a person, but that’s kind of the point of wearing camouflage is to fool things or people into thinking that they are something else, in this case, a tree, okay? So really, the key takeaway here is that machines will learn to associate patterns of pixels, rather than an individual pixel value, with certain categories that we have taught it to recognize, okay?

Now, we can see a nice example of that in this picture here. So, there’s a lot going on in this image, even though it may look fairly boring to us. There’s the decoration on the wall. There’s the lamp, the chair, the TV, the couple of different tables. There’s a vase full of flowers. There’s a picture on the wall and there’s obviously the girl in front. And, the girl seems to be the focus of this particular image.

Now, we are kind of focusing around the girl’s head, but there’s also, a bit of the background in there, there’s also, you got to think about her hair, contrasted with her skin. There’s also a bit of the image, that kind of picture on the wall, and so on, and so forth. So there may be a little bit of confusion. It’s not 100% girl and it’s not 100% anything else. And, that’s why, if you look at the end result, the machine learning model, this is 94% certain that it contains a girl, okay? It’s, for a reason, 2% certain it’s the bouquet or the clock, even though those aren’t directly in the little square that we’re looking at, and there’s a 1% chance it’s a sofa.

Now, I know these don’t add up to 100%, it’s actually 101%. But, you’ve got to take into account some kind of rounding up. Also, this definitely demonstrates how a bigger image is broken down into many, many smaller images and ultimately is categorized into one of these categories. So, in this case, we’re maybe trying to categorize everything in this image into one of four possible categories, either it’s a sofa, clock, bouquet, or a girl. And, in this case, what we’re looking at, it’s quite certain it’s a girl, and only a lesser bit certain it belongs to the other categories, okay?

So again, remember that image classification is really image categorization. Machines can only categorize things into a certain subset of categories that we have programmed it to recognize, and it recognizes images based on patterns in pixel values, rather than focusing on any individual pixel, ‘kay? So when we come back, we’ll talk about some of the tools that will help us with image recognition, so stay tuned for that.

Otherwise, thanks for watching! See you guys in the next one!

Interested in continuing? Check out the full Convolutional Neural Networks for Image Classification course, which is part of our Machine Learning Mini-Degree.

Classification with Support Vector Machines

Mohit Deshpande — Sun, 06 Sep 2020 02:21:57 +0000

One of the most widely-used and robust classifiers is the support vector machine. Not only can it efficiently classify linear decision boundaries, but it can also classify non-linear boundaries and solve linearly inseparable problems. We’ll be discussing the inner workings of this classification jack-of-all-trades. We first have to review the perceptron so we can talk about support vector machines. Then we’ll derive the support vector machine problem for both linearly separable and inseparable problems. We’ll discuss the kernel trick, and, finally, we’ll see how varying parameters affects the decision boundary on the most popular classification dataset: the iris dataset.

Download the full code here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Perceptron Review

Before continuing on to discuss support vector machines, let’s take a moment to recap the perceptron.

The perceptron takes a weighted sum of its inputs and applies an activation function. To train a perceptron, we adjust the weights of the weighted sum. The activation function can be any number of things, such as the sigmoid, hyperbolic tangent (tanh), or rectified linear unit (ReLU). After applying the activation function, we get an activation out, and that activation is compared to the actual output to measure how well our perceptron is doing. If it didn’t correctly classify our data, then we adjust the weights. We keep iterating over our training data until the perceptron can correctly classify each of our examples (or we hit the maximum number of epochs).

We trained our perceptron to solve logic gates but came to an important realization: the perceptron can only solve linear problems! In other words, the perceptron’s weights create a line (or hyperplane)! This is the reason we can’t use a single perceptron to solve the XOR problem.

Let’s discuss just linear problems for now. One of the most useful properties of the perceptron is the perceptron convergence theorem: for a linearly separable problem, the perceptron is guaranteed to find an answer in a finite amount of time.

However, there is one big catch: it finds the first line that correctly classifies all examples, not the best line. For any problem, if there is a single line that can correctly classify all training examples, there are an infinite number of lines that can separate the classes! These separating lines are also called decision boundaries because they determine the class based on which side of the boundary an example falls on.

Let’s see an example to make this more concrete. Suppose we had the given data for a binary classification problem. If we used a perceptron, we might get a decision boundary that looks like this.

This isn’t the best decision boundary! The line is really close to all of our green examples and far from our magenta examples. If we get new examples, then we might have an example that’s really close to the decision boundary, but on the magenta side. If I didn’t draw that line, we would certainly think that the new point would be a green point. But, since it is on the other side of the decision boundary, even though it is closer to the green examples, our perceptron would classify it as a magenta point. This is not good!

If this decision boundary is bad, then where, among the infinite number of decision boundaries, is the best one? Our intuition tell us that the best decision boundary should probably be oriented in the exact middle of the two classes of data.

The dashed line is the decision boundary. This seems like a better fit! Now, if we have a new example that’s really close to this decision boundary, we still can classify it correctly! But how do we find this best decision boundary?

Support Vector Machines

The goal of support vector machines (SVMs) is to find the optimal line (or hyperplane) that maximally separates the two classes! (SVMs are used for binary classification, but can be extended to support multi-class classification). Mathematically, we can write the equation of that decision boundary as a line.

Note that we set this equal to zero because it is an equation. Depending on the value of for a particular point , we can classify into the two classes. We’re using vector notation to be as general as possible, but this works for a simple 2D (one input) case as well.

If we do some geometry, we can figure out that the distance from any point to the decision boundary is the following

Our goal is to maximize for the points closest to the optimal decision boundary. These points are so important that they have a special name: support vectors!

We can actually simplify this goal a little bit by considering only the support vectors. Notice that the numerator just tells us which class (we’re assuming the two classes are 1 and -1), but the denominator doesn’t change. We can take the absolute value of each side to get rid of the numerator.

where is the optimal decision boundary (later we’ll show that the bias is easy to solve for if we know ) We can simplify even further! Maximizing is equivalent to minimizing . This is a bit tricky to do mathematically, so we can just square this to get . (The constant out front is there so it can nicely cancel out later!)

However, we need more constraints, else we could just make ! That wouldn’t solve anything! The other constraints come from our need to correctly classify the examples!

where is the ground truth and we iterate over our training set. To see why this is correct, let’s split it into the two classes 1 and -1:

We can compress the two into the single equation above. After we’ve considered all of this, we can formally state our optimization problem! (In the constraints, the 1 was moved over to the other side of the inequality.)

This is called the primal problem. This is a run-of-the-mill optimization problem, so we can use the technique of Lagrange Multipliers to solve this problem.

where the ‘s are the Lagrange multipliers. To solve this, we have to compute the partial derivatives with respect to our weights and bias, set them to zero, and solve! I’ll skip over the derivation and just give the solutions.

The first equation is and the second equation is . These solutions tell us some useful things about the weights and Lagrange multipliers. In particular, they give some constraints on the Lagrange multipliers. These ‘s also tell us something very important about our SVM: they indicate the support vectors! If a particular point is a support vector, then its corresponding Lagrange multiplier will be greater than 0! If it is not a support vector, then it will be equal to 0!

However, we still don’t have enough information to solve our problem. As it turns out, there is a corresponding problem called the dual problem that we can solve instead.

This is something that we can solve! Notice that it’s only in terms of the Lagrange multipliers! Everything else is known! We usually use a quadratic programming solver to do this for us because it is infeasible to solve by-hand for large numbers of points. But we would solve for this by setting each and solving.

After we’ve solved for the ‘s, we can find the optimal line using the following equations.

The first is from the primal problem, and the second is just solving for the bias from the decision boundary equation.

SVMs for Logic Gates

Let’s take a break from the math and apply support vector machines to a simple logic gate, like what we did for perceptrons. In particular, let’s train an SVM to solve the logic AND gate.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])
y = np.array([0, 0, 0, 1])

clf = clf.SVC(kernel='linear', C=1e6)
clf.fit(X, y)

We’re building a linear decision boundary. Ignore the other parameter ; we’ll discuss that later. Now we can use some plotting code (source) to show the decision boundary and support vectors.

plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)

# plot the decision function
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)

# plot decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])
# plot support vectors
ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=100, linewidth=1, facecolors='none')
plt.show()

Before we plot this, let’s try to predict what our decision boundary and surface will look like. Here’s the picture of the logic gates again.

Where will the decision boundary be? Which points will be the support vectors? The decision boundary will be a diagonal line between the two classes. The support vectors will be (1,1), (0,1), and (1,0) since they are closest to that boundary.

This matches our intuition! So SVMs can certainly solve linear separable problems, but what about non-linearly separable problems?

SVMs for Linearly Inseparable Problems

Suppose we had the following linearly inseparable data.

There is no line that can correctly classify each point! Can we still use our SVM? We can, but with a modification. We have to add slack variables . These measure how many misclassifications there are. We also want to minimize the sum of all of the slack variables. Intuitively, this corresponds to minimizing the number of incorrect classifications. We can reformulate our primal problem.

where we introduce a new hyperparameter that measures the tradeoff between the two objectives: largest margin of separation and smallest number of incorrect classifications. And, from there, go to our corresponding dual problem.

This looks almost the same as before! The change is that our ‘s are also bounded above by . After solving for our ‘s, we can solve for our weights and bias exactly the same as in our linearly separable case!

The Kernel Trick

One last topic to discuss is the kernel trick. Instead of having a linear decision boundary, we can have a nonlinear decision boundary. The idea behind the kernel trick is to apply a nonlinear kernel to our inputs to transform them into a higher-dimensional space where we can find a linear decision boundary.

Consider the above figure. The left is our 2D dataset that can’t be separated using a line. However, if we use some kernel function to project all of our points into a 3D space, then we can find a plane that separates our examples. The intuition behind this is that higher dimensional spaces have extra degrees of freedom that we can use to find a linear plane! There are many different choices of kernel functions: radial basis functions, polynomial functions, and others.

SVM for The Iris Dataset

One of the most famous datasets in all of machine learning is the iris dataset. It has 150 data points across 3 different types of flowers. The features that were collected were sepal length/width and petal length/width. Our goal is to use an SVM to correctly classify an input into the correct flower and to draw the decision boundary.

Since the iris dataset has 4 features, let’s consider only the first two features so we can plot our decision regions on a 2D plane. First, let’s load the iris dataset, create our training and testing data, and fit our SVM. We’ll change some parameters later, but let’s use a linear SVM.

iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target

C = 1.0 
clf = svm.SVC(kernel='linear', C=C)
clf.fit(X, y)

Now we can use some auxiliary functions (source) to plot our decision regions.

ax = plt.gca()
def make_meshgrid(x, y, h=.02):
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    return xx, yy


def plot_contours(ax, clf, xx, yy, **params):
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xlabel('Sepal length')
ax.set_ylabel('Sepal width')
ax.set_xticks(())
ax.set_yticks(())

plt.show()

Additionally, we’re going to print the classification report to see how well our SVM performed.

from sklearn.metrics import classification_report
print(classification_report(y, clf.predict(X), target_names=iris.target_names))

Now let’s run our code to see a plot and classification metrics!

Additionally, we can try using an RBF kernel and changing our value. Recall that controls the tradeoff between large margin of separation and a lower incorrect classification rate.

C = 1.0 
clf = svm.SVC(kernel='rbf', C=C)

Try varying different parameters to get the best classification score – and feel free to add all this to your own coding portfolio as well!

To summarize, Support Vector Machines are very powerful classification models that aim to find a maximal margin of separation between classes. We saw how to formulate SVMs using the primal/dual problems and Lagrange multipliers. We also saw how to account for incorrect classifications and incorporate that into the primal/dual problems. Finally, we trained an SVM on the iris dataset.

Support Vector Machines are one of the most flexible non-neural models for classification; they’re able to model linear and nonlinear decision boundaries for linearly separable and inseparable problems.

An Introduction to Machine Learning

Lindsay Schardon — Fri, 20 Dec 2019 16:00:56 +0000

You can access the full course here: Machine Learning for Beginners with TensorFlow

Intro to Machine Learning

Now that we know what the course is all about, let’s learn a bit about the main topic: machine learning. What is machine learning? Machine learning is the study of statistics and algorithms aimed at performing a task without being explicitly programmed to. Theoretically, a machine is said to have learned if it produces “better” results over time without modifications to its programming. Practically, this means writing an algorithm, feeding it some data, and letting it interpret the data to find some pattern to solve a problem.

Behind the scenes, machine learning models consist of layers of connected nodes (often called neurons). We will cover model structure in greater detail later but it is important to know now that each node has one or more values assigned to it that, when multiplied with a function of our choice produces some output. Through training, a model can change the values to produce more accurate outputs rather than having us, as the coders, explicitly change the algorithm or values manually. In this way, the model is “learning” as it is improving its results over time using the same algorithm. It should also be noted that some models do not undergo training and are only used to find patterns in data. We will explain the differences in the section common machine learning models.

So what makes machine learning so special? Why not just hard code the algorithms ourselves? The main reason we use machine learning is to help find patterns in data that we wouldn’t otherwise be able to see. If we cannot find patterns in data, we cannot hard code algorithms to look for them as we wouldn’t know what to tell the algorithm to do. It is this pattern recognition that allows machine learning models to solve recognition, classification, and prediction problems such as speech recognition, image classification, and stock market prediction. Machine learning also helps to customize user experience and tailor solutions to the user based on their previous habits such as with responsive game AIs, health apps, and text suggestion.

Transcript

What is up, guys? And welcome to the first tutorial in our machine learning course! This’ll also be the very first topic and, as you can see, it is on an Intro to Machine Learning. This is a good way to get some conceptual background info about what machine learning is, kind of how it works, and also go over some practical examples before we start writing any actual code.

So, what topics will we be covering here? Well, we’re gonna divide ourselves into three subtopics. We’ll start with “What is Machine Learning?”, then “What can we do with Machine Learning?”, and we’ll finish up with “What types of Machine Learning there are out there”. Now, I’m going to devote a separate tutorial to each of these three topics just because I’m trying to keep things a bit shorter. There’s a lot to digest within each of these so, we don’t want anything running too, too long.

Okay, so, for starters, we’ll cover, What Is Machine Learning? I think we can safely do that in this video, too. Now, there is a technical definition here; Machine Learning is the study of statistics and algorithms that’s aimed at performing a task without being explicitly programmed to. That’s a lot to take in. It’s quite wordy and is very technical, uses a lot of jargon. So, let’s try to break that down a bit. I’ve got a, hopefully a bit more of an easy definition to understand here and that is that Machine Learning is finding patterns in data that help to solve a problem without us necessarily writing the algorithm to find the patterns and solve the problem. Now, realistically, these are both kind of describing the same thing. There are a couple of common themes in there.

So, the first is that we’re aimed at performing a task or solving some kind of a problem. Well, that’s kind of obvious. I mean, that’s what software is supposed to help us to do is perform some task or solve a problem easier than we would otherwise be able to do. But the second thing that’s really important is that it’s not programmed explicitly to necessarily solve that problem or to improve over time, okay? So, the learning aspect of it is actually not something that we, ourselves, necessarily program in. It kind of figures out what it should and should be doing, should and shouldn’t be doing, based on the data that it sees.

So, usually Machine Learning does involve performing a task better over time, otherwise there’s not really an aspect of learning involved. Performance in most of the models, not all of them however, is linked to this concept of training. Now, we’ll go into training in greater detail later but for now this is essentially feeding data into a model to increase a performance over time without modifying the algorithm itself.

Now, that’s a very important aspect of it because, otherwise, it’s not really Machine Learning. If we’re actually making changes to the algorithm to increase performance then, it’s not learning itself. That’s us manually changing things. There’s no Machine Learning there. If a machine produces the best results after training with the same algorithm, it’s said to have learned. So, if the model starts out at the very beginning with maybe, 20% accuracy, that means, it’s only getting 20% of the answers correct, and then, later on, that bumps up to, I don’t know maybe 80 or 90% and then we have a good model and that model is set to have learned, a lot.

The Key Takeaway here is that Machine Learning Models perform better over time without any changes made to the algorithm themselves. It’s really just putting in a bunch of data into the model and then having the model kind of perform better over time without any changes to the actual algorithm.

Okay! So, that’s all we’re actually gonna do here. Hopefully, just take a second to digest that. Hopefully it wasn’t too much and when we come back we’ll talk about some kinda things that we can do with Machine Learning, where we might see some practical applications of Machine Learning in real life, hopefully that will help to clear up any questions that you might have, okay? So, stay tuned for that. Thanks for watching! See you guys in the next one.

Interested in continuing? Check out the full Machine Learning for Beginners with TensorFlow course, which is part of our Machine Learning Mini-Degree.

How to Build a Spam Detector

Zenva — Fri, 15 Mar 2019 04:00:28 +0000

You can access the full course here: Build a Spam Detector AI with Text Classification

Transcript Part 1

Hello everybody my name is Mohit Deshpande, and in this course we’ll be building an AI that would be able to determine if an input email is spam or not.

And so you can see in some of the scores that we’re gonna be achieving as we build this AI, 99, 98% accuracy at determining if an email is spam or not. And just to kinda mess with it, I tried to, you know, I tried to give an AI an email that I just kind of fabricated up. Here’s the text of the email and you can see it correctly determined that this email is spam and we’re gonna be building this AI. So we’re gonna be learning a lot of things.

In particular, we have to look at text classification and how, you know, we’re gonna discuss some of the challenges of text classification and we’re gonna discuss one group of algorithms called Naive Bayes that are gonna be helpful for our problem. And then we’re also going to discuss this term frequency and inverse document frequency because those provide us some improvements that we can use to help bring up, the accuracy of our AI and we’re gonna be looking at one dataset in particular, called the Enron dataset that I’m going to, kind of go over a little bit in the course.

It’s a publicly available dataset. You can download it, I’ll provided it in the source code, you can download it. And we’re gonna be using that to tell our AI, what is spam email and what is not a spam email.

So we’ve been making video courses since 2012, we’re super excited to have you on board. Online course is fantastic way to learn a new skill and I take a lot of them myself. ZENVA courses consist mainly of video lessons that you can watch, re-watch, at your own pace as many times as you want. We also have downloadable source code and project files that contain everything that we build in a lesson. It’s highly recommended that you code along with me, in my experience it’s the best way to learn something.

And finally, we’ve seen that students who make the most out of these online courses are the same students that create a plan and stick with it, depending on your own availability and learning style, of course. And remember that these videos you can watch and re watch as many times as you want, so that really gives you more flexibility, so you can adapt to how you learn. At ZENVA we’ve taught programming and game development to over 200,000 students, over 50 courses since 2012. (now 350,000+)

Some of these students have used the the skills that they’ve learned in these courses to advance their own careers, start a company or publish their own apps and games. Thanks again for joining and I look forward to seeing all the cool stuff that you’ll be building. Now without further ado, let’s get started.

Transcript 2

Hello everybody. In in this video, I just wanna introduce you guys to the problem of text classification. I just wanna give this brief overview of what it is and a little bit about kind of a specific thing and we’re gonna go through an example and then we’ll then move on to how we can actually perform this kind of classification.

The problem of this text classification and actually it’s sometimes also called document classification. So, document classification– The challenge with document classification is, given an input document, there’s some text in it. We want to be able to put it into one or in some cases more, different bins. For example, if I were given some kind of document, maybe I wanna know whether it’s an invoice. Here’s a possible bin it could go into. Or maybe it’s a receipt. So here’s another possible bin it could go into, and then so on and so on.

Just looking at the text of a document, I want to be able to tell what kind of document it is and as you probably guess this requires supervised learning but it’s a bit more challenging than what we’ve seen before because instead of just having those X’s and O’s on that nice two dimensional plane, we’re dealing with text data and computers are really great at handling numerical data but with text data they’re not as good.

Humans, on the other hand, we can look at text data and it’s fine. Looking at large amounts of numerical data for humans can be kind of tedious and error prone. But for computers, they love that stuff. So, we’re gonna discuss a bit later how we can take textual data, we have to take textual data and convert it into some kind of numerical representation so that we can work with it using learning algorithms. We should be able to work with it once numerical data ’cause learning algorithms operate on numerical data. And so we have to find some way to convert the words into a numeric representation so that we can work with them a little bit better.

Now in particular, a good example that’s used for text classification is, suppose I have an email and I basically wanna categorize it as spam or not spam, also called ham. But if I’m given an email, I want to determine if it’s spam or not spam and a lot of email services already have a built in way to do this. For example, like Gmail or Outlook, Microsoft’s stuff, all these companies already have ways that given a new input email, detect whether it’s spam or ham. This is a good example to use because it’s simple enough that there are only two categories. This is really just a binary problem, whether it’s spam or not.

And so, it’s a bit easier to understand than maybe this top example… you have a document or wanna put it into different classes because it might be the case that it belongs in more than one of these classes. I could get a receipt that’s also an invoice or I can mix and merge these things but with something like email classification, like spam filtering, it’s either spam or not spam. There’s no combination.

Just to recap, we want to look at the text of the email, the words on this email because some words are more indicative of an email that’s… being spam than others. And so we want to look at the text of this email, including often information and we want to build and construct an AI that given a new email message, it can determine if it’s spam or not and then route it to either your inbox or your spam folder. Like I mentioned before, supervised learning is a great way to accomplish this task. I’m gonna give our AI lots of examples, labeled examples of messages that are spam and messages that are not spam and then it should be able to learn what kind of words or what kind of other characteristics spam messages have and then apply them to a new input document.

Like I mentioned, in order to do this, we have to have a numerical representation but let’s assume we have this right now. Assume it’s already there. We’ll deal with that a bit later. I also wanna mention that text classification isn’t just used for spam filtering.

Actually, there’s a ton of different applications. Spam filtering is just one of them. But there’s also a sentiment analysis. So, sentiment analysis, which is given the text of a document, or actually, it’s very popularly used for social media, given a tweet or a Facebook post, or something like that, I want to be able to determine the sentiment of the creator. I wanna be able to tell is the sender angry or upset or other things like that. And there’s certain words and phrasings that are used in that sense and we can try to predict the sentiment of the sender, for example. This is especially popular in social media so you can mine lots of social media data. There’s been a lot of work there and then try to run sentiment analysis on that.

Another thing it’s also used for is book classification. Book classification, and what I mean by that is, given a book, maybe it’s title or something like that, we wanna be able to tell what genre it is, for example. Given some information about the book, like the title, the author or something like that, we wanna be able to tell what genre it is and this is super useful because then you don’t have to have manual people that have to make these decisions themselves because it might get tedious or maybe they have more important things to do. This is something that we can delegate to a machine learning algorithm lists from previous inaccuracies. Give it a book, you can tell what genre it is, for example or even categorize it even further.

Another thing that I wanna mention, another popular use of this is called readability. Readability. And with readability, it is more like given a passage of text, I wanna determine things like what level of reading or comprehension level you need to understand this passage or more accurately, given some of the words in this passage, what is the expected reading level, for example. So, you might notice that if you have an elementary school or a primary school reading level, the words might be just one or two syllables and the sentence structure is fairly simple but as you go up to more scientific writing or graduate school writing for example, then you notice that there are longer words.

There are more complicated words. Maybe the sentence structure is different. The structure varies, it’s more complicated. Readability assessment, we can look at text. We can look at words, the sentence structure to access a passage and determine what level of reading comprehension would be assigned to that passage of text. So that’s where I’m gonna stop here.

So just to recap, text classification or document classification is this problem of given some text input, I want to assign some label to it. I can assign labels like invoice, receipt, medical record and so on. Or the example that’s nice to look at is email filtering and that is given an email, I just want it to say whether it’s spam or not spam and I can use the words inside of the email to help me make this determination. I mentioned that there are plenty of different applications of text classification. I mentioned sentiment analysis in social media, book classification to tag books what genres and readability to determine to what level we can tag a passage like primary, elementary school up to scientific, college, grad school sort of thing. And that is text classification.

Transcript 3

Hello everybody, my name is Mohit Deshpande and in this video, I just want to introduce this algorithm that we can use called Naive Bayes, and we’re going to kind of be discussing it over the next few videos.

In this particular video I’m gonna write down and explain the actual equation that we use and we’ll look at how we can do a concrete example and see how Naive Bayes works there. And then I have some other, just kinda wrap-up stuff to discuss Naive Bayes. And so, like we were mentioning with textual data we have to have some kind of numeric representation for it and we want to learn some information about whether or not we’re going to be using spam detection, for example, for this sequence. Because it’s a nice example, the equation turned out to be fairly simple and easy to understand. It’s pretty intuitive.

So we’re gonna stick with this notion of spam filtering or spam detection. So spam filtering, or detection. And so the uh, so with Naive Bayes and with spam filtering it’s kind of logical to assume that spam messages tend to have words in more, have a different word distribution than messages that are, that are not spam. What I mean by this is, like for example, one example we’ll look at is the word “free”, or something, right.

So the word “free” is probably going to be more common in spam messages that are advertising like “free money” or “free something or the other”. For example, like the word “free” is more associated with spam messages than it is with ham or not spam messages. And so we can use that, we can word this probabilities and so we can use that to better assess given an input image, or given an input text, I should say, given an input text. We can determine whether it’s spam or not based on the words that are in that email. So I wanna write down the equation.

So Naive Bayes centers around this thing of probability in many fields called Bayes theorem. And so let me write it down. I’m gonna write it down first and then I’m going to discuss it intuitively. It might seem a little scary at first but each part intuitively makes sense that you’ll see. So, the probability that a message is spam knowing that we have some word, W, in it, is equal to the probability that of finding that same word, W, in a spam message times the probability that we actually have a spam message all divided by the same thing up here. The probability of finding that word in a spam message times the probability that message is spam to begin with plus the probability of finding that word in something that is not spam.

And this symbol by the way just means “not”. Times the probability that a message is not spam. So this seems kind of big and tedious but I just want to, we’ll kind of go through it in part. And so, I should mention that these Ps by the way, this means “probability” and you can intuitively think of probability as like chances or odds, like if I roll a die, if I roll a six-sided die, then what is the probability that it lands on a five? Well, it’s just one out of six because I have one outcome that I want which is landing with a five side up, and there’s six possible outcomes.

And so, intuitively, this is also something that in the probabilities, here’s the outcome and then here are all possible outcomes. And so you can kind of intuitively think of this as being, you know, probability but that’s what just the Ps mean, probability. Might see a capital P or P, capital P or lower case R. I’m just gonna write it like this.

So an this bar means “given”. And what I mean by given is that this is a conditional probability because it depends on what word that we know here. So, anyway, I just wanna kinda go through this intuitively. It seems a little scary at first but let’s get through this. So this first term here, this is what we want to find. We want to know, what is the probability that this message that we received is spam, given that it has some word, W, in it. Maybe I should make this rule. So this is spam. This is ham, or not spam. And W then is just a word, any word. So, so this is, let me get back here.

So this is a probability that their input message is spam given that it has some word in it. And this is equal to the probability that we find that word in a spam message times the probability, this probability, S, means what’s the probability that this message is even spam to begin with.

So we can have like general statistics as to you know, what percent of input email, what percent of email that you get is spam, now versus ham. And so this is the probability that any input email that you get to begin with is spam or not. And this is if it’s not spam. So probably what’s gonna happen, we’re also gonna see what happens in practice is that this is actually pretty high. The probability that it could, you probably get more spam than you do ham messages. So, expect this to be kind of close to one.

And so anyway, this is the probability that we find a word in that spam message. And this is like what I mentioned with the word “free”. We expect the word “free” to be found, it’s more likely to find that word in a spam message than a ham message. And so this is what this top quantity kind of looks at is what are the odds of finding this word, W, in a spam message. Then we divide by this sum, which is all possible outcomes. A possible outcome is that this word is in a spam message and probably spam.

But then there’s also maybe a smaller likelihood that this word is, you wanna also know like what are the odds of finding this word in a message that’s not spam? So would you look at a word and see is it more likely to fall into a spam message or is it more likely to fall into a not spam message? And so we can make a determination based on which of those two outcomes has a higher probability. So, is the probability that, this output probability is, if we find it’s more likely to be a spam message then we’re gonna assign it to be a spam message. And so, that’s how we can make a determination.

And I, we’re going to go through an example, a more concrete example of this. But intuitively this is like saying that the probability of a spam message with some given word is the likelihood, first of all that we have a spam message to begin with, times the probability of that word being in a spam message, divided by the probability that we actually encounter this word in the first place. And that’s what this is trying to say.

So this is a word given that it’s in spam and then times the probability of spam plus the word is in not spam. So there are really only two outcomes, right? Either the word is in a spam message or it’s not in a spam message. And so this is what this is trying to account for. And we’ll try to find the probability that this is in a spam message. So this is high. What does that mean? That means that the probability that this input email is spam given this word is in it, means that this email is probably spam. And so that’s kind of, I’m gonna stop right here because I don’t want to complicate this further. We’re actually going to go through an example of this in the next video. But I’m gonna stop right here and just kinda reexplain this one more time.

So Naive Bayes centers around this principal of Bayes Theorem and intuitively I can say what this says is what we wanna find is the probability that any input email is spam given that it has some word in it. We say that this is a logical assumption to make because there are a lot of, you know, we look at the content of an email to determine whether it’s spam or ham. So we wanna say what the probability of finding you know, finding if this is a spam message given that it has a word in it, a particular word, W, in it. And so that’s equal to the probability that this word appears in spam messages times the overall probability that we even get spam messages to begin with.

And then this bottom is kind of, like I said, separating this into two parts, where you know, this is W in spam, this is W in ham or not spam, messages.

So these are kinda the two things, outcomes that we can get for W. It’s also commonly called our evidence. Basically, you know, so that’s what we’re trying. So we can find all of these numbers based on our data. We can look at each word in our all the emails in our training data and see, you know, what’s the likelihood that is in a spam message? Oh hey, I found this word in all these spam messages this many times, and then I can, you know, figure out these probabilities. So all these things I can calculate from my data set. And then given a new input image, or a new input text I should say, I can determine and compute this probability. And if it’s super-high then I know that this message is spam.

So I’m gonna stop right here and in the next video I’m actually gonna go through a more concrete example so that we can understand this a bit better.

Transcript 4

Hello everybody. My name is Mohit Deshpande and in this video, I want to go through a more concrete example of applying this Naive Bayes approach to seeing if there is Spam or not.

So I wanna substitute concrete values for these and so we can kinda see. We’re trying to make the values, well it’s gonna be obvious that we can see that this is a good approach and that it works. And so let’s kinda use an example. So I’m gonna give you some of these numbers. So first, I’ll give you the probability of that we have a message and it’s Spam. So, and this is something that you can, again, you can compute from your data set but I’m just kinda use a overall statistic.

So the way that, there’s been like studies and to try to find this value, and so it’s been shown that about 86% of messages that you get are Spam. And so logically, if 86% of messages you get are Spam, then that means that the other 14% must be messages that are not Spam. So I can write that probability of not Spam is then 0.14, and this follows logically cause this message is either Spam or not Spam. Now suppose that the word we’re looking at is a word, free. And so now we still need some, we’re still missing some values. So we have like here, and here, and here.

So we got three values down, we still need three more. And we’ll really, really just need two more because these two are the same here. Actually these two are the same here as well. And so let’s suppose that we’re looking at the word, free, because same as just like I mentioned, I tend to have the word, free, in there and regular emails that you get, Ham emails, generally don’t have the word, but they could. So let’s make this like really obvious. Let’s suppose that the probability that we’d find the word, free, in a Spam message is something really high, like 0.96.

And so what this is saying is that the odds that we find the word, free, in a Spam message is 96%. It means that in, given that a message is, if we know that the message is Spam, then we have this chance of the word, free, appears in Spam messages, 96% of the time. What we wanna find is whether this new message that we get, that we received was Spam, but given that it has the word free in it.

So based on just looking at this number, you’re probably already thinking that, well this is probably gonna be a Spam message because in our data set, for example, this is such a high probability. But let’s suppose that finding the message, finding free, in a message that’s not Spam, is something like 0.02. So two, in very few cases, 2% of the time, we find the word free in something that is not Spam.

And it’s important to note that these two things are not related. So you can’t do the same thing like you did here. You can’t be like, well hey, so. If this is free in Spam message then why don’t I have 0.04 here, because 0.04 is actually a probability of not finding free in a Spam message, that would be 0.04, so these two things are not related at all. So now we actually have enough values to plug this in and find our answer.

So just kinda off the bat, you can probably, I’ll try to make this obvious, but you can probably tell that this message is not gonna be Spam because the probability that you find the word, free, in a Spam message is 96%, so the word, free, is a pretty good indicator, finding the word free is a pretty good indicator that your message is Spam, and so let’s actually like go through a computation and find this probability, and we expect it to be really high. So let’s plug in values.

So first I wanna find the probability that the message is Spam given that it has the word, free, in it. Well that’s equal to the probability of finding the word, free, in Spam messages times the overall probability that this, any given input is Spam to begin with. Sometimes the probability that I find the word that, the same thing here plus the probability, when I find the word, free, in a message that is not Spam, times the probability that this is not Spam.

And so I can like plug in values here. So the probability that it’s free, given that it’s, probably to finding the word, free, in Spam message is 0.96, 0.96 times probability that I actually have a Spam message, 0.86 divided by the same quantity here. More or less that these two are the same because this is one possible outcome, and this must be overall outcomes, plus probability that I find the word, free, in a Ham message was only 0.02, and then the probability of actually finding, getting a message that is not Spam is 0.14.

So when I compute all of these out, I get a 0.9966, and so we can’t have this value, we know from our data set or from wherever you find these values from, we know that this input email is almost, almost positive that this input email is a Spam message and we just did that by knowing just these four values. And these values are actually things that we’d learn from our data set. Given our input data set, that they’re labeled, examples of Spam and Ham messages.

So we can compute this probability. So we can know what the probability of receiving a Spam message based on our data set. Likewise, we also know this. And then we can find this probability about our word or in case these words, and there’s actually a problem with this approach that I’m gonna address in the next video is that it’s not just, we just don’t really look at a single word, there’s actually, we don’t look at all the words in our email, and there’s kind of, I’m gonna talk about why it’s called Naive Bayes in the next video because of this assumption that we make.

But yeah, so we can see that this is actually a pretty good way that we can determine whether an email is Spam or not by looking at what’s the likelihood that I find a, what’s the likelihood that I find this word or any word in a Spam message, or what’s the likelihood that I find it in a message that isn’t Spam and then compute that probability, and so suppose I had this, we’re supposed to set a free, I had like, something like, I don’t know, like report or, well like report or research or something like that.

Now then what’s probably gonna happen is that the probability that I find the word, research, for example, in a Spam message is probably gonna be pretty low or it will be lower than if I look at it. The probability of finding the word, research, in a Ham message or a message that is not Spam or from looking at my university account, for example. So based on that, then what would happen is this probability would be pretty low, probably finding the word, the probability that this message is Spam, given that it has the word, research, in it for example. That probability might be low, and if it’s low then I can conclude that this message is not Spam.

And these two things are, like finding this, and there’s a probability that it’s not Spam given that it has the word, free, in it by the way, are indeed related by the way, because, so if the probability that I have a message, if it’s Spam message given it has the word, free, it’s 0.9966, then the probability that it’s not Spam, or it’s Ham is gonna be what, 0.0044, and so that’s very low probability. And so I wanna pick which one of these gets me the highest probability, in this case, it’s far more likely that this input email is Spam instead of Ham and so I can brought that to my Spam folder.

And I should mention that the algorithms and techniques that companies use or personal proprietary, and they’re probably far more complicated than this but this approach is actually not that bad. So this is kinda, it’s certainly easier to understand than some of the more advanced technique, so we’ll just be looking at, we’ll be looking at this, and so I’m gonna stop right here, just do a quick recap. So we actually ran through a computation of using a Naive Bayes technique to determine if a message was Spam, given that it has the word, free, in it.

And i kinda gave you some numbers here, and while the probability of getting any message that’s Spam is like pretty high, and consequently, the probability that you’re getting a message that is Ham is pretty low and then it looks for a data set and say, well based on my experience, have I seen the word, free, in Spam messages or Ham messages and so based on that, I can, I know whether or not this is Spam or Ham because then I just look at how many times have I seen this in Spam messages versus Ham messages and I’ll be computing this probability and we found that this message is almost, almost 100% positive that this message is almost 100% likely this message is Spam. And so that’s how we can do a computation of Naive Bayes.

All these things are things that we can learn or algorithms that can automatically compute for us. Kind of problem with is that we’re only looking at a single word but we wanna be looking at emails or sequence of words. So we wanna be able to consider a sequence of words. So I’m gonna address how we can do that in the next video.

Interested in continuing? Check out the full Build a Spam Detector AI with Text Classification course. You can also check out our full Machine Learning Mini-Degree for further Python development skills.

How to Classify Images using Machine Learning

Zenva — Fri, 01 Mar 2019 05:00:56 +0000

You can access the full course here: Build Sarah – An Image Classification AI

Transcript 1

Hello everybody, and thanks for joining me, my name is Mohit Deshpande, and in this course we’ll be building an image classification app.

Given a set of images, we’re going to train an AI to learn what these images are, and then we can actually assign them labels. So, you see some of what our data set is gonna kinda look like, you have things like trucks, cats, airplane, deer, horse, and whatnot. And so, when, what we will be building is an AI that can actually classify these images and assign them labels so that we know what’s in the image. And so, we can build an AI to do that. So, kind of the big topic here is all about image classification.

So first, I want to introduce you to what image classification is, in case you’re not familiar with it. I will also do like a quick intro to machine learning as well. And, kinda the first approach that we’re going to take is through this thing called the nearest neighbor classifier, and so we’ll kind of build the intuition behind how that works, and then write the code for that from scratch.

Then, we’ll move on to something a bit more generic than that, and a bit better, and it’s called a k nearest neighbors classifier. So, instead of just the nearest neighbor, you look at the top k hostess neighbors, is kind of the intuition behind that. And I’m going to go into much more depth with that And, for this actually we’re going to use a pre-built, pre-built models, or pre-built classifier, whose code is already written so it can get kind of complicated with that. Then, we’re going to talk about hyperparameter tuning, because the question is then, you know, how do we choose the value of k, what is k, and so we’re going to be discussing how we pick these values and the approaches that we can take to get the best possible hyperparameters.

And finally, I also want to discuss the CIFAR-10 dataset, and what’s really cool about CIFAR-10 is that it’s a very popular, widely-used, real dataset that people doing research in image classification use to, when they’re reporting their results. And so, it’s going to be really cool, because you’ll be using that same dataset that the top researchers have used before. So, we’ll also be looking at that CIFAR-10 dataset. So, we’ve been making video courses since 2012, and we’re super excited to have you onboard. Online courses are a great way to learn new skills, and I take a lot of online courses myself. Zenva courses consist mainly of video lessons that you can watch at your own pace and as many times as you want.

All the source code that we make is downloadable, and one of the things that I want to mention is the best way to learn this material is to code along with me. So, we highly recommend that you code along so that you can better learn the material, because there’s a big difference between watching someone code and coding yourself. And finally, we’ve seen the students who get the most out of these online courses are also the same students who make, kind of, a weekly planner or a weekly schedule and stick with it, depending on your own availability and your learning style. Remember that these video lectures, you can watch and rewatch as many times as you want. So, that really gives you more flexibility.

At Zenva we’ve taught programming and game development to over 200,000 students, over 50 plus courses, since 2012. And these students have used the skills that they’ve learned in these courses to advance their careers, start up a company, or publish their own apps and games. Thanks for joining, and I look forward to seeing the cool stuff you’ll be building. Now, without further ado, let’s get started.

Transcript 2

Hello everybody, my name is Mohit Deshpande and in this video I wanna give you guys an overview of machine learning.

And we’ll talk a little bit about where it came from and towards the end I just wanna list a few different subfields within machine learning that there’s a lot of ongoing research currently going into that. So that’s what I’m gonna be talking about in this video.

So let’s get started. So before we had machine learning or actually just artificial intelligence in general, AI, computers were very unintelligent machines. Because even though they were really good at computing large numbers or performing large computations and things of that nature, even though they could do those really fast, they had to be told exactly what to do.

And so like I said, that’s something worth writing down. Is something like, before AI, computers had to be told, had to be told exactly, oh that’s a bad exactly, told exactly what to do. Told exactly what to do. You had to account for every possible input or change in your machine state or something like that, you had to account for every single possibility. And that became tedious very fast because there were cases where this becomes incredibly time consuming to have to hard code in your program all of these possible configurations or possible inputs. And that also adds to the length of your program.

And so way back then it was just something that before AI it’s something that you just had to do or you had to have some sort of fail safe condition or something like that. But then towards, after, then people started asking the question, instead of telling computers exactly what to do each time, can we teach them to learn on their own? And as it turns out there was a lot of stuff going around in science fiction particularly, authors and writers in science fiction, were starting to depict robots and they had robots being sentient beings and they looked like mechanical men is I guess what the term was, but eventually turned into robots.

And they had all these futuristic stuff with robots like they could greet you and shake your hand and they just had this repository of knowledge that they could draw from and they were sentient, they knew that they were, they knew their own existence and everything and they learned. And that’s probably the most important aspect of the thing that AI researchers were taking from science fiction is that robots could learn. And so then they started getting into, how can we model knowledge and how can we get some kind of representation with which to learn. And that starts getting into this period of time when we were doing stuff called classic AI, classic AI. And that was actually more centered around intelligent search instead of actual learning.

So what I mean by that is let’s suppose that we were playing a game, something simple that we all know, so tic-tac-toe or something. Where let’s say that I am the blue circles. So and suppose I play a move here and then it’s the computers turn and so then the computer has one, two, three, four, five, six, seven, eight, the computer has eight possible places where it can put an X.

So what classic AI was trying to do is it will try every one of these possible combinations and then it’ll try to predict. So if the X was put here for example, then after that X was played then it’ll try to predict what my motion is. Then maybe I’ll play something like this and then from there the AI could one, two, three, four, five, six different moves. And so it tries each one of them and eventually you get this giant search space basically where you’re looking at every single possible way that the game could be played out from the human just playing a single O here. I mean there are so many possible combinations.

And as it turns out there are different techniques that you can actually get this working reasonably well. You can brighten AI to play tic-tac-toe with you and such that it will choose the best move to try to prevent you from winning. It’s actually called, that’s called a minmax strategy. But anyway, you can build this and it’s actually not that hard to do and it runs reasonably fast. And so this is something that you can build, but this is for something like tic-tac-toe, this is a really simple game.

Imagine if we had something like chess. That’s wrong color there. I mean, imagine if we had something like chess where it’s not just eight possible moves, it’s so, so many moves. Tons and tons of moves on this chess board. And as it turns out, I think way back in, I think sometime in the mid-1990s or something one of IBM’s machines, Deep Blue I think is what it was called, actually ended up beating the national chess champion or something similar to that.

And so trying to do this classic AI stuff with search when it comes to large games like chess or even with even larger games like there’s a game, an ancient Chinese game called go that’s often played and it has even more configuration possible moves than chess, so at some point it just becomes. The number of possible ways a game could be played out is so big that it would either one, use up all the RAM on your computer and crash or two, it just, computing all of this stuff out would take much, much longer than you could actually play a game with. And so search is not a good thing to really do, but back then it was the only viable option at that time.

There was some dabbling going on in actual learning, but a lot of the stuff with classic AI was using search, different kinds of searching algorithms and so you could have it play tic-tac-toe or chess or something. But recent, relatively recently I should say, there’s been this move from instead of search we move towards actual learning.

So we move towards actual learning. So instead of looking at all possible configurations, we start training an AI, we start teaching an AI by giving it lots of example data that it can draw from and so when it gets new input data it can intelligently, it knows because it’s seen previous data, what to do with this new problem. So if you have a particular problem when you’re training an AI, you give it lots of examples with the problem and then it can start learning ways that it can approach a problem. And this is all, I am speaking in the abstract sense because I wanna make this as general as possible. But as, there are a lot of different subfields that I don’t wanna get to specific because then it won’t apply to some subfields.

But right, so when we’re trying to solve a problem we train an AI and then it’s, the AI has seen examples of how to solve the problem and so then it knows from new input it can reason through how to solve that problem with some new input. So that’s a broad level overview of machine learning. But there are actually a few subfields within this.

This is, machine learning itself is a fairly big field. So there’s research going on into, I’m sure you’ve heard of neural networks, I think they’ve been in the news at some point. But neural networks try to take the more biological route and they try to model what’s going on in our brains. Albeit it’s a very overly simplistic model, it’s still a model and it turns out that it works really well. It turns out we can also break down neural networks into things like language with recurrent neural networks or vision with a convolutional neural networks.

But we could even branch this off even further. There are people researching deep learning. Specifically, and that’s kind of related to neural networks, but it’s deep learning, the issue is how deep can we make these neural networks, how many layers can we go and what kind of challenges do we encounter as we make these layers really deep? And so they’re trying to find solutions for that. There’s stuff going on with reinforcement learning is also pretty popular. And reinforcement learning is actually used, it’s very popular to use for teaching AI to play games actually, I think there’s a, if you look around, there’s an AI that can actually play the original Super Mario Bros. or something like that. They can play through the original Mario game.

And reinforcement learning helps let you build that kind of model. I think they can also play, like they’ve built reinforcement learning models that can play Asteroid and a ton of the old Atari games, fairly well, too. So right, these are just some of the subfields. I can’t possibly list all of them because it’s a really big field, but we’re just gonna stop right here and do a quick recap.

So with machine learning, before AI, computers weren’t very intelligent, we had to tell them exactly what to do and this became impossible in some cases because you can’t think of all possible configurations or inputs that you can get. And so this is when we start getting into classic AI. But even with classic AI we were technically just doing searching, we weren’t actually learning anything about this. And now we’ve moved from search more to learning and where we actually are learning of knowledge representations and using those.

And I just mentioned a couple subfields of machine learning here with neural networks, deep learning and reinforcement learning to show you that this is a very popular field at this point and it’s a very, very rapidly expanding field. And you can definitely expect many more cool advances to come in the future.

Transcript 3

Hello, everybody, my name is Mohit Deshpande and in this video, I want to introduce you guys to one particular subfield of machine learning and that is supervised classification and so, classification is a very popular thing to do with machine learning.

So, let me actually define this. So, classification is the problem of trying to fit new data…. I should make this a bit more specific, I should say, fit or label new data based on previously seen data. This seems kind of like a weird description at this point but with classification, the task is to… We’ve seen a lot of data and it’s labeled and given some new data, we want to give it a label based on some of the previously labeled data that we’ve seen. I should mention that these are… I’ll put it over here, actually. I should mention that classification is… We have discrete classes or labels to each data point or input and so, let me illustrate this by an example. So, suppose I have a… That was a really bad line.

Suppose I have like a scatter plot, over here or something. Let me just add in some stuff here. So, I’m just adding in a ton of red x’s and then, we’ll add like, blue circles, over here. So, this data is labeled so, these will actually correspond to actual points. So, this for the X direction and this for the Y direction. These would correspond to actual points. I haven’t actually like, plotted all the points, but trust me, they correspond to actual points and you see, I’ve labeled them. I’ve labeled them, but they’re only two classes and there is the red X and the blue circle. If I wanted to, I could add, like some other class, like a green triangle. We’ll add a couple green triangles or something, up here.

So, there’s three classes. We’ll say there’s three classes and so, I have all these points and they’re labeled and so, the problem with classification is now that I have these points, if I received some new point, what label would I assign to it? Would I assign to it a red X, a blue circle or a green triangle? So, what we’re trying to do with classification is to find a way and to build a model so that given this new input, we can actually assign it one of these labels.

So, let’s just do a human intuitive, example kind of thing. So, suppose my point, I’m gonna put in, let’s see, purple. If my point was in here, or something. Suppose this is my new point, here. So, with this being my new point, I would ask the classifier what label should I assign to this? Should it be a blue circle, a red X or a green triangle?

And so, as a human, if you were thinking about this, if I gave you this point and I asked you, what would you assign it, you would say, “Well, I would assign it as a blue circle.” and I would ask you, “Well, wait a minute.’ “Why would you assign it as a blue circle?” and you’d say something, probably along the lines of “Well, if I look at what’s around it, “they’re lots of blue circles, around here.” and it turns out, I guess this region of the plane, here tends to have more blue circles, here than red X’s, so, I can try to carve out this portion, over, here, seems to be a lot of blue circles. So, this is probably what I would assign this point and it turns out, that if you were probably to give this to a classifier, he would probably give this a blue circle.

So, I say, “All right. “Now, what about a point, over here?” And so, you would say, “Well, I would give that a red X.” When I ask you again, “Why would you give it a red X?” and the reason for that, is you give the same answer. You say, “Well, in this portion of the plane, over here “of this given data, it’s closer around that question point, “around that new input, there’s a lot of red X’s “and so, I would think that it would be most likely “to be given with a red X.” and so, that’s right and now, I can do the same thing, where I say, I have a point up here, or something and you’d say, “Well, this part of the plane, here is more… “like this part over here, you’re more likely to encounter “a green triangle than you are any of these.”

And so, I would probably give this point a… Probably say that, that new point should be a green triangle and so, this is kind of like, the thought process that is going on with these classifiers and so, what you use to make your decision, was this kind of… I kind of drew it in, here.

This kind of imaginary boundary sort of thing, between our data and so, this called the decision boundary. The decision boundary, right here and it helps us make decisions when it comes to a supervised classification because we can take our point and depending, we can take any sort of input data and find some way to put it on a plane, like this and then, just find what the decision boundary is and then, we can plot this, and so, with a lot of classification algorithms, what they try to do, is they try to find this boundary, is what they’re all concerned about, because once you have this boundary, then, if you get a new point, then it’s fairly easy to classify.

You can say, “Well, I want this portion to be “part of the boundary is blue. “This part of the boundary is red. “This part of the boundary is green.” so, if you get points that are inside one of these boundaries, you just give it a label of what’s around there and so, this is what supervised classification algorithms try to find, some kind of boundary. It might not be the case that you have, such a nice, two dimensional data, like this but there are ways that you can fit it onto a plane.

I’m not gonna get into, too much but, here’s a question. So, what if my point was like, right over here. Then it’s not so obvious as to if it is a blue circle or a red X and so, you know, there’s some inherent there’s some confidence value or some measure that says that, “I think that this is a blue value “with this confidence or with this probability” and so, even the points that we we’re classifying, here they did.

Even though that it seemed kind of obvious, that around them, there are blue circles, there is some inherent uncertainty about this and it turns out that, well, for each of these points, there is a chance that it could have been a red X or it could have been a green triangle, but that chance was very, very low and we only assigned it the label that has the maximum chance. So, it’s not necessarily the case that this must be a blue circle, instead, we say, that this was a high probability a blue circle and so, you can’t be 100% certain.

If you look at this point, over here, it becomes clear that this could be a red X or this could be a blue circle. It just kind of depends on what this boundary specifically looks like, but given new inputs I want to be able to, like give them one of these labels, here.

So, and this is where I’m going to stop, right here and I’ll do a quick recap. So, with supervised classification, it is a subfield of machine learning and it’s all, where the problem that we’re trying to solve is, we have these labels and our input data and we want to, now that we’ve seen our data, we want to, given some new input, we want to give it a label based on the labels that we already have and that is kind of the problem of supervised classification.

We want to fit or label some new input based on what we have already seen before and so, I kind of gave this example of, like, if we had red X’s, green triangles and blue circles, given the new point, how would you figure out if it is one of these categories and we use these things called decision boundaries to try to get that and figure it out. So, that is supervised classification.

There are tons and tons of algorithms that can do this. Some of them work better than others. It all depends on what kind of data you’re looking at but the point is that they are lots of different algorithms for this, and so you can take a look around and see if there’s one that you want to know more about but anyway, this is a problem of supervised classification.

Transcript 4

Hello, everybody. My name is Mohit Deshpande, and in this video, I want to give you kind of a, I want to define this problem called image classification, and I want to talk to you about some of the challenges that we can encounter with image classification as well as, you know, some of, get some definitions kind of out of the way and sort of more concretely discuss image classification.

So first of all, I should define what image classification is and so what we’re trying to do with image classification is assign labels to an input image, to an input image. So this kind of fits the scheme of just supervised classification in general, is we’re trying to given some new input, we want to assign some labels to it. There’s some specific, there’s some challenges specific to images that we have to talk about, but before we really get into this, I want to remind you that images are just, images consist of pixels, and so what we’re trying to do here is just remember again that the computer just sees like this grid of, the computer just sees this grid of pixels and so what we’re trying to do with this is we’re trying to give this labels like “bird” for example.

Suppose I have an image of a bird or something over here or something like that. I have some picture of a bird and so what I want to do is give this to my classifier and my classifier will tell me that this, the label that works well with this, the label that closely can be tied to this image is “bird”. And so that’s the goal of image classification and we’re trying to add some higher level meaning to this image. In fact, what we’re trying to do is we’re trying to determine what is inside of an image and that’s what these labels are.

These labels tell us what is inside of the image. Not just random labels, but for image classification we want to know, we’re particularly interested as to what is inside of this image, but this isn’t an easy problem by any means. And so there’s some challenges that are specific to, there’s some challenges, I misspelled that. I forgot about the “n”, there should be an “n” in there.

Challenges specific to image classification so I just want to talk about a couple of them. We won’t get to all of them, but one particular challenge here is scaling and that is if I have a picture of a bird, if I have a picture of a small bird as opposed to when I feed my classifier the same picture, but it’s now maybe doubled in size, then my classifier should be robust to this. I should be able to take an image, and there shouldn’t be any dependence on size. If I give you a picture of a small bird, I can give you a picture of a large bird and it should be able to figure out either which bird that is or that this is a bird, right? If I give this an image of some object or something.

So suppose my class, I should probably define some of these class labels. So suppose my class labels, I don’t know, suppose my class labels are something like “bird”, “cat”, or “dog”. These are just like some example class labels, for example. So if I give it a picture of one of these things, and depending on if it’s a big dog or a small dog, it should be able to identify this as a dog. If I give it a picture of a small cat or a large cat, it should still be able to identify this as a cat. And so there’s challenges with scaling.

There’s this other challenge called occlusion. And occlusion is basically when part of the image is hidden so part of image is hidden or behind another, behind something so that would be like if I had a picture of a bird and maybe like a branch or something is in the way and it’s covering up this portion here. We want our classifier to be robust to things like occlusion this is a pretty big challenge with occlusion because depending on what part you see, we have to make our classifier robust to this.

So occlusion is like a part of an image and it’s hidden behind something else like for example, like this tree branch that’s blocking half of my bird or something. I still want to classify this as a bird so that’s kind of the challenge of occlusion. I guess we can do one more. Another good one is illumination. I can’t spell today, I guess. Illumination is what I mean, and illumination is lighting. Illumination is basically lighting so depending on my lighting conditions of whenever the input image was taken, I still want to be robust to that kind of thing.

I don’t want my image to be classified poorly because my cat is standing in sunlight or something like that. Or my cat is in darkness or if my bird is, it’s a cloudy day or something like that, I don’t want that. I want my classifier to also be robust to illumination and there’s so many more things, so many more challenges with image classification and it makes it kind of difficult and so there’s work going around, there’s still research going into finding ways to be more robust to some of these challenges.

And sort of build a really good classifier, we need to take a data driven approach, so data driven, data driven approach and what I mean by that is we basically give our AI tons of labeled examples so for example, if we were doing this thing that differentiates between these three classes, we would give our AI tons of images of birds and tell them that, tell our AI that this is a bird. We give our AI tons of pictures of cats and say, “This is a cat”. We give our AI tons of pictures of dogs and we say, “This is a dog”.

Alright, so with data driven, we want to give our AI labeled example images and these labeled images are also commonly called ground truths. This labeled example is commonly called ground truth because when we go to evaluate it, we actually compare what the classifier thinks this is to what the actual value or what the actual truth of this image, the truth of what the label is on the image and we call it ground truth so we compare the prediction to ground truth and say how well is our classifier performing.

So yeah, we want this to be data driven so we take this approach by giving our AI lots of labeled example images and then it can learn some features off of that, but if you want to take this approach, however, you’ll need, you can’t just give it two images of a bird or two of each and be done with it, right?

The more good training data that you have, the more high quality training data that you give your AI, the more examples that you give your AI, the better it will be to discriminate between bird, cat, dog. To make that distinction between these classes, you want to give lots of high quality examples to your AI.

And I’m going to talk a little bit about this a bit more, but when we collect this data set, this data set is actually something you have to collect yourself. There’s tons of image classification data sets online. I mean, there’s ImageNet has a few million images across tons of different classes. There’s much smaller data sets, of course. There’s the C4-10 data set that has 10 different images. I think it’s maybe 60,000 images, but the point is lots of good quality training data is always preferable to some super complicated classification algorithm.

So that kind of illustrates that with image classification we want this to be data driven. There’s no way to hard code this for every bird or for every cat or for dog. Hard coding would not be a good approach so we’re taking the more data driven approach by giving our classifier lots of examples with labels on them so it can learn what a bird looks like and what a cat looks like, and so on.

So that’s where I’m going to stop right here and I’m just going to do a recap real quick. So with image classification, we want to give labels to an input image based on some set of labels that we already have. And so given suppose I have three labels like “bird”, “cat” and “dog or something and so given a new input image, I want to say whether it’s a bird, a cat, or a dog, where I want to assign that label and so suppose, so computers only see, the computers only see the image as pixels so we have to find some way to build a classifier out of just given these pixel values, and lots of challenges that are with that.

Like I mentioned scaling, that’s if you have a big bird or a small bird, you want to be able to still say that it’s a bird. There’s occlusion. If I have a tree branch in the way, or something like that, I still want to classify this as a bird. There’s illumination, if I have like a dog, it’s standing in direct sunlight as opposed to a dog in a darker room or something. I still want to classify that as a dog. And kind of, that also gets into other challenges like what’s going on in the background. You want a very sterile background when you’re getting training data. You don’t want a lot of background clutter because that could mess up your classifier. It might learn the wrong thing to associate with your label that you’re trying to give.

But anyway, moving on, so a good approach to doing this is the data-driven approach and that is we give our AI lots of labeled example images. We give it lots of images of birds and tell it that this is what a bird looks like. We give it lots of images of cats and we say, “This is what a cat looks like” and so forth for a dog and for any other classes that you might have. But we give these example images and it will learn some representation of what a bird is and what a cat is and what a dog is, and given that, it can generalize and when you have a new input image, it will do it’s function and that is to label it as one of these labels, or give it one of these labels, I should say.

So I’m going to stop right here and what we’re going to do in the next video, I want to talk probably the simplest kind of image classifier that’s called the nearest neighbors classifier so I’m going to talk about that in the next video.

Interested in continuing? Check out the full Build Sarah – An Image Classification AI course. You can also check out our Machine Learning Mini-Degree and Python Computer Vision Mini-Degree for more Python development skills.

A Comprehensive Guide to Optical Flow

Zenva — Fri, 22 Feb 2019 05:00:18 +0000

You can access the full course here: Video and Optical Flow – Create a Smart Speed Camera

Part 1

In this lesson, you will learn the basics of videos, and how function notation can be applied to find pixel intensities of videos.

Videos are a sequence of images (called frames), which allows image processing to handle videos.

The rate at which the images change is called the frames per second, and is known as the FPS of the video. A common example of this is 60 FPS, which means the video shows 60 frames every second.

For images, a function I can be applied to the image so that I (x,y) = p, where x and y are coordinates, and p is the pixel intensity.

For videos, there needs to be additional information to find the pixel intensity, as there are numerous frames to choose from. The parameter t needs to be added, with t being when in the video the desired frame is located. Then, adding t to the image function, I (x,y,t) = p for videos.

Part 2

In this lesson, you will learn about the basics of optical flow and the mathematics of it.

For images, the only information that you can access is the spatial positioning in relation to other pixels. The benefit of videos over static images is that it adds temporal information, so that you can know not only the location spatially, but also when it exists. This is what allows optical flow to function.

Optical flow is type of computer vision technique that is used to track the apparent motion of objects in a video. Optical flow can be used to track objects, stabilize and compress videos, and allow AI to generate descriptions of videos.

Optical flow functions by tracking a pixel through consecutive frames. This allows for the path of that pixel’s movement to be generated, as shown in this image.

Optical flow makes two assumptions that drastically simplify this process. These are:

Pixel intensities don’t rapidly change between consecutive frames.
Groups of pixels move together.

These two assumptions apply when a video is smooth, as pixels should change gradually and not teleport around the image. These assumptions apply for almost all real world videos, but can be broken if someone specifically edits a video to make that occur.

When analyzing videos, the pixel you are tracking will be displaced from its original position by some value u in the x direction and some value v in the y direction, as shown in the image below. The goal of optical flow is to find the u and v values, as they allow you to create a displacement vector and track the pixel’s path.

Part 3

In this lesson, you will continue to learn about optical flow, with this lesson providing more information about the mathematics behind this technique.

To recap from the previous lesson, the point of optical flow is to find the displacement vector of a pixel, with u representing the change in the x direction and v representing the change in the y direction. In the image, t represents the time at which the frame occurs, with Δt representing how much time has passed since the previous frame.

Using the function notation from the earlier lessons, the pixel intensity for the frame at time t can be stated as I(x,y,t). The pixel intensity for the second frame can be stated as I(x+u, y+v, t+Δt).

These two functions should be equal, due to the first assumption of optical flow, meaning that I(x,y,t) = I(x+u, y+v, t+Δt).

After using some calculus, this equality turns in to the equation I_xu + I_yv + I_t = 0. I_x represents how much the frame changes horizontally, I_y represents how much the frame changes vertically, and I_t represents the difference in time between the frames.

I_x, I_y, and I_t can all be computed, so the equation comes down to solving for u and v. There are numerous methods to solve this, but as they require calculus and linear algebra, this course will not be covering all of them. Certain techniques which will allow you to find u and v will be addressed in later lessons.

Transcript 1

Hello everybody, my name is Mohit Deshpande, and in this course we’ll be building an app that can track objects through video and actually determine what their speed is based on certain properties of our camera.

So we’ll be able to build this. And I’ve shown this visualization here. You can see that I have a pencil here that I’m waving, and the points are being tracked on that pencil, and we can get the speed readings from those points. And so that’s what we’re gonna be building in this course. And you can try this out with other objects as well.

So kind of the big topic that we’re gonna be discussing is called optical flow. And optical flow allows us to take points in video and track them through each frame. So we’re gonna discuss how that kind of works. And we’re also gonna talk a little bit about camera intrinsics, because we have to know a little bit more about our own cameras before we can get accurate speed measurements. And lastly, I’m gonna show you how we can visualize these optical flow patterns.

What I mean by that, and you saw it in the previous slide, how I can kind of draw on top of my frames and we can kind of draw a path. So we’ve been making video courses since 2012, and we’re super excited to have you on board. Online courses are a great way to learn new skills, and I take a lot of these courses myself.

So general courses consist mainly of video lessons, and you can watch, re-watch them as many times as you want, and at your own pace. Everything we do in source code is downloadable by the way. And when I’m coding this stuff in the videos, I really, really recommend that you code along with me, because coding along helps you learn material better than just watching code. The last thing that I wanna mention is that students who get the most out of these online courses are the same students that make a weekly plan or schedule and stick with it based on your own availability and whatever your learning style is. And remember that these videos, you can watch or re-watch them as many times as you want, so that kind of gives you a lot of flexibility.

So at ZENVA, we’ve taught programming and game development to over 200,000 students, over 50 plus courses since 2012. And a lot of these students have used the skills that they’ve learned in these courses to advance their own careers. Some have started companies and published their own games and apps. So thanks for joining, and I look forward to seeing the cool stuff you’ll be building. But now, without further ado, let’s get started.

Transcript 2

Hello, everybody. My name is Mohit Deshpande. And in this video, I want to first discuss what videos are, and we’re going to have kind of a change in notation before we get into optical flow because it just becomes easier to work with and understand optical flow if we have this change of notation kind of thing.

So before we can talk about it though, we have to formally define videos, and that’s because optical flow actually operates on videos. So we need to have a good understanding of what is a video before we can really get into optical flow. So we know that we can represent images as matrices for grayscale, but what about video?

So what is videos? So to kind of answer that question, think about some of the videos that you’ve seen at the movies, or on your TV, or your computer, tablet, phone, YouTube, and what not. So it seems that when you watch a video on YouTube or something, you can pause it and you get a still image, right? And so what does that kind of imply about videos? Well, you can take a scrubber or hit the rewind or forward button and kind of go through each of the still frames. So that produces something enlightening about video and that is that video, video is just a sequence of images.

Sequence of images. Because, you know, you can pause a video and kind of scrub back and forth using some sort of scrubber or a fast-forward, rewind sort of thing. And so you’ll notice that videos are just a sequence of images and they’re played fast enough that they don’t look like still images. They look like one fluid motion into what we call a video. And this also becomes apparent if you ever had like some kind of sticky notes, like a flip book, or something.

You can draw each image and when you flip through them fast enough, it looks like one continuous animation, for example, like that. So that kind of gives you the background for videos. They’re just a sequence of images. And in fact, if you’ve heard of something like FPS, right? That’s actually called frames per second. And that’s just like the speed. That tells us how many frames are being shown in one second.

In the context of videos, these still images, we actually call them frames. So we say that video is actually comprised of different frames. Rather than images, we just say frames. And so when you see something like FPS, that means frames per second. That’s how many frames are shown per second. And you know, you might have heard values for this like 60 FPS is a very common number to see next to frames per second, and that’s saying that you see 60 frames in one second. That’s 60 frames, 60 still images in one second.

So this is moving pretty fast. So the frames are moving pretty fast. You can also have lower FPS, like 30 FPS or 15 FPS. And as you decrease the FPS, it becomes clearer that you’re looking at still images played in succession rather than one continuous motion. But anyway, so that’s just kind of an intuitive understanding of videos.

But now, if we want to formally discuss videos, then we have to change the notation a bit. Particularly, we have to go from a matrix notation to function notation because if we try to stick with matrix notation when we’re discussing videos and we have to talk about these things called tensors, and that just gets like way beyond. It kind of gets out of control at that point. So let’s just start with images first. With images, let’s kind of start with images and then convert from images to video. We’ll start with this notation. So with images, we know that they’re matrices, and we can represent grayscale as matrices.

And so what we can do is, analogously, we can define a function, I’m going to call it capital I, that takes an X and a Y coordinate and returns some pixel value or intensity. And it turns out when it comes with videos, we’re only concerned with pixel intensities. So we can drop any colors and just consider a grayscale. But it turns out that this is like an analogous notation to matrices. So this is, you know, kind of the function notation, function notation, here for images. And it’s kind of like if you were trying to find a book or something in a library, whether that’s a physical library or like some online library.

You know, if we wanted to find a particular book, we need some information about it like it’s book number, the genre, the author’s name, or what not, and you know, et cetera. But when we have that information, we can find the book that we’re looking for. Each book in our library is uniquely identified by some group of attributes or values. Usually, that’s something like title, author, and book number, or something like that. So you know, given a set of values, if I tell you three values, it points to a unique book. It’s never like if I tell you three values, you have two possible books you can choose from. It’s always if I tell you some set of values then you get the unique book.

When we’re using this notation, think of our library as the image and that X, Y is the information that we need. And so this function I is kind of analogous to the act of finding a book that we want or finding a particular pixel intensity that we want. So this is saying that at, you know, this gets us at we’re basically getting the pixel value at X comma Y. And you can think of these with the old matrix notation as king of being the indices.

If we have an image, then we can just like have an image here. And then at a particular location, X comma Y in our image if we think of this as a coordinate plane, there’s some pixel intensity P that’s there. So that is the kind of notation for images. So like, you know, given an X and a Y, X and Y are unique, and so we can get the pixel value. But for videos, this is a bit different ’cause we can’t just use X and Y because we have that additional component of, well, in which frame are we doing this look up?

So, this works well for a single image, but when you have a sequence of images like videos, then we can’t just use X, Y. We need actually one other parameter and I’m going to call that T, and then this will get us a particular pixel value. And so this T represents where, when in the image, or when in the video, I should say. So this kind of represents a when in our video, which frame do we find, you know, look at the X and Y to get a particular pixel intensity.

So this is kind of like a spatial position here and this is a temporal position here, ’cause this has to do with the spatial positioning of the pixels in a particular frame. This has to do with when is that frame. And so that’s why for videos, we need three values here instead of just two images for images because just using X and Y isn’t sufficient for finding a particular pixel in the video because (mumbling) this video, we don’t have a single image, we have a sequence of images, so we need another parameter to tell us where in the duration of the video that we can be finding this pixel. But anyway.

So for video, we need three parameters. So this is kind of the notation that I’m going to be using for the rest of this course because for optical flow, it just becomes easier to deal with this I function rather than dealing with matrices, like I said, ’cause then we have to get into tensors and it just gets kind of out of control at that point. But anyway. So going forward, we’ll be using this notation to look at optical flow. So I guess I’m going to stop right here and do a quick recap.

So in this video, we discussed, well, videos, and we defined them as being just a sequence of frames. These frames are just like, you can think of them as being still images. And they’re played so fast, we see them appear so fast that they appear to be like one continuous motion. They appear to have motion. And so, you know, like, frames per second is a common measure of this and frames per second tells you how many of these frames do you see in one second. So common values for this were, like I said, were like 60. You’re seeing 60 frames in one second, so that’s pretty fast.

So I just kind of gave you an intuitive understanding of videos, then we moved onto this change of notation from matrix notation to function notation here. So I just described this I function as basically like a lookup table. So we go to the pixel at coordinate X comma Y in our image starting, you know, at the standard image coordinate system. And then P is the pixel intensity that we get at coordinate X, Y. And then for videos, I said that X, Y just isn’t sufficient, so we need another parameter, T, and then tells us when or what frame we’re looking at, basically. So when in the duration of the video do we look. So kind of moving forward, we’re going to be using this notation.

So that’s where I’m going to stop with this video, and then actually in the next video, in the next sequence of videos, I’m going to kind of give you an intuitive understanding of optical flow, and so we’ll kind of get through that. So we’re going to start optical flow in the next video.

Transcript 3

Hello everybody my name is Mohit Deshpande, and in this, just gonna start the sequence of videos that are just gonna kind of give you an intuitive and complete understanding of optical flow and we’re also gonna get a little bit into the mathematics of particularly the actual equation for optical flow and I’ll talk about some of the stuff there. I won’t be going too far into the mathematics, but I just wanna kind of give you, I want to start off by giving you more of an intuitive understanding first and then we can kind of take that intuition and solidify it into more concrete math terms.

So first of all, we have to kind of talk a little bit about what this is and the motivation behind this. So images are great. Videos are even cooler, because we have more information in a video than in an image. Because in an image we just have the spatial positioning of the pixels. That is where they are in the image relative to each other. So that’s all we get in an image but in a video, we get the same spatial information but we also add an additional temporal component. Meaning we not only have the location of a pixel spatially, but we also have when does this pixel exist? Maybe it’s only at one particular frame. Maybe it exists for a sequence of like 50 or 100 frames or something like that.

You get this kind of additional, you get this additional information about the time duration of pixels, and so optical flow is kind, so a lot of, when you have this additional information it really opens up a lot of doors as to what we can look into.

And so optical flow is one of these doors. It is a computer vision, I’ll just write this down. It is a computer vision technique that is used to track the apparent motion, apparent motion of objects in a video. So using this technique of optical flow, we can actually find a pixel value or you can kind of make this more generic like an object and we can track it through the video. And so you know later we can draw kind of like a path, we could draw like a pathing. I’m gonna show you how to draw this, just a second, but this is actually really interesting because of optical flow because first of all it’s not just used for things like object tracking through videos.

It actually has a ton of different applications that we’re gonna be discussing in a later video. Like video compression, video stabilization and actually just recently there’s been some very recent research, that is using optical flow patterns to help give descriptions of snippets of video. You can give this AI a snippet of video, and it will generate a language or generate like a video description, and that’s really cool in optical flow features, turns out that optical flow is actually pretty useful to this sort of thing, but we’re gonna discuss this stuff at a very, very top level in a later video, but let’s first derive the intuition behind optical flow. So remember that if you want to track an object through the video, remember that on a computer level we only have access to these raw pixels.

So suppose I have like a frame here, here’s one frame, and then here is another frame. We only have access to the raw pixels and I kind of drew these, these probably should be the same size, but we only have access to, here is, oops, here is a frame t, maybe here’s a frame t plus one. And then I’ll draw another one short. Maybe here’s a frame t plus two and whatnot. What we’re trying to do with flow is to take a point here and a point here, and then track it through through this frame. It’s gonna go here and then it kind of goes down here, and this is what we’re trying to do with flow.

And so to do this, we consider two consecutive frames, and we kind of build the path and just kind of, it becomes like connect the dots. You have, where the dots are the position of this particular pixel at each of the frame, so this is kind of connect the dots. Initially, at first glance this might seem impossible because you have to consider so many things like the size of the frame, how do we know which pixels are which and whatnot, but it turns out that there are two assumptions that optical flow makes that really help simplify this.

So assumptions and we’re gonna be using these assumptions in later videos. So there are two assumptions that optical flow makes. One is that pixel intensities, pixel intensities don’t rapidly change don’t rapidly change between consecutive or successive frames. So what I mean by this is that this pixel value don’t just, in two successive frames they don’t just immediately change. So that would be like, in our case at least that would be like this pixel value is green here and in the next frame, it becomes like blue or something like that. So the assumption that flow makes is that this doesn’t happen. If you consider things like, like this has real world implications.

So the frames are taken such that there is such little time between them but unless you were actually video editing each frame, you wouldn’t really encounter something like this. Now this is not to say that maybe pixels don’t change after a longer period of time. That’s fine. But this is just saying that they don’t just flip between two consecutive frames and the time between each frames is like really small. If your pixels are flipping in between frames that’s kind of weird. Maybe there’s like some video editing stuff that you can do to make that happen, but naturally this doesn’t really happen. Anyway that’s the first assumption.

The second assumption is that groups of pixels, groups of pixels move together. So what I mean by this is that the pixels don’t really move between frames. This is just like saying that pixels don’t teleport. So if I have a pixel over here like a group of pixels here, they’re not just gonna like jump to, that are at the top of the image, they’re not gonna jump to the bottom of the image in the next frame there. That kind of hinders good flow tracking when pixels just teleport. Ideally you don’t want your pixels to teleport between frames, and again this is also you want the motion to be smooth and optical flow works really well when the motion is smooth, not when it’s like jumpy or teleporting.

So like I said, these assumptions have real world implications, so in the real world if you’re taking video, stuff just doesn’t teleport everywhere. That would be really bad. These assumptions are perfectly valid to make based on the real world implications of this. Now there are ways if you were take a video and do like some video editing stuff, you could break these assumptions intentionally but we are not really going to be considering that. I’ve kind of drawn a picture here, but let me draw like a, let me draw a, just almost two frames.

And so actually I’ll probably draw a third one right here. Yeah okay so of course I have my pixel here and here’s one particular frame and I’m gonna color that green, then here is the same pixel in the next frame here, and so let me actually on the label use frames.

So this will be something like t and this will be something t plus delta t and what I mean by delta t is that just some short amount of time has elapsed. So if I were to look at both of these pixels in the same context, then I’m gonna get something like this. Where there are two different, they’re a little bit apart here and so the problem of flow, the thing that we’re trying to solve here on this red here is, I want to find there is some displacement. So the pixel moves in the x direction by some amount u and down in the y direction by some amount v.

So the challenge of flow is to find this u and v that it moves, because if we have that then we track the path. Then now that we have this displacement then we get this thing called this displacement vector, we know how much this pixel has moved. I’m gonna stop this video right here.

In the next video we’re gonna talk a little bit more about the solution to this but yeah this is the problem that optical flow is trying to solve. How we find these values, this u and v? So I’m gonna stop right here, do a quick recap, and then we’re gonna kind of continue flow in the next video. So I’ll just do a quick recap here.

We discussed optical flow and it’s the computer vision technique to track the motion of objects through videos. If here it frames a video, I want to build this path throughout my video tracking a particular pixel. So this can be really challenging but there are two assumptions that optical flow makes that are kind of rooted in the real world.

That is that pixel intensities don’t rapidly flip between frames and that pixels don’t teleport are the two assumptions and so these are valid assumptions to make in everything, but in specific I kind of showed a sequence of frames here, but in specific we get two frames that are some time unit or part of some delta t, then the problem flow is to find this u and v, like how much has this pixel moved in the x direction and how much the pixel has moved in the y direction?

So that’s the problem of flow and then in the next video, I’m gonna kind of take this intuition and make it a bit more concrete using mathematics, so we will get to that in the next video.

Transcript 4

Hello everybody, my name’s Mohit Deshpande, and in this video, we’re gonna be delving a little bit more into optical flow and add a little bit of mathematics to this.

So if you recall from the previous video, the point of flow is to find this u and v, and we have two assumptions. Actually, let me take this image and expand it out a little bit so you can see it a bit better. So I have two consecutive frames here, and then in the resulting frame, so we’ll have, let’s say we’re considering this pixel here in frame at time t, and then this same pixel is over here at the frame at some time t plus delta t, and so delta t is the elapsed time. So that’s what delta t just means, delta just means difference, and so this t plus delta t just means that from this frame a little bit of time has elapsed and now my pixel is in a different location. So let me start here, and then you know, sort of like over here-ish.

And so the point of flow, like I said, in some elapsed time, this pixel has moved to the right by some amount that I’m gonna call u, and then has moved down by some amount that I’m gonna call v. And so that’s what we’re trying to find with optical flow, and actually if I do this, I can complete the triangle. So this is what we’re trying to find, we’re trying to find this u and this v with optical flow. So how do we find this u and v? Well, we use mathematics. So I provided an intuitive picture here, but let’s actually kind of formalize this picture a little bit.

So if you remember at the first assumption, this is saying that pixel intensities don’t rapidly change between consecutive frames. So it’s reasonable to say, and I colored it in such a fashion that these two pixels at different frames have the same pixel intensity, and just to remind you, intensity is just when we drop the, we’re just gonna drop any sort of color and we’re just gonna consider our pixel intensity. So it’s reasonable to say that at these two instances in time, at some t and t plus this delta t, at this point in time, the pixels have the same value. So actually if we go back to, if we use our notation, our function notation, this is saying that the I function applied to this is equal to the I function applied to this.

So let’s actually, you know let’s write that out a bit more formally. So what is the pixel intensity in this frame? Well this pixel intensity is I(x,y,t) because this pixel, let’s say that it’s a coordinate x comma y. So now the question is, what is the pixel coordinate here in terms of x and y, and u and v? Well, this pixel, the x coordinate is the same x coordinate here plus this small change u. So this is gonna be x plus u, because here, the x coordinate is like right here, and the x coordinate here is new here, and so this difference between them I said here’s x, and then here is x plus u, because I’m moving u units to the right.

So this is x plus u, which is why I call it x plus u, and then similarly if this were y, then this is then y plus v. So in this coordinate is x plus u, y plus v. And so now I can write this frame as being I, and then the x coordinate of this pixel is x plus u, because I’m at the same, here’s the coordinate for the first frame and here’s it for the second frame, and so I’m moving right u units. So that’s x plus u, and then comma, what’s the y coordinate? It’s y, which is the same coordinate here, plus v, because remember here’s the initial coordinate in frame t, and then in t plus delta t, I’m moving down by v, and so this is y plus v. And so now what’s the time? Well, I just told you what it is, it’s t plus delta t. T plus delta t.

And so now I’ve written this frame, this next frame, in terms of this current frame, and so actually just to make this clear, so this is movement in x direction, movement in x, which is what u is, and then v is movement in y direction, in y axis, I should say. Then x axis. Right, so then these are just the displacement, and this is movement in time. Movement in time. Time axis, so just like the next frame. So that’s what these three values represent. And so I can represent these two pixels here, but what is I? I is just a measure of pixel intensity. 3

And what is I here? This is just a measure in pixel intensity as well. And so if you remember from the first assumption that pixel intensities don’t rapidly change between two successive frames, these are actually equal. And so this is the optical flow equation here.

This is a really important equation, and it’s not quite in a term that we can use quite yet. So this is an important equation. So this is saying that at some time t, the pixel intensity here, is equal to the pixel intensity at, you know some time has elapsed between one frame and the next frame, and we can write it in terms of this u and this v.

And so just you know, take a second and look at this equation and make sure that it logically follows that from our first assumption that these two things should be equal, and that the x coordinate of the pixel in this frame is the x coordinate in this frame plus u, and then y plus v, and then t plus delta t, and so like the position here so that everything, so that this makes sense. Actually, let me draw these markers as well since I drew it for the x axis. And so hopefully this makes sense.

If you have any questions, I guess I’d just hopefully post a comment, but we want to find the values of u and v but they’re in our function. So how do we separate them from our function? And it turns out that we actually use calculus to do this. So I’m just gonna put dot dot dot calculus. I’m just gonna put calculus, and I’m just gonna write down the final equation here, and that is I sub x u plus I sub y v plus I sub t equals zero. So this direct conversion from here to here actually using calculus, I’m not really gonna talk about that at all, but, so these two things I will talk about.

So we end up with a single equation here, and x, so this I sub x represents how much the frame changes with respect to the x direction horizontally. This y is how much y changes, or how much the frame changes with respect to the y direction, and so vertically, and then I sub t is just the difference between… Is the image difference between the two frames, so how much do the frames change along the time dimension. And so it turns out that we know this. I’m gonna put like a check mark. We can compute this, we can compute this. We compute this, ’cause it’s just an image difference and then these two things we can actually use convolution to compute.

So we have this equation of variables, and it turns out that these things, we can compute. And so, ah be we have u and v but we don’t know u and v, so these are two things that we’re trying to find, but we have one equation, but we have two variables, so how do we solve one equation with two variables? This is also related to something called the aperture problem, in case you’re curious about it. But how do we solve this? But don’t worry, it turns out that there is a way to solve this equation.

There, open CV has ways that we can solve this equation, and the ways that we can approximate u and v. And one particular method that’s good is Lucas Cunardy method, and there’s some other ones along with that. There are actually quite a bit that you can use to find u and v. But to actually use that method again requires calculus and linear algebra, so I won’t talk about that but trust me when I say that there are ways that we can solve this equation, ’cause we’re trying to find u and v, so there are definitely ways that we can solve this. So don’t worry about that.

Okay, so that is, I’m gonna stop right here actually and in the next video there are a couple smaller things with optical flow that I kinda want to wrap up. And so I’m just gonna do a quick recap here, and so with optical flow here, we have the difference between two frames that we’re trying to find this u and this v, which is how much this pixel has moved in the x direction and the y direction. And so we can write down the pixel intensity here, and using the first assumption that pixel intensity’s don’t change quickly, we can say that these two things are equal, and so now we’ve written the second frame in terms of the first frame, and we get one equation here.

And so remember to get this x plus u, is just, if I’m defining u as being how much this pixel has moved in the x direction, and so here is the pixel in some frame t, and here’s the pixel after some time has elapsed, and I say that the difference between this x is u, then this new position must be x plus u, and similarly this must be at y plus v, if I define v to be how much this pixel has changed in the y direction. And then for this… For time is just delta t, which is some elapsed time has happened between these two frames.

And so using the first assumption we can set these two equal to each other, and then dot dot dot calculus, and we end up with this single equation, and it turns out that there are three things that we know and we compute easily, but u and v are things that we don’t quite know. U and v, at least we’ve gotten them out of the function here but there are things that we don’t know yet, but at least when they’re in this form, we can calculus and linear algebra to at least approximate them using several different techniques.

So that’s kind of a quick overview of optical flow, and so like a lot of the techniques are trying to, that you’ll see in optical flow, try to find these u and v values. And so we’re gonna be looking at one in particular, but this is where I’m gonna stop right here, and in the next video is where I want to kind wrap up some things with optical flow. And so I’m gonna go ahead and do that wrap up in the next video.

Interested in continuing? Check out the full Video and Optical Flow – Create a Smart Speed Camera course, which is part of our Python Computer Vision Mini-Degree.

How to Process Video Frames using OpenCV and Python

Zenva — Fri, 25 Jan 2019 05:00:35 +0000

You can access the full course here: Create a Raspberry Pi Smart Security Camera

Transcript

Hello everybody, my name is Mohit Deshpande and in this video, we’re going to start building our app.

Actually, before we really get into our app, we first have to discuss something really important. And that is, how do we actually look at video in terms of images? One way to think about it is that video is really just a sequence of still images. And you can see that, but if you take any video and you pause it, you can kind of increment it just a little bit and you’re kind of going frame-by-frame.

That’s what they’re called in video, They’re called frames. It’s just a particular still image. You know, you can go through all of the frames and when you play them really fast, it appears as video because our eye can’t really detect those changes that fast. We don’t really see them as still images when we play it fast enough, we see it as one coherent video. As it turns out, when we’re dealing with OpenCV, this is exactly what OpenCV likes to think of videos as, as still images. So any of the image processing stuff that we’ve already talked about, we can apply to each frame of this video.

There are a couple different ways that we can setup video and we’re gonna get into the Python code, actually. The first thing I should do is import some of my core things here and that is cv2, and then I’m gonna need numpy as np at some point. Those two things are, you know, when you’re starting any sort CV project, these two imports are really good as the first two lines, you just start reimporting cv2 and numpy ’cause odds are, if you’re doing anything with CV, you’re gonna need these two.

Anyway, so now the question is, how do we get video from, for example, how do we open a file? There’s lots of video files that we can use. .mov or mp4, or even better, for the purposes of our security camera, we don’t want to just open up from a video. We actually wanna open up from a camera, a live camera. So as it turns out in OpenCV, this is actually really, really easy to do, to transition between an image file and the camera itself.

We’re gonna be dealing mostly with image files because some of you may not have a webcam, for example. So we’re primarily going to be dealing with image files, but I’m gonna show you how we could extend this to your webcam. Actually, if we were to run this code on the Raspberry Pi, you would use the camera that I showed you how to install.

First things first, we actually have to tell OpenCV whether we wanna use a file or the actual live camera stream. There’s something in OpenCV that we can use. I’m just gonna call this cap for capture. We’re gonna say cv2.VideoCapture and video.mp4.

Actually, what I have is in the same directory. This is my, I call this app.py in my Developer/security, this is the same Developer/security I have in video and we’re gonna run this against our security camera. We’re gonna be kinda testing it against this so that we can see if it works or not. And so this is security camera footage and if I double click on it it shows me breaking into my roommate’s room.

This is the video feed that we’re gonna be using and I’m gonna provide you this video so you can test it. It’s fairly short because I didn’t want to have to use a really long video file because it takes up a lot of space on your hard disk and we really don’t need that much of a, I tried to balance it out so that there’s at least a few seconds of just stillness here so that we can compare it against when we do the image comparison with video frames. Anyway, well we’re gonna get to that much later.

I just wanted to introduce you to, this is kind of like the data set, the test video that we’ll be using. You can see why I paused it, this is one image and this is what OpenCV is going to be interpreting this as. OpenCV, what it’s gonna do is look through each single frame of our video and we can sort of iterate through those until we reach the end of the video. So let me quit out of this.

So you’ll be provided with video and you can feel free to use your own or you can feel free to use live streams from your webcam, but I’m just gonna provide this video so that everyone’s consistent there. I’ve got some other stuff here but I’ll explain that as we move along. So anyway, this is gonna be video capture, so I just called this cap for capture and now what we want to do is, we want to actually recreate a video player, basically. What I was gonna do is I’m gonna load this video, I just wanna play it back frame by frame. So, how do we exactly do this?

Well, first thing is we need some kind of loop, some kind of structure, to make sure that our video, we’re getting valid frames from our video. So to actually display these videos what we want to do is, if we think about it, we just want to always be pulling frames from this video until we reach the end. What we’re gonna do, like I mentioned, is to just build a small app, kind of, that just replays this video file, and that’s a pretty good start. So to do this, I’m going to first start off an infinite loop. And you may be saying, whoa wait a minute is there some way to check to make sure that, how do we know if we reach the end of the video? And I say, hold on a second, I’m getting to that.

The reason that we put it in a while True loop is so that we keep checking frames from the video and if we pull a frame that we’ve already seen, or at the end of the video which which won’t pull a frame, that frame will actually be None in Python. Once we are fetching a frame that’s None then we know that we’ve reached the end of the video and that we need to stop, thus break out of this loop. So that’s basically what we’re gonna do. Actually before I get to that, one important thing that you have to remember to do is call release() on this videoCapture. That’s sort of like a clean up thing. It’s kinda like a cv2.destroyAllWindows sorta thing.

It just releases the resources that are allocated to this video or the webcam. In this while True loop, what I want to do is pull a frame from this capture. So I can do that really easily using cap.read(). And it’s kinda like imRead except with videos and print. And it actually returns two things. So first thing, it actually returns a tuple or a list. First thing is just a return value that we don’t really care about. But the second this is the actual frame.

So now that we have the frame, this is where we can do some sort of checking like, something like, if frame is None then we want to do something like break out of this loop. So when we enter this case, then we know that we’re done with the video and we just end. With webcam stuff you don’t necessarily need this because as long as the webcam is plugged in and running then we really shouldn’t ever encounter this case.

Except for maybe there’s some weird CV thing that might, some error with OpenCV that could potentially happen that causes this to return None. This still might be a good idea to leave in just in case. Just in case. Anyway, now that we have a frame we can treat this just like an image, any of our image processing techniques that we’ve learned about before, we can apply to this frame. It’s just a frame, it’s a single image. Anything that we know, we can just apply to this frame and that’s super awesome about OpenCV.

The one thing that we have to keep in mind though, and that video is a sequence of frames, and so the thing is, you don’t want to do any kind of image processing stuff that’s going to take a long time with a frame. And we’re gonna see, as we get towards the end of the app, that performance becomes really important.

The speed of your code becomes pretty important when you’re dealing with video. Because if your code is slow on these image processing tasks then you’re gonna notice it because the frame rate in the video is just gonna drop way down. So we have to make sure that we’re not doing too intensive computer vision operations. It can’t be too intensive when we’re dealing with frames. We just want to get in the frame, do what we can with it, and then move on to the next one, quickly.

Now that I have this frame though, I can just show it like an image. I’m gonna do cv2.imshow() and I’ll pass in uploading to the window and the frame, and I can show the frame. There’s one other thing that we have to do, because again, we’re dealing with video, we can do something like cv2.waitKey(1) and then there’s this & 0xFF. If this is equal to q, then that’s another reason for us to break. And this q just lets us quit out of this while loop with a single key press.

And it turns out that we need this sort of thing because we have to make sure that when we’re fetching things from a file or from the webcam, we have to actually give it just a small amount of time for us to actually fetch the frame and do something with it. It turns out, if you get rid of this line then it’s gonna be the case that your app’s not gonna work because it’s gonna crash along this line saying, hey you haven’t provided a frame. That’s because we have to give our camera a second to actually pull the frames here.

Now that I have this sort of thing going, let’s actually run this. And actually, one thing I forgot to do is cv2.destroyAllWindows() so that we get rid of the image or of the very last, a window’s gonna pop up that’s gonna show our video. So this is all we need to run our video so let’s actually run this and see our video playing. And I also, I guess I forgot to put this if here. That’s also important. Okay, and this is all we need to play our video.

So let’s go ahead and run this and we should see our video play. And yeah, awesome! We can see our video playing and it should close out of this in just a second. Excellent, okay! Now we have our video playing and this is actually just where I want to stop, right here because we’ve actually covered quite a bit in this video. So lemme just do a quick recap. And actually before we stop, I just want to mention one thing.

And that’s how we’re playing video right now, but if I wanted to run a webcam, how would I do that? It’s actually really simple. I’m looking to copy this line here. I’ll copy paste it. This is really awesome with OpenCV, is that to go from a video to a webcam we use this exact same line of code, except instead of this we put zero. And that’s it. And so we can replace line four with line five. This will actually get video from your webcam or the camera on the Raspberry Pi instead of a particular video file.

But like I mentioned, just to keep everything consistent, we’re gonna be using video files, but I’m gonna leave this in here and I’ve commented it out so that you can easily flip between the two if you so want. This is where I’m actually gonna stop the video because we’ve actually covered a lot.

So, just to do a quick recap. We’ve covered how to load a video from a file and how to use it from, or how to load it, stream it from a webcam. I then mentioned that videos are just a bunch of still images, frames, and they’re just played so fast that as a human we see them as being one continuous video. So now that we’re loading up this video, how do we actually pull frames from it? That’s what this cap.read() does, is it’ll just pull one frame and it’ll pull the next frame or the first frame. And with the first value we just send a return value, but the second value in this tuple is what we care about and that’s the frame.

So then we can just show this frame, and we can do anything that we want to this frame that we learned about image processing, cause this frame is just an image. To illustrate this we just use cv2.imshow and this is what we use for images. So now we’re using this for frames. Of course, cause frames are images, which is why this works out. And one thing that we have to have here is cv2.waitkey() because we have to give our camera a second to actually take the frame and give it to us so that we can work with it. Or in this case, take a frame from the video and give it to us so that we can work with it. This also adds in some functionality that we can just quit out of our app any time we want.

And then one special case that we have to think about is, what if we’re at the end of the video, in particular if we’re loading video and not streaming from the webcam. What if we’re at the end of the video? So what happens is, if we’re at the end of the video this should return None because we’re trying to get the next frame after the last frame, but there is no frame after the last frame, which is why we return None. So all this makes sense.

So there’s two last things that you have to do any time you’re using anything with video capture. You have to make sure that you call cap.release(). This will release your app’s control of a webcam, or it’ll close up any resources that deal with this video and then we have our classic cv2.destroyAllWindows().

Okay, so this is what we have covered in this app and what we covered in this video. We started building our app, and if you run it this will just load the video and play it. This is really good start but in the next video we’re going to build on this concept a bit more. And we’re gonna actually get into thinking about how we can build a security camera and some of the different talking points with that. So we’re gonna get into that in the next video.

Interested in continuing? Check out the full Create a Raspberry Pi Smart Security Camera course, which is part of our Python Computer Vision Mini-Degree.

How to Use Machine Learning to Show Predictions in Augmented Reality – Part 3

Elisa Romondia — Wed, 23 Jan 2019 05:00:12 +0000

Part 2 Recap

This tutorial is part of a multi-part series. In Part 1, we loaded our data from a .csv file and used Linear Regression in order to predict the number of patients that the hospital is expected to receive in future years. In Part 2 we improved the UI and created a bar chart.

Introduction

Welcome to the third and last part of this tutorial series. In this part, we will use Easy AR in order to spawn our patient numbers histogram on our Image Target in Augmented Reality.

Tutorial Source Code

All of the Part 3 source code can be downloaded here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Easy AR

Easy AR SDK is an Augmented Reality Engine that can be used in Unity, basically allows us to set an image as a target. Easy AR will detect this image target from a live camera of a device like a smartphone or a tablet, when the image will be detected the magic will happen and our object will spawn.

First, we need to create a free account on the website of Easy AR, simply using a password and email, this is the link to the website.

After we sign in into the Easy AR website and create an SDK license key, in the SDK Authorization section click on “Add SDK License Key”. Choose the “EasyAR SDK Basic: Free and no watermark” option. It is required to fill Application Details, don’t worry too much about this because you change those values later.

In the SDK Authorization now we can see our project, let’s copy our SDK License Key because we will need it later.

Let’s download the Easy AR SDK basic for Unity from this page

Now let’s open our project in Unity, after the project is loaded we need to open the downloaded Easy AR package just by click on it

Unity should notice the decompression of the package like in the screenshot

After the decompression Unity will ask to import the files, just click on import and wait a little bit

Now open the scene with the graph, in my case is tutorial_graph

Let’s add the EasyAR_Startup prefab to the scene, just dragging it from the Unity File browser to the Hierarchy section, like in the screenshot below. You can find this prefab in the Assets/EasyAR/Prefabs folder.

Select EasyAR_Startup from the Hierarchy section, in the inspector you should see a “key” textbox, paste your Easy AR key here. We obtained that key before, when creating our project in the Easy AR website.

Let’s remove our maincamera, because we don’t need it anymore, now the main camera will be our device camera

Now if you hit the play button you should see from the webcam of your computer

The image target

We need to create the folder structure for your image target and materials, first inside Easy AR folder create a “Materials” folder

Always inside Easy AR folder create a “Textures” folder

Create a “StreamingAssets” folder inside the Assets folder

Choose an image target, I suggest you to choose an image with a square size, the best thing is a QR code, feel free to use the logo of your company or something similar, remember that the image should have evident edges and contrasts otherwise will be really hard for the camera to detect it. For example, a light grey text on a white background will be almost impossible to recognize etc….

I used this image of a patient and a medic to stay in theme with our project

Copy your target image inside Assets/Streaming Assets and Assets/Easy AR/Textures

Now create a new material inside EasyAR/Materials called Logo_Material

Drag the logo image inside the Albedo input of the material

This should be the material result, now the material sphere instead of being just gray will display your texture image

Now we need to drag the ImageTarget prefab from the EasyAR/Primitives folder inside the Hierarchy section, this will enable us to use the Image Target.

Select the ImageTarget object and in the Inspector complete the ImageTarget input fields with the path with the name of your image, in my case “target.jpeg” and the name “imagetarget”

We need to specify a size for our image target, I will use 5 for both x and y, feel free to use any size

This is really important, we need to change the image target Storage to Assets

Now let’s drag the ImageTracker from the EasyAR_startup inside the Loader input of the image target. This will tell Easy AR which image tracker use in order to detect the image by the device camera feed.

Inside the Material section of image target, we drag our Logo_Material on the Element 0 input

Now if everything is done correctly, the image target should display the image we choose before

Let’s test our work, create a cube inside the ImageTarget object, we will use it later as an object reference for our patients’ number graph. This cube will help us later to anchor the graph on the image target, let’s call it “graph_anchor”

Hit play and the cube should appear over the image target if the device camera detects the image. Don’t worry if you have to wait like one minute in order to let Unity switch on your device camera, it’s a slow process when you are developing a project, but it will be faster when the project will be transformed into an app.

Now we need to make that cube invisible so only our future histogram will be visible on the image target. Let’s create a new material called “Anchor_Material ” inside the Assets/EasyAR/Materials Folder. In order to make it fully transparent (invisible) we need to set the Rendering Mode to Fade

The alpha of the Albedo section must be set to 0, as you can see in the screenshot below

Now drag the Anchor_Material on the graph_anchor cube, this should be the result

Time to save the tutorial_graph scene!

Augmented Reality Histogram

In the last tutorial we generated our data visualization, now it’s time to edit that code in order to make the histogram spawn over our image target. This is really tricky because augmented reality doesn’t always detect perfectly our target, so don’t get frustrated if at the first try something goes wrong.

Let’s write a test function inside the “GenDataViz.cs” script in order to print a console message when the image target is detected

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI; // Required when Using UI elements.
using TMPro; // required to use text mesh pro elements

public class GenDataViz : MonoBehaviour
{
    int scaleSize = 100;
    int scaleSizeFactor = 10;
    public GameObject graphContainer;
    int binDistance = 13;
    float offset = 0;

    //Add a label for the prediction year
    public TextMeshProUGUI textPredictionYear;

    //Check if image target is detected
    public GameObject Target;
    private bool detected = false;

    //The update function is executed every frame
    void Update()
    {
        if (Target.activeSelf == true && detected == false)
        {
            Debug.Log("Image Target Detected");
            detected = true;
        }
    }

   //continues with the rest of the code...
}

A little explanation of what the new code does:

“public GameObject Target”, this object will be our ImageTarget prefab
“private bool detected = false” will help us in order to identify the first detection
The Update function, is a standard Unity function like Start(). The Update function is executed every frame. It’s a useful function when you have to check something continuously, but beware because it’s expensive on the CPU side. You can find more details about the Update function on the official Unity docs
Inside the Update function, there an if that checks if the Target is active (so if the image is detected) and if is the first detection. We need to check if is the first detection in order to avoid expensive useless operations.
Inside the if we have a simple console message and we assign detected true after the first detection

Select the GraphContainer from the Hierarchy section and drag the ImageTarget inside the Target section, as the screenshot shows below

Now if you hit play and the camera detects the image target, the console should print this message “Image Target Detected”

Our patients’ number histogram should be created when the image target becomes active, so let’s edit the “GenDataViz.cs” script

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI; // Required when Using UI elements.
using TMPro; // required to use text mesh pro elements

public class GenDataViz : MonoBehaviour
{
    int scaleSize = 100;
    int scaleSizeFactor = 10;
    public GameObject graphContainer;
    int binDistance = 13;
    float offset = 0;

    //Add a label for the prediction year
    public TextMeshProUGUI textPredictionYear;

    //Check if image target is detected
    public GameObject Target;
    private bool detected = false;

    //The update function is executed every frame
    void Update()
    {
        if (Target.activeSelf == true && detected == false)
        {
            Debug.Log("Image Target Detected");
            detected = true;
            // we moved this function from Start to Update
            CreateGraph();
        }
    }


    // Use this for initialization
    void Start()
    {
    }
//continues with the rest of the code...
}

As you can see:

I moved the CreateGraph() function from Start() to Update(), this will allow us to check if the image is detected and then generate the graph

Now we need to position our 3D histogram on the Image Target, we will use our graph_anchor cube in order to retrieve the relative size, position and rotation. We edit again the “GenDataViz.cs” script, this is the full code of the script:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI; // Required when Using UI elements.
using TMPro; // required to use text mesh pro elements

public class GenDataViz : MonoBehaviour
{
    int scaleSize = 1000;
    int scaleSizeFactor = 100;
    float binDistance = 0.1f;
    float offset = 0;

    //Add a label for the prediction year
    public TextMeshProUGUI textPredictionYear;

    //Check if image target is detected
    public GameObject Target;
    private bool detected = false;
    // The anchor object of your graph
    public GameObject GraphAnchor;

    //The update function is executed every frame
    void Update()
    {
        if (Target.activeSelf == true && detected == false)
        {
            Debug.Log("Image Target Detected");
            detected = true;
            // we moved this function from Start to Update
            CreateGraph();
        }
    }


    // Use this for initialization
    void Start()
    {
    }


    public void ClearChilds(Transform parent)
    {
        offset = 0;
        foreach (Transform child in parent)
        {
            Destroy(child.gameObject);
        }
    }

    // Here we allow the use to increase and decrease the size of the data visualization
    public void DecreaseSize()
    {
        scaleSize += scaleSizeFactor;
        CreateGraph();
    }

    public void IncreaseSize()
    {
        scaleSize -= scaleSizeFactor;
        CreateGraph();
    }

    //Reset the size of the graph
    public void ResetSize()
    {
        scaleSize = 1000;
        CreateGraph();
    }



    public void CreateGraph()
    {
        Debug.Log("creating the graph");
        ClearChilds(GraphAnchor.transform);
        for (var i = 0; i < LinearRegression.quantityValues.Count; i++)
        {
            //Reduced the number of arguments of the function
            createBin((float)LinearRegression.quantityValues[i] / scaleSize, GraphAnchor);
            offset += binDistance;
        }
        Debug.Log("creating the graph: " + LinearRegression.PredictionOutput);

        // Let's add the predictio as the last bar, only if the user made a prediction
        if (LinearRegression.PredictionOutput != 0)
        {
            //Reduced the number of arguments of the function
            createBin((float)LinearRegression.PredictionOutput / scaleSize, GraphAnchor);
            offset += binDistance;
            textPredictionYear.text = "Prediction of " + LinearRegression.PredictionYear;
        }
        else
        {
            textPredictionYear.text = " ";

        }
    }

    //Reduced the number of arguments of the function
    void createBin(float Scale_y, GameObject _parent)
    {
        GameObject cube = GameObject.CreatePrimitive(PrimitiveType.Cube);
        cube.transform.SetParent(_parent.transform, true);

        //We use the localScale of the parent object in order to have a relative size
        Vector3 scale = new Vector3(GraphAnchor.transform.localScale.x / LinearRegression.quantityValues.Count, Scale_y, GraphAnchor.transform.localScale.x / 8);
        cube.transform.localScale = scale;

        //We use the position and rotation of the parent object in order to align our graph
        cube.transform.localPosition = new Vector3(offset - GraphAnchor.transform.localScale.x, (Scale_y / 2) - (GraphAnchor.transform.localScale.y / 2), 0);
        cube.transform.rotation = GraphAnchor.transform.rotation;

        // Let's add some colours
        cube.GetComponent().material.color = Random.ColorHSV(0f, 1f, 1f, 1f, 0.5f, 1f);

    }

}

Many things changed in order to adapt our Histogram to the Augmented Reality target, let’s see what changed:

I changed the initial values of scaleSize in order to have a smaller graph, feel free to change this value
“public GameObject graphContainer” is removed and now “public GameObject GraphAnchor” takes its place
“binDistance” is now a float variable, this will allow us to be more accurate
Creation of the “public GameObject GraphAnchor”, as you can expect this will be the reference to our graph_ancor cube
The “createBin()” function now needs fewer arguments because the scale, rotation and position are retrieved from the parent object
Inside “CreateGraph()” the function “createBin()” uses GraphAnchor instead of GraphContainer, the histogram is generated inside our graph_anchor cube
I updated the lines “createBin((float)LinearRegression.PredictionOutput / scaleSize, GraphAnchor)“, now use a float instead of (int) in order to have a histogram that scales more accurately
In the scale variable the expression “GraphPlatform.transform.localScale.x / LinearRegression.quantityValues.Count” helps us to bound the graph X dimension inside the parent object

Now select the GraphContainer object and drag the graph_anchor cube inside the Graph Anchor input field

Open the menu scene, remember that the dataset is passed from the menu scene to the graph scene, so if you open directly the graph scene unexpected results can happen because the data will be missing. Insert a year of prediction and click on “Prediction”, the on “Data Visualization”. Now show your image target to the camera and graph with the patients’ numbers and the prediction should appear!

Fantastic job: you made it!

Conclusion

In summary, we created an application that can load a dataset derived from the numbers of patients which the hospital staff took care of during previous years, then make a prediction of how many patients they could expect to receive in future. During a meeting, this kind of application can be used in combination with image targets on paper documents, thus enhancing the interactivity of a report by making things easier to understand and visualize in people’s minds for funding and planning purposes.

A potential improvement to this project would be to connect the application to a live stream of data, then visualize that in Augmented Reality and in real time: in terms how the data changes over time >> Please feel free have a go at that or another customization according to your preference, and of course share your progress as well as any other thoughts in the comments below! Hope you enjoyed all the tutorials in this series, and we’ll see you next time – in the meantime take care