Explore Free Python Tutorials – GameDev Academy https://gamedevacademy.org Tutorials on Game Development, Unity, Phaser and HTML5 Fri, 24 Feb 2023 21:14:37 +0000 en-US hourly 1 https://wordpress.org/?v=6.1.1 https://gamedevacademy.org/wp-content/uploads/2015/09/cropped-GDA_logofinal_2015-h70-32x32.png Explore Free Python Tutorials – GameDev Academy https://gamedevacademy.org 32 32 Radial Basis Function Networks – Regression for ML https://gamedevacademy.org/using-neural-networks-for-regression-radial-basis-function-networks/ Mon, 19 Dec 2022 02:34:28 +0000 https://pythonmachinelearning.pro/?p=1895 Read more]]> Machine learning is an expansive field – one often made better by techniques common to data science like regression.

In this tutorial, we’re going to explore the topic of neural networks and discover how we can make use of regression techniques when working with them. We’ll also dive into how to use Python to work with these concepts.

Let’s dive in and start learning!

Intro to Neural Networks and RBF Nets

Neural Networks are very powerful models for classification tasks. But what about regression? Suppose we had a set of data points and wanted to project that trend into the future to make predictions. Regression has many applications in finance, physics, biology, and many other fields.

Radial Basis Function Networks (RBF nets) are used for exactly this scenario: regression or function approximation. We have some data that represents an underlying trend or function and want to model it. RBF nets can learn to approximate the underlying trend using many Gaussians/bell curves.

Prerequisites

You can download the full code here.

Before we begin, please familiarize yourself with neural networks, backpropagation, and k-means clustering. Alternatively, if this is your first foray into the world of machine learning, we recommend first learning the fundamentals of Python. You can check out our tutorials, or try a robust curriculum like the Python Mini-Degree to do so.

For educators, we can also recommend Zenva Schools. Not only does the platform have tons of online courses for Python, but is suitable for K12 environments with classroom management tools, course plans, and more.

BUILD YOUR OWN GAMES

Get 250+ coding courses for

$1

AVAILABLE FOR A LIMITED TIME ONLY

RBF Nets

An RBF net is similar to a 2-layer network. We have an input that is fully connected to a hidden layer. Then, we take the output of the hidden layer perform a weighted sum to get our output.

But what is that inside the hidden layer neurons? That is a Gaussian RBF! This differentiates an RBF net from a regular neural network: we’re using an RBF as our “activation” function (more specifically, a Gaussian RBF).

Gaussian Distribution

The first question you may have is “what is a Gaussian?” It’s the most famous and important of all statistical distributions. A picture is worth a thousand words so here’s an example of a Gaussian centered at 0 with a standard deviation of 1.

This is the Gaussian or normal distribution! It is also called a bell curve sometimes. The function that describes the normal distribution is the following

\[ \mathcal{N}(x; \mu, \sigma^2) = \displaystyle\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\displaystyle\frac{(x-\mu)^2}{2\sigma^2}}  \]

That looks like a really messy equation! And it is, so we’ll use $\mathcal{N}(x; \mu, \sigma^2)$ to represent that equation. If we look at it, we notice there are one input and two parameters. First, let’s discuss the parameters and how they change the Gaussian. Then we can discuss what the input means.

The two parameters are called the mean $\mu$ and standard deviation $\sigma$. In some cases, the standard deviation is replaced with the variance $\sigma^2$, which is just the square of the standard deviation. The mean of the Gaussian simply shifts the center of the Gaussian, i.e. the “bump” or top of the bell. In the image above, $\mu=0$, so the largest value is at $x=0$.

The standard deviation is a measure of the spread of the Gaussian. It affects the “wideness” of the bell. Using a larger standard deviation means that the data are more spread out, rather than closer to the mean.

Technically, the above function is called the probability density function (pdf) and it tells us the probability of observing an input $x$, given that specific normal distribution. But we’re only interested in the bell-curve properties of the Gaussian, not the fact that it represents a probability distribution.

Gaussians in RBF nets

Why do we care about Gaussians? We can use a linear combination of Gaussians to approximate any function!

Source: https://terpconnect.umd.edu/~toh/spectrum/CurveFittingB.html

In the figure above, the Gaussians have different colors and are weighted differently. When we take the sum, we get a continuous function! To do this, we need to know where to place the Gaussian centers $c_j$ and their standard deviations $\sigma_j$.

We can use k-means clustering on our input data to figure out where to place the Gaussians. The reasoning behind this is that we want our Gaussians to “span” the largest clusters of data since they have that bell-curve shape.

The next step is figuring out what the standard deviations should be. There are two approaches we can take: set the standard deviation to be that of the points assigned to a particular cluster $c_j$ or we can use a single standard deviation for all clusters $\sigma_j = \sigma \forall j$ where $\sigma=\frac{d_\text{max}}{\sqrt{2k}}$ where $d_\text{max}$ is the maximum distance between any two cluster centers. and $k$ is the number of cluster centers.

But wait, how many Gaussians do we use? Well that’s a hyperparameter called the number of bases or kernels $k$.

Backpropagation for RBF nets

K-means clustering is used to determine the centers $c_j$ for each of the radial basis functions $\varphi_j$. Given an input $x$, an RBF network produces a weighted sum output.

\begin{equation*}
F(x)=\displaystyle\sum_{j=1}^k w_j\varphi_j(x, c_j) + b
\end{equation*}

where $w_j$ are the weights, $b$ is the bias, $k$ is the number of bases/clusters/centers, and $\varphi_j(\cdot)$ is the Gaussian RBF:

\begin{equation*}
\varphi_j(x, c_j) = \exp\left(\displaystyle\frac{-||x-c_j||^2}{2\sigma_j^2} \right)
\end{equation*}

There are other kinds of RBFs, but we’ll stick with our Gaussian RBF. (Notice that we don’t have the constant up front, so our Gaussian is not normalized, but that’s ok since we’re not using it as a probability distribution!)

Using these definitions, we can derive the update rules for $w_j$ and $b$ for gradient descent. We use the quadratic cost function to minimize.

\begin{equation*}
C = \displaystyle\sum_{i=1}^N (y^{(i)}-F(x^{(i)}))^2
\end{equation*}

We can derive the update rule for $w_j$ by computing the partial derivative of the cost function with respect to all of the $w_j$.

\begin{align*}
\displaystyle\frac{\partial C}{\partial w_j} &= \displaystyle\frac{\partial C}{\partial F} \displaystyle\frac{\partial F}{\partial w_j}\\
&=\displaystyle\frac{\partial }{\partial F}[\displaystyle\sum_{i=1}^N (y^{(i)}-F(x^{(i)}))^2]~\cdot~\displaystyle\frac{\partial }{\partial w_j}[\displaystyle\sum_{j=0}^K w_j\varphi_j(x,c_j) + b]\\
&=-(y^{(i)}-F(x^{(i)}))~\cdot~\varphi_j(x,c_j)\\
w_j &\gets w_j + \eta~(y^{(i)}-F(x^{(i)}))~\varphi_j(x,c_j)
\end{align*}

Similarly, we can derive the update rules for $b$ by computing the partial derivative of the cost function with respect to $b$.

\begin{align*}
\displaystyle\frac{\partial C}{\partial b} &= \displaystyle\frac{\partial C}{\partial F} \displaystyle\frac{\partial F}{\partial b}\\
&=\displaystyle\frac{\partial }{\partial F}[\displaystyle\sum_{i=1}^N (y^{(i)}-F(x^{(i)}))^2]~\cdot~\displaystyle\frac{\partial }{\partial b}[\displaystyle\sum_{j=0}^K w_j\varphi_j(x,c_j) + b]\\
&=-(y^{(i)}-F(x^{(i)}))\cdot 1\\
b &\gets b + \eta~(y^{(i)}-F(x^{(i)}))
\end{align*}

Now we have our backpropagation rules!

RBF Net Code

Now that we have a better understanding of how we can use neural networks for function approximation, let’s write some code!

First, we have to define our “training” data and RBF. We’re going to code up our Gaussian RBF.

def rbf(x, c, s):
    return np.exp(-1 / (2 * s**2) * (x-c)**2)

Now we’ll need to use the k-means clustering algorithm to determine the cluster centers. I’ve already coded up a function for you that gives us the cluster centers and the standard deviations of the clusters.

def kmeans(X, k):
    """Performs k-means clustering for 1D input
    
    Arguments:
        X {ndarray} -- A Mx1 array of inputs
        k {int} -- Number of clusters
    
    Returns:
        ndarray -- A kx1 array of final cluster centers
    """

    # randomly select initial clusters from input data
    clusters = np.random.choice(np.squeeze(X), size=k)
    prevClusters = clusters.copy()
    stds = np.zeros(k)
    converged = False

    while not converged:
        """
        compute distances for each cluster center to each point 
        where (distances[i, j] represents the distance between the ith point and jth cluster)
        """
        distances = np.squeeze(np.abs(X[:, np.newaxis] - clusters[np.newaxis, :]))

        # find the cluster that's closest to each point
        closestCluster = np.argmin(distances, axis=1)

        # update clusters by taking the mean of all of the points assigned to that cluster
        for i in range(k):
            pointsForCluster = X[closestCluster == i]
            if len(pointsForCluster) > 0:
                clusters[i] = np.mean(pointsForCluster, axis=0)

        # converge if clusters haven't moved
        converged = np.linalg.norm(clusters - prevClusters) < 1e-6
        prevClusters = clusters.copy()

    distances = np.squeeze(np.abs(X[:, np.newaxis] - clusters[np.newaxis, :]))
    closestCluster = np.argmin(distances, axis=1)

    clustersWithNoPoints = []
    for i in range(k):
        pointsForCluster = X[closestCluster == i]
        if len(pointsForCluster) < 2:
            # keep track of clusters with no points or 1 point
            clustersWithNoPoints.append(i)
            continue
        else:
            stds[i] = np.std(X[closestCluster == i])

    # if there are clusters with 0 or 1 points, take the mean std of the other clusters
    if len(clustersWithNoPoints) > 0:
        pointsToAverage = []
        for i in range(k):
            if i not in clustersWithNoPoints:
                pointsToAverage.append(X[closestCluster == i])
        pointsToAverage = np.concatenate(pointsToAverage).ravel()
        stds[clustersWithNoPoints] = np.mean(np.std(pointsToAverage))

    return clusters, stds

This code just implements the k-means clustering algorithm and computes the standard deviations. If there is a cluster with none or one assigned points to it, we simply average the standard deviation of the other clusters. (We can’t compute standard deviation with no data points, and the standard deviation of a single data point is 0).

We’re not going to spend too much time on k-means clustering. Visit the link at the top for more information.

Now we can get to the real heart of the RBF net by creating a class.

class RBFNet(object):
    """Implementation of a Radial Basis Function Network"""
    def __init__(self, k=2, lr=0.01, epochs=100, rbf=rbf, inferStds=True):
        self.k = k
        self.lr = lr
        self.epochs = epochs
        self.rbf = rbf
        self.inferStds = inferStds

        self.w = np.random.randn(k)
        self.b = np.random.randn(1)

We have options for the number of bases, learning rate, number of epochs, which RBF to use, and if we want to use the standard deviations from k-means. We also initialize the weights and bias. Remember that an RBF net is a modified 2-layer network, so there’s only only one weight vector and a single bias at the output node, since we’re approximating a 1D function (specifically, one output). If we had a function with multiple outputs (a function with a vector-valued output), we’d use multiple output neurons and our weights would be a matrix and our bias a vector.

Then, we have to write our fit function to compute our weights and biases. In the first few lines, we either use the standard deviations from the modified k-means algorithm, or we force all bases to use the same standard deviation computed from the formula. The rest is similar to backpropagation where we propagate our input going forward and update our weights going backward.

def fit(self, X, y):
    if self.inferStds:
        # compute stds from data
        self.centers, self.stds = kmeans(X, self.k)
    else:
        # use a fixed std 
        self.centers, _ = kmeans(X, self.k)
        dMax = max([np.abs(c1 - c2) for c1 in self.centers for c2 in self.centers])
        self.stds = np.repeat(dMax / np.sqrt(2*self.k), self.k)

    # training
    for epoch in range(self.epochs):
        for i in range(X.shape[0]):
            # forward pass
            a = np.array([self.rbf(X[i], c, s) for c, s, in zip(self.centers, self.stds)])
            F = a.T.dot(self.w) + self.b

            loss = (y[i] - F).flatten() ** 2
            print('Loss: {0:.2f}'.format(loss[0]))

            # backward pass
            error = -(y[i] - F).flatten()

            # online update
            self.w = self.w - self.lr * a * error
            self.b = self.b - self.lr * error

For verbosity, we’re printing the loss at each step. Notice we’re also performing an online update, meaning we update our weights and biases each input. Alternatively, we could have done a batch update, where we update our parameters after seeing all training data, or minibatch update, where we update our parameters after seeing a subset of the training data.

Making a prediction is as simple as propagating our input forward.

def predict(self, X):
    y_pred = []
    for i in range(X.shape[0]):
        a = np.array([self.rbf(X[i], c, s) for c, s, in zip(self.centers, self.stds)])
        F = a.T.dot(self.w) + self.b
        y_pred.append(F)
    return np.array(y_pred)

Notice that we’re allowing for a matrix inputs, where each row is an example.

Finally, we can write code to use our new class. For our training data, we’ll be generating 100 samples from the sine function. Then, we’ll add some uniform noise to our data.

# sample inputs and add noise
NUM_SAMPLES = 100
X = np.random.uniform(0., 1., NUM_SAMPLES)
X = np.sort(X, axis=0)
noise = np.random.uniform(-0.1, 0.1, NUM_SAMPLES)
y = np.sin(2 * np.pi * X)  + noise

rbfnet = RBFNet(lr=1e-2, k=2)
rbfnet.fit(X, y)

y_pred = rbfnet.predict(X)

plt.plot(X, y, '-o', label='true')
plt.plot(X, y_pred, '-o', label='RBF-Net')
plt.legend()

plt.tight_layout()
plt.show()

We can plot our approximated function against our real function to see how well our RBF net performed.

From our results, our RBF net performed pretty well! If we wanted to evaluate our RBF net more rigorously, we could sample more points from the same function, pass it through our RBF net and use the summed Euclidean distance as a metric.

We can try messing around with some key parameters, like the number of bases. What if we increase the number of bases to 4?

Our results aren’t too great! This is because our original function is shaped the way that it is, i.e., two bumps. If we had a more complicated function, then we could use a larger number of bases. If we used a large number of bases, then we’ll start overfitting!

Another parameter we can change is the standard deviation. How about we use a single standard deviation for all of our bases instead of each one getting its own?

Our plot is much smoother! This is because the Gaussians that make up our reconstruction all have the same standard deviation.

There are other parameters we can change like the learning rate; we could use a more advanced optimization algorithm; we could try layering Gaussians; etc.

To summarize, RBF nets are a special type of neural network used for regression. They are similar to 2-layer networks, but we replace the activation function with a radial basis function, specifically a Gaussian radial basis function. We take each input vector and feed it into each basis. Then, we do a simple weighted sum to get our approximated function value at the end. We train these using backpropagation like any neural network! Finally, we implemented RBF nets in a class and used it to approximate a simple function.

RBF nets are a great example of neural models being used for regression!

Want to learn more about how Python can help your career? Check out this article! You can also expand your Python skills with the Python Mini-Degree or, for teachers in need of classroom resources, the Zenva Schools platform.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

]]>
A Guide to Using Enum in Python for Finite Data States https://gamedevacademy.org/python-enum-tutorial/ Sun, 18 Dec 2022 14:07:31 +0000 https://pythonmachinelearning.pro/?p=3175 Read more]]> This post is going to explore the Python enum module, a powerful tool for handling sets of data that don’t change.

Not only is this integral for generalized software development, but is something you’ll also find commonly featured in games – making it an important skill and foundation to learn regardless of your personal coding goals!

Let’s dive in, and explore this fascinating topic of using enum in Python!

Prerequisites

The reader is expected to have a working knowledge of the Python programming language. Familiarity with how to write classes will also be helpful since Python enums are implemented as Python classes.

If you’re in need of beginner-friendly Python material, we recommend checking out our other Python tutorials or checking out the Python Mini-Degree – a full-feature curriculum of courses focusing on everything you might want to get you started.

For educators looking to bring Python into the classroom, we also recommend checking out Zenva Schools. This K12 platform offers online courses on Python (and other popular topics) along with tools for managing the classroom, reporting on student progress, and beyond.

BUILD YOUR OWN GAMES

Get 250+ coding courses for

$1

AVAILABLE FOR A LIMITED TIME ONLY

What is a Python Enum?

Let’s discuss what a Python enum is. Short for enumerations, this language feature defines a set of names that are bound to constant values such as numbers, strings etc. Python enums are useful to represent data that represent a finite set of states such as days of the week, months of the year, etc.

They were added to Python 3.4 via PEP 435. However, it is available all the way back to 2.4 via pypy. As such, you can expect them to be a staple as you explore Python programming.

Illustration of woman at computer

Simple Example

We define a simple enum class (i.e. a class derived from Enum) containing the months of the year. Each month (i.e. enum member) is assigned a unique numeric constant.

from enum import Enum, unique, Flag

class Months(Enum) :
    JANUARY=1
    FEBRUARY=2
    MARCH =  3
    APRIL=4
    MAY=5
    JUNE=6
    JULY=7
    AUGUST=8
    SEPTEMBER=9
    OCTOBER=10
    NOVEMBER=11
    DECEMBER=12

Printing out the members of the Python enum

There are several ways we can do this.

# by numerical index
print (Months(7)
Months.JULY

# item index
print (Months['JULY'])
Months.JULY

# by name
print (Months.JULY)
Months.JULY

# by name
print (Months.JULY.name)
JULY

# by value
print (Months.JULY.value)
JULY

Ensuring uniqueness

Adding a @unique decorator to the class, ensures that duplicate elements don’t exist in the Python enum.

from enum import Enum, unique, Flag

>> @unique
... class Months(Enum) :
...     JANUARY=1
...     JANUARY=1
...     FEBRUARY=2
...     MARCH =  3
...     APRIL=4
...     MAY=5
...     JUNE=6
...     JULY=7
...     AUGUST=8
...     SEPTEMBER=9
...     OCTOBER=10
...     NOVEMBER=11
...     DECEMBER=12
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<stdin>", line 4, in Months
  File "/home/raghavan/anaconda3/lib/python3.8/enum.py", line 112, in __setitem__
    raise TypeError('Attempted to reuse key: %r' % key)
TypeError: Attempted to reuse key: 'JANUARY'

A TypeError occurs when the Python interpreter tries to create this Python enum.

Alternative way to create an Enum.

The Python Enum class is callable and the following Functional API can be invoked to create it.

>> quarter1= Enum('Q1', [('January', 1), ('February', 2), ('March', 3)])
>> print (quarter1(3))
Q1.March

# can also be written as
>> quarter1= Enum('Q1', ('January February March'))
>> print (quarter1)
<enum 'Q1'>

# item access
>>> quarter1['January']
<Q1.January: 1>

>>> Quarter1.March
<Q1.March: 3>

>>> Quarter1.March.value
3

Pictures of coding on a computer screen

Iterating over the elements

For loop

A simple for loop can be used to print out the members.

# simple for loop
for month in (Months) :
    print (month)

Months.JANUARY
Months.FEBRUARY
Months.MARCH
Months.APRIL
Months.MAY
Months.JUNE
Months.JULY
Months.AUGUST
Months.SEPTEMBER
Months.OCTOBER
Months.NOVEMBER
Months.DECEMBER

The __members__ attribute

__members__ is a read-only class level attribute providing a mapping from names to members. It can be iterated over to produce a similar output as above.

# iteration over elements
for name, member in Months.__members__.items():
    print (name, member)

JANUARY Months.JANUARY
FEBRUARY Months.FEBRUARY
MARCH Months.MARCH
APRIL Months.APRIL
MAY Months.MAY
JUNE Months.JUNE
JULY Months.JULY
AUGUST Months.AUGUST
SEPTEMBER Months.SEPTEMBER
OCTOBER Months.OCTOBER
NOVEMBER Months.NOVEMBER
DECEMBER Months.DECEMBER

Hashing

Python enums can be used as dictionary keys as follows.

>> months = {}
>> months[Months.JULY] = 'Many Birthdays'
>> months[Months.JANUARY] = 'First Month of the Year'

>> print (months[Months(7)])
Many Birthdays

Auto values

The values corresponding to the names can be populated automatically using auto() as demonstrated below.

# member values using auto
from enum import auto
class Quarter(Enum):
    Q1 = auto()
    Q2 = auto()
    Q3 = auto()
    Q4 = auto()


for qtr in Quarter:
    print (qtr.value)

# Output
1
2
3
4

The values, by default, are numeric. However, they can be converted into say string values by overriding _generate_next_value_ in the class.

# member values using auto
from enum import auto
class Quarter(Enum):
    def _generate_next_value_(name, start, count, last_values):
         return "[" + name + "]"
    Q1 = auto()
    Q2 = auto()
    Q3 = auto()
    Q4 = auto()

# test the values
for qtr in Quarter:
    print (qtr.value)

# Output
[Q1]
[Q2]
[Q3]
[Q4]

Woman's hands typing on a laptop

Derived Enumerations

Flag

Flag is quite similar to Enum except it has support for bitwise operations (|, &, ^, ~). These operators can be used to combine multiple Python enum elements into a mask.

The class Planets is derived from Flag and contains the 8 planets currently recognized by the International Astronomical Union. The values of the elements are required to be multiples of two while combinations need not follow that restriction.

class Planets(Flag):
    MERCURY = 1
    VENUS   = 2
    EARTH   = 4
    MARS    = 8
    SATURN  = 10
    URANUS  = 12
    NEPTUNE = 14

my_planet = {
              Planets.MERCURY: "Red Planet",
              Planets.EARTH:   "People !!!!!",
              Planets.MARS:    "Martians !!!!"
            }

# bitwise OR mask => Mercury / Earth.
mercury_earth = Planets.MERCURY|Planets.EARTH

# we check which planets in the given dictionary are
# MARS or EARTH by doing a Bitwise AND with the mask above.

for planet_name in myplanet.keys() :
    found = bool(planet_name & mercury_earth)
    print ("Found:" + str(planet_name) + ":" + str(found))

# Output

Found:Planets.MERCURY:True
Found:Planets.EARTH:True
Found:Planets.MARS:False
  1. The Planets class derives from Flag. It contains the names and values of the 8 planets.
  2. The my_planet dictionary contains the enum names of Mercury, Earth and Mars as keys.
  3. Create a bit mask by OR-ing Planets.MERCURY and Planets.EARTH.
  4. Iterate through the planet names in the dictionary and do a bit wise AND between the planet name and the mask created in Step 3. Convert the resultant value into a boolean value.
  5. The bitwise & operation will return True where the element key matches Planets.MERCURY or Planets.EARTH and False for any other planet. The output reflects this.

IntEnum

IntEnum is a derived Python enum that also derives from int. Consequently it supports int operations. It provides ordered comparisons that are not provided by the Python Enum class.

>>> from enum import Enum, auto
>>> class Quarter(Enum):
...     Q1 = auto()
...     Q2 = auto()
...     Q3 = auto()
...     Q4 = auto()


# numerical comparisons other than '==' and '!='
>>> Quarter.Q1 < Quarter.Q2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'Quarter' and 'Quarter'


# IntEnum supports comparisons, sorting.

>>> from enum import IntEnum, Enum, auto
>>> class Quarter (IntEnum):
...     Q1 = auto() 
...     Q2 = auto() 
...     Q3 = auto() 
...     Q4 = auto()
...
>>> Quarter.Q1
<Quarter.Q1: 1>
>>> Quarter.Q2
<Quarter.Q2: 2>

>>> Quarter.Q1 < Quarter.Q2
True

>>> sorted ([Quarter.Q2, Quarter.Q1, Quarter.Q4, Quarter.Q3])
[<Quarter.Q1: 1>, <Quarter.Q2: 2>, <Quarter.Q3: 3>, , <Quarter.Q4: 4>]

# integer operations
>>> Quarter.Q1 + 10
11
  1. Inherit Quarter from Enum.
  2. Try a “<” numerical comparison between Quarter.Q1 and Quarter.Q2. It fails with a TypeError.
  3. Redefine Quarter by inheriting Quarter from IntEnum,
  4. The same comparison now succeeds.
  5. We are also able to sort the members of the IntEnum.
  6. Finally, we can perform integer operations on the IntEnum member.

IntFlag

This variation of enum is a subclass of int. The members of this Python enum are ints. These values can be combined using bitwise operators and the result is also an IntFlag member. Any other operators used on the members could remove them from IntFlag membership.

from enum import IntFlag, auto
class DaysOfWeek(IntFlag) :
    Monday = auto()
    Tuesday = auto()
    Wednesday = auto()
    Thursday = auto()
    Friday = auto()
    Saturday = auto()
    Sunday = auto()

week_end = DaysOfWeek.Saturday | DaysOfWeek.Sunday

is_mon_weekend = DaysOfWeek.Monday & week_end

print (bool(is_mon_weekend))
False

is_sun_weekend = DaysOfWeek.Sunday & week_end
print (bool(is_sun_weekend))
True
  1. We inherit DaysOfWeek from IntFlag. It contains the days of the week as members.
  2. We create a weekend mask (week_end) by bitwise-ORing Saturday and Sunday.
  3. We bitwise-AND this mask with Monday which correctly gives us a False value.
  4. We bitwise-AND this mask with Sunday which correctly gives us a True value.

Lego business man looking stressed at desk

Custom Ordering

Both IntEnum and IntFlag allows us to do custom ordering of the members where natural ordering does not help.

IntEnum

"GREEN" < "BLUE"
False

class Colors(IntEnum) :
    GREEN = 1
    BLUE  = 2

Colors.GREEN < Colors.BLUE
True
  1. The string GREEN cannot be alphabetically made less than the string BLUE.
  2. However, defining them as members of an IntEnum makes this possible.

References / Additional Reading

We’ve covered the basics, but now it’s time for you to explore Python enums yourself! Check out some of the links below to get yourself started, either with enums or with further Python programming topics.

Conclusion

Python enums are extremely useful in managing data that takes a finite set of states. They are also useful in imposing a custom ordering on data. As discussed, there are 3 variations of Python Enum i.e. IntEnum, IntFlag, and Flag. Flag and IntFlag can be used in combining states using bitwise operators, while IntEnum and IntFlag are useful for numeric comparisons and sorting.

Where you go with this newfound knowledge is up to you, though. Perhaps you’re interested in making a calendar program suited for a business suite of software. Or, maybe you’re interested in making a game and using enumerations in Python for different status effects. The uses here are pretty limitless, but we hope you’ve come away with some new information at your disposal and are able to use the Python enum to your benefit and success!

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

]]>
How to do Cluster Analysis with Python – Data Science https://gamedevacademy.org/cluster-analysis-python-tutorial/ Sun, 18 Dec 2022 14:00:19 +0000 https://pythonmachinelearning.pro/?p=2431 Read more]]>

You can access the full course here: Data Insights with Cluster Analysis

Want more Python topics for a school environment? The Zenva Schools platform which offers online courses, classroom management tools, reporting, pre-made course plans, and more for teachers.

Part 1

In this video we are going to discuss Cluster Analysis. We will discuss the following topics:

  1. Intro to Cluster Analysis – what is it, what are it’s different applications, the kinds of algorithms we can expect.
  2. K-means clustering
  3. Density-based Spatial Clustering of Applications with Noise (DBSCAN)
  4. Hierarchical Agglomerative Clustering (HAC)

k-means, DBSCAN and HAC are 3 very popular clustering algorithms which all take very different approaches to creating clusters.

Before diving in, you can also explore why you might want to learn these topics or just what career opportunities these skills will present you with!

Cluster Analysis

Imagine we have some data. In cluster analysis, we want to (in an unsupervised manner – no apriori information), separate different groups based on the data.

Flower petal length and width graph

Looking at a plot of the above data, we can say that it fits into 2 different groups – a cluster of points in the bottom left and a larger, elongated cluster on the top right. When we give this data to a clustering algorithm, it will create a split. Algorithms like k-means need to be told how many clusters we want. In some cases, we don’t need to specify the number of clusters. DBSCAN for instance is smart enough to figure out how many clusters there are in the data.

The data above is from the IRIS data set. This was created by a famous statistician R.A. Fischer, who collected this data set of 3 different species of flowers and plotted their measured properties such as petal width, petal length, sepal width, sepal length. Since we are doing clustering, we have removed the class labels from the data set as that is what the clustering algorithms is trying to give us in terms of what data points belong together.

Clustering

Grouping data into clusters so that the data in each cluster has similar attributes or properties. For example the data in the small cluster in the above plot have small petal length and small petal width.

There are several applications of clustering analysis used across a variety of fields:

  • Market analysis and segmentation
  • Medical imaging – Xrays, MRIs, fMRIs
  • Recommender systems – such as those used on Amazon.com
  • Geospatial data – longitudinal coordinates etc
  • Anomaly detection

People have used clustering algorithms to detect brain anomalies. We see below various brain images. C1 – C7 are various clusters. This is an example of clustering in the medical domain.

Brain images while brain is clustering info

Another example is a spatial analysis of user-generated ratings of venues on yelp.com

Spatial analysis with restaurant data clustered

Cluster Analysis

A set of points X1….Xn are taken in, analyzed and out comes a set of mappings from each point to a cluster (X1 -> C1, X2 -> C2 etc).

Visual representation of points going through cluster analysis

There are several such algorithms that will detect clusters. Some of these algorithms have additional parameters e.g. Number of clusters. These parameters vary for each algorithm.

The input however is a set of data points X1…Xn in any dimensionality i.e. 2D, 3D, 100D etc. For our purposes, we will stick with 2D points. It is hard to visualize data of higher dimensions though there are dimensionality reduction techniques that reduce say 100 dimensions to 2 so that they can be plotted.

The output is a cluster assignment where each point either belongs to a cluster or could be an outlier (noise).

Cluster analysis is a kind of unsupervised machine learning technique, as in general, we do not have any labels. There may be some techniques that use class labels to do clustering but this is generally not the case.

Summary

We discussed what clustering analysis is, various clustering algorithms, what are the inputs and outputs of these algorithms. We discussed various applications of clustering – not necessarily in the data science field.

Part 2

In this video, we will look at probably the most popular clustering algorithm i.e. k-means clustering.

Graph with blue and red points

This is a very popular, simple and easy-to-implement algorithm.

In this algorithm, we separate data into k disjoint clusters. These clusters are defines such that they minimize the within-cluster sum-of-squares. We’ll discuss this more when we look at k-means convergence. Disjoint here means that 1 point cannot belong to more than 1 cluster.

There is only 1 parameter for this algorithm i.e. k (the number of clusters). We need to have an idea of k before we run the algorithm. Sometimes it is obvious how many clusters we should have, but sometimes it is not that clear. We will discuss later how to make this choice.

This algorithm is a baseline algorithm in clustering.

The cluster center/centroid is a point that represents the cluster. The figure above has a red and a blue cluster. X is the centroid – the average of the x and y coordinates. In the blue cluster the average of the x and y coordinates is somewhere in the middle represented by the X in the middle of the square.

K-means Clustering Algorithm

1. Randomly initialize the cluster centers. For example, in the above diagram, we pick 2 random points to initialize the clusters.
2. Assign each point to it’s nearest cluster using distance formula like Euclidian distance.
3. Update the cluster centroids using the mean of the points assigned to it.
4. Go back to 2 until convergence (the cluster centroids stop moving or they move small imperceptible amounts).

Let’s go through an example (fake data) and discuss some interesting things about this algorithm.

Fake data graph

Visually, we can see that there are 2 clusters present.

Let’s randomly assign the cluster centers.

Fake data graph with random cluster centers

Let’s now assign each point to the closest cluster.

Fake data graph with points assigned to cluster

The points are now colored blue or red depending on which centroid they are closer to.

Next we need to update our cluster centroids. For the blue centroid, we take the average of all the x-coordinates – this will be the new x-coordinate for the centroid. Similarly we look at all the y-coordinates for the blue points, take their average and this becomes the new y-coordinate for the centroid. Likewise for the red points.

When I do this, the centroids shift over.

Fake data graph with centroids shifted

Once again I need to figure out which centroid each point is close to, which gives me the following.

Fake data graph with points assigned to new centroids

Once again, I update my cluster centroids as before.

Fake data graph with centroids shifted again

If we try to do another shift, the centroids won’t move again. This is evident from the last 2 figures where the same points are assigned to the cluster centroids.

At this point, we say that k-means has converged.

Convergence

Convergence mean that the cluster centroid don’t move at all or move a very very small amount. We use a threshold value to indicate that if the centroid does not move at least that much the k-means has converged.

Math equation for determining k-means

Mathematically, k-means is guaranteed to converge in a finite number of iterations (assigning point to a cluster and shifting). It may take a long time, but will eventually converge. It does not say anything about best or optimal clustering, just that it will converge.

K-means is sensitive to where you initialize the centroids. There are a few techniques to do this:

  • Assign each cluster center to a random data point.
  • Choose k points to be farthest away from each other within the bounds of the data.
  • Repeat k-means over and over again and pick the average of the clusters.
  • Another advanced approach called k-means ++ does things like ANOVA (Analysis Of Variance). We won’t be getting into it though.

Choosing k (How many clusters to use)

One way is to plot the data points and try different values to see what works the best. Another technique is called the elbow method.

Elbow method

Data graph with elbow point circled

Steps:

  • Choose some values of k and run the clustering algorithm
  • For each cluster, compute the within-cluster sum-of-squares between the centroid and each data point.
  • Sum up for all clusters, plot on a graph
  • Repeat for different values of k, keep plotting on the graph.
  • Then pick the elbow of the graph.

This is a popular method supported by several libraries.

Advantages Of k-means

  • This is widely known and used algorithm.
  • It’s also fairly simple to understand and easy to implement.
  • It is also guaranteed to converge.

Disadvantages of k-means

  • It is algorithmically slow i.e. can take a long time to converge.
  • It may also not converge to the local minima i.e. the optimal solution.
  • It’s also not very robust against varying cluster shapes e.g. It may not perform very well for elongated cluster shapes. This is because we use the same parameters for each cluster.

This was a quick overview of k-means clustering. Lets now look at how it performs on different kinds of data sets.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

Transcript 1

Hello, world, and thanks for joining me. My name is Mohit Deshpande. In this course, we’ll be learning about clustering analysis. In particular, we’re gonna use it in the context of data science, and we’re gonna analyze some data and see if we can segment out different kinds of customers so that we can provide them with all kinds of neat promotional discounts or special offers.

This is a very common thing that’s done for a lot of companies. We have a bunch of data. We wanna figure out unique groups of customers so that we can offer them special things.

In this course, we’ll be learning a little bit about three different kinds of clustering algorithms. First, I wanna introduce you to the concept of clustering, or what is clustering, and what are some other applications of it. Then we’ll get onto three different clustering algorithms, and what’s really neat is that they all approach clustering in a very different fashion.

So first, we’re gonna learn about a very popular kind of clustering called K-means clustering. Then we’ll look into a density-based clustering algorithm called DBSCAN. Then finally we’ll discuss a hierarchical clustering algorithm called hierarchical agglomerative clustering. And then we’ll see different kinds of data, where they work really well, and where they don’t work quite so well. So we’ll get a good idea of, at least some kind of notion, of which kind of clustering algorithms tend to work well on which kind of data.

Then, finally, we’re gonna tie everything together by looking at a real world data set and see if we can segment out different customers. We’ve been making courses since 2012, and we’re super-excited to have you on board. Online courses are a great way to learn new skills, and I take a lot of online courses myself.

Several courses consist mainly of video lessons that you can watch at your own pace and as many times as you want. So feel free to watch and re-watch and pause the video as many times as you want. We also have downloadable source code and project files and you’re getting everything that we build during the lesson.

It’s highly recommended that you code along with me. In my experience, that’s the best way to learn something is to get your feet wet. Lastly, we’ve seen the students who get the most out of these online courses are those who make a weekly plan and stick with it, depending, of course, on your own availability and learning style. Zenva, over the past six years, has taught all kinds of different topics on programming and game development, over 300,000 students, across 100 courses. The skills that they learned in these courses are completely transferrable. In fact, some of the students have used the skills that they’ve learned to advance their careers, to make a start-up, or to publish their own content from the skills that they’ve learned.

Thanks for joining, and I look forward to seeing the cool stuff you’ll be building. Now, without further ado, let’s get started.

Transcript 2

In this video, we are going to learn a little bit about cluster analysis. And this is a topic that we’re gonna be discussing over the duration of this course.

So just to give you an overview of the different things we’re gonna be covering, I’m gonna give you an introduction to cluster analysis, basically what is it and what are the different applications of it, as well as what kind of algorithms can we expect. And in fact, we’re gonna be covering three very popular algorithms, k-means clustering, DBSCAN, which stands for density-based spatial clustering of applications with noise, but usually we just call it DBSCAN, and then hierarchical agglomerative clustering, HAC.

These are three very popular clustering algorithms. And the interesting thing is they all take very different approaches to creating clusters. And we’re gonna get into all those in the subsequent videos. But first let’s talk a little bit about cluster analysis. And that’s what we’re gonna be focusing on primarily in this video, just to acquaint you with some of the terminology as well as some applications of cluster analysis for example. So clustering analysis, so imagine we have some data.

The whole point of clustering analysis is in an unsupervised way with no prior information, we want to be able to separate different groups based on the data that we have. And now sometimes these groups are predefined. You have the set of data like in this case, and you say, well, this seems, we plot this data, and you say, well, it seems to fit into two little groups. Here there’s a little clustering of points on the bottom left, and there’s a larger, kind of elongated cluster on the top right and so we might say, well, we can give a predefined number of clusters.

We want two clusters and we can give that to the clustering algorithms and then they’ll group these guys together. It’ll make a split and it actually, in some cases, we don’t need to specify the number of clusters. In fact, some algorithms, which is DBSCAN, are actually smart enough to be able to figure out how many clusters are based entirely on the data. But algorithms like k-means will actually need to be specified how many clusters that we have.

And so, for example, this data scan is actually taken, it’s a very famous data set called the Iris Dataset, collected by Ronald Fisher, which is, and here is a quick historical side note, he’s probably the most important statistician of the 20th century. A lot of statistical techniques that we have that are used in all kinds of companies were originally some of his work, but he collected this dataset of flowers. He has 50 different of three different kinds of species of flowers and he plots their measured properties like petal width, petal length, sepal width, and sepal length, and they’re all plotted out.

In this case, what I’ve actually done is removed the class labels, because usually when we’re doing clustering analysis, we don’t have the correct labels. In fact, that’s what the clustering is trying to give us. It’s trying to give us some notion that these things belong together and these other things belong together. So this is just a kind of data that you might expect with some clustering.

So clustering is taking our data and then putting it into groups such that the groups have some kind of similar properties or similar attributes here. So if we go back a slide here, so we have one cluster at the bottom left for example. That might be considered a cluster where the flowers in that cluster have a small petal length and a smaller petal width, for example. That’s an example of grouping, as I’m talking about.

And there’s so many different applications of clustering analysis, not just used for something like data science. But also things like medical imaging for things like x-rays or MRIs or FMRIs. They use clustering analysis. Recommender systems like those used on websites like Amazon.com and whatnot, they recommend you can use clustering analysis to help build recommendation systems. Geospatial data, of course, our data is in latitude and longitude coordinates and we can do some kind of clustering with that as well. And we can also use it for things like anomaly detection in our data. The top bullet point here is that we can use it for market analysis and segmentation, a very popular technique for that as well.

And so this kind of gives you a little bit of the applications of clustering analysis and certainly the algorithms that we’re gonna be learning are used in the field and can be used for your data as well. So for example, people have used clustering algorithms to actually do things like, to check brain anomalies, so here are just images I’ve taken of the brain and the different C-1 to C-Center are different clusters of the brain. They use a different clustering metric. And the result is, you can kind of segment out different kinds of brain anomalies. So it’s an application of clustering in the medical domain.

And another domain, what these guys have done is look at reviews on Yelp.com, and they’ve performed a spatial analysis of that. So there are a lot of different applications of clustering in many different fields.

So just a quick overview of clustering, but we can get a little bit more formal with this, with the algorithms. So you can think of clustering analysis on the bottom here, as just taking in a set of points, X1 to Xn. And then we churn it through a machine that is the cluster analysis, and then out comes an assignment that maps each point to a particular cluster. So on the bottom here, X1 is mapped to Cluster 1, X2 is mapped to Cluster 2, and there are a lot of different algorithms that exist to detect clusters.

In fact, like I’ve mentioned, we’re gonna be covering three of them, of the most popular ones. Now the parameters, some of these algorithms have additional parameters to them. So like I mentioned before, one of those parameters might be the number of clusters that you have that might have to be something that’s pre-defined that you would give to an algorithm. But there are sometimes other parameters as well, and they vary for each algorithm, so there’s no uniformness among the parameters. However, like you see in the chart here, the input to clustering algorithms is usually a set of data points, X1 to Xn, and they can be of any dimensionality, they can be in 2D, they can be 2D points, they can be 3D points, they can be 100 dimensional points.

But for our purposes, we’re just gonna stick with 2D points. It is certainly possible to do clustering on higher dimensionality data, but one of the issues is actually visualizing that higher dimensional data. Because we can’t really plot a hundred dimensional graph terribly easily, but there exist dimensionality production techniques that can press 100 dimensions down into two so we can plot it, but that’s a little beyond the scope of the course here. So we’re mostly gonna be sticking with 2D.

And then the output, like I said, is a cluster assignment. So each data point belongs to a cluster, or actually in some cases, some algorithms have some notion of outliers or noise built in, so just DBSCAN for example. So a point may not necessarily belong to an exact cluster, it might also belong to, it might be a noise point or some people like to think of it as being a noise cluster, although that’s not quite correct. But that is the output, and this is kind of a generalization of clustering algorithms. They take in this set of input points and then they produce a mapping that maps each point to a cluster or perhaps to a noise point.

And you can think of clustering analysis as being a kind of unsupervised machine learning, per se. And it’s unsupervised because generally, with clustering analysis, we don’t have any labels for our data. We just get a set of points and we’re set to cluster them. And so this unsupervised, and there are techniques that might use class labels to help do some kind of clustering or provide some other auxiliary information, but in many, many cases, clustering analysis is done unsupervised. We don’t have correct labels.

Alright, so that is just a quick overview of clustering analysis, so basically, what is clustering analysis, what kind of inputs we give to a clustering algorithm, what kinds of outputs do we expect from a clustering algorithm, and also discuss a little bit about different applications of clustering algorithms used in a wide variety of fields. It doesn’t even have to be something like data science or computer science, it can be used in the medical field, in business, finance, and so learning this information, we can take this and apply it to many other fields. So that is just a quick introduction to clustering analysis.

Transcript 3

In this video we are going to look at probably the most popular clustering algorithm called K-means clustering. So K-means clustering, the point of it is to take our cluster data and then separate it into K disjoint clusters, and these clusters are defined so that they minimize what’s called the within-cluster sum-of-squares, but we’ll get to that a bit later. You’ll see what that means when we discuss a little bit more about K-means convergence.

So the point of K-means clustering is to separate the data into K disjoint clusters. What I mean by disjoint clusters means that one point cannot belong to more than one cluster, and the only parameter that we have to set for this algorithm is the number of clusters. So you’ll see an example of an algorithm where you need to have some notion of how many clusters you want your data to have before you run the algorithm. Later, we’re gonna discuss how you can select this value of K, because sometimes it’s quite obvious how many clusters you have, but many times it’s not quite so clear how many clusters you should have, and so we’ll discuss a couple techniques a bit later on how to choose this parameter K.

So K-means clustering is very popular, very well-known. It actually, the algorithm itself is quite simple. It’s just a couple lines of algorithm code and the code itself is writing, if you had to write it from scratch also, it’s not something that would take you a long time. It’s very popular. I mean, it’s taught in computer science curriculums very commonly. So knowing this algorithm is kind of the first step to getting more acquainted with clustering. It’s kind of the baseline algorithm, and we’ll move on to more complicated algorithms a bit later, and then one point of terminology here.

You’ll often hear something called a cluster center or a centroid, and really that’s just a point that represents a cluster, so in the figure that I have here on the bottom right, we have two clusters. We have a red cluster and a blue cluster, and the X is the centroid. In other words, the centroid is really like the average of the X coordinates and the average of the Y coordinates and then that’s the point. So you can see in the blue, the average, if I average the X coordinates is somewhere gonna be in the middle, and if I average the Y coordinates, that’s gonna be somewhere in the middle and we end up with some centroid that’s in the middle of that square, and so that’s what I mean by centroid. So that’s all we really need to know about K-means.

Now we can get to the algorithm, we can actually get to the algorithm. I’m going to go through an example with you so you can kinda see it progress step-by-step with just some fake data, and then we’ll discuss some more interesting things about it such as convergence, properties, you know, talk a little bit about what convergence is, and then we’ll also discuss a bit later how do you select the value of K and what not. So but we’re gonna get to that.

So the clustering algorithm, the first step to the clustering is you randomly initialize the cluster centers. So suppose I have some data, that I have two clusters of four like in the previous chart. I have two clusters, so what I do is I just pick two random points to initialize the clusters. Now, the algorithm doesn’t, there are many ways to do this, and we’ll discuss a couple different ways a bit later, and so after you pick, we randomly initialize our cluster centers, then for each point we assign it to its nearest cluster using some kind of like metric, like the distance formula for example, Euclidean distance, and then we update the cluster centroids by taking the mean of the points assigned to it.

So after each point is assigned to a centroid, we look at the centroids and see how many points are assigned to this one centroid, and then we just average all the points, and that’s gonna shift the centroid. We basically keep doing this.

After the centroids have shifted, we then reset all of our points and figure out where the near, who, assign them to the nearest cluster again, and we just keep doing this until the cluster centroids stop moving or they move a very, very small imperceptible amount. Now, let’s actually do an example of this with some data so you have a better understanding of what I mean by each of these steps. So here is our data and we wanna cluster it and we want it to, we’re setting K equals two. There’s two clusters, okay?

If we look at this data we can kinda see that, well, yeah, there are two clusters here. So here is just like randomly initialized the two clusters. There’s red and blue, and just placed ’em somewhere on the plane. So now what I want to do is for each point, I’m going to assign it to the closest cluster. So after I do that I get something like this. So now I have colored all the points red or blue depending on if it’s closer to the red centroid or closer to the blue centroid.

So after I’ve done this, now I need to update my cluster centroid. So how I do that is I look at all the X, for example, for the blue centroid, I’ll look at all the X coordinates and take the average and that’s my new X coordinate for the centroid. I look at all the Y coordinates of all the blue points and I take the average of those and that becomes the new Y coordinate for my blue centroid. So after I do this, the centroids kinda shift over and then I reset the colors of all the points, and now I just do the same thing again.

So now for each point I have to figure out which centroid is it close to. So after I do that I get this assignment, and now I, again, take the average of the X coordinates and the average of the Y coordinates for each cluster and that becomes the location of my new centroid, and I shift it over again, and so now it’s gonna be here, and now let’s do another, now I do this one more time, and you’ll see that now, where if I try to do another shift, that nothing will, the cluster centroids won’t actually move again because in the previous step we see that the same points are assigned to the same cluster centroids.

So in other words, my centroids aren’t gonna move, and at this point we say that K-means has converged. So I hope that this gives you a bit of an overview of a K-means here as in how do you actually run the algorithm in step. So like I mentioned convergence, just ignore that big equation. It looks kinda scary but this is just to illustrate a point here. So convergence means that the cluster centroids either don’t move entirely or they move a very, very small very small amount, and you usually just set a threshold value. So if they don’t move at least this much, then we say they’ve converged.

The reason that I put this equation up here is just to illustrate the point that mathematically K-means is guaranteed to converge in a finite number of iterations, and by iterations I mean assigning the points to a cluster and then shifting and then assigning the points to a cluster and shifting.

In a finite amount of iterations, it is mathematically guaranteed to converge. Now, finite number, that still means that it might take a long time, but it will eventually converge. Now, this convergence, that theorem of K-means, doesn’t say anything about converging to the best cluster or converging to the optimal clustering. It just says that it will converge. This is still a useful property to know about K-means because this might not be true of other algorithms.

So one other point that I want to discuss is that K-means is actually quite sensitive to where you initialize the centroids, and there are a couple techniques to do this. So one technique that people like to use is you can just assign a cluster to be a random data point. So you pick a data point and say that is my cluster center. Another interesting thing is to choose K points that are the farthest away from each other. So if I have two points, I’ll choose two points that are the farthest away from each other within the bounds of my data, between the bound of the smallest X coordinate and the largest X coordinate, smallest Y coordinate, largest Y coordinate, and another approach is to repeat my K-means many times over and over again.

You just run the algorithm over and over again using one of these initializations and just take the average of the clusters. There’s another way more advanced approach called K-means plus plus that does all kinds of analysis of variance, but we’re not gonna get into that. In fact, we don’t have to implement it either ’cause it’s usually embedded in many libraries. Okay, so the only thing that remains to talk about is how do you choose a value of K. Well, one other thing to do is just plot your data, try a couple K values, and see which clustering works the best.

Another technique that’s actually quite popular is called the elbow method, and what you do is you choose some values of K, you run the clustering algorithm, you compute this thing called a within-cluster-sum-of-squares. In other words, you take the centroid and each data point assigned to that centroid and compute the sum of the square differences between those guys and you sum up all that. So for each cluster so you get one of those, and then you sum it up for all the clusters and you plot that on a graph, and you can keep plotting this on a graph for different values of K and then you pick the elbow of this graph here and that’s, it’s not a very clear-cut way as to what value of K you should use, it just gives you kind of an idea of what the range of different K you should choose there.

So this is also very popular and, again, there are lots of libraries that can actually generate this for you. You can do this yourself as well. So this is just a technique that you might wanna use to figure out how many values of K you should use. Okay, so that pretty much covers K-means clustering.

So I wanna talk a little bit about some of the advantages and disadvantages of this, and then later we’re gonna actually see it run on different kinds of data, and we’ll see how well it performs there. So it’s widely known and used, very popular like I said. It’s a fairly simple algorithm actually. It’s really only like four steps, and in lines of code wise it’s not that long either. It’s in the algorithm, easy to implement. One big perk is that it’s guaranteed to converge. It’s pretty nice.

Disadvantages of K-means, it’s algorithmically pretty slow if you know anything about big O notation. It’s take a long time for it to converge. Another issue is that it might not converge to the best solution. It’s guaranteed to converge, but it might not converge to the best solution, and it’s not very robust against clusters that have various shapes, like elongated clusters for example. It might not perform so well.

We’re actually gonna see this when we look at some fake data. We’ll see how well it performs against elongated data for example. And the reason for this is because we’re using the same parameters for each cluster when we do this. So anyway, that is just a quick overview of K-means clustering, and so now that we have this example we can go ahead and try to look at, analyze how it performs on different kinds of data sets.

Interested in continuing? Check out the full Data Insights with Cluster Analysis course, which is part of our Data Science Mini-Degree.

For teachers, you can also try Zenva Schools which offers online learning suitable for K12 school environments and offers a wide array of topics including Python, web development, Unity, and more.

]]>
Free Course – Learn Object-Oriented Programming with Python https://gamedevacademy.org/python-oop-tutorial/ Fri, 02 Dec 2022 01:00:50 +0000 https://gamedevacademy.org/?p=19262 Read more]]>

Master object-oriented programming techniques for games using Python to store and manipulate program data. You can explore more about object-oriented programming techniques using Python in the full course below!

PYTHON PROJECTS – OBJECT-ORIENTED GAME

About

In this course taught by instructor Nimish Narang, you’ll learn object-oriented programming principles for creating classes and objects. You’ll explore some of the paradigm’s key techniques, including how to use variables to represent attributes and properties of an object. You’ll also practice by implementing a GameObject with properties and methods, and learn how to create (or instantiate) a GameObject that uses your class design. Regardless of what kind of coding projects interest you, these game and OOP foundations will prepare you to pursue any number of interactive projects with responsible habits.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

]]>
Clustering with Gaussian Mixture Models – Data Science & ML https://gamedevacademy.org/clustering-with-gaussian-mixture-models/ Tue, 29 Nov 2022 12:26:18 +0000 https://pythonmachinelearning.pro/?p=1851 Read more]]> Gaussian Mixture Models are an essential part of data analysis – but do you know how they work?

In this article, we’ll seek to demystify how to analyze “clusters” of data and discuss Gaussian Mixture Models which can help us more efficiently and accurately sort out clusters of data points.

Let’s dive in!

What is Clustering?

Clustering is an essential part of any data analysis. Using an algorithm such as K-Means leads to hard assignments, meaning that each point is definitively assigned a cluster center. This leads to some interesting problems: what if the true clusters actually overlap? What about data that is more spread out; how do we assign clusters then?

Gaussian Mixture Models save the day! We will review the Gaussian or normal distribution method and the problem of clustering. Then we will discuss the overall approach of Gaussian Mixture Models. Training them requires using a very famous algorithm called the Expectation-Maximization Algorithm that we will discuss.

Download the full code here.

If you are not familiar with the K-Means algorithm or clustering, read about it here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

Gaussian Distribution

The first question you may have is “what is a Gaussian?”. It’s the most famous and important of all statistical distributions. A picture is worth a thousand words so here’s an example of a Gaussian centered at 0 with a standard deviation of 1.

This is the Gaussian or normal distribution! It is also called a bell curve sometimes. The function that describes the normal distribution is the following

\[ \mathcal{N}(x; \mu, \sigma^2) = \displaystyle\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\displaystyle\frac{(x-\mu)^2}{2\sigma^2}}  \]

That looks like a really messy equation! And it is, so we’ll use $\mathcal{N}(x; \mu, \sigma^2)$ to represent that equation. If we look at it, we notice there are one input and two parameters. First, let’s discuss the parameters and how they change the Gaussian. Then we can discuss what the input means.

The two parameters are called the mean $\mu$ and standard deviation $\sigma$. In some cases, the standard deviation is replaced with the variance $\sigma^2$, which is just the square of the standard deviation. The mean of the Gaussian simply shifts the center of the Gaussian, i.e., the “bump” or top of the bell. In the image above, $\mu=0$, so the largest value is at $x=0$.

The standard deviation is a measure of the spread of the Gaussian. It affects the “wideness” of the bell. Using a larger standard deviation means that the data are more spread out, rather than closer to the mean.

What about the input? More specifically, the above function is called the probability density function (pdf) and it tells us the probability of observing an input $x$, given that specific normal distribution. Given the graph above, we see that observing an input value of 0 gives us a probability of about 40%. As we move away in either direction from the center, we are decreasing the probability. This is one key property of the normal distribution: the highest probability is located at mean while the probabilities approach zero as we move away from the mean.

(Since this is a probability distribution, the sum of all of the values under the bell curve, i.e., the integral, is equal to 1; we also have no negative values.)

Why are we using the Gaussian distribution? The Expectation-Maximization algorithm is actually more broad than just the normal distribution, but what makes Gaussians so special? It turns out that many dataset distributions are actually Gaussian! We find these Gaussians in nature, mathematics, physics, biology, and just about every other field! They are ubiquitous! There is a famous theorem in statistics called the Central Limit Theorem that states that as we collect more and more samples from a dataset, they tend to resemble a Gaussian, even if the original dataset distribution is not Gaussian! This makes Gaussian very powerful and versatile!

Multivariate Gaussians

We’ve only discussed Gaussians in 1D, i.e., with a single input. But they can easily be extended to any number of dimensions. For Gaussian Mixture Models, in particular, we’ll use 2D Gaussians, meaning that our input is now a vector instead of a scalar. This also changes our parameters: the mean is now a vector as well! The mean represents the center of our data so it must have the same dimensionality as the input.

The variance changes less intuitively into a covariance matrix $\Sigma$. The covariance matrix, in addition to telling us the variance of each dimension, also tells us the relationship between the inputs, i.e., if we change x, how does y tend to change?

We won’t discuss the details of the multivariate Gaussian or the equation that generates it, but knowing what it looks like is essential to Gaussian Mixture Models since we’ll be using these.

The above chart has two different ways to represent the 2D Gaussian. The upper plot is a surface plot that shows this our 2D Gaussian in 3D. The X and Y axes are the two inputs and the Z axis represents the probability. The lower plot is a contour plot. I’ve plotted these on top of each other to show how the contour plot is just a flattened surface plot where color is used to determine the height. The lighter the color, the larger the probability. The Gaussian contours resemble ellipses so our Gaussian Mixture Model will look like it’s fitting ellipses around our data. Since the surface plot can get a little difficult to visualize on top of data, we’ll be sticking to the contour plots.

Gaussian Mixture Models

Now that we understand Gaussians, let’s discuss to Gaussian Mixture Models (GMMs)! To motivate our discussion, let’s see some example data that we want to cluster.

We could certainly cluster these data using an algorithm like K-Means to get the following results.

In this case, K-Means works out pretty well. But let’s consider another case where we have overlap in our data. (The two Gaussians are colored differently)

In this case, it’s pretty clear that these data are generated from Gaussians from the elliptical shape of the 2D Gaussian. In fact, we know that these data follow the normal distribution so using K-Means doesn’t seem to take advantage of that fact.  Even though I didn’t tell you our data were normally distributed, remember that the Central Limit Theorem says that enough random samples from any distribution will look like the normal distribution.

Additionally, K-Means doesn’t take into account the covariance of our data. For example, the blue points seem to have a relationship between X and Y: larger X values tend to produce larger Y values. If we had two points that were equidistant from the center of the cluster, but one followed the trend and the other didn’t, K-Means would regard them as being equal, since it uses Euclidean distance. But it seems certainly more likely that the point that follows the trend should match closer to the Gaussian than the point that doesn’t.

Since we know these data are Gaussian, why not try to fit Gaussians to them instead of a single cluster center? The idea behind Gaussian Mixture Models is to find the parameters of the Gaussians that best explain our data.

This is what we call generative modeling. We are assuming that these data are Gaussian and we want to find parameters that maximize the likelihood of observing these data. In other words, we regard each point as being generated by a mixture of Gaussians and can compute that probability.

\begin{align}
p(x) = \displaystyle\sum_{j=1}^{k} \phi_j\mathcal{N}(x; \mu_j, \Sigma_j)\\
\displaystyle\sum_{j=1}^{k} \phi_j = 1
\end{align}

The first equation tells us that a particular data point $x$ is a linear combination of the $k$ Gaussians. We weight each Gaussian with $\phi_j$, which represents the strength of that Gaussian. The second equation is a constraint on the weights: they all have to sum up to 1. We have three different parameters that we need to write update: the weights for each Gaussian $\phi_j$, the means of the Gaussians $\mu_j$, and the covariances of each Gaussian $\Sigma_j$.

If we try to directly solve for these, it turns out that we can actually find closed-forms! But there is one huge catch: we have to know the $\phi_j$’s! In other words, if we knew exactly which combination of Gaussians a particular point was taken from, then we could easily figure out the means and covariances! But this one critical flaw prevents us from solving GMMs using this direct technique. Instead, we have to come up with a better approach to estimate the weights, means, covariances.

Expectation-Maximization

How do we learn the parameters? There’s a very famous algorithm called the Expectation-Maximization Algorithm, also called the EM algorithm for short, (written in 1977 with over 50,000 paper citations!) that we’ll use for updating these parameters. There are two steps in this algorithm as you might think: expectation and maximization. To explain these steps, I’m going to cover how the algorithm works at a high level.

The first part is the expectation step. In this step, we have to compute the probability that each data point was generated by each of the $k$ Gaussians. In contrast to the K-Means hard assignments, these are called soft assignments since we’re using probabilities. Note that we’re not assigning each point to a Gaussian, we’re simply determining the probability of a particular Gaussian generating a particular point. We compute this probability for a given Gaussian by computing $\phi_j\mathcal{N}(x; \mu_j, \Sigma_j)$ and normalizing by dividing by $\sum_{q=1}^{k} \phi_q\mathcal{N}(x; \mu_q, \Sigma_q)$. We’re directly applying the Gaussian equation, but multiplying it by its weight $\phi_j$. Then, to make it a probability, we normalize. In K-Means, the expectation step is analogous to assigning each point to a cluster.

The second part is the maximization step. In this step, we need to update our weights, means, and covariances. Recall in K-Means, we simply took the mean of the set of points assigned to a cluster to be the new mean. We’re going to do something similar here, except apply our expectations that we computed in the previous step. To update a weight $\phi_j$, we simply sum up the probability that each point was generated by Gaussian $j$ and divide by the total number of points. For a mean $\mu_j$, we compute the mean of all points weighted by the probability of that point being generated by Gaussian $j$. For a covariance $\Sigma_j$, we compute the covariance of all points weighted by the probability of that point being generated by Gaussian $j$. We do each of these for each Gaussian $j$. Now we’ve updated the weights, means, and covariances! In K-Means, the maximization step is analogous to moving the cluster centers.

Mathematically, at the expectation step, we’re effectively computing a matrix where the rows are the data point and the columns are the Gaussians. An element at row $i$, column $j$ is the probability that $x^{(i)}$ was generated by Gaussian $j$.

\[ W^{(i)}_j} =  \displaystyle\frac{\phi_j\mathcal{N}(x^{(i)}; \mu_j, \Sigma_j)}{\displaystyle\sum_{q=1}^{k}\phi_q\mathcal{N}(x^{(i)}; \mu_q, \Sigma_q)} \]

The denominator just sums over all values to make each entry in $W$ a probability. Now, we can apply the update rules.

\begin{align*}
\phi_j &= \displaystyle\frac{1}{N}\sum_{i=1}^N W^{(i)}_j\\
\mu_j &= \displaystyle\frac{\sum_{i=1}^N W^{(i)}_j x^{(i)}}{\sum_{i=1}^N W^{(i)}_j}\\
\Sigma_j &= \displaystyle\frac{\sum_{i=1}^N W^{(i)}_j (x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^N W^{(i)}_j}
\end{align*}

The first equation is just the sum of the probabilites of a particular Gaussian $j$ divided by the number of points. In the second equation, we’re just computing the mean, except we multiply by the probabilities for that cluster. Similarly, in the last equation, we’re just computing the covariance, except we multiply by the probabilities for that cluster.

Applying GMMs

Let’s apply what we learned about GMMs to our dataset. We’ll be using scikit-learn to run a GMM for us. In the ZIP file, I’ve saved some data in a numpy array. We’re going to extract it, create a GMM, run the EM algorithm, and plot the results!

First, we need to load the data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture

X_train = np.load('data.npy')

Additionally, we can generate a plot of our data using the following code.

plt.plot(X[:,0], X[:,1], 'bx')
plt.axis('equal')
plt.show()

Remember that clustering is unsupervised, so our input is only a 2D point without any labels. We should get the same plot of the 2 Gaussians overlapping.

Using the GaussianMixture class of scikit-learn, we can easily create a GMM and run the EM algorithm in a few lines of code!

gmm = GaussianMixture(n_components=2)
gmm.fit(X_train)

After our model has converged, the weights, means, and covariances should be solved! We can print them out.

print(gmm.means_)
print('\n')
print(gmm.covariances_)

For comparison, I generate the original data data according to the following Gaussians.

\begin{align*}
\mu_1 &= \begin{bmatrix}3 \\ 3\end{bmatrix}\\
\Sigma_1 &=
\begin{bmatrix}
1 & 0.6 \\
0.6 & 1
\end{bmatrix}\\

\mu_2 &= \begin{bmatrix}1.5 \\ 1.5\end{bmatrix}\\
\Sigma_2 &=
\begin{bmatrix}
1 & -0.7 \\
-0.7 & 1
\end{bmatrix}
\end{align*}

Our means should be pretty close to the actual means! (Our covariances might be a bit off due to the weights.) Now we can write some code to plot our mixture of Gaussians.

X, Y = np.meshgrid(np.linspace(-1, 6), np.linspace(-1,6))
XX = np.array([X.ravel(), Y.ravel()]).T
Z = gmm.score_samples(XX)
Z = Z.reshape((50,50))

plt.contour(X, Y, Z)
plt.scatter(X_train[:, 0], X_train[:, 1])

plt.show()

This code simply creates a grid of all X and Y coordinates between -1 and 6 (for both) and evaluates our GMM. Then we can plot our GMM as contours over our original data.

The plot looks just as we expected! Recall that with the normal distribution, we expect to see most of the data points around the mean and less as we move away. In the plot, the first few ellipses have most of the data, with only a few data points towards the outer ellipses. The darker the contour, the lower the score.

(The score isn’t quite a probability: it’s actually a weighted log probability. Remember that each point is generated by a weighted sum of Gaussians, and, in practice, we apply a logarithm for numerical stability, thus prevent underflow.)

To summarize, Gaussian Mixture Models are a clustering technique that allows us to fit multivariate Gaussian distributions to our data. These GMMs well when our data is actually Gaussian or we suspect it to be. We also discussed the famous expectation-maximization algorithm, at a high level, to see how we can iteratively solve for the parameters of the Gaussians. Finally, we wrote code to train a GMM and plot the resulting Gaussian ellipses.

Gaussian Mixture Models are an essential part of data analysis and anomaly detection – which are important to learn when pursuing exciting developer careers! You can also find out more resources for exploring Python here.

]]>
Face Recognition with Eigenfaces – Computer Vision Tutorial https://gamedevacademy.org/face-recognition-with-eigenfaces/ Sun, 02 Oct 2022 02:49:02 +0000 https://pythonmachinelearning.pro/?p=1878 Read more]]> Face recognition – or the ability of computers to recognize faces and facial features – is an imminent concern to our future.

In this tutorial, we’re going to explore face recognition in-depth and learn how with techniques like eigenfaces, we can create our own software programs capable of identifying human faces. The applications for facial recognition are endless, and whether good or bad, it’s a skill that can help take your Python skills to the next level.

Let’s jump in and explore the world of computer vision and faces!

Intro & Project Files

Face recognition is ubiquitous in science fiction: the protagonist looks at a camera, and the camera scans his or her face to recognize the person. More formally, we can formulate face recognition as a classification task, where the inputs are images and the outputs are people’s names. We’re going to discuss a popular technique for face recognition called eigenfaces. And at the heart of eigenfaces is an unsupervised dimensionality reduction technique called principal component analysis (PCA), and we will see how we can apply this general technique to our specific task of face recognition.

Download the full code here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

Face Recognition

Before discussing principal component analysis, we should first define our problem. Face recognition is the challenge of classifying whose face is in an input image. This is different than face detection where the challenge is determining if there is a face in the input image. With face recognition, we need an existing database of faces. Given a new image of a face, we need to report the person’s name.

A naïve way of accomplishing this is to take the new image, flatten it into a vector, and compute the Euclidean distance between it and all of the other flattened images in our database.

There are several downsides to this approach. First of all, if we have a large database of faces, then doing this comparison for each face will take a while! Imagine that we’re building a face recognition system for real-time use! The larger our dataset, the slower our algorithm. But more faces will also produce better results! We want a system that is both fast and accurate. For this, we’ll use a neural network! We can train our network on our dataset and use it for our face recognition task.

There’s an issue with directly using a neural network: images can be large! If we had a single $m\times n$ image, we would have to flatten it out into a single $m\dot n\times 1$ vector to feed into our neural network as input. For large image sizes, this might hurt speed! This is related to the second problem with using images as-is in our naïve approach: they are high-dimensional! (An $m\times n$ image is really a $m\dot n\times 1$ vector) A new input might have a ton of noise and comparing each and every pixel using matrix subtraction and Euclidean distance might give us a high error and misclassifications!

These issues are why we don’t use the naïve method. Instead, we’d like to take our high-dimensional images and boil them down to a smaller dimensionality while retaining the essence or important parts of the image.

Dimensionality Reduction

The previous section motivates our reason for using a dimensionality reduction technique. Dimensionality reduction is a type of unsupervised learning where we want to take higher-dimensional data, like images, and represent them in a lower-dimensional space. Let’s use the following image as an example.

These plots show the same data, except the bottom chart zero-centers it. Notice that our data do not have any labels associated with them because this is unsupervised learning! In our simple case, dimensionality reduction will reduce these data from a 2D plane to a 1D line. If we had 3D data, we could reduce it down to a 2D plane or even a 1D line.

All dimensionality reduction techniques aim to find some hyperplane, a higher-dimensional line, to project the points onto. We can imagine a projection as taking a flashlight perpendicular to the hyperplane we’re project onto and plotting where the shadows fall on that hyperplane. For example, in our above data, if we wanted to project our points onto the x-axis, then we pretend each point is a ball and our flashlight would point directly down or up (perpendicular to the x-axis) and the shadows of the points would fall on the x-axis. This is a projection. We won’t worry about the exact math behind this since scikit-learn can apply this projection for us.

In our simple 2D case, we want to find a line to project our points onto. After we project the points, then we have data in 1D instead of 2D! Similarly, if we had 3D data, we want to find a plane to project the points down onto to reduce the dimensionality of our data from 3D to 2D. The different types of dimensionality reduction are all about figuring out which of these hyperplanes to select: there are an infinite number of them!

Princpal Component Analysis

One technique of dimensionality reduction is called principal component analysis (PCA). The idea behind PCA is that we want to select the hyperplane such that when all the points are projected onto it, they are maximally spread out. In other words, we want the axis of maximal variance! Let’s consider our example plot above. A potential axis is the x-axis or y-axis, but, in both cases, that’s not the best axis. However, if we pick a line that cuts through our data diagonally, that is the axis where the data would be most spread!

The longer blue axis is the correct axis! If we were to project our points onto this axis, they would be maximally spread! But how do we figure out this axis? We can borrow a term from linear algebra called eigenvectors! This is where eigenfaces gets its name! Essentially, we compute the covariance matrix of our data and consider that covariance matrix’s largest eigenvectors. Those are our principal axes and the axes that we project our data onto to reduce dimensions. Using this approach, we can take high-dimensional data and reduce it down to a lower dimension by selecting the largest eigenvectors of the covariance matrix and projecting onto those eigenvectors.

Since we’re computing the axes of maximum spread, we’re retaining the most important aspects of our data. It’s easier for our classifier to separate faces when our data are spread out as opposed to bunched together.

(There are other dimensionality techniques, such as Linear Discriminant Analysis, that use supervised learning and are also used in face recognition, but PCA works really well!)

How does this relate to our challenge of face recognition? We can conceptualize our $m\times n$ images as points in $m\dot n$-dimensional space. Then, we can use PCA to reduce our space from $m\dot n$ into something much smaller. This will help speed up our computations and be robust to noise and variation.

Aside on Face Detection

So far, we’ve assumed that the input image is only that of a face, but, in practice, we shouldn’t require the camera images to have a perfectly centered face. This is why we run an out-of-the-box face detection algorithm, such as a cascade classifier trained on faces, to figure out what portion of the input image has a face in it. When we have that bounding box, we can easily slice out that portion of the input image and use eigenfaces on that slice. (Usually, we smooth that slide and perform an affine transform to de-warp the face if it appears at an angle.) For our purposes, we’ll assume that we have images of faces already.

Eigenfaces Code

Now that we’ve discussed PCA and eigenfaces, let’s code a face recognition algorithm using scikit-learn! First, we’ll need a dataset. For our purposes, we’ll use an out-of-the-box dataset by the University of Massachusetts called Labeled Faces in the Wild (LFW).

Feel free to substitute your own dataset! If you want to create your own face dataset, you’ll need several pictures of each person’s face (at different angles and lighting), along with the ground-truth labels. The wider variety of faces you use, the better the recognizer will do. The easiest way to create a dataset for face recognition is to create a folder for each person and put the face images in there. Make sure each are the same size and resize them so they aren’t large images! Remember that PCA will reduce the image’s dimensionality when we project onto that space anyways so using large, high-definition images won’t help and will slow down our algorithm. A good size is ~512×512 for each image. The images should all be the same size so you can store them in one numpy array with dimensions (num_examples, height, width) . (We’re assuming grayscale images). Then use the folder names to disambiguate classes. Using this approach, you can use your own images.

However, we’ll be using the LFW dataset. Luckily, scikit-learn can automatically load our dataset for us in the correct format. We can call a function to load our data. If the data aren’t available on disk, scikit-learn will automatically download them for us from the University of Massachusetts’ website.

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier


# Load data
lfw_dataset = fetch_lfw_people(min_faces_per_person=100)

_, h, w = lfw_dataset.images.shape
X = lfw_dataset.data
y = lfw_dataset.target
target_names = lfw_dataset.target_names

# split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

The argument to our function just prunes all people without at least 100 faces, thus reducing the number of classes. Then we can extract our dataset and other auxiliary information. Finally, we split our dataset into training and testing sets.

Now we can simply use scikit-learn’s PCA class to perform the dimensionality reduction for us! We have to select the number of components, i.e., the output dimensionality (the number of eigenvectors to project onto), that we want to reduce down to, and feel free to tweak this parameter to try to get the best result! We’ll use 100 components. Additionally, we’ll whiten our data, which is easy to do with a simple boolean flag! (Whitening just makes our resulting data have a unit variance, which has been shown to produce better results)

# Compute a PCA 
n_components = 100
pca = PCA(n_components=n_components, whiten=True).fit(X_train)

# apply PCA transformation
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

We can apply the transform to bring our images down to a 100-dimensional space. Notice we’re not performing PCA on the entire dataset, only the training data. This is so we can better generalize to unseen data.

Now that we have a reduced-dimensionality vector, we can train our neural network!

# train a neural network
print("Fitting the classifier to the training set")
clf = MLPClassifier(hidden_layer_sizes=(1024,), batch_size=256, verbose=True, early_stopping=True).fit(X_train_pca, y_train)

To see how our network is training, we can set the verbose  flag. Additionally, we use early stopping.

Let’s discuss early stopping as a brief aside. Essentially, our optimizer will monitor the average accuracy for the validation set for each epoch. If it notices that our validation accuracy hasn’t increased significantly for a certain number of epochs, then we stop training. This is a regularization technique that prevents our model from overfitting!

Consider the above chart. We notice overfitting when our validation set accuracy starts to decline. At that point, we immediately stop training to prevent overfitting.

Finally, we can make a prediction and use a function to print out an entire report of quality for each class.

y_pred = clf.predict(X_test_pca)
print(classification_report(y_test, y_pred, target_names=target_names))

Here’s an example of a classification report.

                   precision    recall  f1-score   support

     Colin Powell       0.86      0.89      0.87        66
  Donald Rumsfeld       0.85      0.61      0.71        38
    George W Bush       0.88      0.94      0.91       177
Gerhard Schroeder       0.67      0.69      0.68        26
       Tony Blair       0.86      0.71      0.78        35

      avg / total       0.85      0.85      0.85       342

Notice there is no accuracy metric. Accuracy isn’t the most specific kind of metric to begin. Instead, we see precision, recall, f1-score, and support. The support is simply the number of times this ground truth label occurred in our test set, e.g., in our test set, there were actually 35 images of Tony Blair. The F1-Score is actually just computed from the precision and recall scores. Precision and recall are more specific measures than a single accuracy score. A higher value for both is better.

After training our classifier, we can give it a few images to classify.

# Visualization
def plot_gallery(images, titles, h, w, rows=3, cols=4):
    plt.figure()
    for i in range(rows * cols):
        plt.subplot(rows, cols, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i])
        plt.xticks(())
        plt.yticks(())

def titles(y_pred, y_test, target_names):
    for i in range(y_pred.shape[0]):
        pred_name = target_names[y_pred[i]].split(' ')[-1]
        true_name = target_names[y_test[i]].split(' ')[-1]
        yield 'predicted: {0}\ntrue: {1}'.format(pred_name, true_name)

prediction_titles = list(titles(y_pred, y_test, target_names))
plot_gallery(X_test, prediction_titles, h, w)

(plot_gallery  and titles  functions modified from the scikit-learn documentation)

We can see our network’s predictions and the ground truth value for each image.

Another thing interesting thing to visualize is are the eigenfaces themselves. Remember that PCA produces eigenvectors. We can reshape those eigenvectors into images and visualize the eigenfaces.

These represent the “generic” faces of our dataset. Intuitively, these are vectors that represent directions in “face space” and are what our neural network uses to help with classification. Now that we’ve discussed the eigenfaces approach, you can build applications that use this face recognition algorithm!

We discussed a popular approach to face recognition called eigenfaces. The essence of eigenfaces is an unsupervised dimensionality reduction algorithm called Principal Components Analysis (PCA) that we use to reduce the dimensionality of images into something smaller. Now that we have a smaller representation of our faces, we apply a classifier that takes the reduced-dimension input and produces a class label. For our classifier, we used a single-layer neural network.

Face recognition is a fascinating example of merging computer vision and machine learning and many researchers are still working on this challenging problem today!

Nowadays, deep convolutional neural networks are used for face recognition, and have vast implications in terms of the development career world. Try one out on this dataset!

]]>
Perceptrons: The First Neural Networks for Machine Learning https://gamedevacademy.org/perceptrons-the-first-neural-networks/ Wed, 28 Sep 2022 16:31:42 +0000 https://pythonmachinelearning.pro/?p=1620 Read more]]> Machine learning is not only a fascinating topic but one with a variety of approaches. Neural networks, one such approach, have come to the forefront largely due to their accessibility for those taking their first step into the machine learning world.

In this tutorial, we’re going to use Python and NumPy to explore the fundamentals of creating neural networks and combining them – using analogies for how our brains work to better understand them.

If you’re ready to start diving deep into the world of machine learning, let’s get started!

Intro & Project Files

Neural Networks have become incredibly popular over the past few years, and new architectures, neuron types, activation functions, and training techniques pop up all the time in research. But without a fundamental understanding of neural networks, it can be quite difficult to keep up with the flurry of new work in this area.

To understand the modern approaches, we have to understand the tiniest, most fundamental building block of these so-called deep neural networks: the neuron. In particular, we’ll see how to combine several of them into a layer and create a neural network called the perceptron. We’ll write Python code (using NumPy) to build a perceptron network from scratch and implement the learning algorithm.

For the completed code, download the ZIP file here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

Biological Neurons

Perceptrons and artificial neurons actually date back to 1958. Frank Rosenblatt was a psychologist trying to solidify a mathematical model for biological neurons. To better understand the motivation behind the perceptron, we need a superficial understanding of the structure of biological neurons in our brains.

(Credit: https://commons.wikimedia.org/wiki/File:Neuron_-_annotated.svg)

Let’s consider a biological neuron. The point of this cell is to take in some input (in the form of electrical signals in our brains), do some processing, and produce some output (also an electrical signal). One very important thing to note is that the inputs and outputs are binary (0 or 1)! An individual neuron accepts inputs, usually from other neurons, through its dendrites. Although the image above doesn’t depict it, the dendrites connect with other neurons through a gap called the synapse that assigns a weight to a particular input. Then, all of these inputs are considered together when they are processed in the cell body, or soma.

Neurons exhibit an all-or-nothing behavior. In other words, if the combination of inputs exceeds a certain threshold, then an output signal is produced, i.e., the neuron “fires.” If the combination falls short of the threshold, then the neuron doesn’t produce any output, i.e., the neuron “doesn’t fire.” In the case where the neuron does fire, the output travels along the axon to the axon terminals. These axon terminals are connected to the dendrites of other neurons through the synapse.

Let’s take a moment to recap biological neurons. They take some binary inputs through the dendrites, but not all inputs are treated the same since they are weighted. We combine these weighted signals and, if they surpass a threshold, the neuron fires. This single output travels along the axon to other neurons. Now that we have this summary in mind, we can develop mathematical equations to roughly represent a biological neuron.

Artificial Neurons

Now that we have some understanding of biological neurons, the mathematical model should follow from the operations of a neuron.

In this model, we have n binary inputs (usually given as a vector) and exactly the same number of weights $W_1, …, W_n$. We multiply these together and sum them up. We denote this as z and call it the pre-activation.

\[ z = \displaystyle\sum_{i=1}^{n} W_i x_i = W^T x \]

(We can re-write this as an inner product for succinctness.) There is another term, called the bias, that is just a constant factor.

\[ z = \displaystyle\sum_{i=1}^{n} W_i x_i + b = W^T x + b \]

For mathematical convenience, we can actually incorporate it into our weight vector as $W_0$ and set $x_0 = +1$ for all of our inputs. (This concept of incorporating the bias into the weight vector will become clearer when we write code.)

\[ z = \displaystyle\sum_{i=0}^{n} W_i x_i = W^T x \]

After taking the weighted sum, we apply an activation function, $\sigma$, to this and produce an activation a. The activation function for perceptrons is sometimes called a step function because, if we were to plot it, it would look like a stair.

\[
\sigma(q)=
\begin{cases}
1 & q\geq 0 \\
0 & q < 0
\end{cases}
\]

In other words, if the input is greater than or equal to 0, then we produce an output of 1. Otherwise, we produce an output of 0. This is the mathematical model for a single neuron, the most fundamental unit for a neural networks.

\[ a = \sigma (W^T x) \]

Let’s compare this model to the biological neuron. The inputs are analogous to the dendrites, and the weights model the synapse. We combine the weighted inputs by summing and send that weighted sum to the activation function. This acts as our all-or-nothing response function where 0 means the neuron didn’t produce an output. Also note that our inputs and outputs are also binary, which is in accordance with the biological model.

Capabilities and Limitations of Perceptrons

Since the output of a perceptron is binary, we can use it for binary classification, i.e., an input belongs to only one of two classes. The classic examples used to explain what perceptrons can model are logic gates!

Let’s consider the logic gates in the figure above. A white circle means an output of 1 and a black circle means an output of 0, and the axes indicate inputs. For example, when we input 1 and 1 to an AND gate, the output is 1, the white circle. We can create perceptrons that act like gates: they take 2 binary inputs and produce a single binary output!

However, perceptrons are limited to solving problems that are linearly separable. If two classes are linearly separable, this means that we can draw a single line to separate the two classes. We can do this easily for the AND and OR gates, but there is no single line that can separate the classes for the XOR gate! This means that we can’t use our single-layer perceptron to model an XOR gate.

An intuitive way to understand why perceptrons can only model linearly separable problems is to look the weighted sum equation (with the bias).

\[ \displaystyle\sum_{i=0}^N W_i x_i + b \]

This looks very similar to the equation of a line! (Or, more generally, a hyperplane.) Hence, we’re creating a line and saying that everything on one side of the line belongs to one class and everything on the other side belongs to the other class. This line is called the decision boundary, and, when we use a single-layer perceptron, we can only produce one decision boundary.

In light of this new information, it doesn’t seem like perceptrons are useful! But, in practice, many problems are actually linearly separable. Hope is not lost for non-linearly separably problems however! It can be shown that organizing multiple perceptrons into layers and using an intermediate layer, or hidden layer, can solve the XOR problem! This is the foundation of modern neural networks!

Single-Layer Perceptron Code

Now that we have a good understanding of how perceptrons works, let’s take one more step and solidify the math into code. We’ll use object-oriented principles and create a class. In order to construct our perceptron, we need to know how many inputs there are to create our weight vector. The reason we add one to the input size is to include the bias in the weight vector.

import numpy as np

class Perceptron(object):
    """Implements a perceptron network"""
    def __init__(self, input_size):
        self.W = np.zeros(input_size+1)

We’ll also need to implement our activation function. We can simply return 1 if the input is greater than or equal to 0 and 0 otherwise.

def activation_fn(self, x):
    return 1 if x >= 0 else 0

Finally, we need a function to run an input through the perceptron and return an output. Conventionally, this is called the prediction. We add the bias into the input vector. Then we can simply compute the inner product and apply the activation function.

def predict(self, x):
    x = np.insert(x, 0, 1)
    z = self.W.T.dot(x)
    a = self.activation_fn(z)
    return a

All of these are functions of the Perceptron class that we’ll use for perceptron learning.

Perceptron Learning Algorithm

We’ve defined a perceptron, but how do perceptrons learn? Rosenblatt, the creator of the perceptron, also had some thoughts on how to train neurons based on his intuition about biological neurons. Rosenblatt intuited a simple learning algorithm. His idea was to run each example input through the perceptron and, if the perceptron fires when it shouldn’t have, inhibit it. If the perceptron doesn’t fire when it should have, excite it.

How do we inhibit or excite? We change the weight vector (and bias)! The weight vector is a parameter to the perceptron: we need to keep changing it until we can correctly classify each of our inputs. With this intuition in mind, we need to write an update rule for our weight vector so that we can appropriately change it:

\[ w \leftarrow w + \Delta w \]

We have to determine a good $\Delta w$ that does what we want. First, we can define the error as the difference between the desired output d and the predicted output y.

\[ e = d – y \]

Notice that when d and y are the same (both are 0 or both are 1), we get 0! When they are different, (0 and 1 or 1 and 0), we can get either 1 or -1. This directly corresponds to exciting and inhibiting our perceptron! We multiply this with the input to tell our perceptron to change our weight vector in proportion to our input.

\[ w \leftarrow w + \eta\cdot e\cdot x \]

There is a hyperparameter $\eta$ that is called the learning rate. It is just a scaling factor that determines how large the weight vector updates should be. This is a hyperparameter because it is not learned by the perceptron (notice there’s no update rule for $\eta$!), but we select this parameter.

(For perceptrons, the Perceptron Convergence Theorem says that a perceptron will converge, given that the classes are linearly separable, regardless of the learning rate. But for other learning algorithms, this is a critical parameter!)

Let’s take another look at this update rule. When the error  is 0, i.e., the output is what we expect, then we don’t change the weight vector at all. When the error is nonzero, we update the weight vector accordingly.

Perceptron Learning Algorithm Code

With the update rule in mind, we can create a function to keep applying this update rule until our perceptron can correctly classify all of our inputs. We need to keep iterating through our training data until this happens; one epoch is when our perceptron has seen all of the training data once. Usually, we run our learning algorithm for multiple epochs.

Before we code the learning algorithm, we need to make some changes to our init function to add the learning rate and number of epochs as inputs.

def __init__(self, input_size, lr=1, epochs=10):
    self.W = np.zeros(input_size+1)
    # add one for bias
    self.epochs = epochs
    self.lr = lr

Now we can create a function, given inputs and desired outputs, run our perceptron learning algorithm. We keep updating the weights for a number of epochs, and iterate through the entire training set. We insert the bias into the input when performing the weight update. Then we can create our prediction, compute our error, and perform our update rule.

def fit(self, X, d):
    for _ in range(self.epochs):
        for i in range(d.shape[0]):
            y = self.predict(X[i])
            e = d[i] - y
            self.W = self.W + self.lr * e * np.insert(X[i], 0, 1)

The entire code for our perceptron is shown below.

class Perceptron(object):
    """Implements a perceptron network"""
    def __init__(self, input_size, lr=1, epochs=100):
        self.W = np.zeros(input_size+1)
        # add one for bias
        self.epochs = epochs
        self.lr = lr
    
    def activation_fn(self, x):
        #return (x >= 0).astype(np.float32)
        return 1 if x >= 0 else 0

    def predict(self, x):
        z = self.W.T.dot(x)
        a = self.activation_fn(z)
        return a

    def fit(self, X, d):
        for _ in range(self.epochs):
            for i in range(d.shape[0]):
                x = np.insert(X[i], 0, 1)
                y = self.predict(x)
                e = d[i] - y
                self.W = self.W + self.lr * e * x

Now that we have our perceptron coded, we can try to give it some training data and see if it works! One easy set of data to give is the AND gate. Here’s a set of inputs and outputs.

if __name__ == '__main__':
    X = np.array([
        [0, 0],
        [0, 1],
        [1, 0],
        [1, 1]
    ])
    d = np.array([0, 0, 0, 1])

    perceptron = Perceptron(input_size=2)
    perceptron.fit(X, d)
    print(perceptron.W)

In just a few lines, we can start using our perceptron! At the end, we print the weight vector. Using the AND gate data, we should get a weight vector of [-3, 2, 1]. This means that the bias is -3 and the weights are 2 and 1 for $x_1$ and $x_2$, respectively.

To verify this weight vector is correct, we can try going through a few examples. If both inputs are 0, then the pre-activation will be -3+0*2+0*1 = -3. When applying our activation function, we get 0, which is exactly 0 AND 0! We can try this for other gates as well. Note that this is not the only correct weight vector. Technically, if there exists a single weight vector that can separate the classes, there exist an infinite number of weight vectors. Which weight vector we get depends on how we initialize the weight vector.

To summarize, perceptrons are the simplest kind of neural network: they take in an input, weight each input, take the sum of weighted inputs, and apply an activation function. Since they were modeled from biological neurons by Frank Rosenblatt, they take and produce only binary values. In other words, we can perform binary classification using perceptrons. One limitation of perceptrons is that they can only solve linearly separable problems. In the real world, however, many problems are actually linearly separable. For example, we can use a perceptron to mimic an AND or OR gate. However, since XOR is not linearly separable, we can’t use single-layer perceptrons to create an XOR gate. The perceptron learning algorithm fits the intuition by Rosenblatt: inhibit if a neuron fires when it shouldn’t have, and excite if a neuron does not fire when it should have. We can take that simple principle and create an update rule for our weights to give our perceptron the ability of learning.

Perceptrons are the foundation of neural networks so having a good understanding of them now will be beneficial when learning about deep neural networks! This will also help as you pursue exciting developer opportunities.

Want to learn more Python in general? Check out our free course on Kivy!

]]>
Text Classification with Naive Bayes – Python Tutorial https://gamedevacademy.org/text-classification-tutorial-with-naive-bayes/ Sat, 24 Sep 2022 13:36:53 +0000 https://pythonmachinelearning.pro/?p=1766 Read more]]> If asked how your email sorted spam emails from legitimate emails, what would you say? Do you think you could build a spam filter yourself?

Hidden throughout our lives is one of the most prevalent uses for machine learning: text classification. Text classification allows us to do everything from identifying spam emails to being able to convert hand-written words into their plain text, digital equivalent.

In this tutorial, we’re going to discover how we can use the Naive Bayes technique to build our own text classification and build a spam filter using Python.

If you’re ready to expand how you see the digital world and text, let’s jump in!

Intro to Text Classification

The challenge of text classification is to attach labels to bodies of text, e.g., tax document, medical form, etc. based on the text itself. For example, think of your spam folder in your email. How does your email provider know that a particular message is spam or “ham” (not spam)? We’ll take a look at one natural language processing technique for text classification called Naive Bayes.

Download the full code here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

Bayes Theorem

(credit: https://en.wikipedia.org/wiki/Bayes%27_theorem#/media/File:Bayes%27_Theorem_MMB_01.jpg)

Before we derive the algorithm, we need to discuss the fundamental rule that Naive Bayes uses: Bayes Theorem:

\[ p(A|B) = \displaystyle\frac{p(B|A)p(A)}{p(B)} \]

where $A$ and $B$ are events and $p(\cdot)$ is a probability.

Let’s take a second to break this down. On the left, we have the probability of an event $A$ happening given that event $B$ happens. We say this is equal to the probability of event $B$ happening given event $A$ times the probability that event $A$ happens overall. All of that is divided by the probability that event $B$ happens overall. An example of this might help shed some light on why this is an ingenious theorem.

The classic example used to illustrate Bayes Theorem involves medical testing. Let’s suppose that we were getting tested for the flu. When we get a medical test, there are really 4 cases to consider when we get the results back:

  • True Positive: The test says we have the flu and we actually have the flu
  • True Negative: The test says we don’t have the flu and we actually don’t have the flu
  • False Positive: The test says we have the flu and we actually don’t have the flu
  • False Negative: The test says we don’t have the flu and we actually do have the flu

Suppose we also know some information about the flu and our testing methodology: we know our test can correctly detect that a person has the flu 99.5% of the time (i.e., $p(\text{+}|\text{Flu})=0.995$)  and correctly detect that a person does not have the flu 99.5% of the time (i.e., $p(\text{-}|\text{No Flu})=0.995$). These correspond to the true positive rate and true negative rate. We also know that this specific type of flu is rare and only affects 1% of people. Given this information, we can compute the probability that any randomly selected person will have this specific type of the flu. Specifically, we want to compute the probability that the person has the specific type of flu, given that the person tested positive for it, i.e., event $A=\text{Flu}$ and $B=\text{+}$.

Let’s just substitute the problem specifics into Bayes Theorem.

\[ p(\text{Flu}|\text{+}) = \displaystyle\frac{p(\text{+}|\text{Flu})p(\text{Flu})}{p(\text{+})} \]

Now let’s try to figure out specific values for the quantities on the right-hand side. The first quantity is $p(\text{+}|\text{Flu})$. This is the probability that someone tests positive given that they have the flu. In other words, this is the true positive rate: the probability that our test can correctly detect that a person has the flu! This number is 99.5% or 0.995 The next quantity in the numerator is $p(\text{Flu})$. This is called the prior probability. In other words, it is the probability that any random person has the flu. We know from our problem that this number is 1%, or 0.01. Let’s substitute in those values in the numerator.

\[ p(\text{Flu}|\text{+}) = \displaystyle\frac{0.995 \cdot 0.01}{p(\text{+})} \]

Now we have to deal with the denominator: $p(\text{+})$. This is the probability that our test returns positive overall. We can’t quite use the information given in the problem as directly as before however. But first, why do we even need $p(\text{+})$? Recall that probabilities have to be between 0 and 1. Based on the above equation, if we left out the denominator, then we wouldn’t have a valid probability!

Anyways, when can our test return positive? Well there are two cases: either our test returns positive and the person actually has the flu (true positive) or our test returns positive and our person does not have the flu (false positive). We can’t quite simply sum both of these cases to be the denominator. We have to weight them by their respective probabilities, i.e., the probability that any person has the flu overall and the probability that any person does not have the flu overall. Let’s expand the denominator.

\[ p(\text{Flu}|\text{+}) = \displaystyle\frac{0.995 \cdot 0.01}{p(\text{+}|\text{Flu})p(\text{Flu})+p(\text{+}|\text{No Flu})p(\text{No Flu})} \]

Now let’s reason about these values. $p(\text{+}|\text{Flu})p(\text{Flu})$ is something we’ve seen before: it’s the numerator! Now let’s look at the next quantity: $p(\text{+}|\text{No Flu})p(\text{No Flu})$. We can compute the first term by taking the complement of the true negative: $p(\text{+}|\text{No Flu})=1-p(\text{-}|\text{No Flu})=0.005$. And $p(\text{No Flu})=1-p(\text{Flu})=0.99$ since they are complimentary events. So now we can plug in all of our values and get a result.

\[ p(\text{Flu}|\text{+}) = \displaystyle\frac{0.995 \cdot 0.01}{0.995 \cdot 0.01+0.005 \cdot 0.99}= 0.6678\]

This result is a little surprising! This is saying, despite our test’s accuracy, knowing someone tested positive means that there’s only a 67% chance that they actually have the flu! Hopefully, this example illustrated how to use Bayes Theorem.

Deriving Naive Bayes

Now let’s convert the Bayes Theorem notation into something slightly more machine learning-oriented.

\[ p(H|E) = \displaystyle\frac{p(E|H)p(H)}{p(E)} \]

where $H$ is the hypothesis and $E$ is the evidence. Now this might make more sense in the context of text classification: the probability that our hypothesis is correct given the evidence to support it is equal to the probability of observing that evidence given our hypothesis times the prior probability of the hypothesis divided by the probability of observing that evidence overall.

Let’s break this down again like we did for the original Bayes Theorem, except we’ll use the context of the text classification problem we’re trying to solve: spam detection. Our hypothesis $H$ is something like “this text is spam” and the evidence $E$ is the text of the email. So to restate, we’re trying to find the probability that our email is spam given the text in the email. The numerator is then the probability that that we find these words in a spam email times the probability that any email is spam. The denominator is a bit tricky: it’s the probability that we observe those words overall.

There’s something a bit off with this formulation though: the evidence needs to be represented as multiple pieces of evidence: the words $w_1,\dots,w_n$. No problem! We can do that and Bayes Theorem still holds. We can also change hypothesis $H$ to a class $\text{Spam}$.

\[ p(\text{Spam}|w_1,\dots,w_n) = \displaystyle\frac{p(w_1,\dots,w_n|\text{Spam})p(\text{Spam})}{p(w_1,\dots,w_n)} \]

Excellent! We can use a conditional probability formula to expand out the numerator.

\[ p(\text{Spam}|w_1,\dots,w_n) = \displaystyle\frac{p(w_1|w_2,\dots,w_n,\text{Spam})p(w_2|w_3,\dots,w_n,\text{Spam})\dots p(w_{n-1}|w_n,\text{Spam})p(\text{Spam})}{p(w_1,\dots,w_n)} \]

Not only does this look messy, it’s also quite messy to compute! Let’s think about the first term: $p(w_1|w_2,\dots,w_n,\text{Spam})$. This is the probability of finding the first word, given all of the other words and given that the email is spam. This is really difficult to compute if we have a lot of words!

Naive Bayes Assumption

To help us with that equation, we can make an assumption called the Naive Bayes assumption to help us with the math, and eventually the code. The assumption is that each word is independent of all other words. In reality, this is not true! Knowing what words come before/after do influence the next/previous word! However, making this assumption greatly simplifies the math and, in practice, works well! This assumption is why this technique is called Naive Bayes. So after making that assumption, we can break down the numerator into the following.

\[ p(\text{Spam}|w_1,\dots,w_n) = \displaystyle\frac{p(w_1|\text{Spam})p(w_2|\text{Spam})\dots p(w_n|\text{Spam})p(\text{Spam})}{p(w_1,\dots,w_n)} \]

This looks better! Now we can interpret a term $p(w_1|\text{Spam})$ to mean the probability of finding word $w_1$ in a spam email. We can use a notational shorthand to symbolize product ($\Pi$).

\[ p(\text{Spam}|w_1,\dots,w_n) = \displaystyle\frac{p(\text{Spam})\displaystyle\prod_{i=1}^np(w_i|\text{Spam})}{p(w_1,\dots,w_n)} \]

This is the Naive Bayes formulation! This returns the probability that an email message is spam given the words in that email. For text classification, however, we need an actually label, not a probability, so we simply say that an email is spam if $p(\text{Spam}|w_1,\dots,w_n)$ is greater than 50%. If not, then it is not spam. In other words, we choose “spam” or “ham” based on which one of these two classes has the higher probability! Actually, we don’t need probabilities at all. We can forget about the denominator since its only purpose is to scale the numerator.

\[ p(\text{Spam}|w_1,\dots,w_n) \propto p(\text{Spam})\displaystyle\prod_{i=1}^np(w_i|\text{Spam}) \]

(where $\propto$ signifies proportional to) That’s one extra thing we don’t have to compute! In this instance, we pick whichever class has the higher score since this is not a true probability anymore.

Numerical Stability

There’s one extra thing we’re going to do to help us with numerical stability. If we look at the numerator, we see we’re multiplying many probabilities together. If we do that, we could end up with really small numbers, and our computer might round down to zero! To prevent this, we’re going to look at the log probability by taking the log of each side. Using some properties of logarithms, we can manipulate our Naive Bayes formulation.

\begin{align*}
\log p(\text{Spam}|w_1,\dots,w_n) &\propto \log p(\text{Spam})\displaystyle\prod_{i=1}^np(w_i|\text{Spam})\\
\log p(\text{Spam}|w_1,\dots,w_n) &\propto \log p(\text{Spam}) + \log \displaystyle\prod_{i=1}^np(w_i|\text{Spam})\\
\log p(\text{Spam}|w_1,\dots,w_n) &\propto \log p(\text{Spam}) + \displaystyle\sum_{i=1}^n \log p(w_i|\text{Spam})
\end{align*}

Now we’re dealing with additions of log probabilities instead of multiplying many probabilities together! Since log has really nice properties (monotonicity being the key one), we can still take the highest score to be our prediction, i.e., we don’t have to “undo” the log!

Dataset

We’ll be using the Enron email dataset for our training data. This is real email data from the Enron Corporation after the company collapsed. Before starting, download all of the numbered folders, i.e., enron1, enron2, etc., here and put all of them in a top-level folder simply called enron. Put this enron folder in the same directory as your source code so we can find the dataset!

A WORD OF WARNING!: Since this dataset is a real dataset of emails, it contains real spam messages. Your anti-virus may prune some these emails because they are spam. Let your anti-virus prune as many as it wants. This will not affect our code as long as there are some spam and ham messages still there!

Naive Bayes Code

Here is the dataset-loading code:

import os
import re
import string
import math

DATA_DIR = 'enron'
target_names = ['ham', 'spam']

def get_data(DATA_DIR):
	subfolders = ['enron%d' % i for i in range(1,7)]

	data = []
	target = []
	for subfolder in subfolders:
		# spam
		spam_files = os.listdir(os.path.join(DATA_DIR, subfolder, 'spam'))
		for spam_file in spam_files:
			with open(os.path.join(DATA_DIR, subfolder, 'spam', spam_file), encoding="latin-1") as f:
				data.append(f.read())
				target.append(1)

		# ham
		ham_files = os.listdir(os.path.join(DATA_DIR, subfolder, 'ham'))
		for ham_file in ham_files:
			with open(os.path.join(DATA_DIR, subfolder, 'ham', ham_file), encoding="latin-1") as f:
				data.append(f.read())
				target.append(0)

	return data, target

This will produce two lists: the data list, where each element is the text of an email, and the target list, which is simply binary (1 meaning spam and 0 meaning ham). Now let’s create a class and add some helper functions for string manipulation.

class SpamDetector(object):
    """Implementation of Naive Bayes for binary classification"""
    def clean(self, s):
        translator = str.maketrans("", "", string.punctuation)
        return s.translate(translator)

    def tokenize(self, text):
        text = self.clean(text).lower()
        return re.split("\W+", text)

    def get_word_counts(self, words):
        word_counts = {}
        for word in words:
            word_counts[word] = word_counts.get(word, 0.0) + 1.0
        return word_counts

We have a function to clean up our string by removing punctuation, one to tokenize our string into words, and another to count up how many of each word appears in a list of words.

Before we start the actual algorithm, let’s first understand the algorithm. For training,  we need three things: the (log) class priors, i.e., the probability that any given message is spam/ham; a vocabulary of words; and words frequency for spam and ham separately, i.e., the number of times a given word appears in a spam and ham message. Given a list of input documents, we can write this algorithm.

  1. Compute log class priors by counting how many messages are spam/ham, dividing by the total number of messages, and taking the log.
  2. For each (document, label) pair, tokenize the document into words.
  3. For each word, either add it to the vocabulary for spam/ham, if it isn’t already there, and update the number of counts. Also add that word to the global vocabulary.
def fit(self, X, Y):
    self.num_messages = {}
    self.log_class_priors = {}
    self.word_counts = {}
    self.vocab = set()

    n = len(X)
    self.num_messages['spam'] = sum(1 for label in Y if label == 1)
    self.num_messages['ham'] = sum(1 for label in Y if label == 0)
    self.log_class_priors['spam'] = math.log(self.num_messages['spam'] / n)
    self.log_class_priors['ham'] = math.log(self.num_messages['ham'] / n)
    self.word_counts['spam'] = {}
    self.word_counts['ham'] = {}

    for x, y in zip(X, Y):
        c = 'spam' if y == 1 else 'ham'
        counts = self.get_word_counts(self.tokenize(x))
        for word, count in counts.items():
            if word not in self.vocab:
                self.vocab.add(word)
            if word not in self.word_counts[c]:
                self.word_counts[c][word] = 0.0

            self.word_counts[c][word] += count

First, we can compute the log class priors by counting up how many spam/ham messages are in our dataset and dividing by the total number. Finally, we take the log.

Then we can iterate through our dataset. For each input, we get the word counts and iterate through each (word, frequency) pair. If the word isn’t in our global vocabulary, we add it. If it isn’t in the vocabulary for that particular class label, we also add it along with the frequency.

For example, suppose we had a “spam” message. We count up how many times each unique word appears in that spam message and add that count to the “spam” vocabulary. Suppose the word “free” appears 4 times. Then we add the word “free” to our global vocabulary and add it to the “spam” vocabulary with a count of 4.

We’re keeping track of the frequency of each word as it appears in either a spam or ham message. For example, we expect the word “free” to appear in both messages, but we expect it to be more frequent in the “spam” vocabulary than the “ham” vocabulary.

Now that we’ve extracted all of the data we need from the training data, we can write another function to actually output the class label for new data. To do this classification, we apply Naive Bayes directly. For example, given a document, we need to iterate each of the words and compute $\log p(w_i|\text{Spam})$ and sum them all up, and we also compute $\log p(w_i|\text{Ham})$ and sum them all up. Then we add the log class priors and check to see which score is bigger for that document. Whichever is larger, that is the predicted label!

To compute $\log p(w_i|\text{Spam})$, the numerator is how many times we’ve seen $w_i$ in a “spam” message divided by the total count of all words in every “spam” message.

On additional note: remember that the log of 0 is undefined! What if we encounter a word that is in the “spam” vocabulary, but not the “ham” vocabulary? Then $p(w_i|\text{Ham})$ will be 0! One way around this is to use Laplace Smoothing. We simply add 1 to the numerator, but we also have to add the size of the vocabulary to the denominator to balance it.

def predict(self, X):
    result = []
    for x in X:
        counts = self.get_word_counts(self.tokenize(x))
        spam_score = 0
        ham_score = 0
        for word, _ in counts.items():
            if word not in self.vocab: continue
            
            # add Laplace smoothing
            log_w_given_spam = math.log( (self.word_counts['spam'].get(word, 0.0) + 1) / (self.num_messages['spam'] + len(self.vocab)) )
            log_w_given_ham = math.log( (self.word_counts['ham'].get(word, 0.0) + 1) / (self.num_messages['ham'] + len(self.vocab)) )

            spam_score += log_w_given_spam
            ham_score += log_w_given_ham

        spam_score += self.log_class_priors['spam']
        ham_score += self.log_class_priors['ham']

        if spam_score > ham_score:
            result.append(1)
        else:
            result.append(0)
    return result

In our case, the input can be a list of document texts; we return a list of predictions. Finally, we can use the class like this.

if __name__ == '__main__':
    X, y = get_data(DATA_DIR)
    MNB = SpamDetector()
    MNB.fit(X[100:], y[100:])

    pred = MNB.predict(X[:100])
    true = y[:100]

    accuracy = sum(1 for i in range(len(pred)) if pred[i] == true[i]) / float(len(pred))
    print("{0:.4f}".format(accuracy))

We’re reserving the first 100 for the testing set, “train” our Naive Bayes classifier, then compute the accuracy.

To recap, we reviewed Bayes Theorem and demonstrated how to use it with an example. Then we re-worked it using hypotheses and evidence instead of just events $A$ and $B$ to make it more specific to our task of spam detection. From there, we derived Naive Bayes by making the Naive Bayes Assumption that each word appears independently of all other words. Then we formulated a prediction equation/rule. Using the Enron dataset, we created a binary Naive Bayes classifier for detecting spam emails.

Naive Bayes is a simple text classification algorithm that uses basic probability laws and works quite well in practice! It is also an important skill set used in the professional industry, so having these skills down pat will have immense implications for your career as well!

]]>
Free Course – Create a Balloon Popper Mini-Project with Python Turtle https://gamedevacademy.org/python-balloon-popper-tutorial/ Wed, 08 Jun 2022 04:31:56 +0000 https://gamedevacademy.org/?p=16392 Read more]]>

Learn to create a balloon popper project with Python Turtle. Not only will you learn common programming concepts such as branching paths, functions, debugging, loops, and more, but you’ll also gain the skills needed to build bigger and more complex projects in a variety of genres. You can explore the full course below!

Python Turtle Mini-Projects

About

With instructions from developer Daniel Buckley, this balloon popper mini-course tutorial will help you get started with programming fundamentals and Python Turtle by learning to apply them in a practical setting. You’ll not only learn how to define, design, implement, and evaluate a project but also master how to detect keyboard input, save and manipulate data with variables, create branching paths with conditions, make functions and loops to reuse code, and debug issues in algorithm logic. Regardless of the project, you wish to build, these foundations will help you to master a variety of techniques for creating Python projects and utilizing Python Turtle features to create functional games and applications.

]]>
Free Course – Create a Mini Game with Python https://gamedevacademy.org/python-digitrek-game-tutorial/ Wed, 01 Jun 2022 10:49:25 +0000 https://gamedevacademy.org/?p=16388 Read more]]>

Master the fundamentals of how to code and project management for games by learning Python Turtle! You’ll cover important topics related to algorithms, and even create a game for your portfolio! You can also explore more about Python Turtle in the full course below!

Python Turtle Mini-Projects

About

In this presentation brought to you by Pablo Farias, you’ll explore how by using Python Turtle and the concepts of computer instructions, you can create a practical game project featuring a turtle that can be moved with key presses. Through these skills, you’ll come to understand computer science principles for programming, learn how projects go from idea to digital solution, and build strong foundations for building games and more! Perfectly suited for beginners, this course will make sure you’re prepared for the technological future of tomorrow!

]]>