Explore Free Reinforcement Learning Tutorials – GameDev Academy https://gamedevacademy.org Tutorials on Game Development, Unity, Phaser and HTML5 Thu, 23 Feb 2023 21:14:35 +0000 en-US hourly 1 https://wordpress.org/?v=6.1.1 https://gamedevacademy.org/wp-content/uploads/2015/09/cropped-GDA_logofinal_2015-h70-32x32.png Explore Free Reinforcement Learning Tutorials – GameDev Academy https://gamedevacademy.org 32 32 An Introduction to Machine Learning https://gamedevacademy.org/what-is-machine-learning/ Fri, 20 Dec 2019 16:00:56 +0000 https://pythonmachinelearning.pro/?p=2651 Read more]]>

You can access the full course here: Machine Learning for Beginners with TensorFlow

Intro to Machine Learning

Now that we know what the course is all about, let’s learn a bit about the main topic: machine learning. What is machine learning? Machine learning is the study of statistics and algorithms aimed at performing a task without being explicitly programmed to. Theoretically, a machine is said to have learned if it produces “better” results over time without modifications to its programming. Practically, this means writing an algorithm, feeding it some data, and letting it interpret the data to find some pattern to solve a problem. 

Behind the scenes, machine learning models consist of layers of connected nodes (often called neurons). We will cover model structure in greater detail later but it is important to know now that each node has one or more values assigned to it that, when multiplied with a function of our choice produces some output. Through training, a model can change the values to produce more accurate outputs rather than having us, as the coders, explicitly change the algorithm or values manually. In this way, the model is “learning” as it is improving its results over time using the same algorithm. It should also be noted that some models do not undergo training and are only used to find patterns in data. We will explain the differences in the section common machine learning models.

So what makes machine learning so special? Why not just hard code the algorithms ourselves? The main reason we use machine learning is to help find patterns in data that we wouldn’t otherwise be able to see. If we cannot find patterns in data, we cannot hard code algorithms to look for them as we wouldn’t know what to tell the algorithm to do. It is this pattern recognition that allows machine learning models to solve recognition, classification, and prediction problems such as speech recognition, image classification, and stock market prediction. Machine learning also helps to customize user experience and tailor solutions to the user based on their previous habits such as with responsive game AIs, health apps, and text suggestion.

Transcript

What is up, guys? And welcome to the first tutorial in our machine learning course! This’ll also be the very first topic and, as you can see, it is on an Intro to Machine Learning. This is a good way to get some conceptual background info about what machine learning is, kind of how it works, and also go over some practical examples before we start writing any actual code.

So, what topics will we be covering here? Well, we’re gonna divide ourselves into three subtopics. We’ll start with “What is Machine Learning?”, then “What can we do with Machine Learning?”, and we’ll finish up with “What types of Machine Learning there are out there”. Now, I’m going to devote a separate tutorial to each of these three topics just because I’m trying to keep things a bit shorter. There’s a lot to digest within each of these so, we don’t want anything running too, too long.

Okay, so, for starters, we’ll cover, What Is Machine Learning? I think we can safely do that in this video, too. Now, there is a technical definition here; Machine Learning is the study of statistics and algorithms that’s aimed at performing a task without being explicitly programmed to. That’s a lot to take in. It’s quite wordy and is very technical, uses a lot of jargon. So, let’s try to break that down a bit. I’ve got a, hopefully a bit more of an easy definition to understand here and that is that Machine Learning is finding patterns in data that help to solve a problem without us necessarily writing the algorithm to find the patterns and solve the problem. Now, realistically, these are both kind of describing the same thing. There are a couple of common themes in there.

So, the first is that we’re aimed at performing a task or solving some kind of a problem. Well, that’s kind of obvious. I mean, that’s what software is supposed to help us to do is perform some task or solve a problem easier than we would otherwise be able to do. But the second thing that’s really important is that it’s not programmed explicitly to necessarily solve that problem or to improve over time, okay? So, the learning aspect of it is actually not something that we, ourselves, necessarily program in. It kind of figures out what it should and should be doing, should and shouldn’t be doing, based on the data that it sees.

So, usually Machine Learning does involve performing a task better over time, otherwise there’s not really an aspect of learning involved. Performance in most of the models, not all of them however, is linked to this concept of training. Now, we’ll go into training in greater detail later but for now this is essentially feeding data into a model to increase a performance over time without modifying the algorithm itself.

Now, that’s a very important aspect of it because, otherwise, it’s not really Machine Learning. If we’re actually making changes to the algorithm to increase performance then, it’s not learning itself. That’s us manually changing things. There’s no Machine Learning there. If a machine produces the best results after training with the same algorithm, it’s said to have learned. So, if the model starts out at the very beginning with maybe, 20% accuracy, that means, it’s only getting 20% of the answers correct, and then, later on, that bumps up to, I don’t know maybe 80 or 90% and then we have a good model and that model is set to have learned, a lot.

The Key Takeaway here is that Machine Learning Models perform better over time without any changes made to the algorithm themselves. It’s really just putting in a bunch of data into the model and then having the model kind of perform better over time without any changes to the actual algorithm.

Okay! So, that’s all we’re actually gonna do here. Hopefully, just take a second to digest that. Hopefully it wasn’t too much and when we come back we’ll talk about some kinda things that we can do with Machine Learning, where we might see some practical applications of Machine Learning in real life, hopefully that will help to clear up any questions that you might have, okay? So, stay tuned for that. Thanks for watching! See you guys in the next one.

Interested in continuing? Check out the full Machine Learning for Beginners with TensorFlow course, which is part of our Machine Learning Mini-Degree.

]]>
Machine Learning Mini-Degree https://gamedevacademy.org/machine-learning-mini-degree/ Sat, 08 Dec 2018 00:30:22 +0000 https://pythonmachinelearning.pro/?p=2218 Read more]]> Master Machine Learning with Python and Tensorflow. Craft Advanced Artificial Neural Networks and Build Your Cutting-Edge AI Portfolio.

The Machine Learning Mini-Degree is an on-demand learning curriculum composed of 6 professional-grade courses geared towards teaching you how to solve real-world problems and build innovative projects using Machine Learning and Python.

Learn and understand the fundamentals necessary to build the next generation of intelligent applications and software. The concepts and theory you’ll learn can be applied across technology and frameworks.

No prior experience with AI or Machine Learning is necessary to join. However, basic to intermediate Python skills are assumed in all of the courses.

Access this Mini-Degree on Zenva Academy

]]>
The Complete Programming and Full-Stack Bundle – 20 Course Smart Curriculum https://gamedevacademy.org/the-complete-programming-and-full-stack-bundle-20-course-smart-curriculum/ Sat, 24 Nov 2018 00:30:23 +0000 https://pythonmachinelearning.pro/?p=2220 Read more]]> ?? Go from beginner to full-stack developer!

The Complete Programming and Full-Stack Bundle is the world’s most effective way to go from beginner to professional coder. Whether your goal is to advance your career, start your own business or expand your existing skill-set, our 20-course Smart Curriculum has something in store for you.

This bundle is suitable both for absolute beginners and more advanced developers. Projects cover a wide range of different topics including:

  • Crafting interactive websites and web applications with Bootstrap, Angular, React, Node and Express
  • Coding in Python and building smart Machine Learning and AI applications
  • Building games with Unity, Phaser and JavaScript
  • Creating stunning Virtual Reality games and applications
  • Game artwork creation with Blender and Gimp

Access The Complete Programming and Full-Stack Bundle on Zenva Academy

]]>
Free Ebook – Machine Learning For Human Beings https://gamedevacademy.org/free-ebook-machine-learning-for-human-beings/ Thu, 04 Jan 2018 04:03:14 +0000 https://pythonmachinelearning.pro/?p=2125 Read more]]>

We are excited to announce the launch of our free ebook Machine Learning for Human Beings, authored by researcher in the field of computer vision and machine learning Mohit Deshpande, in collaboration with Pablo Farias Navarro, founder of Zenva.

In over 100 pages you will learn the basics of Machine Learning – text classification, clustering and even face recognition and learn to implement these algorithms using Python! This ebook covers both theoretical and practical aspects of Machine Learning, so that you have a strong foundation and understand what happens under the hood. Some of the topics covered in the book are:

  • Overview of Machine Learning – Supervised vs Unsupervised Learning
  • Text Classification with Naive Bayes
  • Data Clustering with K-Means
  • Clustering with Gaussian Mixture Models
  • Face Recognition with Eigenfaces
  • Dimensionality Reduction
  • Classification with Support Vector Machines
  • Reinforcement Learning using the OpenAI library

This book is provided at no cost in PDF format.

Download the ebook here

]]>
An Overview of Reinforcement Learning: Teaching Machines to Play Games https://gamedevacademy.org/an-overview-of-reinforcement-learning-teaching-machines-to-play-games/ Sat, 02 Dec 2017 18:39:16 +0000 https://pythonmachinelearning.pro/?p=2058 Read more]]> Think back to the time you first learned a skill: driving a car, playing an instrument, cooking a recipe. Let’s consider the example of playing chess. Initially, it might have seemed difficult, but, as you played more and more, it became easier to understand the game. After playing many games of chess, you are much better than when you first started and “get” the game. You know which moves you should avoid and which moves you should play. In other words, you’ve developed experience by playing the game.

This is the same idea behind reinforcement learning. First, we’ll discuss what we need to define a game through a Markov Decision Process. Then we’ll discuss how we solve these using the Value Iteration Algorithm. We’ll discuss the Q-Learning algorithm for teaching a machine to play a game. Finally, we’ll implement Q-Learning in an OpenAI gym environment to teach a machine to play CartPole!

Download the full code here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

Games

Before discussing reinforcement learning, we have to discuss the setup of a game. Before delving into the mathematics and definitions, let’s first try to use our reasoning. Suppose we have a robot (purple star) in an environment like the following picture.

The dark gray squares are walls. The green square yields the highest reward, e.g., the location of the buried treasure. The red square yields the lowest reward, e.g., a fire pit. We want to give commands to our robot to make it go to the green square and avoid the red square in the most efficient way possible. We have a finite number of actions: move left, up, right, and down. But, since we’re in the real world, the robot may not always go up. Robots don’t always act perfectly in the real world! Instead, our robot has a very high probability of doing the intended action, e.g., 90% chance of going up when we tell it go to up. However, there are non-zero probabilities that the robot will go in a different direction, e.g., 5% chance of going left or right when we tell it go to up. We have to keep these in mind when deciding which action to take.

These actions have to be taken within the confines of our environment, the most important characteristic of our game. Given our finite environment, there are only so many possibilities where our robot can be: the number of light gray squares. When we take an action, we go from one square to another, changing our board configuration. If we take the right actions, we can get to the green square and get a large reward!

Using the framework we described, we can define our game. Specifically, we use a Markov Decision Process (MDP) to define our game. An MDP is defined by the following.

  1. Set of states S
  2. Set of actions A that we can take at a state s
  3. Transition function T(s, a, s') where s\in S, a\in A, and s'\in S that is nondeterministic because of real-world situations where we may not always follow the given action always.
  4. Reward function R(s, a, s') that tells us what reward we get by taking action a in state s and ending up in state s'

Using these four things, we can define an MDP. The goal is to learn an optimal policy \pi^*(s) that tells us the optimal action to take at every state to maximize our reward.

We also introduce the concept of discounting: we prefer to have rewards sooner rather than later. We discount rewards at each time step so that we place an emphasis on having rewards sooner. This not only has a semantic meaning, but it is also mathematically useful: it helps our algorithms actually converge! Mathematically, we denote the discounting factor as \gamma. This will manifest itself in terms of exponential decay, but we’ll discuss that soon.

We can define these in terms of our robot in the environment. The set of states is all possible position of our robot. The set of actions is moving left, up, right, and down. We can model the transition function as being a probabilistic function that returns the probability of going into state s' after taking action a in state s. And the reward function can simply be a positive value if we go into the green square and a negative value for going into the red square.

Value Iteration

Knowing the transition function and reward function, the goal is to figure out the optimal policy \pi^*(s). Value Iteration is technique we can use to solve for this. The idea is that we assign a value V(s) to each state that represents the expected reward of starting in s and acting optimally. We have to compute an expected value because we’re not deterministic!

We can compute this using one of the Bellman Equations.

    \[ V^*(s) =\displaystyle\max_a\sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^*(s')] \]

Let’s take a second to break this down. The sum is over all states s'\in S and computes the expected reward of taking action a in state s. The \max in the front tells us to consider all actions a\in A and pick the expected reward that is the largest. Notice that we’re considering the discounting factor \gamma in the expectation. Also note that this definition is recursive: we have to know V^*(s') to compute V^*(s).

Values near the high-reward states will have high V^*(s) values since there is a high expected reward because we’re close to a high-reward state and the discounting factor isn’t large. On the other hand, values far away from the high-reward state will have smaller expected reward because the discounting factor will be larger. Think of it this way: the farther you are, there are more chances for malfunctioning circuits in that distance as opposed to being right next to a high-reward state.

To compute these optimal expected rewards, we can use the Value Iteration algorithm. We initialize all V_0(s)=0 for all states. And we compute the V_1(s) given all of the values of V_0(s). Then, we compute V_2(s) using V_1(s) and so on using this equation.

    \[ V_{k+1}(s) =\displaystyle\max_a\sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V_k(s')] \]

We use this iterative Bellman equation to compute the next iteration of all of the values for all of the states given the previous values. Eventually, the values won’t change, and we say that the algorithm has converged.

Algorithmically, we can write the following.

  1. Start with V_0(s)=0 for all states
  2. Compute V_{k+1}(s) =\displaystyle\max_a\sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V_k(s')]
  3. Go to Step 2 and repeat until convergence

The amazing thing is that value iteration will certainly converge to the optimal values (I’ve omitted the exact proof). Here are some images that show value iteration for our robotic environment.

But remember our goal: find the optimal policy \pi^*(s). How can we use these values to compute the optimal policy?

Suppose that we’ve run value iteration for a long enough amount of time, and our values have converged. In other words, we know V^*(s) for all states. To determine the policy, i.e., how to act at a state, we do a one-step lookahead to see which action will produce the maximum expected reward. This technique is called policy extraction.

    \[ \pi^*(s) = \displaystyle\operatornamewithlimits{argmax}\limits_{a}\sum_{s'}T(s, a, s') [R(s, a, s') + \gamma V^*(s')] \]

Using value iteration and policy extraction, we can solve MDPs! However, we stated that we knew the transition and reward functions. In many real-world scenarios, we don’t know these explicitly! In other words, we don’t know which states are good/bad or what the actions actually do. Therefore, we can’t use value iteration and policy extraction anymore! We actually have to try to take actions in the environment and observe their results to learn. This is the heart of reinforcement learning: learn from experience!

Q-Learning

If we knew the transition and reward functions, we could easily use value iteration and policy extraction to solve our problem. However, in reinforcement learning we don’t know these!

Q-Learning is a simple modification of value iteration that allows us to train with the policy in mind. Instead of using and storing just expected reward, we consider actions as well in a pair called a q-value Q(s, a)! Then we can iterate over these q-values and computing the optimal policy is simply selecting the action with the largest q-value.

    \[ \pi^*(s) =\displaystyle\operatornamewithlimits{argmax}\limits_{a} Q(s, a) \]

The q-value update looks very similar to value iteration.

    \[ Q_{k+1}(s, a) =\displaystyle\sum_{s'} T(s, a, s') [R(s, a, s') + \gamma \max_{a'} Q_k(s', a')] \]

We consider the expected reward, but we only look at the action that produces the largest q-value (\max_{a'} Q_k(s', a')]). We can iterate over the q-values: the idea is to learn from each experience!

Here is the Q-Learning algorithm:

  1. Perform an action a in a state s to end up in s' with reward r, i.e., consider the tuple (s, a, s', r).
  2. Compute the intermediate q-value \hat{Q}(s, a) = R(s, a, s') + \gamma \max_{a'} Q_k(s', a')
  3. Incorporate that new evidence into the existing value using a running average Q_{k+1}(s, a) = (1-\alpha)Q_k(s,a) + (\alpha)\hat{Q}(s, a) (where \alpha is the learning rate). This can be re-written in an gradient-descent update rule fashion as Q_{k+1}(s, a) = Q_k(s,a) + \alpha(\hat{Q}(s, a) - Q_k(s,a))

Q-Learning, like value iteration, also converges to the optimal values, even if we don’t act optimally! The downsides of q-learning is that we have to make sure we explore enough and decay the learning rate.

Why do we want our agent to explore? If we find one sequence of actions that leads to a high reward, we want to repeat those actions. But there’s a chance that there exist another sequence of actions that could produce an even higher expected reward. In practice, we sometimes pick a random action instead of the optimal one a fraction of the time. This fraction is denoted by \epsilon. The value of this encourages our agent to try doing some random values, i.e., to explore the state space more! This is called the \epsilon-Greedy Approach.

Initially, our agent doesn’t know what to do or what actions lead to good rewards, so we start with a high value of \epsilon (relatively) and decay it over time (because good actions become clearer as we play, and we should stop taking random actions). We may decay to zero, which means we should always take the optimal action after convergence. Or we can decay to a very small value, encouraging the agent to try random actions every once-in-a-while. The other parameter we decay is the learning rate, which will help our network converge to the optimal values.

Here an example of running Q-Learning for our robotic environment.

Recall that the optimal policy is to take the action with the largest q-value at any state. If we did this for the above q-values, we would tend to go towards the higher reward and avoid the lower one.

Notice that there are some q-values that are still zero. In other words, we have arrived at a particular state where we haven’t taken a specific action. For example, consider the bottom-right square. We haven’t taken the down or right action in that state so we don’t know what value should be assigned to it.

Q-Learning Agent for CartPole

Now that we have an understanding of Q-Learning, let’s code!

Before starting, we’ll need to install OpenAI Gym (pip3 install gym ) and ffmpeg (brew install ffmpeg ). The OpenAI Gym provides us with at ton of different reinforcement learning scenarios with visuals, transition functions, and reward functions already programmed.

Now we’ll implement Q-Learning for the simplest game in the OpenAI Gym: CartPole! The objective of the game is simply to balance a stick on a cart.

This is a simple game that we can understand well. You can take the Q-Learning code we implement and generalize it to any of the games in the OpenAI Gym.

The state of this game is characterized by 4 quantities: position of the cart, velocity of the cart, angle of the pole, and angular velocity of the pole (x, \dot{x}, \theta, \dot{\theta}), respectively. But all of these quantities are continuous variables! We have to discretize the values by putting them into buckets, e.g., values between -0.5 and -0.4 will be in one bucket, etc. As for the actions, there are only two: move left or move right!

Let’s create a class for our Q-Learning agent.

import gym
import numpy as np
import math

class CartPoleQAgent():
    def __init__(self, buckets=(1, 1, 6, 12), num_episodes=1000, min_lr=0.1, min_explore=0.1, discount=1.0, decay=25):
        self.buckets = buckets
        self.num_episodes = num_episodes
        self.min_lr = min_lr
        self.min_explore = min_explore
        self.discount = discount
        self.decay = decay

        self.env = gym.make('CartPole-v0')

        self.upper_bounds = [self.env.observation_space.high[0], 0.5, self.env.observation_space.high[2], math.radians(50) / 1.]
        self.lower_bounds = [self.env.observation_space.low[0], -0.5, self.env.observation_space.low[2], -math.radians(50) / 1.]

        self.Q_table = np.zeros(self.buckets + (self.env.action_space.n,))

In the constructor, we use buckets to discretize our state space. In reinforcement learning, we train for a number of episodes, kind of like the number of epochs for supervised/unsupervised learning. We also have the minimum learning rate, exploration rate, and discount factor. Finally we have the decay factor that will be used for the learning and exploration rate decay.

We also bound the position and angle to be the same as the low and high of the environment. We manually bound velocity (\pm0.5 m/s) and angular velocity (\pm50 deg / s). Finally, we create the table of q-values.

In practice, we store all of the q-values in a giant lookup table. The rows are the state space and the columns are the actions. However, we characterized our state space as being a 4-tuple. So the resulting table will have 5 dimensions: the first four correspond to the state and the last one is the action index: (x, \dot{x}, \theta, \dot{\theta}, a). Given this tuple, we can access a scalar value in the table.

Speaking of the q-values, let’s write the code that will update a value in the table. Recall that the update formula is the following (substituting for \hat{Q}(s,a).

    \[Q_{k+1}(s, a) \gets Q_k(s,a) + \alpha[R(s, a, s') + \gamma \max_{a'} Q_k(s', a') -Q_k(s,a)] \]

We can translate this directly into Python code.

def update_q(self, state, action, reward, new_state):
    self.Q_table[state][action] += self.lr * (reward + self.discount * np.max(self.Q_table[new_state]) - self.Q_table[state][action])

Additionally, we’ll write update rules for the exploration and learning rates.

def get_explore_rate(self, t):
    return max(self.min_explore, min(1., 1. - math.log10((t + 1) / self.decay)))

def get_lr(self, t):
    return max(self.min_lr, min(1., 1. - math.log10((t + 1) / self.decay)))

This is just a form of decay. Essentially, we decay our exploration and learning rates until it is the minimum exploration and learning rate that we specified in the constructor.

We also need a function to choose an action using the \epsilon-Greedy approach. \epsilon percent of the time, we take a random action, but the rest of the time we take the optimal action. Recall that the optimal action is the following.

    \[ \pi^*(s) =\displaystyle\operatornamewithlimits{argmax}\limits_{a} Q(s, a) \]

We can translate that into code.

def choose_action(self, state):
    if (np.random.random() < self.explore_rate):
        return self.env.action_space.sample() 
    else:
        return np.argmax(self.Q_table[state])

Now we can write some code to discretize our search space and train our agent! The first function simply does some math to figure out which buckets our observation should be in.

def discretize_state(self, obs):
    discretized = list()
    for i in range(len(obs)):
        scaling = (obs[i] + abs(self.lower_bounds[i])) / (self.upper_bounds[i] - self.lower_bounds[i])
        new_obs = int(round((self.buckets[i] - 1) * scaling))
        new_obs = min(self.buckets[i] - 1, max(0, new_obs))
        discretized.append(new_obs)
    return tuple(discretized)

def train(self):
    for e in range(self.num_episodes):
        current_state = self.discretize_state(self.env.reset())

        self.lr = self.get_lr(e)
        self.explore_rate = self.get_explore_rate(e)
        done = False

        while not done:
            action = self.choose_action(current_state)
            obs, reward, done, _ = self.env.step(action)
            new_state = self.discretize_state(obs)
            self.update_q(current_state, action, reward, new_state)
            current_state = new_state

    print('Finished training!')

The training function runs for a number of episodes. We reset the CartPole environment and discretize the state (x, \dot{x}, \theta, \dot{\theta}). Then we fetch the exploration and learning rates at that episode. The inner loop iterates until the episode is finished, which is determined by the CartPole environment. We can figure out the exact conditions if we look inside the code of CartPole. The episode is done when the position and angle are within a threshold.

Until then, we pick an action and perform that action. We get back a new state, reward, and other parameters. We use that discretized state and reward to update the q-value and update the current state.

After we finish training, we can render the CartPole environment in a window and see the cart balance the pole.

def run(self):
    self.env = gym.wrappers.Monitor(self.env, directory='cartpole', force=True)

    while True:
        current_state = self.discretize_state(self.env.reset())
        done = False

        while not done:
            self.env.render()
            action = self.choose_action(current_state)
            obs, reward, done, _ = self.env.step(action)
            new_state = self.discretize_state(obs)
            current_state = new_state

if __name__ == "__main__":
    agent = CartPoleQAgent()
    agent.train()
    agent.run()

Since we’re not training, I decided not to update the q-values, but you may choose to do so! By configuring a monitor, OpenAI Gym will create short clips like the following.

To summarize, we discussed the setup of a game using Markov Decision Processes (MDPs) and value iteration as an algorithm to solve them when the transition and reward functions are known. Then we moved on to reinforcement learning and Q-Learning. Finally, we implemented Q-Learning to teach a cart how to balance a pole. This is just the first step into Deep Q-Networks where the q-value table can be replaced with a neural network.

Reinforcement Learning, such as Deep Q-Networks, is currently in use everywhere: to teach computers to play games like Pacman, Chess, and Go to driving cars autonomously!

]]>
A Guide to Improving Deep Learning’s Performance https://gamedevacademy.org/a-guide-to-improving-deep-learnings-performance/ Tue, 03 Oct 2017 01:00:59 +0000 https://pythonmachinelearning.pro/?p=1823 Read more]]> Although deep learning has great potential to produce fantastic results, we can’t simply leave everything to the learning algorithm! In other words, we can’t treat the model as some black-box, closed entity that can read our minds and perform the best! We have to be involved in the training and design process to make sure our model is learning efficiently. We’re going to look at several different techniques we can apply to every deep learning model that will help improve our model’s accuracy.

Overfitting and Underfitting

To discuss overfitting and underfitting, let’s consider the challenge of curve-fitting: given a set of points, find the curve of best fit. We might think that the curve that goes through all of the points is the best curve, but, actually, that’s not quite right! If we gave new data to that curve, it wouldn’t do well! This problem is called overfitting. Why we’re discussing it is because it is very common in deep architectures. Overfitting happens when our model tries so hard to correctly classify each and every example, that it ends up modeling of all of these tiny noises and intricacies for each input. Then, when we give it new data it hasn’t seen before, the model doesn’t know what to do! We say it generalizes poorly! We want our model to generalize to new, never-before-seen data. If it overfits, then we’ll get poor accuracy on new data.

Overfitting is also related to the size of the model and is, therefore, a huge problem in deep learning where we have millions of parameters! The more parameters we have, the better we can fit the data. Specifically for curve-fitting, we can perfectly fit a curve to N number of points using an N-1 degree polynomial.

There are several ways to detect overfitting. We can plot the error/loss of the training data and the same for the validation set. If we see that our training loss is very small, but our validation set loss is still high, this is an indication of overfitting. Our model is doing really well on the training set but is failing to generalize. Another, similar, indication is the training set and testing set accuracy. If our model has a very high training set accuracy and very low test set accuracy, this is an indication of overfitting for the same reason: our model isn’t generalizing! To combat overfitting, we use regularization. We’ll discuss a few techniques later.

The reciprocal case to overfitting is underfitting. This is less of a problem in deep learning but does help with model selection. In the case of underfitting, our model is generalizing so much that it’s actually missing the underlying relationship of the data. In the figure above, the line is linear when the data are clearly non-linear. This type of generalization also leads to poor accuracy!

To detect underfitting, we can look at the training and validation loss. If both are high, then we know that our model is underfitting. To prevent underfitting, we can simply use a larger model! If we have a 2-layer network, we can try increasing the number of hidden neurons or try adding more hidden layers. Both of these will help increase the number of parameters of our model and prevent underfitting.

This problem of overfitting and underfitting is called the bias-variance dilemma or tradeoff in learning theory. We’re referring to the bias and variance of the model, not the actual bias or variance of the parameters! The bias and variance represent how much we pay attention to the model data itself. If we have a high-variance model, this means we’re paying too much attention to the intricacies of the data. A high-bias model means we’re ignoring the data entirely. The goal is to build a model with low bias and low variance.

We can explain overfitting and underfitting in terms of bias and variance. If we have an overfit model, then we say this is a high-variance model. This is because the overfit model is learning the tiny intricacies and noise of the data. An underfit model is a high-bias model because we’re completely missing and ignoring the underlying structure of the data.

To summarize, we can look at the training and testing loss to determine if our system is underfitting or overfitting. If both losses are high, then our model is underfitting, and we need to increase the number of parameters. If our training loss is low and our testing loss is high, then our model is overfitting and not generalizing well. In this case, we can use regularization to help prevent overfitting. Let’s discuss a few techniques!

Dropout

Dropout is a technique that was designed to counteract overfitting. In fact, the paper that introduced it is titled Dropout: A Simple Way to Prevent Neural Networks from Overfitting by Srivastava et al. In the paper, they present a radical new way to prevent overfitting and improve generalization.

(Dropout: A Simple Way to Prevent Neural Networks from Overfitting by Srivastava et al.)

The idea behind dropout is to randomly zero out some of the weights in a given layer. When we do this, we effectively “kill” that neuron. Then, we continue with the regular training process for each example in a mini-batch. The figure above is taken from the paper and pictorially explains this nullification.

Mathematically, each layer that has dropout enabled, which won’t be all layers, will have dropout neurons that each have some equal probability of being dropped out. For example, in the above 2-layer network, each neuron in the second hidden layer has a probability of being dropped out. For each example in each mini-batch, we drop out some neurons, pass the example forward, and backpropagate with the neurons still dropped out! This is the critical point: the neurons that were dropped out in the forward pass are also dropped out in the backward pass. Remember that neural networks learn from backpropagation. If we didn’t also drop out the same neurons during the backward pass, then dropout wouldn’t do anything! During testing, however, we enable all of the neurons.

This was a bit of a radical technique when it was introduced. Many researchers asked, “how could something like this help the network learn?” There is an intuition behind why this works: by thinning the network, preventing it from using all of the neurons, we force it to learn alternative or redundant representations because the network can never know which neurons will be unavailable for it to use.

As great as this technique sounds, let’s not be dropout-happy and apply it to each and every layer! There are a few instances where we don’t use dropout. For example, we never apply dropout in the output layer. This is because the output neurons produce a probability distribution over all of the classes. It doesn’t make sense to drop out any neurons in this layer because then we’re randomly ignoring some classes!

Dropout has implementations in almost all major libraries, such as Tensorflow, Keras, and Torch. For example, the following line of code in Keras will add Dropout to the previous layer with a dropout probability of 0.5.

model.add(Dropout(0.5))

In most cases, a dropout probability of 0.5 is used. This regularization technique of dropout is very widely used in almost all deep learning architectures!

There is one minor point to discuss when implementing dropout: output scaling. During training, we have to scale the outputs of the neurons. This is because we need to ensure that the training and testing phases’ activations are identically scaled, because, during testing, all neurons receive all inputs while, during training, a fraction of neurons see the inputs. However, since testing-time performance is very important, we often implement dropout as inverted dropout where we scale during training instead of testing.

Weight Regularization

There are two other types of regularization that are a function of the parameters or weights themselves. We call these \ell_1 and \ell_2 regularization. These correspond intuitively to weight sparsity and weight decay. Both of these regularizers are attached to the cost function with a parameter \lambda that controls how much emphasis we should put on the regularization.

    \[ C = E + \lambda R(W) \]

Here, C is a cost function where E computes the error using something like squared error, cross entropy, etc. However, we introduce another function R(W) that is the regularizer and the \lambda parameter. Since our new cost function is a sum of two quantities with a preference parameter \lambda, our optimizer is going to perform a balancing act: minimize the error while also obeying the regularizer. In other words, the regularizer expresses a preference to do something. In the case of regularization, we’re going to prefer the optimizer to do something with respect to the weights. Let’s discuss the types of regularization!

We’ll first discuss \ell_1 regularization, shown below. This is also called weight sparsity because of what it tries to do: make as many weight values 0 as possible. The idea is to add a quantity to the cost function penalizes weights that have nonzero values. Notice that R(W) = 0 when all of the weights are zero. However, the network wouldn’t learn anything in this case! This prevents us from ever reaching the case where R(W) = 0. In other words, the \ell_1 regularizer prefers many zero weights. This is called weight sparsity. An interesting product of \ell_1 regularization is that it acts as a feature selector. In other words, due to sparsity and the preference to keep zero weights, the weights that are nonzero are very important because they would have been set to zero by the optimizer otherwise!

    \[ R(W) = \displaystyle\sum_k |w_k| \]

Now let’s move on to \ell_2 regularization, shown below. This is sometimes called weight decay. Although they are technically different concepts, the former being a quantity added to the cost function and the latter being a factor applied to the update rule, they have the same purpose: penalize large weights! This regularizer penalizes large weight values by adding the squared size of all of the weights to the cost function. This forces our optimization algorithm to try to minimize the sum of the squared size of all of the weights, i.e., bring them closer to zero. In other words, we want to avoid the scenario where a few weights have really large values and the rest of the weights are zero. We want the scenario where the weight values are fairly spread out among the weights.

    \[ R(W) = \displaystyle\sum_k w_k^2 \]

One more point regarding weight decay and \ell_2 regularization. Although they are conceptually different, mathematically, they are equivalent. By introducing \ell_2 regularization, we’re decaying the weights linearly by a constant factor during gradient descent, which is exactly weight decay.

Like dropout, all major libraries support regularization. In Keras, we can add regularization to weights like this.

model.add(Dense(1024, kernel_regularizer=regularizers.l2(0.01))

In this case, we’re using \ell_2 Regularization with a strength of 0.01. In practice, the most common kind of regularization used is \ell_2 Regularization and Dropout. Using these techniques, we can prevent our model from overfitting and help it generalize better!

We’ve covered some important aspects of applied deep learning. We discussed this problem of underfitting and overfitting and the bias-variance dilemma/tradeoff. Overfitting is more common in deep learning because we often deal with models with millions of parameters. We discussed a technique called Dropout that can help us prevent overfitting. The idea is to drop out neurons for each dropout layer. This forces our network to learn redundancies when training. During testing, we give all neurons all inputs again. We also looked at two other types of regularization that produce interesting weight characteristics: \ell_1 and \ell_2 regularization. The former produces sparse weights and acts a feature selector: we have many zero weights, and the nonzero weights are the most important features. With the latter flavor of regularization, we prefer diffuse weights to a few weights with large values. This is similar to the technique of weight decay and reduces to the same thing mathematically.

The concept of regularization is widely used in deep learning and critical to preventing model overfitting!

]]>