Explore Free Data Visualization Tutorials – GameDev Academy

A Bite-Sized Guide to Data Visualization

Lindsay Schardon — Fri, 23 Aug 2019 15:00:05 +0000

You can access the full course here: Bite-Sized Python Data Visualization

Part 1

In this video, we are going to be looking for 2 of the more common plots – the column and bar plots. There is a very small difference between the two and matplotlib gives us a way to use an almost identical API for both these plots.

Lets start coding. Lets make sure we have chosen the correct environment and launch Spyder (our development tool).

Lets create a new Python script called column.py in the same folder as the pickle files (the matplotlib folder).

Lets import the relevant packages.

# import matplotlib.pyplot as an alias
import matplotlib.pyplot as plt

# python object serialization library
import pickle

# load data using with block (f is closed automatically after the block)
# rb means “read binary data”
with open ('fruit-sales.pickle', 'rb') as f :	
    data = pickle.load(f)
    
print (data)

Lets run this code to see the data.

We see a tuple of elements each with the name of the fruit and the quantity sold. To make this easier to work with, lets split this tuple into 2 separate lists – one with the names and one with the numerical values.

#splitting a list into 2 lists
fruit, num_sold = zip(*data)
print (fruit)
print (num_sold)

Lets run the code.

We can see the data in 2 different tuples now.

Let continue adding code to make the column plot. We need to tell matplotlib where to position the bars i.e. give it x-coordinates as in the diagram below.

The bars are at positions 0, 1, 2, 3. Lets create a list containing these values.

# list from 0 – the number of fruit
bar_coords = range(len(fruit))
 
# tell matplotlib to plot this

# second argument specifies the height of the bars
plt.bar(bar_coords, num_sold)

# show plot
plt.show()

Lets run this code.

As expected we see the bars centered around 0, 1, 2, 3, 4.

The matplotlib documentation has a list of all functions we can use to produce various kinds of plots.

As we can see, there are APIs for all kinds of plots – histograms, spectrograms, 3D plots, contour plots etc.

Lets look at the API documentation for bar plots.

We can see the parameters above. Scrolling down there are other parameters we can change such as color, linewidth etc.

At the bottom of the page are some examples of how to use the bar plot API.

Our column chart works, but does not really look nice. Lets add a few features like a plot title, axis labels. Lets do that in the next video.

Part 2

In this video, we are going to make our plot look nicer by adding labels on the axes, the title. New code segments will be marked as # NEW.

Lets start with the code we wrote for the plot so far.

There are 2 ways we can view these plots. One was in a separate window as we showed in the previous video.

The other way is to have it display inside the iPython console. This is convenient when you don’t want to launch a separate window for the plot.

To do this, we need to change some preferences in Spyder. Open the Preferences dialog as follows :

In the Preferences window, go to IPython console => Graphics.

If the Backend option is set to Automatic, the plot will open in a separate window with buttons that you can use to manipulate the plot. If you select Inline, the plot will show in the console but there won’t be any buttons to manipulate the plot with. After making a change to this setting, you’ll need to quit Spyder and restart it.

Lets start making our chart look a little better.

# import matplotlib.pyplot as an alias
import matplotlib.pyplot as plt

# python object serialization library
import pickle

# load data using with block (f is closed automatically after the block)
# rb means “read binary data”
with open ('fruit-sales.pickle', 'rb') as f :	
    data = pickle.load(f)

#splitting a list into 2 lists
fruit, num_sold = zip(*data)

# list from 0 – the number of fruit
bar_coords = range(len(fruit))
 
# tell matplotlib to plot this

# second argument specifies the height of the bars
plt.bar(bar_coords, num_sold)

# NEW
# add plot title
plt.title(' No of fruit sold(2017)')

# show plot
plt.show()

Running the code, we get:

We see the plot title we just added.

Lets set an axis label for the y-axis.

# NEW
plt.ylabel('Number of fruit(millions)')

Running the code after the above addition, we see.

We see the label along the y-axis. The pyplot documentation will show ways we can rotate this label by 180 degrees.

We can setup the x-axis so that instead of 0, 1, 2, 3, 4, we can use the actual fruit. We use the matplotlib xticks function. The fruits are split in order with the zip function. so we don’t have to worry about the correct order.

# NEW
# replace bar_coords with fruit names
plt.xticks(bar_coords, fruit)

When we run this code, we get:

We see the fruit names on the x-axis. Apples sold the most. Grapefruits did not sell as much. So column bar charts allow us to look at the data and infer information from it. The x-axis has categories and bar charts are very good to depict categorical data. The term bar charts and column charts are used interchangeably.

Summary

So we saw how to add annotations such a title, y-label etc. to any matplotlib chart we work with. Like ylabel, we have xlabel that lets you label the x-axis. We choose not to do it since we have fruit names on the x-axis.

Transcript 1

Hello, world, my name is Mohit Deshpande. In this course, we’ll be learning all about plotting. So plotting is a fundamental aspect of doing any kind of data science, or really just science in general. It’s the ability to take your data and present it in a nice, clean way that’s easy for people to understand.

So, you see here we have different kinds of plots, and these are all using different plotting libraries that we’re going to be learning about. So the three big plotting libraries that are out there that we’re gonna be discussing is Matplotlib, Seaborn, and Bokeh, and all of them provide different advantages as to how you want to display your data, and they provide very nice APIs for you. They just consume your data and then present a very nice-looking plot that’s completely customizable. So this is what we’re gonna be learning, we’re gonna be learning the APIs of these libraries, as well as how to create beautiful plots with all of these libraries.

We’ve been making courses since 2012, and we’re super-excited to have you onboard. Online courses are a fantastic way to learn new skills, and I take a lot of online courses myself. Zenva courses consist mainly of video lessons that you can watch at your own pace, as many times as you want, you can always go back and rewatch videos as many times as you want. We also have downloadable source code, and project files, and data, and they contain everything that we’ll be building in the lessons.

It’s highly, highly recommended that you code along with me. In my experience, it’s the best way to learn something is to get your feet wet, get your hands dirty with the code, so coding along with me will really help you get a good understanding of the code and what’s going on. And lastly, we’ve seen that we notice the students who get the most of these online courses are the same students who make some kind of weekly planner or schedule and stick with it, depending, of course, on your own availability and learning style.

So over the past eight years or so, Zenva has taught all different kinds of topics on programming and game development to over 300,000 students, and this is across about 100 courses. And the skills that they’ve learned in these courses are completely transferable to other domains as well. In fact, some of these students have used the skills that they’ve learned in these courses to advance their own careers, to start a company, or publish their own content from the skills that they’ve learned. Thanks for joining, and I look forward to seeing all the cool stuff that you’ll be building.

Now without further ado, let’s get started.

Transcript 2

So let’s get started with some imports. And so the first thing we need to import is of course Matplotlib and all of the plotting functionality is in a submodule called pyplot. We want to load our data with open, and then inside here we want the file name. And our data’s stored in a file called fruit-sales.pickle. And this would just open it as a text file, but for efficiency, I’ve saved the data in a binary format, so we have to tell Python to read from a binary format, so that’s what this RB stands for.

So we’re now reading binary data and then as F links this, it creates a variable F that represents this file. Okay and then I use a colon here to set up an indentation block. So inside of this block, now I can do anything with F, this file, that I want, and then after I get out of this indentation block it will automatically close the file for me. So then I’ll just load my data. So let’s see our data, let’s see our data somewhere on this guy.

As you can see we have, it’s actually a list of tuples. Each tuple has a name of a fruit, and then the quantity that sold. So it’s in this format, but what we can do to make our lives easier to work with, is we can split this list of tuples out into two separate lists. So essentially, we’ll have one list that has all of the fruit, and another list that has all of the numerical values.

So using a Python function called zip, so I can say fruit and num_sold equals zip, and then there’s a special operator that we have to use here, there’s a special syntax. It’s star, or asterisk, and then data, and what this does is this is going to split our list of tuples into two separate lists. So fruit will have all of our fruit here, and then num sold will have all of our numerical data. So this is what we want the result to kinda look like.

We need to tell Matplotlib where to put these bars. so we need to create a list where we say zero, one, two, three, and then we can give that to Matplotlib, and Matplotlib will know where to position it. Bar coords, range, length of fruit. So what this will do is this will create essentially a list that goes from starts at zero and goes up to however many fruit we have.

Plt.bar, and the first input to this guy is going to be these coordinates, so I have to tell Matplotlib hey here are the coordinates. And then the second argument is how tall do I want the bars? And that’s just the num_sold here. Now this is going to set up our bar plot, and then to show it we have to call plot.show. so I’m gonna call this guy and now that I have this, I can actually run this and we should be able to see our plot.

So here is our plot. So you see we have these bars here, and their positioned at zero, one, two, three and four because we told Matplotlib to do that.

Transcript 3

In this video, we are going to make our plot look a bit better and then add things like labels on the axes, and the title. So we can have a look at a function on plot called title that I can add, and you can look at the documentation for there’s a bunch of other arguments that you can go with this. But we will just say, let’s give it a simple title. So number of fruit sold 2017. So here in the figure that shows up, we have a title at the very top here. There are other configuration options as to where do you wanna put the title.

The other thing that we want to do is set a title or an axis label for the y-axis, y-label, and that will label the y-axis for us. So I can say number of fruit, and this will be in the millions. And we see that on the left, you see that there is the Number of fruit (millions) label on the y-axis.

The one last thing that we need to do is set up the x-axis so that instead of saying 0, 1, 2, 3, 4, we use the actual fruit x-ticks. We are essentially going to take the numbers and replace them with ticks. So we need to tell Matplotlib which of the numbers we wanna replace and what we wanna replace them with. So we wanna replace these r-coordinates here with the fruit here. So if I run this guy again, you see that now, we have replaced the 0, 1, 2, 3, 4 with the actual names of the fruits, and we have a completed chart here.

So you can see that apples sold the most and grapefruits apparently didn’t quite sell as much, and oranges and bananas were very close, pears were a little bit more, but not quite as much as apples. So using this bar chart, we can look at different kinds of information and gain some insights out of this. So these kinds of column charts are great for categorical data, so here in the x-axis, we have categories, and then we have a numerical value assigned to each category. So these kinds of bar column charts are great for this kind of categorical data.

And then I’ll just make one last subtle point is that these are technically called column charts, but you’ll hear people call them bar charts all the time. They’re really just interchangeable depending on how you wanna show the bars.

Interested in continuing? Check out the full Bite-Sized Python Data Visualization course, which is part of our Bite-Sized Coding Academy.

Getting Started with Data Visualization in Python

Zenva — Fri, 08 Mar 2019 05:00:00 +0000

You can access our newest course on Data Visualization here: The Complete Python Data Visualization Course

Transcript 1

Hello, everybody. My name is Mohit Deshpande and thanks for joining me. And in this course we’re gonna be learning a lot about data visualization.

And so what we’re gonna be looking at is, we’ll be able to build something like this, in fact we’ll be building this exact visualization. And you can see that we can do all sorts of neat things with this kind of visualization. And it’s really useful for if you have a lot of data and you want to try to gain some insights about that data.

It’s sometimes helpful to visualize it, but the question might be, what kind of visualization, or what kind of plots, should I be using? And so hopefully we’re gonna be answering that as we kinda progress through this course. And so before we get started with the actual plotting of the stuff, we have to know a little bit about statistics. And so I’m just gonna have a very brief and just a very scratch-the-surface sort of thing, of statistics, and then we can get right into doing plotting.

And so we’re gonna start with some of the more basic plots, like you’ve probably heard or seen the bar charts, you’ve seen lots of line plots and scatter plots, and so on. And we’re gonna look at some of the more advanced plots, like there are quiver plots, which are used for vector fields. There are 3D line and surface plots, like the one that I just showed. And then we’ll also talk how we can kind of arrange these multiple plots into just one figure.

So we’ve been making courses since 2012 and we’re super excited to have you on board. Online courses are a great way to learn new skills and I take a lot of them myself. ZENVA courses consist mainly of video lessons that you can watch at your own pace and as many times as you want. We also have downloadable source code and project files that contain everything that we build in the lessons. And it’s highly recommended that you code along with me. That’s, in my experience, the best way to learn a new skill is to actually code along.

And lastly, we’ve seen that students who get the most out of these online courses are the same students who kind of make a weekly plan and stick with it depending on their own availability and learning style. And remember that you can watch and rewatch these video lessons as many times as you want. So this really gives you more flexibility. And at Zenva, we’ve taught programming and game development to over 200,000 students over 50 courses. That’s in 2012. And some of the students have used the skills that they’ve learned in these courses to advance their own careers, start a company, or publish their own games and apps. So thanks for joining, and I look forward to seeing all the cool stuff you’ll be building. Now, without further ado, let’s get started.

Transcript 2

Hello everybody my name is Mohit Deshpande. In this video, the kind of chart that I want to cover is called a bar chart.

So what we’ll be doing is, I’ll show you how we can use Mat Plot Lib, is the library that we’re going to be using, and we can pull out a bar chart. And the bar chart is just going to have, we’ll do it first with one, one series of data and then we’ll do like two series, two series of data. And so I’ll show you how we can get that. And then a ton of different stuff that we can do with, with bar charts. First of all, you’ll notice that I have a ton of imports here and just ignore them for a second, cause we’re going to get to them.

At some point we’re going to use all these, but I’ll just ignore that for a second. Probably one of the more important things is, plot.show is, it shows the actual, it’ll show the graph. But we’ll just plot, that’s this matplotlib.pyplot. And that’s what the, that’s the kind of plot.

All the plotting we’re going to do is going to be using this, using functions on this plot. So, that’s kind of how Mat Plot Lib works. To actually discuss a bar chart, so what kind of data is a bar chart good to show? Well, we can use bar charts to show something like… We can use bar charts to show categorical data. What I mean by that is, suppose that we wanted to… Suppose that we wanted to know how many people have A’s and B’s and C’s and D’s and F’s in some classroom or something. So what we can do is, we can make a bar chart so that all of the, you know all of the, on the X axis the bar chart will be what grade they have and then on the Y axis will be how many receive that grade. And then we can actually split it up.

Because maybe for the particular class, we could split it up by people who had one Professor, or people who had another Professor. Or we could split it up in many different ways, but that’s just kind of an example of what we could use for for bar charts, is when you have discreet data that you want to, discreet data that you want to plot, like counts or something like that. First thing that we’ll do is, we need a number of bins, and what I mean by bins is how many actual, how many tics on the X axis do we have. We have five right? A – B – C – D – and F.

Then we can actually create our data, and for this I’m actually going to use numpi. Numpi has some pretty great functions, so numpi.random has some great functions. So I can go randint. I pass in the minimum value, the maximum value, so let’s go zero to 100. And then I pass in, you know what kind of data that I, or what kind, how many of those points that I want to generate.

In this case I want to generate one for one per bin. And so what’s going to happen is, each A, B, C, D, and F we’re going to have a number for that and that’s going to be bar one. I’m going to create bar two a little bit later, but after I have this, I actually need these indices and the indices are what Mat Plot Lib will use to kind of organize. So we will say that the first index is going to get the value at bar one. For the second index we’ll get the value bar 1, and then you know, this is good because it works for multiple, when we have bar charts that have multiple bars or have multiple series I should say. So a quick way to do that is, np.arange, not arange but A range. And then you say number of bins, and then that will get you all the indices.

And then here comes the magic, you actually plot it. So plot.bar and then pass in the indices. And then we pass I the data, which is just bar one. And there’s some other options that we can use, but I’m going to leave those blank for now. And so this is like the bare, the bare minimum that we need and I can just kind of plot this and we can see what happens.

Okay so here is my bar chart. And so it just generates some random numbers from zero to 100 and then here’s what the bar chart actually looks like. These bars are really thick though, so I’m gonna have, I’m actually going to thin them out. So I’m going to exit out of this. Then we put another parameter here called bar widths, that we’ll set to like 0.25. And we can actually make these bars a bit skinnier, and we’ll need to do that when we are… that other parameter goes here, we’ll need to do that when we go to have multiple bars in just a second here.

I can rerun this and then now my bars are skinnier. So this works out well, oh wait a minute, the values that are showing here aren’t actually A, B, C, D, or F. So we can change that, very simple. We say, whoops, we say plot.xticks and then we show the indices that we want. Indices, and then we just give it a twopull of the values that we want, and since we have five we can just say, A, B, C, D, and F. And so now that should the X ticks, or that should actually label them instead of giving them numerical values at the bottom. And there we go. So now you see that instead of these numerical values at the bottom, we actually have A, B, C, D, and F. Okay, great.

But how well do this scale for if we have multiple series. And what I mean by multiple series is that when we run this, there were just basically bars in one color. So what if we wanted to have another bar because maybe we’re like splitting it up by, maybe we want to split it up by like instructor or something like that. Maybe you take the same course with a different instructor and maybe one instructor’s students tend to to better than another instructor’s students, or something like that. And so we can do that also very easily, in fact, I just got to copy this here. And I’ll make bar two, and it’s going to be just the same as bar one except now I’m actually have, I have to put it in the in the bar chart. And that’s going to be a bit more complicated.

What, basically what we can do here is say plot.bar and then instead of indices I have to do indices + plus bar width. And what that does is that like shifts, cause if we just have indices what’s going to happen is bar two is going to be like overlaid on top of bar one and we don’t want that. And we know what the width of bar one is, it’s bar one width. So now what this will do, this will put the two, put the two right next to each other in the bar chart. Then I can say bar two and then I also have to specify the bar width.

And then for any plots that you, if you want to have three bars for example, then the next one would be plot.bar indices plus two times bar width. And that will be, that would shift you bar over three and so it’s in the right positions so it’s not like being overlaid by anything.

But anyway that’s basically how that works, and we can run this and see what our result is. Okay so now we have you know like two bars like that. There’s some cleanup stuff that we can do. For example, we can change the colors cause these bars are both the same color and that’s not really desirable. So we can change the colors very easily. In plot.bar there’s label parameter called color, and then we can pass it a single letter or color. So like B is blue, and then for this one we’ll do color = G for green.

And also I, I also want to label these bars. And so when I create a legend, I can actually give them a label. And so I can say like, this represents the scores of Professor one. So I can label = like… Professor one or something. And then here I can also say, label = Professor two. And then I can do a legend, so plot.legend and that’ll automatically generate a legend just based off of these labels. So when I run this, I get this result. You’ll notice I have my, you know, all the bars are colored now so that they’re nice, and I have Professor One and Professor Two.

I’m going to do one quick thing before we move on and that is, I’m going to move these labels so that it’s actually kind of in between these two. And that’s also pretty simple. And I’m also going to label my Axis as well because it’s always important to label your Axis and give it titles. So I can, I can do the Axis label real quick. So plot.Xlabel labels the X axis, so this will be like final grade, and then the Y label function just labels the Y axis. So this will be score, or I should say, more accurately I should say score frequency. Which is how many students received that score.

And now if I want to shift over, my X tics here, I want to shift these over. How much do I want to shift it over by. Well, I’m in the same predicament as when I did my second bar. I can just shift it over by my bar width, so I’m going to say copy, paste, and now what’ll happen is this will be positioned so that they’re one bar width over. And so they’re in between the two. See so now I have my Axis that are labeled, frequency and final grade, I have my Professors legend, and I have my X ticks are like nice and in between the two.

So I’m going to stop right here. We can also add a title, but I’ll get to that later. So in this video we discussed quickly how we can make a bar chart. And I discussed some of the stuff that we can do specifically to Mat Plot Lib and that is, we call plot.bar pass in the indices and the data and so then it’s going to take each potion of the indices and plot it with the, the plot corresponding to the coordinate here. And then we also did the bar width and then we can give it color and a label, and the label’s for the legend that’s showing up on the top right. We can also change the position of the label if we wanted to.

But anyway, if we wanted to have more than one bar then we’ll have to make sure that we shift the bars over by some bar width. So plus one times bar width, plus two times bar width, and so on. We can label our X and Y axis using these functions. And with X ticks, we can position the X labels. And so that we don’t have numerical values, we have categories, like A, B, C, D, or F. And so that is how we can plot bar charts in Mat Plot Lib.

Transcript 3

Hello everybody, in this video I want to explain how we can plot a histogram in matplotlib, and I’ll also mention what a histogram is in matplotlib.

So what I’m gonna be doing from now on is I’m actually gonna be taking this section and commenting it out. So that way if you wanna know how to make a certain kind of plot you can always just uncomment the section by removing this line here and this line here, and then you’ll show the plots. Anyway, that’s sort of bar chart, and in this video what we’ll be doing is plotting histogram. So you might be asking first of all, what is a histogram?

A histogram is very similar to a bar chart, but there’s some technicalities that are being brought up. For example, for a histogram, the bars in between the bars don’t touch, or the bars do touch, basically. With a bar chart actually, a bar chart the bars they usually don’t touch, but with a histogram the bars do touch. Histograms are particularly important in the field of probability and statistics. There’s a lot of, there’s a thing called distributions, and distributions have a certain shape. And to know what the shape of a distribution is, you have to look at the histogram. And by taking random values from a distribution, you should be able to generate a histogram that looks similar to what the original, what the true value of the distribution is.

I’ve been saying this word a lot, but we can actually go ahead and plot the histogram. One cool thing that we’re gonna be doing is, I’m going to explain a little bit about what this thing called a normal distribution is. I want to show you basically what it looks like first, because if you know what it looks like, then when we draw the bar chart, it might be a little bit clearer what the final result should look like, I should say.

Before we actually get to plotting the histogram, I wanna take a second and just show what the normal distribution looks like. Actually, the normal distribution itself is very very widely used, and there’s a lot of stuff that discusses normal distribution. So what I wanted to do is to take a few seconds, and just at least visually draw it out so that you have some idea of what it looks like. So what we’re talking about is normal distribution. There are of course other kinds of distributions. Distribution.

There are of course other kinds of distributions, but probably the most famous one is the normal distribution. So when I plot this space with the histogram, it’s gonna be like, that was a really bad line. It’ll basically be kind of like this here, and so here’s my line. The way that it looks like is we have some mean. Suppose here is my mean. And it turns out that the peak of the normal distribution is at that mean, so it kind of looks like this, curling down here, and then going back to the top here, and then down like that. And so this is what kind of like the normal, and then these actually go out to infinity.

These actually, whoops, that wasn’t good. But these technically do trail outward to infinity in both directions, and so this is basically what the normal distribution looks like, it’s kind of like this curve like this.

And so here is what the mean of the normal distribution is, and the variance like what we talked about actually determines how far this is out. Here’s one with a fairly small sigma, or variance. I can have a much larger, if I have a larger variance then what happens, it will end up with something that’s gonna be more like this. And so you can see that the values are like way more, this peak here is smaller because all my values are being spread out more, as opposed to this. And I guess technically if it was to look like this then it should devolve very quickly here. And then this will be a bit more, flatter basically, is what I’m trying to say.

This is the normal distribution, so when we sample from a normal distribution, then we will, then we can expect to see something like this. I can generate a normal distribution, I just need a mu and a sigma. I can say here’s my mu, here’s my sigma, and so we’ll just do something like, well here I can actually just, mu equals zero and sigma equals one, it’s a unit normal. What we can do is generate all of these values. So we can say something like vals equals mu plus sigma, times np.random.randn. This will basically just generate 1000 points that fit this normal distribution whose mean is zero and whose sigma is one.

And we can plot this again, let’s do plt.hist, and hist is histogram. I can just pass in my values, and then another parameter I can pass in is the number of bins, basically. I can plot that, and so this is really all we need to plot our histogram.

So I can run this, it only takes a second, and you see that hey, it actually does, compare what this looks like here, to what this looks like. You see that they’re not exactly the same, but it fits the general scheme, right? And that’s basically what happens when we do this sampling is that we are taking, we’re just kind of picking data from this smooth curve here and we’re just like picking discrete data points. So when I pick these discrete data points, I get a histogram that looks like this. And the more data points that I pick and plot, the smoother this will look like, and the closer that this distribution actually gets the true normal distribution. Anyway that is a histogram.

Another thing, again, for clarity what we should do is plot the, we should do the xlabel. The xlabel on the x would just be like bins or values. Histograms usually have bins. For the y label, that is the frequency, or the frequency or you can think of it as like a probability, but we’ll say like frequency. And we can also give it a title, by the way, we can say title is Normal Distribution sampled. And what I mean by sampling is that’s basically what we’re doing here is we’re just picking 1000 values that fit the normal distribution, and then we can just plot them.

And the more points that we pick from that distribution, the closer it’ll look like. Then one extra component that I wanna mention is we can actually have a grid. So I can do plt.grid(True), and that will actually show a two-by-two grid, so it’ll become clear when I run it and I show you.

So now you can see that we actually get this nice grid here. So we can get this kind of normal distribution here, so I got my labels on and everything looks good, and you can see here’s my frequency, so you see that the majority of my, the most frequent points, are around zero. That makes sense, right, because zero is the mean, and I set the deviation to be small, but zero is the mean, and sigma is actually what’s called the standard deviation, which is the square root of the arrangement. That’s beside the point, but it’s the same principles apply. So the larger my sigma value, the more spread out this is gonna be. And what I can do is I can actually increase my sigma value and we can see how that changes. So I’ll change it from like one to 10.

When I run it you can see that it’s not as pointy. It’s still a little pointy, but it’s a bit flatter, if you notice, and let me change it again. I’m gonna change it to 100 now. And so you can see now it’s like way more spread out. It looks the same, but consider these extremes. And so we’re spreading out our normal distribution. If I were to zoom in at zero then you’d see that, it’s actually quite flat locally. Matplotlib automatically zooms out for you, is the point. So I’m just gonna change that back to one here, and we get our distribution. So notice the scale of this, right? So anyway, that’s where I’m gonna stop right here.

So in this video we discussed histograms, and they’re a great way to look at distributions of data, if you have a ton of data and you wanna know whether it fits a particular distribution, you can throw it up into a histogram, pull up a ton of figures of histograms, and then of different distributions, and see well how closely does my data fit this particular distribution. So I can compare and see if my data fits the distribution given a histogram. So anyway that is the histogram in matplotlib.

Interested in continuing? Check out The Complete Python Data Visualization Course, our latest data visualization course, which is part of our Data Science Mini-Degree.

How to Use Machine Learning to Show Predictions in Augmented Reality – Part 3

Elisa Romondia — Wed, 23 Jan 2019 05:00:12 +0000

Part 2 Recap

This tutorial is part of a multi-part series. In Part 1, we loaded our data from a .csv file and used Linear Regression in order to predict the number of patients that the hospital is expected to receive in future years. In Part 2 we improved the UI and created a bar chart.

Introduction

Welcome to the third and last part of this tutorial series. In this part, we will use Easy AR in order to spawn our patient numbers histogram on our Image Target in Augmented Reality.

Tutorial Source Code

All of the Part 3 source code can be downloaded here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Easy AR

Easy AR SDK is an Augmented Reality Engine that can be used in Unity, basically allows us to set an image as a target. Easy AR will detect this image target from a live camera of a device like a smartphone or a tablet, when the image will be detected the magic will happen and our object will spawn.

First, we need to create a free account on the website of Easy AR, simply using a password and email, this is the link to the website.

After we sign in into the Easy AR website and create an SDK license key, in the SDK Authorization section click on “Add SDK License Key”. Choose the “EasyAR SDK Basic: Free and no watermark” option. It is required to fill Application Details, don’t worry too much about this because you change those values later.

In the SDK Authorization now we can see our project, let’s copy our SDK License Key because we will need it later.

Let’s download the Easy AR SDK basic for Unity from this page

Now let’s open our project in Unity, after the project is loaded we need to open the downloaded Easy AR package just by click on it

Unity should notice the decompression of the package like in the screenshot

After the decompression Unity will ask to import the files, just click on import and wait a little bit

Now open the scene with the graph, in my case is tutorial_graph

Let’s add the EasyAR_Startup prefab to the scene, just dragging it from the Unity File browser to the Hierarchy section, like in the screenshot below. You can find this prefab in the Assets/EasyAR/Prefabs folder.

Select EasyAR_Startup from the Hierarchy section, in the inspector you should see a “key” textbox, paste your Easy AR key here. We obtained that key before, when creating our project in the Easy AR website.

Let’s remove our maincamera, because we don’t need it anymore, now the main camera will be our device camera

Now if you hit the play button you should see from the webcam of your computer

The image target

We need to create the folder structure for your image target and materials, first inside Easy AR folder create a “Materials” folder

Always inside Easy AR folder create a “Textures” folder

Create a “StreamingAssets” folder inside the Assets folder

Choose an image target, I suggest you to choose an image with a square size, the best thing is a QR code, feel free to use the logo of your company or something similar, remember that the image should have evident edges and contrasts otherwise will be really hard for the camera to detect it. For example, a light grey text on a white background will be almost impossible to recognize etc….

I used this image of a patient and a medic to stay in theme with our project

Copy your target image inside Assets/Streaming Assets and Assets/Easy AR/Textures

Now create a new material inside EasyAR/Materials called Logo_Material

Drag the logo image inside the Albedo input of the material

This should be the material result, now the material sphere instead of being just gray will display your texture image

Now we need to drag the ImageTarget prefab from the EasyAR/Primitives folder inside the Hierarchy section, this will enable us to use the Image Target.

Select the ImageTarget object and in the Inspector complete the ImageTarget input fields with the path with the name of your image, in my case “target.jpeg” and the name “imagetarget”

We need to specify a size for our image target, I will use 5 for both x and y, feel free to use any size

This is really important, we need to change the image target Storage to Assets

Now let’s drag the ImageTracker from the EasyAR_startup inside the Loader input of the image target. This will tell Easy AR which image tracker use in order to detect the image by the device camera feed.

Inside the Material section of image target, we drag our Logo_Material on the Element 0 input

Now if everything is done correctly, the image target should display the image we choose before

Let’s test our work, create a cube inside the ImageTarget object, we will use it later as an object reference for our patients’ number graph. This cube will help us later to anchor the graph on the image target, let’s call it “graph_anchor”

Hit play and the cube should appear over the image target if the device camera detects the image. Don’t worry if you have to wait like one minute in order to let Unity switch on your device camera, it’s a slow process when you are developing a project, but it will be faster when the project will be transformed into an app.

Now we need to make that cube invisible so only our future histogram will be visible on the image target. Let’s create a new material called “Anchor_Material ” inside the Assets/EasyAR/Materials Folder. In order to make it fully transparent (invisible) we need to set the Rendering Mode to Fade

The alpha of the Albedo section must be set to 0, as you can see in the screenshot below

Now drag the Anchor_Material on the graph_anchor cube, this should be the result

Time to save the tutorial_graph scene!

Augmented Reality Histogram

In the last tutorial we generated our data visualization, now it’s time to edit that code in order to make the histogram spawn over our image target. This is really tricky because augmented reality doesn’t always detect perfectly our target, so don’t get frustrated if at the first try something goes wrong.

Let’s write a test function inside the “GenDataViz.cs” script in order to print a console message when the image target is detected

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI; // Required when Using UI elements.
using TMPro; // required to use text mesh pro elements

public class GenDataViz : MonoBehaviour
{
    int scaleSize = 100;
    int scaleSizeFactor = 10;
    public GameObject graphContainer;
    int binDistance = 13;
    float offset = 0;

    //Add a label for the prediction year
    public TextMeshProUGUI textPredictionYear;

    //Check if image target is detected
    public GameObject Target;
    private bool detected = false;

    //The update function is executed every frame
    void Update()
    {
        if (Target.activeSelf == true && detected == false)
        {
            Debug.Log("Image Target Detected");
            detected = true;
        }
    }

   //continues with the rest of the code...
}

A little explanation of what the new code does:

“public GameObject Target”, this object will be our ImageTarget prefab
“private bool detected = false” will help us in order to identify the first detection
The Update function, is a standard Unity function like Start(). The Update function is executed every frame. It’s a useful function when you have to check something continuously, but beware because it’s expensive on the CPU side. You can find more details about the Update function on the official Unity docs
Inside the Update function, there an if that checks if the Target is active (so if the image is detected) and if is the first detection. We need to check if is the first detection in order to avoid expensive useless operations.
Inside the if we have a simple console message and we assign detected true after the first detection

Select the GraphContainer from the Hierarchy section and drag the ImageTarget inside the Target section, as the screenshot shows below

Now if you hit play and the camera detects the image target, the console should print this message “Image Target Detected”

Our patients’ number histogram should be created when the image target becomes active, so let’s edit the “GenDataViz.cs” script

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI; // Required when Using UI elements.
using TMPro; // required to use text mesh pro elements

public class GenDataViz : MonoBehaviour
{
    int scaleSize = 100;
    int scaleSizeFactor = 10;
    public GameObject graphContainer;
    int binDistance = 13;
    float offset = 0;

    //Add a label for the prediction year
    public TextMeshProUGUI textPredictionYear;

    //Check if image target is detected
    public GameObject Target;
    private bool detected = false;

    //The update function is executed every frame
    void Update()
    {
        if (Target.activeSelf == true && detected == false)
        {
            Debug.Log("Image Target Detected");
            detected = true;
            // we moved this function from Start to Update
            CreateGraph();
        }
    }


    // Use this for initialization
    void Start()
    {
    }
//continues with the rest of the code...
}

As you can see:

I moved the CreateGraph() function from Start() to Update(), this will allow us to check if the image is detected and then generate the graph

Now we need to position our 3D histogram on the Image Target, we will use our graph_anchor cube in order to retrieve the relative size, position and rotation. We edit again the “GenDataViz.cs” script, this is the full code of the script:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI; // Required when Using UI elements.
using TMPro; // required to use text mesh pro elements

public class GenDataViz : MonoBehaviour
{
    int scaleSize = 1000;
    int scaleSizeFactor = 100;
    float binDistance = 0.1f;
    float offset = 0;

    //Add a label for the prediction year
    public TextMeshProUGUI textPredictionYear;

    //Check if image target is detected
    public GameObject Target;
    private bool detected = false;
    // The anchor object of your graph
    public GameObject GraphAnchor;

    //The update function is executed every frame
    void Update()
    {
        if (Target.activeSelf == true && detected == false)
        {
            Debug.Log("Image Target Detected");
            detected = true;
            // we moved this function from Start to Update
            CreateGraph();
        }
    }


    // Use this for initialization
    void Start()
    {
    }


    public void ClearChilds(Transform parent)
    {
        offset = 0;
        foreach (Transform child in parent)
        {
            Destroy(child.gameObject);
        }
    }

    // Here we allow the use to increase and decrease the size of the data visualization
    public void DecreaseSize()
    {
        scaleSize += scaleSizeFactor;
        CreateGraph();
    }

    public void IncreaseSize()
    {
        scaleSize -= scaleSizeFactor;
        CreateGraph();
    }

    //Reset the size of the graph
    public void ResetSize()
    {
        scaleSize = 1000;
        CreateGraph();
    }



    public void CreateGraph()
    {
        Debug.Log("creating the graph");
        ClearChilds(GraphAnchor.transform);
        for (var i = 0; i < LinearRegression.quantityValues.Count; i++)
        {
            //Reduced the number of arguments of the function
            createBin((float)LinearRegression.quantityValues[i] / scaleSize, GraphAnchor);
            offset += binDistance;
        }
        Debug.Log("creating the graph: " + LinearRegression.PredictionOutput);

        // Let's add the predictio as the last bar, only if the user made a prediction
        if (LinearRegression.PredictionOutput != 0)
        {
            //Reduced the number of arguments of the function
            createBin((float)LinearRegression.PredictionOutput / scaleSize, GraphAnchor);
            offset += binDistance;
            textPredictionYear.text = "Prediction of " + LinearRegression.PredictionYear;
        }
        else
        {
            textPredictionYear.text = " ";

        }
    }

    //Reduced the number of arguments of the function
    void createBin(float Scale_y, GameObject _parent)
    {
        GameObject cube = GameObject.CreatePrimitive(PrimitiveType.Cube);
        cube.transform.SetParent(_parent.transform, true);

        //We use the localScale of the parent object in order to have a relative size
        Vector3 scale = new Vector3(GraphAnchor.transform.localScale.x / LinearRegression.quantityValues.Count, Scale_y, GraphAnchor.transform.localScale.x / 8);
        cube.transform.localScale = scale;

        //We use the position and rotation of the parent object in order to align our graph
        cube.transform.localPosition = new Vector3(offset - GraphAnchor.transform.localScale.x, (Scale_y / 2) - (GraphAnchor.transform.localScale.y / 2), 0);
        cube.transform.rotation = GraphAnchor.transform.rotation;

        // Let's add some colours
        cube.GetComponent().material.color = Random.ColorHSV(0f, 1f, 1f, 1f, 0.5f, 1f);

    }

}

Many things changed in order to adapt our Histogram to the Augmented Reality target, let’s see what changed:

I changed the initial values of scaleSize in order to have a smaller graph, feel free to change this value
“public GameObject graphContainer” is removed and now “public GameObject GraphAnchor” takes its place
“binDistance” is now a float variable, this will allow us to be more accurate
Creation of the “public GameObject GraphAnchor”, as you can expect this will be the reference to our graph_ancor cube
The “createBin()” function now needs fewer arguments because the scale, rotation and position are retrieved from the parent object
Inside “CreateGraph()” the function “createBin()” uses GraphAnchor instead of GraphContainer, the histogram is generated inside our graph_anchor cube
I updated the lines “createBin((float)LinearRegression.PredictionOutput / scaleSize, GraphAnchor)“, now use a float instead of (int) in order to have a histogram that scales more accurately
In the scale variable the expression “GraphPlatform.transform.localScale.x / LinearRegression.quantityValues.Count” helps us to bound the graph X dimension inside the parent object

Now select the GraphContainer object and drag the graph_anchor cube inside the Graph Anchor input field

Open the menu scene, remember that the dataset is passed from the menu scene to the graph scene, so if you open directly the graph scene unexpected results can happen because the data will be missing. Insert a year of prediction and click on “Prediction”, the on “Data Visualization”. Now show your image target to the camera and graph with the patients’ numbers and the prediction should appear!

Fantastic job: you made it!

Conclusion

In summary, we created an application that can load a dataset derived from the numbers of patients which the hospital staff took care of during previous years, then make a prediction of how many patients they could expect to receive in future. During a meeting, this kind of application can be used in combination with image targets on paper documents, thus enhancing the interactivity of a report by making things easier to understand and visualize in people’s minds for funding and planning purposes.

A potential improvement to this project would be to connect the application to a live stream of data, then visualize that in Augmented Reality and in real time: in terms how the data changes over time >> Please feel free have a go at that or another customization according to your preference, and of course share your progress as well as any other thoughts in the comments below! Hope you enjoyed all the tutorials in this series, and we’ll see you next time – in the meantime take care

How to Use Machine Learning to Show Predictions in Augmented Reality – Part 2

Elisa Romondia — Wed, 16 Jan 2019 05:00:06 +0000

Part 1 Recap

This tutorial is part of a multi-part series: in Part 1, we loaded our data from a .csv file and used Linear Regression in order to predict the number of patients that the hospital is expected to receive in future years.

Introduction

In this second part, we will focus our work on the UI and the Data Visualization. We will build a 3D histogram and the relative UI in order to manage the data visualization process.

Tutorial Source Code

All of the Part 2 source code can be downloaded here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Improve our UI

In order to have a cleaner user experience, we will use two scenes:

A scene with the menu, where the user can predict the affluence of patients in the next years, simply using an input field and the ability to read the data from the .csv
A scene where we will show the data visualization, where the user will have the possibility to look at a 3D histogram of the data and the prediction. Furthermore, we will allow the user to customize the data visualization, increasing and decreasing the size of the histogram.

First, we create a new button inside the canvas called “GoToGraphScene”, later we will reorganize the UI, this button will allow the user to navigate through the scenes.

Here we create a new script called NavigateScenes.cs in the Scripts folder, where we load the new scene and allow the user to change the scene.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.SceneManagement;

public class NavigateScenes : MonoBehaviour {

    public void LoadScene()
    {
        Scene scene = SceneManager.GetActiveScene();
        SceneManager.LoadScene("tutorial_graph");

    }
}

Note that I included “using UnityEngine.SceneManagement”.

We need to execute the script when the “GoToMenu” button is clicked by the user, in order to do that :

Add the script to the “GoToMenu” button
Click on the plus sign on the On Click section
Drag the button itself on the empty field
Select from the dropdown menu in the On Click section the NavigateScenes, then LoadScene function.

We create a new scene called “tutorial_graph” in the asset folder.

Save the changes in the “tutorial” scene and now let’s repeat the same process for the navigation button in the “tutorial_graph” scene:

Create a canvas
Create a panel inside the canvas
Create a button called “GoToMenu” inside the canvas,
in order to go back to the menu
Anchor the button on the top and centre of the canvas
Assign the script called “NavigateScene.cs” to the button
Execute the script when the button is clicked
Save changes

We can optimize this process, instead of creating a new script for each scene, we can add some line of codes in the”NavigateScene.cs” script and load the different scenes based on which is the active scene at the moment, you can find more details about this topic in the SceneManager docs https://docs.unity3d.com/ScriptReference/SceneManagement.SceneManager.html. We use an if because we have only two scenes, with more scene would be better a switch statement or a more complex architecture, but for now, if we are in “tutorial” we will go to “tutorial_graph” or the inverse operation, so a simple if it’s enough.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.SceneManagement;

public class NavigateScenes : MonoBehaviour {

    public void LoadScene()
    {
        Scene scene = SceneManager.GetActiveScene();
        if (scene.name != "tutorial")
        {
            SceneManager.LoadScene("tutorial");
        }
        else
        {
            SceneManager.LoadScene("tutorial_graph");
        }
    }
}

This is the look of the new scene, feels empty, but we will add awesome things later in the tutorial.

Open the build settings from the File menu, we need to add in the build window both the scenes, we can do it simply dragging them into this window, after that our build settings window should look like this:

Now, clicking on the button should change the scene in both of the scenes, so the user can now navigate successfully from the menu to the data visualization scene and back.

Let’s go back to the first “tutorial” scene to improve the UI.

Would be great to fold the dataset text with a button, so we create a new button called “ShowData”, remember to create this element inside the canvas.

Inside that button, we create a Scroll View from the UI menu, the scroll view object allows the user to scroll through a long list of data. I disabled the horizontal scroll, anchor the scroll view at the bottom of the button, as you can see in the screenshot.

Set the value of the Scrollbar Vertical Scrollbar script to 1, so the scrollbar starts from the top, it is important to set this value, otherwise, the text will start in the middle of the scroll view.

Now we drag the DatasetText object inside the Content object of the scroll view, remember to anchor this element (I used the top left anchor), so even in with different screen sizes the position will be relatively the same.

A long list can be tedious to watch, so it is better to make the user choose to expand it or not. In order to add the show or hide functionality of our “ShowData” button, we create a script called “ShowHide.cs” in the Scripts folder.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class ShowHide : MonoBehaviour
{

    public GameObject showHideObj;

    public void ShowHideTask()
    {
        if (showHideObj.activeSelf)
        {
            showHideObj.SetActive(false);

        }
        else
        {
            showHideObj.SetActive(true);
        }
    }

}

As you can see in the code, using an If we check if the object is active or not, so we hide or show our Scroll View. Feel free to reuse this code to add more functionalities like this in other parts of the UI.

Add that script to the “ShowData” button and assign the ScrollView as the target object.

Next, we add the show/hide functionality on the click event, just dragging the button itself in the On Click() section, selecting the ShowHideTask() from the “ShowHide.cs” script.

Disable the Scroll View element.

Now, clicking on the button should hide or show the dataset list.

Let’s centre the UI elements like in this screenshot, all elements are anchored at the top center, this will make our application easier to use on smaller screens, such as small smartphone devices or low-resolution tablets.

Create the Data Visualization

We need to modify the code of LinearRegression.cs, we need to make our List static, this will allow us to pass values between scripts and scenes. In order to have more details about the static modifier, you can look at the docs (https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/static).

Let’s add the static modifier to the lists variables and a clear function in order to avoid to load the same list multiple times.

using System;
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI;
using TMPro;

public class LinearRegression : MonoBehaviour
{
    public TextMeshProUGUI textPredictionResult;
    public TextMeshProUGUI textDatasetObject;
    public InputField inputObject;

    // Use public and static to share the lists data
    public static List yearValues = new List();
    public static List quantityValues = new List();

    void Start()
    {
        // Clear the lists
        yearValues.Clear();
        quantityValues.Clear();

        TextAsset csvdata = Resources.Load("dataset");

        string[] data = csvdata.text.Split(new char[] { '\n' });
        textDatasetObject.text = "Year Quantity";


        for (int i = 1; i < data.Length - 1; i++)
        {
            string[] row = data[i].Split(new char[] { ',' });

            if (row[1] != "")
            {
                DataInterface item = new DataInterface();

                int.TryParse(row[0], out item.year);
                int.TryParse(row[1], out item.quantity);

                yearValues.Add(item.year);
                quantityValues.Add(item.quantity);
                textDatasetObject.text += "\n" + item.year + " " + item.quantity;


            }
        }
    }
    
    public void PredictionTask()
    {
      
        double intercept, slope;
        LinearRegressionCalc(yearValues.ToArray(), quantityValues.ToArray(), out intercept, out slope);


        var predictedValue = (slope * int.Parse(inputObject.text)) + intercept;
        textPredictionResult.text = "Result: " + predictedValue;
        Debug.Log("Prediction for " + inputObject.text + " : " + predictedValue);

    }


    public static void LinearRegressionCalc(
        double[] xValues,
        double[] yValues,
        out double yIntercept,
        out double slope)
    {
        if (xValues.Length != yValues.Length)
        {
            throw new Exception("Input values should be with the same length.");
        }

        double xSum = 0;
        double ySum = 0;
        double xSumSquared = 0;
        double ySumSquared = 0;
        double codeviatesSum = 0;

        for (var i = 0; i < xValues.Length; i++)
        {
            var x = xValues[i];
            var y = yValues[i];
            codeviatesSum += x * y;
            xSum += x;
            ySum += y;
            xSumSquared += x * x;
            ySumSquared += y * y;
        }

        var count = xValues.Length;
        var xSS = xSumSquared - ((xSum * xSum) / count);
        var ySS = ySumSquared - ((ySum * ySum) / count);

        var numeratorR = (count * codeviatesSum) - (xSum * ySum);
        var denomR = (count * xSumSquared - (xSum * xSum)) * (count * ySumSquared - (ySum * ySum));
        var coS = codeviatesSum - ((xSum * ySum) / count);

        var xMean = xSum / count;
        var yMean = ySum / count;

        yIntercept = yMean - ((coS / xSS) * xMean);
        slope = coS / xSS;
    }

}

Let’s open the “tutorial_graph” scene and create an empty GameObject called “GraphContainer”, this object will contain all the bars of the graph.

Now, we will create dynamically the data visualization, in order to do that we create a new script called “GenDataViz.cs”, this is the code:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI; // Required when Using UI elements.

public class GenDataViz : MonoBehaviour {
    int scaleSize = 100;
    int scaleSizeFactor = 10;
    public GameObject graphContainer;
    int binDistance = 13;
    float offset = 0;

    // Use this for initialization
    void Start () {
        CreateGraph();
    }


    public void ClearChilds(Transform parent)
    {
        offset = 0;
        foreach (Transform child in parent)
        {
            Destroy(child.gameObject);
        }
    }


    public void CreateGraph()
    {
        Debug.Log("creating the graph");
        ClearChilds(graphContainer.transform);
        for (var i = 0; i < LinearRegression.quantityValues.Count; i++)
        {
            createBin(10, (float)LinearRegression.quantityValues[i] / scaleSize, 10, offset, ((float)LinearRegression.quantityValues[i] / scaleSize) / 2, graphContainer);
            offset += binDistance;
        }

    }


    void createBin(float Scale_x, float Scale_y, float Scale_z, float Padding_x, float Padding_y, GameObject _parent)
    {
        GameObject cube = GameObject.CreatePrimitive(PrimitiveType.Cube);
        cube.transform.SetParent(_parent.transform, true);
        cube.transform.position = new Vector3(Padding_x, Padding_y - 100, 400);

        Vector3 scale = new Vector3(Scale_x, Scale_y, Scale_z);

        cube.transform.localScale = scale;
    }

    // Update is called once per frame
    void Update () {
		
	}
}

The code logic works in this order:

Clear previous graphs if they are present.
Load the data from the script “LinearRegression.cs” and using a for loop we pass the data inside the “createBar” function.
Set a parent container.
The “createBar” function creates a primitive cube inside the parent container, so if we want to clear the graph we simply remove the children of the parent container. The cube has a fixed width, its height is taken from the number of patients in a particular year. We use the scaleHeight in order to modify the size of the height of the visualization, offset helps us to have a distance between the bars.
There a scaleSize value that we will use later to allow the user to customize the graph.
There is an offset value that allows us to create a bin next to the other.

This part of the code:

((float)LinearRegression.quantityValues[i] / scaleSize) / 2

assures that each bar of the histogram will start from the same baseline, otherwise, the bars will be centred vertically.

This line:

cube.transform.position = new Vector3(Padding_x, Padding_y - 100, 400);

contains fixed values in order to position the histogram in front of the camera, feel free to change this values (100 is y, 400 is z) in order to position the histogram where you like. In the third part of the tutorial, we will use a target image and the histogram will spawn on the target, so those coordinates will be removed.

We need to assign the “GenDataViz.cs” to the GraphContainer object, in the graphContainer field of the script we drag the GraphContainer object.

Now, everything should be fine, let’s start the game scene from the “tutorial” scene and navigate to the “graph_tutorial” scene using the go-to button, the 3D histogram should appear near the centre of the screen.

The histogram looks really boring and it’s hard to distinguish one bar from another, let’s add some colors, in this snippet of code we add a random color to each bar.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI; // Required when Using UI elements.

public class GenDataViz : MonoBehaviour
{
    int scaleSize = 100;
    int scaleSizeFactor = 10;
    public GameObject graphContainer;
    int binDistance = 13;
    float offset = 0;

    // Use this for initialization
    void Start()
    {
        CreateGraph();
    }


    public void ClearChilds(Transform parent)
    {
        offset = 0;
        foreach (Transform child in parent)
        {
            Destroy(child.gameObject);
        }
    }


    public void CreateGraph()
    {
        Debug.Log("creating the graph");
        ClearChilds(graphContainer.transform);
        for (var i = 0; i < LinearRegression.quantityValues.Count; i++)
        {
            createBin(10, (float)LinearRegression.quantityValues[i] / scaleSize, 10, offset, ((float)LinearRegression.quantityValues[i] / scaleSize) / 2, graphContainer);
            offset += binDistance;
        }

    }


    void createBin(float Scale_x, float Scale_y, float Scale_z, float Padding_x, float Padding_y, GameObject _parent)
    {
        GameObject cube = GameObject.CreatePrimitive(PrimitiveType.Cube);
        cube.transform.SetParent(_parent.transform, true);

        cube.transform.position = new Vector3(Padding_x, Padding_y - 100, 400);

        // Let's add some colours
        cube.GetComponent().material.color = Random.ColorHSV(0f, 1f, 1f, 1f, 0.5f, 1f);


        Vector3 scale = new Vector3(Scale_x, Scale_y, Scale_z);

        cube.transform.localScale = scale;
    }

}

This is the result, less boring and easier to look at.

Now, let’s add more customization to the graph, allowing the user to modify the size of the graph using two buttons, so we create two buttons called “IncreaseSize” and “DecreaseSize” inside the panel.

Remember to anchor them, I personally use the top and centre anchor.

We need to modify our “GenDataViz.cs” script in order to these functionalities, so we add the two functions in order to decrease or increase the scale size of the graph, increasing ScaleSize the graph would become smaller, decreasing ScaleSize the graph would become bigger. Feel free to change the ScaleFactor at your will.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI; // Required when Using UI elements.

public class GenDataViz : MonoBehaviour
{
    int scaleSize = 100;
    int scaleSizeFactor = 10;
    public GameObject graphContainer;
    int binDistance = 13;
    float offset = 0;

    // Use this for initialization
    void Start()
    {
        CreateGraph();
    }


    public void ClearChilds(Transform parent)
    {
        offset = 0;
        foreach (Transform child in parent)
        {
            Destroy(child.gameObject);
        }
    }

    // Here we allow the use to increase and decrease the size of the data visualization
    public void DecreaseSize()
    {
        scaleSize += scaleSizeFactor;
        CreateGraph();
    }

    public void IncreaseSize()
    {
        scaleSize -= scaleSizeFactor;
        CreateGraph();
    }



    public void CreateGraph()
    {
        Debug.Log("creating the graph");
        ClearChilds(graphContainer.transform);
        for (var i = 0; i < LinearRegression.quantityValues.Count; i++)
        {
            createBin(10, (float)LinearRegression.quantityValues[i] / scaleSize, 10, offset, ((float)LinearRegression.quantityValues[i] / scaleSize) / 2, graphContainer);
            offset += binDistance;
        }

    }


    void createBin(float Scale_x, float Scale_y, float Scale_z, float Padding_x, float Padding_y, GameObject _parent)
    {
        GameObject cube = GameObject.CreatePrimitive(PrimitiveType.Cube);
        cube.transform.SetParent(_parent.transform, true);

        cube.transform.position = new Vector3(Padding_x, Padding_y - 100, 400);

        // Let's add some colours
        cube.GetComponent().material.color = Random.ColorHSV(0f, 1f, 1f, 1f, 0.5f, 1f);


        Vector3 scale = new Vector3(Scale_x, Scale_y, Scale_z);

        cube.transform.localScale = scale;
    }

}

Assign the increase size function:

Assign the decrease size function:

Now, if everything works correctly, the increase button should make the graph bigger and the other should decrease the size.

Et voila, now when the user changes the scale size input, the graph should be recreated.

Let’s add a reset function, so if the user gets lost in increasing or decreasing the size of the graph, the user can reset the size easily with a button. Then, we create a new button called “ResetSize” inside the canvas. Anchor this button to the top center.

Let’s add a function inside “GenDataViz.cs”, here we add the “ResetSize” function that simply reset the ScaleSize value and recreate the graph calling the “CreateGraph” function.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI; // Required when Using UI elements.

public class GenDataViz : MonoBehaviour
{
    int scaleSize = 100;
    int scaleSizeFactor = 10;
    public GameObject graphContainer;
    int binDistance = 13;
    float offset = 0;

    // Use this for initialization
    void Start()
    {
        CreateGraph();
    }


    public void ClearChilds(Transform parent)
    {
        offset = 0;
        foreach (Transform child in parent)
        {
            Destroy(child.gameObject);
        }
    }

    // Here we allow the use to increase and decrease the size of the data visualization
    public void DecreaseSize()
    {
        scaleSize += scaleSizeFactor;
        CreateGraph();
    }

    public void IncreaseSize()
    {
        scaleSize -= scaleSizeFactor;
        CreateGraph();
    }

    //Reset the size of the graph
    public void ResetSize()
    {
        scaleSize = 100;
        CreateGraph();
    }



    public void CreateGraph()
    {
        Debug.Log("creating the graph");
        ClearChilds(graphContainer.transform);
        for (var i = 0; i < LinearRegression.quantityValues.Count; i++)
        {
            createBin(10, (float)LinearRegression.quantityValues[i] / scaleSize, 10, offset, ((float)LinearRegression.quantityValues[i] / scaleSize) / 2, graphContainer);
            offset += binDistance;
        }

    }


    void createBin(float Scale_x, float Scale_y, float Scale_z, float Padding_x, float Padding_y, GameObject _parent)
    {
        GameObject cube = GameObject.CreatePrimitive(PrimitiveType.Cube);
        cube.transform.SetParent(_parent.transform, true);

        cube.transform.position = new Vector3(Padding_x, Padding_y - 100, 400);

        // Let's add some colours
        cube.GetComponent().material.color = Random.ColorHSV(0f, 1f, 1f, 1f, 0.5f, 1f);


        Vector3 scale = new Vector3(Scale_x, Scale_y, Scale_z);

        cube.transform.localScale = scale;
    }

}

Now, something is still missing, it’s the bar of the prediction about how many patients the hospital will need to take care in a future year. We will add the prediction as a new a bar at the end of the histogram.

First, Let’s modify the “LinearRegression.cs” script, so we can call the “PredictionTask” function from the “graph_tutorial” scene and give the user the ability to generate different predictions from the new scene.

using System;
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI;
using TMPro;

public class LinearRegression : MonoBehaviour
{
    public TextMeshProUGUI textPredictionResult;
    public TextMeshProUGUI textDatasetObject;
    public InputField inputObject;
    // Share the prediction with other scripts
    public static int PredictionOutput;

    // Use public and static to share the lists data
    public static List yearValues = new List();
    public static List quantityValues = new List();

    void Start()
    {
        // Clear the lists
        yearValues.Clear();
        quantityValues.Clear();

        TextAsset csvdata = Resources.Load("dataset");

        string[] data = csvdata.text.Split(new char[] { '\n' });
        textDatasetObject.text = "Year Quantity";


        for (int i = 1; i < data.Length - 1; i++)
        {
            string[] row = data[i].Split(new char[] { ',' });

            if (row[1] != "")
            {
                DataInterface item = new DataInterface();

                int.TryParse(row[0], out item.year);
                int.TryParse(row[1], out item.quantity);

                yearValues.Add(item.year);
                quantityValues.Add(item.quantity);
                textDatasetObject.text += "\n" + item.year + " " + item.quantity;


            }
        }
    }
    
    public void PredictionTask()
    {
      
        double intercept, slope;
        LinearRegressionCalc(yearValues.ToArray(), quantityValues.ToArray(), out intercept, out slope);

        // We use the the static variable in order to share the result of the prediction
        // we convert to an int because we are talking about a number of patients
        PredictionOutput = (int)((slope * int.Parse(inputObject.text)) + intercept);
        textPredictionResult.text = "Result: " + PredictionOutput;
        Debug.Log("Prediction for " + inputObject.text + " : " + PredictionOutput);

    }


    public static void LinearRegressionCalc(
        double[] xValues,
        double[] yValues,
        out double yIntercept,
        out double slope)
    {
        if (xValues.Length != yValues.Length)
        {
            throw new Exception("Input values should be with the same length.");
        }

        double xSum = 0;
        double ySum = 0;
        double xSumSquared = 0;
        double ySumSquared = 0;
        double codeviatesSum = 0;

        for (var i = 0; i < xValues.Length; i++)
        {
            var x = xValues[i];
            var y = yValues[i];
            codeviatesSum += x * y;
            xSum += x;
            ySum += y;
            xSumSquared += x * x;
            ySumSquared += y * y;
        }

        var count = xValues.Length;
        var xSS = xSumSquared - ((xSum * xSum) / count);
        var ySS = ySumSquared - ((ySum * ySum) / count);

        var numeratorR = (count * codeviatesSum) - (xSum * ySum);
        var denomR = (count * xSumSquared - (xSum * xSum)) * (count * ySumSquared - (ySum * ySum));
        var coS = codeviatesSum - ((xSum * ySum) / count);

        var xMean = xSum / count;
        var yMean = ySum / count;

        yIntercept = yMean - ((coS / xSS) * xMean);
        slope = coS / xSS;
    }

}

As you can see we added a new static variable, “PredictedOutput” and modified the “PredictionTask” function, so this function can take a value from an input field and output the value in the “PredictedOutput” variable.

We convert the prediction result to an int because we are predicting the number of patients so a float number would not be appropriate, we can’t have one patient and a half!

Now, let’s modify the “GenDataViz.cs”

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI; // Required when Using UI elements.

public class GenDataViz : MonoBehaviour
{
    int scaleSize = 100;
    int scaleSizeFactor = 10;
    public GameObject graphContainer;
    int binDistance = 13;
    float offset = 0;

    // Use this for initialization
    void Start()
    {
        CreateGraph();
    }


    public void ClearChilds(Transform parent)
    {
        offset = 0;
        foreach (Transform child in parent)
        {
            Destroy(child.gameObject);
        }
    }

    // Here we allow the use to increase and decrease the size of the data visualization
    public void DecreaseSize()
    {
        scaleSize += scaleSizeFactor;
        CreateGraph();
    }

    public void IncreaseSize()
    {
        scaleSize -= scaleSizeFactor;
        CreateGraph();
    }

    //Reset the size of the graph
    public void ResetSize()
    {
        scaleSize = 100;
        CreateGraph();
    }



    public void CreateGraph()
    {
        Debug.Log("creating the graph");
        ClearChilds(graphContainer.transform);
        for (var i = 0; i < LinearRegression.quantityValues.Count; i++)
        {
            createBin(10, (float)LinearRegression.quantityValues[i] / scaleSize, 10, offset, ((float)LinearRegression.quantityValues[i] / scaleSize) / 2, graphContainer);
            offset += binDistance;
        }
        // Let's add the predictio as the last bar, only if the user made a prediction
        if (LinearRegression.PredictionOutput != 0)
        {
            createBin(10, (float)LinearRegression.PredictionOutput / scaleSize, 10, offset, ((float)LinearRegression.PredictionOutput / scaleSize) / 2, graphContainer);
            offset += binDistance;
        }

    }


    void createBin(float Scale_x, float Scale_y, float Scale_z, float Padding_x, float Padding_y, GameObject _parent)
    {
        GameObject cube = GameObject.CreatePrimitive(PrimitiveType.Cube);
        cube.transform.SetParent(_parent.transform, true);

        cube.transform.position = new Vector3(Padding_x, Padding_y - 100, 400);

        // Let's add some colours
        cube.GetComponent().material.color = Random.ColorHSV(0f, 1f, 1f, 1f, 0.5f, 1f);


        Vector3 scale = new Vector3(Scale_x, Scale_y, Scale_z);

        cube.transform.localScale = scale;
    }

}

We added the prediction as the last bar of the histogram, but only if the user made a prediction in the previous scene, otherwise the graph will visualize the dataset without the prediction bar.

Let’s add a UI TextMeshPro for the year of the predictions, inside the “tutorial_graph” scene, called “PredictionLabel“.

Now, let’s modify our “LinearRegression.cs” script in order to share the year of the prediction.

using System;
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI;
using TMPro;

public class LinearRegression : MonoBehaviour
{
    public TextMeshProUGUI textPredictionResult;
    public TextMeshProUGUI textDatasetObject;
    public InputField inputObject;
    // Share the prediction with other scripts
    public static int PredictionOutput;

    //Share the prediction year with other scripts
    public static int PredictionYear;

    // Use public and static to share the lists data
    public static List yearValues = new List();
    public static List quantityValues = new List();

    void Start()
    {
        // Clear the lists
        yearValues.Clear();
        quantityValues.Clear();

        TextAsset csvdata = Resources.Load("dataset");

        string[] data = csvdata.text.Split(new char[] { '\n' });
        textDatasetObject.text = "Year Quantity";


        for (int i = 1; i < data.Length - 1; i++)
        {
            string[] row = data[i].Split(new char[] { ',' });

            if (row[1] != "")
            {
                DataInterface item = new DataInterface();

                int.TryParse(row[0], out item.year);
                int.TryParse(row[1], out item.quantity);

                yearValues.Add(item.year);
                quantityValues.Add(item.quantity);
                textDatasetObject.text += "\n" + item.year + " " + item.quantity;


            }
        }
    }
    
    public void PredictionTask()
    {
      
        double intercept, slope;
        LinearRegressionCalc(yearValues.ToArray(), quantityValues.ToArray(), out intercept, out slope);

        // We use the the static variable in order to share the result of the prediction
        // we convert to an int because we are talking about a number of patients
        PredictionOutput = (int)((slope * int.Parse(inputObject.text)) + intercept);

        // assign the prediction year
        PredictionYear = int.Parse(inputObject.text);

        textPredictionResult.text = "Result: " + PredictionOutput;
        Debug.Log("Prediction for " + inputObject.text + " : " + PredictionOutput);

    }


    public static void LinearRegressionCalc(
        double[] xValues,
        double[] yValues,
        out double yIntercept,
        out double slope)
    {
        if (xValues.Length != yValues.Length)
        {
            throw new Exception("Input values should be with the same length.");
        }

        double xSum = 0;
        double ySum = 0;
        double xSumSquared = 0;
        double ySumSquared = 0;
        double codeviatesSum = 0;

        for (var i = 0; i < xValues.Length; i++)
        {
            var x = xValues[i];
            var y = yValues[i];
            codeviatesSum += x * y;
            xSum += x;
            ySum += y;
            xSumSquared += x * x;
            ySumSquared += y * y;
        }

        var count = xValues.Length;
        var xSS = xSumSquared - ((xSum * xSum) / count);
        var ySS = ySumSquared - ((ySum * ySum) / count);

        var numeratorR = (count * codeviatesSum) - (xSum * ySum);
        var denomR = (count * xSumSquared - (xSum * xSum)) * (count * ySumSquared - (ySum * ySum));
        var coS = codeviatesSum - ((xSum * ySum) / count);

        var xMean = xSum / count;
        var yMean = ySum / count;

        yIntercept = yMean - ((coS / xSS) * xMean);
        slope = coS / xSS;
    }

}

Now, we take the prediction year inside the “GenDataViz.cs” scripts and we will show the text inside the TextMeshPro element.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI; // Required when Using UI elements.
using TMPro; // required to use text mesh pro elements

public class GenDataViz : MonoBehaviour
{
    int scaleSize = 100;
    int scaleSizeFactor = 10;
    public GameObject graphContainer;
    int binDistance = 13;
    float offset = 0;

    //Add a label for the prediction year
    public TextMeshProUGUI textPredictionYear;


    // Use this for initialization
    void Start()
    {
        CreateGraph();
    }


    public void ClearChilds(Transform parent)
    {
        offset = 0;
        foreach (Transform child in parent)
        {
            Destroy(child.gameObject);
        }
    }

    // Here we allow the use to increase and decrease the size of the data visualization
    public void DecreaseSize()
    {
        scaleSize += scaleSizeFactor;
        CreateGraph();
    }

    public void IncreaseSize()
    {
        scaleSize -= scaleSizeFactor;
        CreateGraph();
    }

    //Reset the size of the graph
    public void ResetSize()
    {
        scaleSize = 100;
        CreateGraph();
    }



    public void CreateGraph()
    {
        Debug.Log("creating the graph");
        ClearChilds(graphContainer.transform);
        for (var i = 0; i < LinearRegression.quantityValues.Count; i++)
        {
            createBin(10, (int)LinearRegression.quantityValues[i] / scaleSize, 10, offset, ((int)LinearRegression.quantityValues[i] / scaleSize) / 2, graphContainer);
            offset += binDistance;
        }
        Debug.Log("creating the graph: " + LinearRegression.PredictionOutput);

        // Let's add the predictio as the last bar, only if the user made a prediction
        if (LinearRegression.PredictionOutput != 0)
        {
            createBin(10, LinearRegression.PredictionOutput / scaleSize, 10, offset, (LinearRegression.PredictionOutput / scaleSize) / 2, graphContainer);
            offset += binDistance;
            textPredictionYear.text = "Prediction of " + LinearRegression.PredictionYear;
        }
        else
        {
            textPredictionYear.text = " ";

        }
    }


    void createBin(float Scale_x, float Scale_y, float Scale_z, float Padding_x, float Padding_y, GameObject _parent)
    {
        GameObject cube = GameObject.CreatePrimitive(PrimitiveType.Cube);
        cube.transform.SetParent(_parent.transform, true);

        cube.transform.position = new Vector3(Padding_x, Padding_y - 100, 400);

        // Let's add some colours
        cube.GetComponent().material.color = Random.ColorHSV(0f, 1f, 1f, 1f, 0.5f, 1f);


        Vector3 scale = new Vector3(Scale_x, Scale_y, Scale_z);

        cube.transform.localScale = scale;
    }

}

We assign the prediction label to the script.

Finally, our data visualization should look like this:

That’s all for this second part of the tutorial, I hope you have enjoyed it!

In the next and last part, we’ll write the code to visualize our 3D data visualization in Augmented Reality using EasyAR. In the meantime, have fun with our project, trying to customize the 3D data visualization and the UI as you prefer!

See you soon!

How to Use Machine Learning to Show Predictions in Augmented Reality – Part 1

Elisa Romondia — Wed, 12 Dec 2018 05:23:57 +0000

Introduction

In this tutorial series, you’ll learn how to develop a project that integrates Machine Learning algorithms for predictions and Augmented Reality for data visualization.

For this project we will use Unity 3D and EasyAR, we’ll build Machine Learning algorithms from the ground up in order to give you a better understanding of how Machine Learning works.

This tutorial will be split into three parts :

How to load a dataset and write Machine Learning algorithms in Unity 3D
Visualize data using Unity 3D
Wrapping up – Connecting Machine Learning algorithm output with Augment Reality data visualizations using EasyAR

Part 1 Requirements

Basic Unity 3D skills
Basic Math and Statistical skills
Unity 3D
The dataset.csv file , (available on GitHub)

Tutorial Source Code

All of the Part 1 source code can be downloaded here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Unity

Unity is a cross-platform game engine developed by Unity Technologies, that gives users the ability to create games in both 2D and 3D, and the engine offers a primary scripting API in C#, for both the Unity editor in the form of plugins and games themselves, as well as drag and drop functionality.

For this tutorial will use Unity 3D in order to create a 3D data visualization in AR, an introduction to this platform is available for free here Unity 101 – Game Development and C# Foundations on Zenva Academy.

Unity is the preferred development tool for the majority of XR creators. This platform supports the majority of the AR SDK, like ARCore, ARKit and EasyAR, allowing us to develop AR experiences for almost all devices. The majority of AR experiences around the world were created with Unity.

If you want to have a deeper knowledge of Unity 3D you can also check out the Unity Game Development Mini-Degree on Zenva Academy.

If you don’t have yet Unity 3D installed, you can download Unity Personal for free from the Unity website.

Machine Learning

Machine Learning (ML) is a subset of artificial intelligence that consists of the application of statistical techniques and algorithms, in order to make the machines able to learn from data and output predictions.

The possible applications of ML are almost endless. Depending on the situation that we want to deal with (for example, predicting if there is an anomaly in a data set, predicting sales growth of a good / service or recognizing and categorizing images), we can use different algorithms and processing techniques. If you want to learn more, take a look at the Machine Learning Mini-Degree on Zenva Academy.

In this tutorial, we’ll use ML to predict the number of future patients admitted to an imaginary hospital. For this purpose, we’ll use Linear Regression.

Linear Regression

Linear regression is a statistical method that allows us to study the relationship between two continuous variables. One is the dependent variable while one or more, are the explanatory variables. It is called linear regression because we assume that there is a linear relationship between the input variables (x) and the output variables (y). When the input variable is unique we talk about simple linear regression, when the input is more than the one we talk about multiple linear regression.

Set up your project

First of all, let’s open Unity and create your project, selecting 3D

Let’s populate our scene, creating a new panel in order to build a UI, where we will add our texts, input and button.

I changed the panel background to transparent, but feel free to chose the color that you prefer.

I will use TextMeshPro, a free asset for better-looking texts in the UI, you can find this asset on the Unity Asset Store, here.

After importing the asset, add a TextMeshPro Text object into the panel, now you should see the new item in the UI menu. I customized the text and the size using the inspector. In my case, I chose the black color and 18 as a size but you can customize your text as you prefer. In order to change the text, you have simply to change the text in the input box and its properties in the related inspector section.

I anchored the text at the top left, so even with different screen sizes, the text will be always on the top left. This is important in order to make our UI responsive, because mobile devices have various range of screen sizes. In order to do that, we can use the anchor presets sections inside the inspector, just click on the anchor icon. You can see an example in the screenshot below.

Then I renamed the text mesh to “DatasetText“, we’ll use this text later to visualize the Dataset content.

Let’s look at our dataset.csv file, it has two columns, year and quantity.

Each row contains both the year and the quantity of the orders received that year.

As we can see in the first row, we have 2000 as a year and 200 as a number of patients affluence in that year. As I wrote in the introduction, in this project we want to predict how many patients the hospital will receive in the future years.

We’ll use this text later to visualize the Linear Regression prediction output.

Create a new folder, called “Resources”, then download inside that folder the dataset.csv from this link.

After that, let’s create an empty game object where we’ll assign our script, I renamed the object “ScriptObject“.

Then, we create another folder for the scripts, called “Scripts“.

We can create a new script, right-clicking inside the folder and clicking on C# script.

First of all, let’s create a data interface for our dataset, creating a script called DataInterface.cs in the scripts folder.

An interface in C# has only the declaration of the methods, properties, and events, but not the implementation. The class will implement the interface by providing an implementation for all the members of the interface, making easy to maintain a program.

If you want to know more about interfaces in C# I suggest you this link.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class DataInterface
{
    public int year;
    public int quantity;
   
}

as you can see we’ve two variables, one for each column of the dataset.

Now, let’s create a new script, called LinearRegression.cs

using System;
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI;
using TMPro;

public class LinearRegression : MonoBehaviour
{
    public TextMeshProUGUI textDatasetObject;
    List yearValues = new List();
    List quantityValues = new List();

    void Start()
    {
        TextAsset csvdata = Resources.Load("dataset");

        string[] data = csvdata.text.Split(new char[] { '\n' });
        textDatasetObject.text = "Year Quantity";

        for (int i = 1; i < data.Length - 1; i++)
        {
            string[] row = data[i].Split(new char[] { ',' });

            if (row[1] != "")
            {
                DataInterface item = new DataInterface();

                int.TryParse(row[0], out item.year);
                int.TryParse(row[1], out item.quantity);

                yearValues.Add(item.year);
                quantityValues.Add(item.quantity);
                textDatasetObject.text += "\n" + item.year + " " + item.quantity;


            }
        }
    }

First of all, you can see that I wrote :

using System in order to enable more features.
using UnityEngine.UI for the UI elements
using TMPro for the TextMeshPro elements

Next, I declared the TextMesh, later we’ll link it to a text object of the UI.

In the two lines below, I created two different lists, one will contain all the years and the other all the quantities.

I loaded the dataset in the Start function from the Text Assets Resources, and I extracted for each row of the dataset.csv the relative year and quantity.

I assigned a text header to the DatasetText object in order to avoid unexpected results, such as lines repetition.

I used Split to extract the rows from the file putting them in different rows using \n to break the lines. I wrote a for loop to cycle through the rows, extracting with TryParse the value of the year or the quantity from every single row.

I added these values to the relative lists and we populate the dataset text object with all the content from the two lists, using always \n to output every single line separately.

The Start function is a reserved function in Unity, this function will be executed instantly when the object with this script assigned is created. With this in mind, if we include an object from the beginning of a scene the script will be executed directly when the scene will start. That’s why we created an empty object with this script assigned, so the dataset will be loaded from the beginning.

Let’s visualize our dataset inside the UI panel. First, we assign the Linear Regression script to the ScriptObject, dragging the script inside the ScriptObject’s inspector.

Then, we drag the DatasetText Object inside the script text slot.

Now, we can see the dataset contents in the UI, clicking play on the Unity editor.

Let’s add another part of the UI, we’ll create from UI menu section:

TextMeshPro, as the label for the input field, I renamed this object “EnterYearLabel“
InputField, in order to enter the year of the prediction
Button, in order to start our prediction, changing the button text to “Predict“
TextMeshPro, in order to visualize the result of our prediction, I renamed this object “ResultPredictionText”

I anchored all these items in the top right corner of the screen.

Let’s create the function that will calculate the Linear Regression inside the LinearRegression.cs

public static void LinearRegressionCalc(
        double[] xValues,
        double[] yValues,
        out double yIntercept,
        out double slope)
    {
        if (xValues.Length != yValues.Length)
        {
            throw new Exception("Input values should be with the same length.");
        }

        double xSum = 0;
        double ySum = 0;
        double xSumSquared = 0;
        double ySumSquared = 0;
        double codeviatesSum = 0;

        for (var i = 0; i < xValues.Length; i++)
        {
            var x = xValues[i];
            var y = yValues[i];
            codeviatesSum += x * y;
            xSum += x;
            ySum += y;
            xSumSquared += x * x;
            ySumSquared += y * y;
        }

        var count = xValues.Length;
        var xSS = xSumSquared - ((xSum * xSum) / count);
        var ySS = ySumSquared - ((ySum * ySum) / count);

        var numeratorR = (count * codeviatesSum) - (xSum * ySum);
        var denomR = (count * xSumSquared - (xSum * xSum)) * (count * ySumSquared - (ySum * ySum));
        var coS = codeviatesSum - ((xSum * ySum) / count);

        var xMean = xSum / count;
        var yMean = ySum / count;

        yIntercept = yMean - ((coS / xSS) * xMean);
        slope = coS / xSS;
    }

This function is simply a conversion of the Linear Regression’s mathematical equation to C# code. As you can see in the code, I passed inside this function the values relative to the years and the values relative to the quantities. I passed the years list as xValues and the quantity list as yValues in order to predict the future quantities.

This function outputs the intercept and the slope of the Linear Regression, we’ll use these values to build our prediction in the next function.

If you want to know more about the Linear Regression or you just want to refresh your knowledge about it, this tutorial is a great place to start.

Now, let’s go deep into the project, wiring up the code that we wrote before!

Let’s add the references to the new UI elements:

using System;
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI;
using TMPro;

public class LinearRegression : MonoBehaviour
{
    public TextMeshProUGUI textPredictionResult;
    public TextMeshProUGUI textDatasetObject;
    public InputField inputObject;
    List yearValues = new List();
    List quantityValues = new List();

    void Start()
    {
        TextAsset csvdata = Resources.Load("dataset");

        string[] data = csvdata.text.Split(new char[] { '\n' });
        textDatasetObject.text = "Year Quantity";


        for (int i = 1; i < data.Length - 1; i++)
        {
            string[] row = data[i].Split(new char[] { ',' });

            if (row[1] != "")
            {
                DataInterface item = new DataInterface();

                int.TryParse(row[0], out item.year);
                int.TryParse(row[1], out item.quantity);

                yearValues.Add(item.year);
                quantityValues.Add(item.quantity);
                textDatasetObject.text += "\n" + item.year + " " + item.quantity;


            }
        }
    }

As you can see, in the first lines I included the references to the input field and the new prediction result text:

public TextMeshProUGUI textPredictionResult, will be the slot in the script where we assign the text that will output the prediction
public TextMeshProUGUI textDatasetObject, will be the slot in the script where we assign the text that will output the dataset
public InputField inputObject, will be the slot in the script where we assign the input field where the user will insert the year of the prediction

Now, let’s create the functions that will output the prediction in our UI.

    public void PredictionTask()
    {
      
        double intercept, slope;
        LinearRegressionCalc(yearValues.ToArray(), quantityValues.ToArray(), out intercept, out slope);


        var predictedValue = (slope * int.Parse(inputObject.text)) + intercept;
        textPredictionResult.text = "Result: " + predictedValue;
        Debug.Log("Prediction for " + inputObject.text + " : " + predictedValue);

    }

In the PredictionTask function, first I convert to an array and pass the two lists as x and y values. After I get the LinearRegressionCalc function results and I calculate the prediction. I used the input text field value as the year in order to predict the quantity, multiplying its value with the slope of the function and adding the intercept. In the end, I output the prediction result in our UI prediction text object.

Now, we connect the scripts to the UI. Let’s drag the LinearRegression.cs to the button inspector.

After, we drag the input field, the prediction text and the dataset text into the script component inside the button inspector.

We drag the button inside the onClick section of the button itself, setting the onClick event to the PredictionTask function.

As you can see in the screenshot below, we set the PredictionTask from the LinearRegression script.

Finally, on our Unity Game window, clicking on play. We can see now, the result of our prediction entering a year into the input field and clicking on the predict button. Our hospital will have a prediction of how many patients it will have to take care of in the next years and act accordingly:

Hire new doctors in advance
Planning to build new areas of the hospital in order to be able to take care of a large number of patients

All those things will take time, so it’s better to act as early as possible, and prediction allows us exactly to do that. Something else to note is that in real-world applications, ideally multiple variables would be used to increase the accuracy of the prediction.

That’s all for this first part of the tutorial, I hope you have enjoyed it!

In the next part, we’ll create the 3D Data Visualizations. In the meantime, have fun with our model, trying to include other datasets and customize the UI!

See you soon!

Dimensionality Reduction

Mohit Deshpande — Sat, 04 Nov 2017 01:41:13 +0000

Dimensionality Reduction is a powerful technique that is widely used in data analytics and data science to help visualize data, select good features, and to train models efficiently. We use dimensionality reduction to take higher-dimensional data and represent it in a lower dimension. We’ll discuss some of the most popular types of dimensionality reduction, such as principal components analysis, linear discriminant analysis, and t-distributed stochastic neighbor embedding. We’ll use these techniques to project the MNIST handwritten digits dataset of images into 2D and compare the resulting visualizations.

Download the full code here.

BUILD GAMES

FINAL DAYS: Unlock 250+ coding courses, guided learning paths, help from expert mentors, and more.

ACCESS NOW

Handwritten Digits: The MNIST Dataset

Before discussing the motivation behind dimensionality reduction, let’s take a look at the MNIST dataset. We’ll be using it as a running example.

The MNIST handwritten digits dataset consists of binary images of a single handwritten digit (0-9) of size . The provided training set has 60,000 images, and the testing set has 10,000 images.

We can think of each digit as a point in a higher-dimensional space. If we take an image from this dataset and rasterize it into a vector, then it becomes a point in 784-dimensional space. That’s impossible to visualize in that higher space!

These kinds of higher dimensions are quite common in data science. Each dimension represents a feature. For example, suppose we wanted to build a naive dog breed classifier. Our features may be something like height, weight, length, fur color, and so on. Each one of these becomes a dimension in the vector that represents a single dog. That vector is then a point in a higher-dimensional space, just like our MNIST dataset!

Dimensionality Reduction

Dimensionality reduction is a type of learning where we want to take higher-dimensional data, like images, and represent them in a lower-dimensional space. Let’s use the following data as an example.

(These plots show the same data, except the bottom chart zero-centers it.)

With these data, we can use a dimensionality reduction to reduce them from a 2D plane to a 1D line. If we had 3D data, we could reduce them down to a 2D plane, and then to a 1D line.

Most dimensionality reduction techniques aim to find some hyperplane, which is just a higher-dimensional version of a line, to project the points onto. We can imagine a projection as taking a flashlight perpendicular to the hyperplane we’re projecting onto and plotting where the shadows fall on that hyperplane. For example, in our above data, if we wanted to project our points onto the x-axis, then we pretend each point is a ball and our flashlight would point directly down or up (perpendicular to the x-axis) and the shadows of the points would fall on the x-axis. This is what we call a projection. We won’t worry about the exact math behind this since scikit-learn can apply this projection for us.

In our simple 2D case, we want to find a line to project our points onto. After we project the points, then we have data in 1D instead of 2D! Similarly, if we had 3D data, we would want to find a plane to project the points down onto to reduce the dimensionality of our data from 3D to 2D. The different types of dimensionality reduction are all about figuring out which of these hyperplanes to select: there are an infinite number of them!

Principal Component Analysis

One technique of dimensionality reduction is called principal component analysis (PCA). The idea behind PCA is that we want to select the hyperplane such that, when all the points are projected onto it, they are maximally spread out. In other words, we want the axis of maximal variance! Let’s consider our example plot above. A potential axis is the x-axis or y-axis, but, in both cases, that’s not the best axis. However, if we pick a line that has the same diagonal orientation as our data, that is the axis where the data would be most spread!

The longer blue axis is the correct axis! (The shorter blue axis is for visualization only and is perpendicular to the longer one.) If we were to project our points onto this axis, they would be maximally spread! But how do we figure out this axis? We can use a linear algebra concept called eigenvectors! Essentially, we compute the covariance matrix of our data and consider that covariance matrix’s largest eigenvectors. Those are our principal axes and the axes that we project our data onto to reduce dimensions.

Using this approach, we can take high-dimensional data and reduce it down to a lower dimension by selecting the largest eigenvectors of the covariance matrix and projecting onto those eigenvectors.

Linear Discriminant Analysis

Another type of dimensionality reduction technique is called linear discriminant analysis (LDA). Similar to PCA, we want to find the best hyperplane and project our data onto it. However, there is one big distinction: LDA is supervised! With PCA, we were using eigenvectors from our data to figure out the axis of maximum variance. However, with LDA, we want the axis of maximum class separation! In other words, we want the axis that separates the classes with the maximum margin of separation.

The following figure shows the difference between PCA and LDA.

(Source: https://algorithmsdatascience.quora.com/PCA-4-LDA-Linear-Discriminant-Analysis)

With LDA, we choose the axis so that Class 1 and Class 2 are maximally separated, i.e., the distance between their means is maximal. We must have class labels for LDA because we need to compute the mean of each class to figure out the optimal plane.

It’s important to note that LDA does make some assumptions about our data. In particular, it assumes that the data for our classes are normally distributed (Gaussian distribution). We can still use LDA on data that isn’t normally distributed, but we may not find an optimal hyperplane. Another assumption is that the covariances of each class are the same. In reality, this also might not be the case, but LDA will still work fairly well. We should keep these assumptions in mind when using LDA on any set of data.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

A more recent dimensionality reduction technique that’s been widely adopted is t-Distributed Stochastic Neighbor Embedding (t-SNE) by Laurens Van Der Maaten (2008). t-SNE fundamentally differs from PCA and LDA because it is probabilistic! Both PCA and LDA are deterministic, but t-SNE is stochastic, or probabilistic.

At a high level, t-SNE aims to minimize the divergence between two distributions: the pairwise similarity of the points in the higher-dimensional space and the pairwise similarity of the points in the lower-dimensional space.

To measure similarity, we use the Student’s t-distribution or Cauchy Distribution! This is a distribution that looks very similar to a Gaussian, but it is not the Gaussian distribution! To compute the similarity between two points and , we use probabilities:

where N is the number of data points and . In other words, this equation is telling us the likelihood that would choose as its neighbor. Notice the t-distribution is centered around . Intuitively, the farther is from , the smaller the probability becomes.

Similarly, we can compute the same quantity for the points in the lower-dimensional space.

Now how do we measure the divergence between two distributions? We simply use the Kullback-Leibler divergence (KLD).

This is our cost function! Now we can use a technique like gradient descent to train our model.

There’s just one last thing to figure out: . We can’t use the same for all points! Denser regions should have a smaller , and sparser regions should have a larger . We solidify this intuition into a mathematical term called perplexity. Think of it as a measure of the effective number of neighbors, similar to the of k-nearest neighbors.

t-SNE performs a binary search for the that produces a distribution with the perplexity specified by the user: perplexity is a hyperparameter. Values between 5 and 50 tend to work the best.

In practice, t-SNE is very resource-intensive so we usually use another dimensionality reduction technique, like PCA, to reduce the input space into a smaller dimensionality (like maybe 50 dimensions), and then use t-SNE.

t-SNE, as we’ll see, produces the best results out of all of the dimensionality reduction techniques because of the KLD cost function.

Dimensionality Reduction Visualizations

Now that we’ve discussed a few popular dimensionality reduction techniques, let’s apply them to our MNIST dataset and project our digits onto a 2D plane.

First, we need to import numpy, matplotlib, and scikit-learn and get the MNIST data. Scikit-learn already comes with the MNIST data (or will automatically download it for you) so we don’t have to deal with uncompressing it ourselves! Additionally, I’ve provided a function that will produce a nice visualization of our data.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import manifold, datasets, decomposition, discriminant_analysis

digits = datasets.load_digits()
X = digits.data
y = digits.target
n_samples, n_features = X.shape

def embedding_plot(X, title):
    x_min, x_max = np.min(X, axis=0), np.max(X, axis=0)
    X = (X - x_min) / (x_max - x_min)

    plt.figure()
    ax = plt.subplot(aspect='equal')
    sc = ax.scatter(X[:,0], X[:,1], lw=0, s=40, c=y/10.)

    shown_images = np.array([[1., 1.]])
    for i in range(X.shape[0]):
        if np.min(np.sum((X[i] - shown_images) ** 2, axis=1)) < 1e-2: continue
        shown_images = np.r_[shown_images, [X[i]]]
        ax.add_artist(offsetbox.AnnotationBbox(offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r), X[i]))

    plt.xticks([]), plt.yticks([])
    plt.title(title)

Using any of the dimensionality reduction techniques that we’ve discussed in scikit-learn is trivial! We can get PCA working in just a few lines of code!

X_pca = decomposition.PCA(n_components=2).fit_transform(X)
embedding_plot(X_pca, "PCA")
plt.show()

Below is the resulting plot.

After taking a closer look at this plot, we notice something spectacular: similar digits are grouped together! If we think about it, this result makes sense. If the digits looked similar, when we rasterized them into a vector, the points must have been relatively close to each other. So when we project them down to the lower-dimensional space, we also expect them to be somewhat close together as well. However, PCA doesn’t know anything about the class labels: it’s going off of just the axis of maximal variance. Maybe LDA will work better since we can separate the classes in the lower-dimensional space better.

Now let’s use LDA to visualize the same data. Just like PCA, using LDA in scikit-learn is very easy! Notice that we have to also give the class labels since LDA is supervised!

X_lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=2).fit_transform(X, y)
embedding_plot(X_lda, "LDA")

plt.show()

Below is the resulting plot from LDA.

This looks slightly better! We notice that the clusters are a bit farther apart. Consider the 0’s: they’re almost entirely separated from the rest! We also have clusters of 4’s at the top left, 2’s and 3’s at the right, and 6’s in the center. This is doing a better job at separating the digits in the lower-dimensional space. We can attribute this improvement to having information about the classes: remember that LDA is supervised! By knowing the correct labels, we can choose hyperplanes that better separate the classes. Let’s see how t-SNE compares to LDA.

(We may get a warning about collinear points. The reason for the warning is because we have to take a matrix inverse. Don’t worry too much about it since we’re just using this for visualization.)

Finally, let’s use t-SNE to visualize the MNIST data. We’re initializing the embedding to use PCA (in accordance with Laurens Van Der Maaten’s recommendations). Unlike LDA, t-SNE is completely unsupervised.

tsne = manifold.TSNE(n_components=2, init='pca').fit_transform(X)
embedding_plot(X_tsne,"t-SNE")

plt.show()

Below is the plot from t-SNE.

This looks the best out of all three! There are distinct clusters of points! Notice that almost all of the digits are separated into their own clusters. This is a direct result of minimizing the divergence of the two distributions: points that are close to each other in the high-dimensional space will be close together in the lower-dimensional space!

To summarize, we discussed the problem of dimensionality reduction, which is to reduce high-dimensional data into a lower dimensionality. In particular, we discussed Principal Components Analysis (PCA), Linear Discriminant Analysis (LDA), and t-Distributed Stochastic Neighbor Embedding. PCA tries to find the hyperplane of maximum variance: when the points are projected onto the optimal hyperplane, they have maximum variance. LDA finds the axis of largest class separation: when the points are projected onto the optimal hyperplane, the means of the classes are maximally apart. Remember that LDA is supervised and requires class labels! Finally, we discussed an advanced technique called t-SNE, which actually performs optimization across two distributions to produce an lower-dimensional embedding so that the points in the lower dimensionality are pairwise representative of how they appear in the higher dimensionality.

Dimensionality reduction is an invaluable tool in the toolbox of a data scientist and has applications in feature selection, face recognition, and data visualization.

Building Blocks – Data Science and Linear Regression

Mohit Deshpande — Tue, 12 Sep 2017 06:30:37 +0000

“Data science” or “Big data analyst” is a phrase that has been tossed around since the advent of Big Data. But what is it, really? Well imagine working for a retail company. One of the questions you may be asked to answer is “how many chips should we stock up for this month?” It seems like a simple question that makes sense to ask. But there are a lot of steps involved in answering that single question. We need to collect the data, clean/prep the data, figure out which features we need to answer the question, determine the appropriate machine learning algorithm to use, get the results, and, finally, write a report that outlines the results of the analysis. And that’s still a fairly top-level understanding of what might be needed to complete this task! Some portion of that might take a long time, e.g., we may spend a few weeks collecting and cleaning/prepping data!

We’re going to go through an example question similar to what a data scientist may be expected to answer. We’ll also cover the topic of linear regression, a fundamental machine learning algorithm. Download the dataset and code in the ZIP file here.

Data science hinges on questions whose answer lie in the data itself. We’re going to go through an example of answering one of these questions using data from Facebook. The University of California Irvine (UCI) has a ton of free-to-use machine learning datasets here, and we’re going to use their Facebook metrics dataset to answer some questions.

Let’s start off with a question whose answer you can probably intuit: “When a post is shared more often, do more people tend to like that post as well?” You can probably use your intuition to figure out that the more a post is shared, the more likely it is to be liked or commented on. But intuition doesn’t always pan out! The key word in “data science” is “data”! We need evidence to support our intuition or our hypothesis.

Data Extraction and Exploration

The first thing we should do is spend some time to explore our dataset. This includes writing code to read in our data and to create a visualization. Since the data are in a CSV file, we can use the built-in csv module to extract them. If we’re dealing with large datasets, it’s quicker to pre-allocate our numpy arrays whenever possible. This is why we count the number of rows in our csv file, which should be exactly 500. (We subtract one because of the header row.) Then we can pre-allocate our X and Y arrays. We use Python’s csv module to open the file, skip the header row, and load it into the numpy arrays.

A few minor points to note: the enumerate function will produce a tuple with the index and value at that index if given a list or generator. This is particularly useful for loading into pre-allocated arrays. Another note is that some of data might not exist. In this case, we have to decide what we should do. There are many valid options: forget/remove the nonexistent data, replace it with a default value, interpolate (for time-series data), etc. In our case, we’re just substituting a reasonable value (zero).

import numpy as np
import matplotlib.pyplot as plt
import csv

def load_dataset():
    num_rows = sum(1 for line in open('dataset_Facebook.csv')) - 1
    X = np.zeros((num_rows, 1))
    y = np.zeros((num_rows, 1))
    with open('dataset_Facebook.csv') as f:
        reader = csv.DictReader(f, delimiter=';')
        next(reader, None)
        for i, row in enumerate(reader):
            X[i] = int(row['share']) if len(row['share']) > 0 else 0
            y[i] = int(row['like'])  if len(row['like']) > 0 else 0
    return X, y

Now that our dataset is loaded, we can use matplotlib to create a scatter plot. Remember to label the axes!

def visualize_dataset(X, y):
    plt.xlabel('Number of shares')
    plt.ylabel('Number of likes')
    plt.scatter(X, y)
    plt.show()

if __name__ == '__main__':
    X, y = load_dataset()
    visualize_dataset(X, y)

Here is the resulting plot:

From this plot alone, we notice a few things. There seems to be an outlier with around 800 shares and a little over 5000 likes. This might be an interesting post to investigate. A second point to notice is that most of our data are between 0 – 200 shares and 0 – 2000 likes, quite densely actually. Remember that 500 data points are being shown! Another interesting point we notice is the right-upwards trend of our data. This seems to fit our intuition: the more shares a post receives, the most likes it tends to have!

Normally, we would spend more time exploring our data, but our question can be answered using these data. Now that we see our intuition fits our data, we need to provide quantitative numbers that show this relationship between the number of shares and number of likes. The best way to do this, in our case, is using linear regression.

Linear Regression

Linear regression is a machine learning algorithm used find linear relationships between two sets of data. The crux of linear regression is that it only works when our data is somewhat linear, which fits our data. There are metrics that we’ll use to see exactly how linear our data are.

With linear regression, we’re trying to estimate the parameters of a line:

Following suit with neural networks, we use instead of . Another reason for this is because we may have multivariate data where the input is a vector. Then we generalize to a dot product. Based on this equation, we have two parameters: and .

We want to compute the line-of-best-fit, but what does “best” mean? Well we need to define a metric that we use to measure “best” because we need a way to determine if one set of parameter values is better than another. To do this, we define a cost function. Intuitively, this is just a measure of how good our parameters are, given our data. When our cost function is minimal, that means that our parameters are optimal!

(where N is the number of training examples and y is the true value)

Let’s take a closer look at this cost function intuitively. It is measuring the sum of the squared error of all of our training examples. The reason we square is because we don’t really care if the error is positive or negative: an error is an error! (More precisely, the reason we use a square and not an absolute value is because it is differentiable: the derivative of a squared function is nicer than the derivative of an absolute value function.)

Now that we know a bit about the cost function. How do we use it? The optimal parameters are found when the cost is at a minimum. So we need to perform some optimization! To find the optimal value of w, we take the partial derivative of with respect to , set it equal to zero, and solve for ! We do the same thing for .

We’ll skip the partial derivatives and derivation for now, but we can use some statistics to rearrange the solutions into a closed-form:

(where the bars over the variables represent average/mean)

Now we can figure out the slope and y-intercept from the data itself!

Linear Regression Code

Now that we have the math figured out, we can implement this in code. Let’s create a new class with parameters w and b.

class LinearRegression(object):
    """Implements linear regression"""
    def __init__(self):
        self.w = 0
        self.b = 0

Now we can write a fit function that computes both of these using the closed-forms we discussed.

def fit(self, X, y):
    mean_x = X.mean()
    mean_y = y.mean()
    errors_x = X - mean_x
    errors_y = y - mean_y
    errors_product_xy = np.sum(np.multiply(errors_x, errors_y))
    squared_errors_x = np.sum(errors_x ** 2)

    self.w = errors_product_xy / squared_errors_x
    self.b = mean_y - self.w * mean_x

We are making use of numpy’s vectorized operations to speed up our computation. We compute the means of our inputs and outputs using a quick numpy function. We can compute the errors in a vectorized fashion as well. In numpy, when we perform addition, subtraction, multiplication, or division of an array by a scalar, numpy will apply that operation to all values in the array. For example, when we subtract the scalar mean_x from the vector X, we’re actually taking each element of X and subtracting mean_x from it. Same goes for the vector y.

When we compute the errors, we have to tell numpy to do an element-wise multiplication of the two vectors, not a vector-vector multiplication. (Mathematically, this is called the Hadamard product). This takes the first element of errors_x and multiplies it by the first element of errors_y and so on to produce a new vector. Then we take the sum of all of the elements in that vector. This produces the numerator of the expression to compute the slope.

To compute the denominator, we can simply take the square of the errors and take the sum. Numpy can also apply exponents to each element of an array. Finally, we can compute the slope and y-intercept!

Before we visualize our answer, let’s make another convenience function that takes in inputs and applies our equation to return predictions.

def predict(self, X):
    return self.w * X + self.b

Now that we have the slope and y-intercept, we can visualize this along with the scatter plot of our data to see if our solution is correct.

def visualize_solution(X, y, lin_reg):
    plt.xlabel('Number of shares')
    plt.ylabel('Number of likes')
    plt.scatter(X, y)

    x = np.arange(0, 800)
    y = lin_reg.predict(x)
    plt.plot(x, y, 'r--')

    plt.show()

if __name__ == '__main__':
    X, y = load_dataset()
    lin_reg = LinearRegression()
    lin_reg.fit(X, y)
    
    visualize_solution(X, y, lin_reg)

Here, we simply create some x values and apply our line equation to them to produce the red, dashed line. The result is shown below:

We can see that our red, prediction line-of-best-fit fits most of our data! However, there’s another statistical measure that we can compute to give us more information about our line. In particular, it will tell us how linear our data are and the correlation.

Correlation Coefficient

The Pearson correlation coefficient is a number that represents how linear our data are and the relationship between the input and output, namely between the number of shares and the number of likes.

We won’t cover the exact derivation of it, but we can represent it quite simply in statistical terms:

(where and are the standard deviations of X and Y and is the covariance of X and Y)

We can compute the covariance using the following equation

The correlation coefficient is a value from -1 to 1 that represents two thing: the linearity of our data and the correlation between X and Y. A value close to -1 or 1 means that our data is very linear. A value close to 0 means that our data is not linear at all. If the value is positive, then high values of X tend to produce high values of Y and vice-versa. If the value is negative, high values of X tend to produce low values of Y and vice-versa. This is a very powerful metric and essential to answering our question: “are posts that are shared more often receive more likes?”

To compute the value, let’s add another property.

class LinearRegression(object):
    """Implements linear regression"""
    def __init__(self):
        self.w = 0
        self.b = 0
        self.rho = 0

We can compute this value in the fit function when we have the inputs and outputs.

def fit(self, X, y):
    mean_x = X.mean()
    mean_y = y.mean()
    errors_x = X - mean_x
    errors_y = y - mean_y
    errors_product_xy = np.sum(np.multiply(errors_x, errors_y))
    squared_errors_x = np.sum(errors_x ** 2)

    self.w = errors_product_xy / squared_errors_x
    self.b = mean_y - self.w * mean_x

    N = len(X)
    std_x = X.std()
    std_y = y.std()
    cov = errors_product_xy / N
    self.rho = cov / (std_x * std_y)

Now we can add some code in our visualization to show this value in the legend of the plot.

def visualize_solution(X, y, lin_reg):
    plt.xlabel('Number of shares')
    plt.ylabel('Number of likes')
    plt.scatter(X, y)

    x = np.arange(0, 800)
    y = lin_reg.predict(x)
    plt.plot(x, y, 'r--', label='r = %.2f' % lin_reg.rho)
    plt.legend()

    plt.show()

if __name__ == '__main__':
    X, y = load_dataset()
    lin_reg = LinearRegression()
    lin_reg.fit(X, y)
    
    visualize_solution(X, y, lin_reg)

Notice the label parameter to the plot function. Finally, we can show our plot:

We notice that our data has a correlation coefficient of +0.9, which is close to 1 and positive. This means our data is really linear and the more shares a post gets, the more likes it tends to get too! We’ve answered our question!

There are many more columns to this data set, and I encourage you to explore all of them using scatter plots, histograms, etc. Try to find hidden correlations! (But remember that correlation does not imply causation!)

To summarize, we learned how to answer a data science question using linear regression, which gives us the line-of-best-fit for our data. We use a cost function to mathematically represent what the “best” line is, and we can use optimization to directly solve for the slope and intercept of the line. However, this equation might not be enough. Additionally, we can compute the correlation coefficient to tell us the linearity of our data and the correlation between the inputs and outputs.

Linear regression is a fundamental machine learning algorithm and essential to data science!