It’s hard to think of a hotter topic than Deep Learning and that’s what we’re going to talk about in-depth and hands-on for the next few hours. I´m oing to show you how neural networks work, artificial neural networks, perceptrons, multi-layer perceptrons and then we’re going to talk into some more advanced topics like convolutional neural networks and recurrent neural networks. None of that probably means anything to you right now, but the bottom line is if you’ve been curious about how Deep Learning and artificial neural networks work, you’re going to understand that by the end of these next few hours, so think of it as Deep Learning for people in a hurry. I’m going to give you just enough depth to be dangerous and there will be several hands-on activities and exercises, so you can actually get some confidence in actually applying these techniques and really understanding how they work and what they’re for. I think you’ll find that they’re a lot easier to use than you might have thought, so let’s dive in and see what it’s all about. So first things first, let’s do some housekeeping. Some of you may be taking this section on Deep Learning and neural networks out of context from the larger Data Science and Machine Learning course that it’s a part of, maybe you just skipped ahead to this section or maybe you’re getting it on youtube or something like that, if so, this is how to get the course materials that you need for this section of the course just, head on over to sundog-education.com/deep-learning, just like that, pay attention to the dashes and capitalization and all that stuff, it all matters, and should bring you to this page right here. From here you can find this link for the course materials, this contains all of the scripts that are used in this section of the course, it’s a zip archive, so if you need a unzip utility which you might on Mac OS you need to install that first to decompress that file that you get and there’s also a link here to the course slides if you want to keep those around for future reference as well. There’s also a Facebook group for the larger course on Data Science and Machine Learning, you’re welcome to join us there if you want to, it’s completely optional, just a place for students to hang out and learn from each other, and finally if you are just diving into this right now, you will need a iPython 3 environment to work within, the one I use is called Enthought Canopy and I’ve got a download link to it here, it is free software, but just make sure that you have a python 3.5 or newer python environment that supports Jupiter notebooks or iPython notebooks as they might also be called. If you wanna use Anaconda or something like that, it´s totally fine, but if you don’t have any existing scientific Python 3 system set up and installed on your computer, you can go ahead and download it from there and just go ahead and install that, follow the instructions, there’s nothing special about it at all. So that’s all you need to get started with that. Let’s talk about some prerequisites for understanding the mathematics of Deep Learning. Let’s talk about some of the mathematical prerequisites that you need to understand Deep Learning. It’s probably going to be the most challenging part of the Course actually, just some of the mathematical jargon that we need to familiarize ourselves with, but once we have these basic concepts down we can talk about them a little more easily. I think you’ll find that artificial intelligence itself is actually a very intuitive field and once you get these basic concepts down it’s very easy to talk about and very easy to comprehend. First thing we want to talk about is Gradient Descent. This is basically a Machine Learning optimization technique for trying to find the most optimal set of parameters for a given problem. So what we’re plotting here basically is some sort of cost function, some measurement of the error of your learning system and this applies to machine learning in general, right? Like you’re going to have some sort of function that defines how close to the result you want your model produces results for, right? So we’re always doing in the context of supervised learning, we will be feeding our algorithm our model if you will, a group of parameters, you know, some sort of ways that we have tuned the model and we need to identify different values of those parameters that produce the optimal results. So the idea with gradient descent is that you just pick some point at random and each one of these dots represent some set of parameters to your model, maybe it’s, you know, the various parameters for some model we’ve talked about before or maybe it’s the exact weights within your neural network, whatever it is, we’re gonna try some set of parameters to start with and we will then measure whatever the area is that that produces on our system and then what we do is we move on down the curve here, right? So we’ll try a different set of parameters here, again, just like moving in a given direction with different parameter values and we then measure the error that we get from that, and in this case we actually achieved less error by trying this new set of parameters, so we say “OK I think we’re heading in the right direction here let’s change them even more in the same way,” and we just keep on doing this at different steps until finally we hit the bottom of a curve here and our error starts to increase after that point, so at that point we’ll know that we actually hit the bottom of this gradient, so you understand the nature of the term here, “gradient descent.” Basically we’re picking some point at random with a given set of parameters that we measure the error for and we keep on, you know, pushing those parameters in a given direction until the error minimizes itself and starts to come back up some other value, OK? And that’s how gradient descent works in a nutshell. I’m not going to get into all the hard core mathematics of it all, the concept is what’s important here because gradient descent is how we actually train our neural networks to find an optimal solution. Now you can see there are some areas of improvement here for this idea. First of all you can actually think of this as sort of a ball rolling downhill, so one optimization that we’ll talk about later is using the concept of momentum. You can actually have that ball gain speed as it goes down the hill here if you will, and slow down as it reaches the bottom and, you know, kind of bottoms out there, that’s the way to make it to converge more quickly when you’re doing things and can make actual training your neural networks even faster. Another thing we’re talking about is the concept of local minima. So what if I randomly picked a point that ended up over here on this curve? I might end up settling into this minima here which isn’t actually the point of the least error, the point with the least error in this graph is over here, that’s a problem, you know, I mean that’s a general problem and a gradient descent. How do you make sure that you don’t get stuck in what’s called a local minima? Because if you just look at this part of the graph, that looks like the optimal solution, and if I just happen to start over here that’s where I’m gonna get stuck. Now there are various ways of dealing with this problem, Obviously you could start from different locations, try to prevent that sort of thing, but in practical terms it turns out that local minima aren’t really that big of a deal when it comes to training neural networks, this just doesn’t really happen that off, you don’t end up with shapes like this in practice, so we can get away with not worrying about that as much. That’s a very important good thing because for a long time people believed that AI would be limited by this local minima effect and in practice it’s really not that big of a deal. Another concept we need to familiarize yourself with something called “autodiff,” and we don’t really need to go into the hard core mathematics of how autodiff works, you just need to know what it is and why it’s important. So when you’re doing gradient descent, somehow you need to know what the gradient is, Right? So we need to measure what is the slope that we’re taking along our cost function, our measurement of error, might be mean standard error for all we know, and to do that mathematically you need to get into calculus, right? If you’re trying to find the slope of a curve and you’re dealing with multiple parameters and we’re talking about partial derivatives, right? The first partial derivatives to figure out the slope that we’re heading in. Now turns out that this is very mathematically intensive and inefficient for computers to do, so by just, you know, doing the brute force approach to gradient descent, that gets very expensive very quickly. Autodiff is a technique for speeding that up, so specifically we use something called reverse-mode autodiff and what you need to know is that it can compute all the partial derivatives you need just by traversing your graph in the number of outputs plus one that you have and this works out really well in neural networks because in a neural network you tend to have artificial neurons that have very many inputs, but probably only one output or very few outputs and in comparison to the inputs. So this turns out to be a pretty good little calculus trick, it’s complicated, you know, you can look up how it works, it is pretty hardcore stuff, but it works and that’s what’s important and what’s also important is that it’s what the tensor flow library uses under the hood to implement its gradient descent. So again, you know, you’re never going to have to actually implement gradient descent from scratch or implement autodiff from scratch, these are all baked into the libraries that we’re using, libraries such as TensorFlow for doing Deep Learning; but they are terms that we throw around a lot, so it’s important that you at least know what they are and why they’re important. So just to back up a little bit, gradient descent is the technique we’re using to find the local minima of the error that we’re trying to optimize for given a certain set of parameters and autodiff is a way of accelerating that process, so we don’t have to do quite as much math or quite as much computation to actually measure that gradient of the gradient descent. One other thing we need to talk about a softmax. Again, you know, the, the mathematics aren’t so complicated here, but again, what’s really important is understanding what it is and what it’s for. So basically when you have the end result of a neural network you end up with a bunch of what we call weights that come out of the neural network at the end. So how we make use of that? How do we make practical use of the output of our neural networks? Well, that’s where softmax comes in. Basically it converts each of the final weights that come out of your neural network into a probability, so if you’re trying to classify something in your neural network like, for example decide if an image is a picture of a face or a picture of a dog or a picture of a stop sign, you might use softmax at the end to convert those final outputs of the neurons into probabilities for each class, OK? And then you can just pick the class that has the highest probability. So it’s just a way of normalizing things if you will, into a comparable range and in such a manner that if you actually choose the highest value of the softmax function from the various outputs, you end up with the best choice of classification at the end of the day, so it’s just a way of converting the final output of your neural network to an actual answer for a classification problem. So again, you might have the example of a neural network that’s trying to drive your car for you and it needs to identify pictures of stop signs or yield signs or traffic lights, you might use softmax at the end of some neural network that will take your image and classify it as one of those sign types, Right? So again, just to recap: Gradient descent, an algorithm for minimizing error over multiple steps, basically we started some random set of parameters, measure the error, move those parameters in a given direction, see if that results in more error or less error and just try to move in the direction of minimizing error until we find the actual bottom of the curve there where we have a set of parameters that minimizes the error of whatever it is you’re trying to do. Autodiff is just a calculus trick for making gradient descent faster, it makes it easier to find the gradients in gradient descent just by using some calculus trickery; and softmax is just something we apply on top of our neural network at the very end to convert the final output of our neural network to an actual choice of classification given several classification types to choose from. OK? So those are the basic mathematical terms or algorithmic terms that you need to understand to talk about artificial neural networks. So with that under our belt let’s talk about artificial neural networks next.