*35C3 preroll music* Herald Angel: Welcome to our introduction

to deep learning with Teubi. Deep learning, also often called machine learning

is a hype word which we hear in the media all the time. It’s nearly as bad as blockchain. It’s a solution for everything. Today we’ll get a sneak peek into the internals

of this mystical black box, they are talking about And Teubi will show us why people, who know what machine learning really is about, have to facepalm so often,

when they read the news. So please welcome Teubi with a big round of applause! *Applause* Teubi: Alright! Good morning and welcome to Introduction to

Deep Learning. The title will already tell you what this

talk is about. I want to give you an introduction onto

how deep learning works, what happens inside this black box. But, first of all, who am I? I’m Teubi. It’s a German nickname, it has nothing to

do with toys or bees. You might have heard my voice before,

because I host the Nussschale podcast. There I explain scientific topics in under

10 minutes. I’ll have to use a little more time today,

and you’ll also have fancy animations which hopefully will help. In my day job I’m a research scientist at

an institute for computer vision. I analyze microscopy images of bone marrow

blood cells and try to find ways to teach the computer to understand what it sees. Namely, to differentiate between certain cells or, first of all, find cells in an image,

which is a task that is more complex than it might sound like. Let me start with

the introduction to deep learning. We all know how to code. We code in a very simple way. We have some input for all computer algorithm. Then we have an algorithm which says:

Do this, do that. If this, then that. And in that way we generate some output. This is not how machine learning works. Machine learning assumes you have some

input, and you also have some output. And what you also have is some statistical

model. This statistical model is flexible. It has certain parameters, which it can

learn from the distribution of inputs and outputs you give it for training. So you basically learn the statistical model to generate the desired output

from the given input. Let me give you a really simple example of how this might work. Let’s say we have two animals. Well, we have two kinds of animals:

unicorns and rabbits. And now we want to find an algorithm that

tells us whether this animal we have right now as an input is a rabbit or a unicorn. We can write a simple algorithm to do

that, but we can also do it with machine learning. The first thing we need is some input. I choose two features that are able to tell

me whether this animal is a rabbit or a unicorn. Namely, speed and size. We call these features,

and they describe something about what we want to classify. And the class is in this case our animal. First thing I need is some training data, some input. The input here are just pairs

of speed and size. What I also need is information about the desired output. The

desired output, of course, being the class. So either unicorn or rabbit, here

denoted by yellow and red X’s. So let’s try to find a statistical model which we can use to separate this feature

space into two halves: One for the rabbits, one for the unicorns. Looking at this, we can actually find a really

simple statistical model, and our statistical model in this case is

just a straight line. And the learning process is then to find where

in this feature space the line should be. Ideally, for example, here. Right in the middle between the two classes

rabbit and unicorn. Of course this is an overly simplified example. Real-world applications have feature distributions which look much more like this. So, we

have a gradient, we don’t have a perfect separation between those two classes, and

those two classes are definitely not separable by a line. If we look again at

some training samples — training samples are the data points we use for the machine

learning process, so, to try to find the parameters of our statistical model — if

we look at the line again, then this will not be able to separate this training set. Well, we will have a line that has some errors, some unicorns which will be

classified as rabbits, some rabbits which will be classified as unicorns. This is what we call underfitting. Our model is just not able to express what

we want it to learn. There is the opposite case. The opposite case being: we just learn all

the training samples by heart. This is if we have a very complex model and just a few

training samples to teach the model what it should learn. In this case we have a

perfect separation of unicorns and rabbits, at least for the few data points

we have. If we draw another example from the real world, some other data points,

they will most likely be wrong. And this is what we call overfitting. The perfect

scenario in this case would be something like this: a classifier which is really

close to the distribution we have in the real world and machine learning is tasked

with finding this perfect model and its parameters. Let me show you a different

kind of model, something you probably all have heard about: Neural networks. Neural

networks are inspired by the brain. Or more precisely, by the neurons in our

brain. Neurons are tiny objects, tiny cells in our brain that take some input

and generate some output. Sounds familiar, right? We have inputs usually in the form

of electrical signals. And if they are strong enough, this neuron will also send

out an electrical signal. And this is something we can model in a computer-

engineering way. So, what we do is: We take a neuron. The neuron is just a simple

mapping from input to output. Input here, just three input nodes. We denote

them by i1, i2 and i3 and output denoted by o. And now you will actually see some

mathematical equations. There are not many of these in this foundation talk, don’t

worry, and it’s really simple. There’s one more thing we need first, though, if we

want to map input to output in the way a neuron does. Namely, the weights. The weights are just some arbitrary numbers

for now. Let’s call them w1, w2 and w3. So, we take those weights and we multiply them with the input. Input1 times weight1,

input2 times weight2, and so on. And this, this sum just will be our output. Well,

not quite. We make it a little bit more complicated. We also use something called

an activation function. The activation function is just a mapping from one scalar

value to another scalar value. In this case from what we got as an output,

the sum, to something that more closely fits what we need. This could for example be

something binary, where we have all the negative numbers being mapped to zero and

all the positive numbers being mapped to one. And then this zero and one can encode

something. For example: rabbit or unicorn. So, let me give you an example of how we

can make the previous example with the rabbits and unicorns work with such a

simple neuron. We just use speed, size, and the arbitrarily chosen number 10 as

our inputs and the weights 1, 1, and -1. If we look at the equations, then we get

for our negative numbers — so, speed plus size being less than 10 — a 0, and a 1 for

all positive numbers — being speed plus size larger than 10, greater than 10. This way we again have a separating line between unicorns and rabbits. But again we

have this really simplistic model. We want to become more and more complicated

in order to express more complex tasks. So what do we do? We take more neurons. We take our three input values and put them into one neuron, and into a second neuron,

and into a third neuron. And we take the output of those three neurons as input for

another neuron. We also call this a multilayer perceptron. Perceptron just being a different name for

a neuron, what we have there. And the whole thing is also called a neural

network. So now the question: How do we train this? How do we

learn what this network should encode? Well, we want a mapping from input to

output, and what we can change are the weights. First, what we do is we take a

training sample, some input. Put it through the network, get an output. But

this might not be the desired output which we know. So, in the binary case there are

four possible cases: computed output, expected output, each two values, 0 and 1. The best case would be: we want a 0, get a

0, want a 1 and get a 1. But there is also the opposite case. In these two cases we can learn something

about our model. Namely, in which direction to change the

weights. It’s a little bit simplified, but in principle you just raise the weights if

you need a higher number as output and you lower the weights if you need a lower

number as output. To tell you how much, we have two terms. First term being the error, so in this case just the difference between

desired and expected output – also often called a loss function, especially in deep

learning and more complex applications. You also have a second term

we call the act the learning rate, and the learning rate is what tells us how quickly

we should change the weights, how quickly we should adapt the weights. Okay, this is how we learn a model. This is almost everything you need to know. There are mathematical equations that tell

you how much to change based on the error and the

learning function. And this is the entire learning process. Let’s get back to the

terminology. We have the input layer. We have the output layer, which somehow

encodes our output either in one value or in several values if we have a multiple,

if we have multiple classes. We also have the hidden layers, which are actually what

makes our model deep. What we can change, what we can learn, is the are the weights,

the parameters of this model. But what we also need to keep in mind, is the number

of layers, the number of neurons per layer, the learning rate, and the

activation function. These are called hyper parameters,

and they determine how complex our model is, how well it is suited to solve the task at

hand. I quite often spoke about solving tasks, so

the question is: What can we actually do with neural networks? Mostly classification tasks, for example: Tell me, is this animal a rabbit or unicorn? Is this text message spam or legitimate? Is this patient healthy or ill? Is this image a picture of a cat or a dog? We already saw for the animal that we

need something called features, which somehow encodes information about

what we want to classify, something we can use as input

for the neural network. Some kind of number that is meaningful. So, for the animal it could be speed, size,

or something like color. Color, of course, being more complex again,

because we have, for example, RGB, so 3 values. And, text message being a more complex case

again, because we somehow need to encode the sender, and whether the sender is

legitimate. Same for the recipient, or the

number of hyperlinks, or where the hyperlinks refer to, or the, whether there

are certain words present in the text. It gets more and more complicated. Even more

so for a patient. How do we encode medical history in a proper way for the network to

learn. I mean, temperature is simple. It’s a scalar value, we just have a number. But how do we encode whether certain symptoms are present. And the image, which is

actually what I work with everyday, is again quite complex. We have values, we

have numbers, but only pixel values, which make it difficult, which are difficult to

use as input for a neural network. Why? I’ll show you. I’ll actually show you with

this picture, it’s a very famous picture, and everybody uses it in computer vision. They will tell you, it’s because there is a multitude of different characteristics

in this image: shapes, edges, whatever you desire. The truth is, it’s a crop from the

centrefold of the Playboy, and in earlier years, the computer vision engineers was a

mostly male audience. Anyway, let’s take five by five pixels. Let’s assume, this is

a five by five pixels, a really small, image. If we take those 25 pixels and use

them as input for a neural network you already see that we have many connections

– many weights – which means a very complex model. Complex model, of course,

prone to overfitting. But there are more problems. First being, we have

disconnected the pixels from its neigh-, a pixel from its neighbors. We can’t encode

information about the neighborhood anymore, and that really sucks. If we just

take the whole picture, and move it to the left or to the right by just one pixel,

the network will see something completely different, even though to us it is exactly

the same. But, we can solve that with some very clever engineering, something we call

a convolutional layer. It is again a hidden layer in a neural network, but it

does something special. It actually is a very simple neuron again, just four input

values – one output value. But the four input values look at two by two pixels,

and encode one output value. And then the same network is shifted to the right, and

encodes another pixel, and another pixel, and the next row of pixels. And in this

way creates another 2D image. We have preserved information about the

neighborhood, and we just have a very low number of weights, not the huge number of

parameters we saw earlier. We can use this once, or twice, or several hundred times. And this is actually where we go deep. Deep means: We have several layers, and

having layers that don’t need thousands or millions of connections, but only a few. This is what allows us to go really deep. And in this fashion we can encode an

entire image in just a few meaningful values. How these values look like, and

what they encode, this is learned through the learning process. And we can then, for

example, use these few values as input for a classification network. The fully connected network we saw earlier. Or we can do something more clever. We can

do the inverse operation and create an image again, for example, the same image, which

is then called an auto encoder. Auto encoders are tremendously useful, even

though they don’t appear that way. For example, imagine you want to check whether

something has a defect, or not, a picture of a fabric, or of something. You just

train the network with normal pictures. And then, if you have a defect picture,

the network is not able to produce this defect. And so the difference of the

reproduced picture, and the real picture will show you where errors are. If it

works properly, I’ll have to admit that. But we can go even further. Let’s say, we

want to encode something entirely else. Well, let’s encode the image, the

information in the image, but in another representation. For example, let’s say we

have three classes again. The background class in grey, a class called hat or

headwear in blue, and person in green. We can also use this for other applications

than just for pictures of humans. For example, we have a picture of a street and

want to encode: Where is the car, where’s the pedestrian? Tremendously useful. Or we

have an MRI scan of a brain: Where in the brain is the tumor? Can we somehow learn

this? Yes we can do this, with methods like these, if they are trained properly. More about that later. Well we expect something like this to come out but the

truth looks rather like this – especially if it’s not properly trained. We have not

the real shape we want to get but something distorted. So here is again

where we need to do learning. First we take a picture, put it through the

network, get our output representation. And we have the information about how we

want it to look. We again compute some kind of loss value. This time for example

being the overlap between the shape we get out of the model and the shape we want to

have. And we use this error, this lost function, to update the weights of our

network. Again – even though it’s more complicated here, even though we have more

layers, and even though the layers look slightly different – it is the same

process all over again as with a binary case. And we need lots of training data. This is something that you’ll hear often in connection with deep learning: You need

lots of training data to make this work. Images are complex things and in order to

meaningful extract knowledge from them, the network needs to see a multitude of

different images. Well now I already showed you some things we use in network

architecture, some support networks: The fully convolutional encoder, which takes

an image and produces a few meaningful values out of this image; its counterpart

the fully convolutional decoder – fully convolutional meaning by the way that we

only have these convolutional layers with a few parameters that somehow encode

spatial information and keep it for the next layers. The decoder takes a few

meaningful numbers and reproduces an image – either the same image or another

representation of the information encoded in the image. We also already saw the

fully connected network. Fully connected meaning every neuron is connected to every

neuron in the next layer. This of course can be dangerous because this is where we

actually get most of our parameters. If we have a fully connected network, this is

where the most parameters will be present because connecting every node to every

node … this is just a high number of connections. We can also do other things. For example something called a pooling layer. A pooling layer being basically the

same as one of those convolutional layers, just that we don’t have parameters we need

to learn. This works without parameters because this neuron just chooses whichever

value is the highest and takes that value as output. This is really great for

reducing the size of your image and also getting rid of information that might not

be that important. We can also do some clever techniques like adding a dropout

layer. A dropout layer just being a normal layer in a neural network where we remove

some connections: In one training step these connections, in the next training

step some other connections. This way we teach the other connections to become more

resilient against errors. I would like to start with something I call the “Model

Show” now, and show you some models and how we train those models. And I will

start with a fully convolutional decoder we saw earlier: This thing that takes a

number and creates a picture. I would like to take this model, put in some number and

get out a picture – a picture of a horse for example. If I put in a different

number I also want to get a picture of a horse, but of a different horse. So what I

want to get is a mapping from some numbers, some features that encode

something about the horse picture, and get a horse picture out of it. You might see

already why this is problematic. It is problematic because we don’t have a

mapping from feature to horse or from horse to features. So we don’t have a

truth value we can use to learn how to generate this mapping. Well computer

vision engineers – or deep learning professionals – they’re smart and have

clever ideas. Let’s just assume we have such a network and let’s call it a

generator. Let’s take some numbers put, them into the generator and get some

horses. Well it doesn’t work yet. We still have to train it. So they’re probably not

only horses but also some very special unicorns among the horses; which might be

nice for other applications, but I wanted pictures of horses right now. So I can’t

train with this data directly. But what I can do is I can create a second network. This network is called a discriminator and I can give it the input generated from the

generator as well as the real data I have: the real horse pictures. And then I can

teach the discriminator to distinguish between those. Tell me it is a real horse

or it’s not a real horse. And there I know what is the truth because I either take

real horse pictures or fake horse pictures from the generator. So I have a truth

value for this discriminator. But in doing this I also have a truth value for the

generator. Because I want the generator to work against the discriminator. So I can

also use the information how well the discriminator does to train the generator

to become better in fooling. This is called a generative adversarial network. And it can be used to generate pictures of an arbitrary distribution. Let’s do this

with numbers and I will actually show you the training process. Before I start the

video, I’ll tell you what I did. I took some handwritten digits. There is a

database called “??? of handwritten digits” so the numbers of 0 to 9. And I

took those and used them as training data. I trained a generator in the way I showed

you on the previous slide, and then I just took some random numbers. I put those

random numbers into the network and just stored the image of what came out of the

network. And here in the video you’ll see how the network improved with ongoing

training. You will see that we start basically with just noisy images … and

then after some – what we call apox(???) so training iterations – the network is

able to almost perfectly generate handwritten digits just from noise. Which

I find truly fascinating. Of course this is an example where it works. It highly

depends on your data set and how you train the model whether it is a success or not. But if it works, you can use it to generate fonts. You can generate

characters, 3D objects, pictures of animals, whatever you want as long as you

have training data. Let’s go more crazy. Let’s take two of those and let’s say we

have pictures of horses and pictures of zebras. I want to convert those pictures

of horses into pictures of zebras, and I want to convert pictures of zebras into

pictures of horses. So I want to have the same picture just with the other animal. But I don’t have training data of the same situation just once with a horse and once

with a zebra. Doesn’t matter. We can train a network that does that for us. Again we

just have a network – we call it the generator – and we have two of those: One

that converts horses to zebras and one that converts zebras to horses. And then

we also have two discriminators that tell us: real horse – fake horse – real zebra

– fake zebra. And then we again need to perform some training. So we need to

somehow encode: Did it work what we wanted to do? And a very simple way to do this is

we take a picture of a horse put it through the generator that generates a

zebra. Take this fake picture of a zebra, put it through the generator that

generates a picture of a horse. And if this is the same picture as we put in,

then our model worked. And if it didn’t, we can use that information to update the

weights. I just took a random picture, from a free library in the Internet, of a

horse and generated a zebra and it worked remarkably well. I actually didn’t even do

training. It also doesn’t need to be a picture. You can also convert text to

images: You describe something in words and generate images. You can age your face

or age a cell; or make a patient healthy or sick – or the image of a patient, not

the patient self, unfortunately. You can do style transfer like take a picture of

Van Gogh and apply it to your own picture. Stuff like that. Something else that we

can do with neural networks. Let’s assume we have a classification network, we have

a picture of a toothbrush and the network tells us: Well, this is a toothbrush. Great! But how resilient is this network? Does it really work in every scenario. There’s a second network we can apply: We call it an adversarial network. And that

network is trained to do one thing: Look at the network, look at the picture, and

then find the one weak spot in the picture: Just change one pixel slightly so

that the network will tell me this toothbrush is an octopus. Works remarkably

well. Also works with just changing the picture slightly, so changing all the

pixels, but just slight minute changes that we don’t perceive, but the network –

the classification network – is completely thrown off. Well sounds bad. Is bad if you

don’t consider it. But you can also for example use this for training your network

and make your network resilient. So there’s always an upside and downside. Something entirely else: Now I’d like to show you something about text. A word-

language model. I want to generate sentences for my podcast. I have a network

that gives me a word, and then if I want to somehow get the next word in the

sentence, I also need to consider this word. So another network architecture –

quite interestingly – just takes the hidden states of the network and uses them

as the input for the same network so that in the next iteration we still know what

we did in the previous step. I tried to train a network that generates podcast

episodes for my podcasts. Didn’t work. What I learned is I don’t have enough

training data. I really need to produce more podcast episodes in order to train a

model to do my job for me. And this is very important, a very crucial point:

Training data. We need shitloads of training data. And actually the more

complicated our model and our training process becomes, the more training data we

need. I started with a supervised case – the really simple case where we, really

simple, the really simpler case where we have a picture and a label that

corresponds to that picture; or a representation of that picture showing

entirely what I wanted to learn. But we also saw a more complex task, where I had

to pictures – horses and zebras – that are from two different domains – but domains

with no direct mapping. What can also happen – and actually happens quite a lot

– is weakly annotated data, so data that is not precisely annotated; where we can’t

rely on the information we get. Or even more complicated: Something called

reinforcement learning where we perform a sequence of actions and then in the end

are told “yeah that was great”. Which is often not enough information to really

perform proper training. But of course there are also methods for that. As well

as there are methods for the unsupervised case where we don’t have annotations,

labeled data – no ground truth at all – just the picture itself. Well I talked

about pictures. I told you that we can learn features and create images from

them. And we can use them for classification. And for this there exist

many databases. There are public data sets we can use. Often they refer to for

example Flickr. They’re just hyperlinks which is also why I didn’t show you many

pictures right here, because I am honestly not sure about the copyright in those

cases. But there are also challenge datasets where you can just sign up, get

some for example medical data sets, and then compete against other researchers. And of course there are those companies that just have lots of data. And those

companies also have the means, the capacity to perform intense computations. And those are also often the companies you hear from in terms of innovation for deep

learning. Well this was mostly to tell you that you can process images quite well

with deep learning if you have enough training data, if you have a proper

training process and also a little if you know what you’re doing. But you can also

process text, you can process audio and time series like prices or a stack

exchange – stuff like that. You can process almost everything if you make it

encodeable to your network. Sounds like a dream come true. But – as I already told

you – you need data, a lot of it. I told you about those companies that have lots

of data sets and the publicly available data sets which you can actually use to

get started with your own experiments. But that also makes it a little dangerous

because deep learning still is a black box to us. I told you what happens inside the

black box on a level that teaches you how we learn and how the network is

structured, but not really what the network learned. It is for us computer

vision engineers really nice that we can visualize the first layers of a neural

network and see what is actually encoded in those first layers; what information

the network looks at. But you can’t really mathematically prove what happens in a

network. Which is one major downside. And so if you want to use it, the numbers may

be really great but be sure to properly evaluate them. In summary I call that

“easy to learn”. Every one – every single one of you – can just start with deep

learning right away. You don’t need to do much work. You don’t need to do much

learning. The model learns for you. But they’re hard to master in a way that makes

them useful for production use cases for example. So if you want to use deep

learning for something – if you really want to seriously use it –, make sure that

it really does what you wanted to and doesn’t learn something else – which also

happens. Pretty sure you saw some talks about deep learning fails – which is not

what this talk is about. They’re quite funny to look at. Just make sure that they

don’t happen to you! If you do that though, you’ll achieve great things with

deep learning, I’m sure. And that was introduction to deep learning. Thank you! Applause Herald Angel: So now it’s question and

answer time. So if you have a question, please line up at the mikes. We have in

total eight, so it shouldn’t be far from you. They are here in the corridors and on

these sides. Please line up! For everybody: A question consists of one

sentence with the question mark in the end – not three minutes of rambling. And also

if you go to the microphone, speak into the microphone, so you really get close to

it. Okay. Where do we have … Number 7! We start with mic number 7:

Question: Hello. My question is: How did you compute the example for the fonts, the

numbers? I didn’t really understand it, you just said it was made from white

noise. Teubi: I’ll give you a really brief recap

of what I did. I showed you that we have a model that maps image to some meaningful

values, that an image can be encoded in just a few values. What happens here is

exactly the other way round. We have some values, just some arbitrary values we

actually know nothing about. We can generate pictures out of those. So I

trained this model to just take some random values and show the pictures

generated from the model. The training process was this “min max game”, as its

called. We have two networks that try to compete against each other. One network

trying to distinguish, whether a picture it sees is real or one of those fake

pictures, and the network that actually generates those pictures and in training

the network that is able to distinguish between those, we can also get information

for the training of the network that generates the pictures. So the videos you

saw were just animations of what happens during this training process. At first if

we input noise we get noise. But as the network is able to better and better

recreate those images from the dataset we used as input, in this case pictures of

handwritten digits, the output also became more lookalike to those numbers, these

handwritten digits. Hope that helped. Herald Angel: Now we go to the

Internet. Can we get sound for the signal Angel,

please? Teubi: Sounded so great,

“now we go to the Internet.” Herald Angel: Yeah, that sounds like

“yeeaah”. Signal Angel: And now we’re finally ready

to go to the interwebs. “Schorsch” is asking: Do you have any recommendations

for a beginner regarding the framework or the software? Teubi: I, of course, am very biased to recommend what I use everyday. But I also

think that it is a great start. Basically, use python and use pytorch. Many people

will disagree with me and tell you “tensorflow is better.” It might be, in my

opinion not for getting started, and there are also some nice tutorials on the

pytorch website. What you can also do is look at websites like OpenAI, where they

have a gym to get you started with some training exercises, where you already have

datasets. Yeah, basically my recommendation is get used to Python and

start with a pytorch tutorial, see where to go from there. Often there also some

github repositories linked with many examples for already established network

architectures like the cycle GAN or the GAN itself or basically everything else. There will be a repo you can use to get started. Herald Angel: OK, we stay with the internet. There’s some more questions, I

heard. Signal Angel: Yes. Rubin8 is asking: Have

you have you ever come across an example of a neural network that deals with audio

instead of images? Teubi: Me personally, no. At least not

directly. I’ve heard about examples, like where you can change the voice to sound

like another person, but there is not much I can reliably tell about that. My

expertise really is in image processing, I’m sorry. Herald Angel: And I think we have time for one more question. We have one at number

8. Microphone number 8. Question: Is the current Face recognition

technologies in, for example iPhone X, is it also a deep learning algorithm or is

it something more simple? Do you have any idea about that? Teubi: As far as I know, yes. That’s all I can reliably tell you about that, but it

is not only based on images but also uses other information. I think distance

information encoded with some infrared signals. I don’t really know exactly how

it works, but at least iPhones already have a neural network

processing engine built in, so a chip dedicated to just doing those

computations. You saw that many of those things can be parallelized, and this is

what those hardware architectures make use of. So I’m pretty confident in saying, yes, they

also do it there. How exactly, no clue. Herald Angel: OK. I myself have a last completely unrelated question: Did you

create the design of the slides yourself? Teubi: I had some help. We have a really

great Congress design and I used that as an inspiration

to create those slides, yes. Herald Angel: OK, yeah, because those are

really amazing. I love them. Teubi: Thank you! Herald Angel: OK, thank you very much Teubi. *35C5 outro music*

## 4 Comments

## Neolex

Awesome talk, thanks !

## Christian Mose

Very helpful, thanks!

What kind of defect is Teubi talking about around 16:30?

An example would help me a lot.

Cheers.

## Manfred Orse

is there a github? <3

## sTL45oUw

Why is ML 101 in here?