So yeah, today I'm going to be going over Introduction to Neural Networks.
So, you know, with AI, it's really just math.
But it's not just math.
It's a little bit more than that.
It's also software engineering.
Although maybe the software engineering part really looks more than textbooks and Python code.
It looks more like this.
We have a little more stack overflow in there.
But realistically, it's both, right?
It's really a combination of math and software engineering.
It's applied math and applied software engineering that you're doing.
And that's the core of AI, right?
You can start on one side and move to the other and vice versa.
You don't have to know one to know the other to do well at AI.
But you know, those are kind of the core fundamentals of it.
So a clicker would be great, yeah.
And so before we actually dive into what neural networks are, what AI is, let's do a little bit of term disambiguation here.
Because if we start out, you know, I'm sure you've heard AI, machine learning, deep learning, generative AI, you know, what does all this mean, right?
Are they just different synonyms for the same thing?
Well, they're actually not.
So if we start out with AI, artificial intelligence is just the ability of a machine to imitate intelligent, quote unquote, human behavior, right?
Thank you.
And so what we define as intelligent human behavior is basically just human behavior in general.
So the chatbots of yore, the old ones you used to run into, and those shitty phone trees where you have to yell manager into the phone tree five times to get it to work, those are actually AI.
That's all it is, is just imitating human behavior.
If we go one step deeper, we can talk about machine learning.
Machine learning is the application of AI that allows a system to automatically learn and improve.
So we can actually talk about statistical methods, like things like they're called k-means clustering or linear regression.
These are really old statistical methods that date back early 1900s and beyond, that have existed for a very long time.
These are actually machine learning too, right?
This is just saying here's a bunch of data, learn something about this data, and go from there.
It's not Terminator, it's not anything crazy from there, it's all that machine learning is.
Finally, we can talk about deep learning.
Deep learning is starting to get to what people think about when they think about AI and machine learning, and that is the application of machine learning that uses complex algorithms and deep neural networks to train a model to do a task.
And so these are going to be things like Google Assistant and self-driving cars and vision classifiers and things like that.
So today what we're going to be looking at, we're going to be talking mostly about machine learning, and within machine learning, realistically a lot of it's going to be centered on neural networks, which is part of deep learning.
So what we're going to be looking at, like I said, kind of a rough outline of what's going on.
What are machines learning to do, right?
Looking at some of the different kinds of tasks that these machines are learning, what we're trying to teach things.
When did machine learning begin?
Kind of look at a brief history of things.
How do machines learn?
Kind of breaking down the basic structure of artificial neurons, building up neural networks, things like that.
And then finally, we're going to look at some common pitfalls.
Now that we have these brand new technologies, new problems arise, and how do we deal with those?
So if I bring up this very fancy graph, it's probably a little hard to read on the TV.
We can take machine learning and we can kind of split it into three large groups from the get-go.
The first that we want to look at... I have too many transitions in this slide deck, I really need to fix it.
First we can look at here, it's called supervised learning.
Supervised learning is basically kind of what most people consider the classical AI, classical machine learning models.
If you go and look up a tutorial on how do I build a neural network, it'll be for supervised learning tasks.
This is going to be classifying things, so looking at a picture, is this a cat, is this a dog?
I work in the science domain, so it's usually along the lines of what kind of galaxy are we looking at from a picture, right?
These are supervised tasks where there's a label to your data, there's a known truth, and you tell the model both the truth and the data itself, and it can learn the relationship between those two.
After that, we have unsupervised learning.
Unsupervised learning is kind of one of the more interesting ones in my opinion, where basically you give your model just the data, right?
You're not giving it labels, you're not saying this data is, you know, a cat, this data is a dog.
You're saying here's a bunch of data, try and learn and figure out something interesting that, you know, you as a human are kind of defining what's interesting a little bit, so learn something interesting here.
So a lot of things we do here are like clustering algorithms, where you're saying I don't know anything about this data, but I want you to take these data points, cluster them all together to tell me at least what's similar.
Things like autoencoders, which are a really interesting type of model in my opinion, where they basically, you give it what a bunch of normal data looks like, and it learns what's normal, and it can reconstruct normal data very well, but as soon as you start feeding it something irregular,
it learns how to do, it starts performing poorly, and you can say, oh, hey, something anomalous is going on here, and you can use that to discover brand new things that, you know, that you classify, discover, figure out new things that you never would have been able to have data for in the first place.
Finally, after unsupervised learning, we have reinforcement learning as our last thing.
This is kind of, again, one of the more headline adjacent types of machine learning you'll see.
These are going to be things like playing games, robotics, scheduling, and a lot of this comes from places where, you know, you have an agent acting out tasks with a certain kind of reward.
And so that's going to be a lot of the more complex tasks you see tend to be reinforcement learning.
So that's kind of the kinds of machine learning we have.
Let's talk about history briefly.
When did all this start, right?
So back in 1943, machine learning actually began, and McCullough and Pitts devised a mathematical model for an artificial neuron, right?
This is saying, let's look at the neurons in a brain , how do they work, and how can we map that to a mathematical function?
In 1956, things grew kind of from that first artificial neuron, and AI as a field was established at Dartmouth College, and they had the first kind of conference on artificial intelligence there.
Then from there, the next kind of big note, big step of note, is Fukushima, sorry, Fukushima's neocognition, and that was kind of the first model to do pattern recognition on shapes.
From there, in 1988, the first widely deployed convolutional neural network from Lacune to look at handwritten digits in numbers on checks was deployed, and then in 2012, kind of the most modern iteration of the start of the most modern AI boom, the ImageNet competition,
where Alex Kerevsky, Chris Evsky, figured out that you could actually use consumer-grade GPUs to run, to train neural networks very efficiently, and that kind of started the most recent AI boom that we're in now.
Um, and so you've kind of noticed that there's some gaps in this here, right?
We talk about AI winters, these gaps are AI winters here.
The first AI winter happened because, you know, even though AI, we found out these artificial neurons, AI kind of was established as a field, it turned out that these early networks were not generalizable enough, as we say.
This means that when we train it on data, we can train, you know, a model to say, what's a cat, what's a dog, it could do really well on what a cat and what a dog is in the pictures that you trained it on, so if you have a hundred pictures that you trained it on,
it can classify those hundred pictures very well.
As soon as you have a hundred and first picture that didn't see during training of a cat or a dog, it starts to fail.
So it's not learning generalizable methods, it's not learning generally what is a cat, what is a dog, but it was more memorizing, oh, picture one is a cat, picture two is a dog, etc.
Um, finally, you know, we had a little bit of an uptick in the 80s with neocognition and CNNs, and we figured out how to get past that generalizability issue, but we had another AI winter in the late 80s, early 90s, because we found out that neural networks are very compute and data intensive,
and so they were not very efficient to train these networks.
You know, if I can spend a lot of compute resources to do this, why would I spend so much more to train a neural network when I can use a simpler method that uses order of magnitude less compute that does, maybe not as well, but close enough that I'm just going to say,
I would rather spend a tenth of the cost.
Um, and so that's why it was such a big deal in 2012 when we found out, hey, consumer grade GPUs, which are already on the shelf and relatively accessible price-wise, can actually be used to train neural networks very efficiently.
Um, so let's kind of rewind a little bit on this timeline, let's talk about the artificial neuron here.
This is the basis of all neural networks, you know, even today.
So the artificial neuron is based on a real neuron, surprisingly.
Um, and so it was first proposed by McCullough and Pitts in 1943, and effectively what happens is we have a mathematical function that we can map different parts of the actual neuron onto mathematically.
So if we look at the dendrites of the neuron, and I'm not a neuroscientist, so I might mess some of this up.
Um, if we look at the dendrites of the neuron, which are basically the inputs, um, we can kind of map that along to the inputs and what we call the weights.
In a lot of cases, the weights are what are learned by a network, those are representing the learned and stored information.
Um, the inputs are multiplied by the weights, then they get passed to the SOMA, which is a kind of summing function, to take all of the individual contributions and, you know, sum them together into one value.
Um, then that gets passed into what's called an activation function.
An activation function is kind of the secret sauce of a neuron, where you say, okay, given the collection of this information and all these learned weights and everything, do I pass along the message or not?
Right?
And so, in this case, the, we use what's called a step function, a heave side step function, I'm not going to try and get into the math weeds too far here, but basically, either do I output a one or a zero based on the input, the knowledge that I've gathered here.
And so, this is the basic idea of an artificial neuron.
You know, it'll output a, this first neuron only was binary, it outputs a one or a zero and that's it .
So, all the tasks have to be binary.
So, based, yeah, you have a question?
Here.
Yeah, here, let me go back to the slide.
You can keep talking.
Obviously, like, this is representing an artificial neuron.
Yeah.
Um, and I understand, like, the idea of why
it's either above or below a certain threshold.
Right.
That is a good question and one that I'm probably not prepared to fully answer, actually.
But, yeah, it's a good question.
Linear regressions, like, so, that's pretty, probably, like, yeah, like, in this case, this is the very simple first go.
And maybe, like, yeah.
No, that is a very good question.
I don't know, Gavin, if you have any insights on there that you want to, yeah.
I'll even give you the microphone.
What was the question again?
What makes this different than a linear
neural network?
Okay, yeah.
So, the question is, we have our inputs, we have our weights, and how is this different than a linear regression?
So, that would actually be a logistic regression.
Linear regression kind of predicts based on a certain trend line.
Logistic regression allows you to pick A or B based on whatever that threshold is for the trend line.
This is a linear regression, or a logistic regression, in concept.
Now, if you have one layer of neurons, then it can only decide A or B.
And that's actually what he's going to get into, is you only get one straight line.
So, if you were to look at all of the people in Texas, you have to draw a line.
So, you either separate the west coast from Texas, or you separate the east coast from Texas.
You can't do both.
And that doesn't even get into the north or the south or anything like that.
So, to add more, you need to do what are called hidden layers.
And so, each hidden layer allows you to add a curve, essentially.
Yeah, add more turns into your equation.
Yeah, so this is just a linear classifier.
So, great question for that reason.
As we make more complex neural networks, we take that same idea of a logistic regression and just make it more insane.
Exactly.
You add more to it, and then you can change.
This is our understanding of a neural network back in the 40s.
And we've significantly improved it.
But this same technology is what recurrent neural networks, convolutional neural networks, transformers like modern day GPTs are built off of.
And this just kind of gets you started.
So, I'll hand this back over.
Yeah, good question.
So, yeah.
We start out with this artificial neuron.
Like you said, it's a very limited thing.
It's only linear.
It can only do linear transformations.
But from here, we build the first neural network called the perceptron.
And fun fact, the perceptron itself was actually a real machine that was hand-wired, that was real electronics.
The perceptron Mark 1 was built in 1943 by McCullough and Pitts and Frank Rosenblatt.
And so, like we kind of talked about, the artificial neuron model here only outputs a true or false, which makes the perceptron a binary classifier.
Like Gavin was talking about, if you kind of look over to the right, you can only draw a straight line to divide your two classes.
You can't get more complex than that.
And so, that means there's some limits here, but it's still useful, right?
Things have to be linearly separable.
So, if we take the perceptron and we figure out, okay, cool, we have this model here.
How do we actually train it to look at cats and dogs and distinguish between the two, right?
So, if you have a training dataset, the way that we train our perceptron is we take one sample from our training dataset, we input it, we initialize our weights to something random, either all zero, all one, random numbers, doesn't matter, right?
It does matter, but in our case, it doesn't matter.
So, we take a sample, we input it to our network, we calculate the output of our perceptron, and then from there, we can actually calculate brand-new weights.
We say, first, what is the predicted output minus the desired output?
How wrong were we?
And then we have a value called a learning rate that says how big of a change we want to make each step of training.
And then what is our input data?
And from there, we calculate a brand-new weight, and we update our weights, and we go again.
And so, this is so that we are trying to learn the general idea of the dataset, right?
We're not saying, what does picture A do?
We have to go across our entire hundred-sample, thousand-sample, whatever dataset.
And so, we repeat this process for a number of iterations that we set, or until a certain error threshold that we define is within our acceptable bounds.
And so, that's how the perceptron learns how to train, or learns things, right?
That's how we train things.
But, perceptron obviously has some issues, like we talked about.
It's only linearly separable.
It's only, you know, there's binary true or false.
It has a lot of limitations here.
So, a better network was dissolved, was discovered, called the multilayer perceptron, in 1965 by Alex Ivanenko.
I'm sorry.
And so, this has, the multilayer perceptron has a lot of improvements, versus the Mark 1 perceptron.
We first have multiple interconnected neurons, instead of just one layer of neurons.
We have layers of neurons passing the outputs of each other into the inputs of other ones.
So, that's a big deal for having complex models.
We talk about having more terms in a model, being able to represent more complex cuts of data.
This is what lets us do that.
The other thing that kind of is the secret sauce of MLPs is using non-linear activation functions.
We use things like sigmoid or tanh, and that allows us to add non-linearity to our model, and that gives us the ability to be more than just a linear classifier, right?
But, we have a problem now.
Now that we have layers of neurons, how do we actually determine the values of weights when we're updating them?
That old method we had, that I was just showing in the last slide, that doesn't work anymore.
So, we have to invent a few new methods to do this.
So, first of all, we have to figure out, how do we figure out how wrong a model is, right?
We have our old method here, and that we're looking at the predicted output minus desired output.
In this case, we call this a loss function.
A loss function quantifies the expected value versus the actual output, and shows you how wrong that is, right?
How wrong did I get?
Am I really close?
Am I really far?
Et cetera.
This is just for one sample, though.
For efficiency reasons, we end up wanting to look at what's called a cost function, which is the error across a set of multiple samples.
This is because if we're starting to have data sets that are hundreds of thousands of millions of points large, then we don't want to have to calculate that error for every single sample.
We want to be able to calculate it across a few thousand samples at once, because we still compute limited at some points.
So we take a cost function, we learn that, and we calculate error every batch, and from here, given the ability to quantify how wrong a model is, we can say that the objective of model training is to optimise the weights of that model, to find a model with minimal loss.
And so if we kind of look at this fancy 3D plot here, realistically all that means is that we want to find a given value of W1 and W2 where the loss is the lowest.
We want to find that valley that is the lowest point here.
But the problem is with that, we have to explore the loss landscape, as we call it, and we can't just do that by... we can't just manually brute force it.
We can't explore every single possible combination when you have millions of parameters in a model like we do nowadays.
So we have to do it, you know, with a little bit more smarts.
And so that comes in with a function we call optimiser, called gradient descent.
And so gradient descent is just basically at each given training step, we will look at the cost function and we descend down the slope of that by a certain magnitude.
I know we're getting into the math side of things, but we kind of descend along by a certain amount called the learning rate, and that really kind of... the thing to pay attention to here is this little gif on the corner, where if we imagine our loss landscape as this weird hills and valleys thing,
we have a little ball rolling down these hills with different algorithms to do this in different ways, some are more efficient than others.
And so the goal of all of them is trying to get to that lowest energy point as possible, the lowest loss point that you can.
So now that we have a way to quantify loss, now that we have a way to explore the loss, finally we have to figure out , okay, how do we actually figure out what the updater weights are?
You know, we already have this problem here of calculating weights, we know how wrong we are, but we have to figure out backpropagation.
So we have layers of neurons now, and sadly I'm not going to get into backpropagation too much in this talk, because it's a long topic on its own, I highly recommend a 3Blue1Brown video on it, I stole a lot of the gifs on this slide from the 3Blue1Brown video,
he does a fantastic job at this, but in short, backpropagation uses the clever application of chain rule to calculate the gradients with respect to each other, each parameter, efficiently to avoid extra calculations here.
And so we kind of just figure out how much to wiggle or nudge each parameter, excuse me, in relation to each other, and we do that multiple times over the course of learning in our training loop.
So now that we have figured all this out, we're at what we call deep neural networks.
So an MLP, the MLP discovered in 1967, was a shallow neural network where it has one layer in between the input and the output, but all we have to do is we have to add more layers in between our inputs and outputs, and so we're at modern deep neural networks,
these are used today, these are very effective tools to do a lot of things, and that's all a deep neural network is.
We've kind of closed the circle on that.
But we're not done yet with neural networks.
Now we have convolutional neural networks, which I affectionately dubbed the rest of the owl, because we're already most of the way through the presentation.
So convolutional neural networks are really useful tools, as I kind of mentioned, in 1988, the first widely deployed one was used to read numbers on checks, and then in 2012, the ImageNet competition was an image recognition one, where CNN was used and trained on GPUs.
They're very, very useful as image processing tools, where position and, like, relation to neighboring features are important, right?
So if you're trying to look at things and figure out, not just, okay, what is a global feature, how does everything relate to everything, but how does neighboring features relate to each other, CNNs are a really useful tool in this case.
And so, when we're building a CNN, you'll notice that as part of the CNN, we actually still have our fully connected layers, we still have these, uh, a deep neural network as part of this, but we have a lot more, and that's doing the classification side of things,
just like if you're doing cats versus dogs, this is still a thing that's actually looking at all of our input features and saying, okay, based on these features, what is this?
But, we have to do a few more things here.
First, we notice that at the very end, we have a probabilistic distribution, this is really common for how classification networks work, where we'll have some output that says, okay, I think class one is a star, class two is a galaxy, class three is a nebula,
and I think it's a 20% chance it's a star, 70% chance it's a galaxy, 10% chance it's a nebula, in this case we'd say the network is predicting that whatever it's looking at is a galaxy, right?
Um, and so, the other part of this is where the convolutional layers come into play, and that's called feature extraction, where, you know, it's kind of what it sounds like, we're looking at an image and we're extracting certain features that we're trying to,
that the rest of the network is using to classify things.
And so, from here, the way a convolution works is we, if we take our image, if the green is our image, the yellow is what we call a filter, and the red is what we call a feature map, we slide the filter over our image, and we multiply the value from our image by a learned weight in that filter,
and that produces a feature map.
And so we do this a few times, we slide this little filter over everything, we get a feature map out, and we can actually just convolute over that feature map multiple times in a row, because it looks like just a smaller image, um, and it assumes some,
a symmetry of data, things are translational, um, and it contains these, like I said, it contains the weight parameters, um, to be optimized for a given thing.
So we'll also have, we'll have stacks of filters, we'll do this multiple times with multiple filters to learn different features of our input data.
Um, the other thing, now that we're, uh, kind of in modern land, we look at more activation functions than just sigmoid and tanh, we have ReLU, we have leaky ReLU, all sorts of things that I won't get into super much, but, you know, nowadays we have a lot more,
uh, effective activation functions that we can use in different scenarios when they're useful here.
Um, and finally we have what's called pooling layers.
Pooling layers are things where we, we need to reduce the size of a given feature map, we can say, okay, for a kernel size, you know, if we're looking at a 4x4 grid, just take the average or the maximum value within that 4x4 grid and reduce the feature size based off of that.
Um, and so this is just when we need to reduce data down so we can get it to a manageable size.
Um, so, what does this look like?
This kind of looks like, here's a little GIF translation here where if we have the input on the left and the kernel is what's sliding over, it produces these feature maps that look, you know, to a human eye, kind of look similar to what's being happened,
but you can see that it looks like, okay, you'll see the edges of these buildings, maybe sometimes the windows, etc.
Right?
It's the important features to distinguish things.
Um, if we pass this through a activation function, you'll notice that, you know, certain things start to pop up, so, the white values are non-negative values in this case, so that you can actually see, there's like the windows and the outlines of the building,
so there's certain things that are shining through here that our neural network can use to make decisions.
Um, and finally if we consider what a pooling layer does, you know, that really starts to like reduce down the size of our image, it starts to look a little bit hard to see, you can still kind of make out that it's a city skyline if you squint a little bit,
um, you know, especially at the sum of someone, but, uh, realistically this starts to get into the area where it's not human recognizable anymore and you have to just trust your network is learning these features.
Right?
Um, so this is kind of, uh, one of my favorite gifs of the talk, where you can actually see the different layers that it's looking at right now, so you see the inputs on the left, we have three convolutional layers, and it looks at the different features,
kind of collapses everything down, passes it to a dense layer, and then has a prediction of the output of what it's looking at here.
And so, I don't know, this one, I like to see everything at once, I think this is a great visual, so, I think this is kind of a cool way to, uh, wrap things up.
So, alright, final thing, I promise I'll have a few more slides here.
Now that we have these brand new techniques, we have dense neural networks and, uh, convolutional neural networks, we have new problems that pop up now, right?
First problem we have, talking about number of parameters in a thing, is, uh, underfitting and overfitting of our data.
So, if, if we remember the first thing we, uh, looked at, we had a linear only classifier, that means that our network was too small, right?
We only draw a straight line to just try and distinguish between two class, two, uh, populations of data.
That's not really what we need to do, so that means that our network needs to be larger.
A properly fit network looks like the middle one here, where you actually can draw a pretty, uh, expressive line between the two populations, and it looks, and it is, uh, you know, it's not going to be perfect, it's not going to get every single sample,
but you're learning the general rule that distinguishes between class A and class B.
Finally, we have a brand new problem called overfitting.
Overfitting is when you have too large of a network, or you've trained a network for too long, and it learns really exactly what you're trying to do.
You can see that, you know, hey, we're separating the blue Xs versus the blue dots very well, and it has a very nice squiggly line to do it all.
The problem is, if I add another blue dot or a blue X here, if it's slightly out of these bounds, which a lot of times it's going to be a little bit, you know, that's wrong, I only know, it has to be exactly this to be correct.
That's because this is, you know, neural networks can do this very easily, right?
This is a very easy problem to fall into.
And when a model overfits, we call this memorizing, right?
As opposed to learning the general rule of something, it's just memorizing the inputs, it's just memorizing, I know question one is answer B, question two is answer D, right?
And it doesn't actually learn what you wanted it to learn.
And so, that's bad, we want our model to learn the general rule, not the specific everything that's going on.
And so the way we kind of mitigate this, is we do what's called a validation test, or a train validation and test split of our data.
So we, instead of using, if we have 100,000 samples to train our model with, instead of using all 100,000 to train it with, we'll take, you know, 10,000 or 20,000 of those, hold them off to the side, and every time we go through during training, and, you know,
stop at a training step, we'll say, hey, how are we doing, let's look at the validation set that we're not training on, and see, just predict performance here.
If we notice that our validation error and our training error are starting to diverge a lot, that means that we're starting to overfit, and we're starting to, you know, it's, the model's starting to overfit.
And, you know, we need to stop training and maybe roll back a little bit so that we're not gonna start, try to mitigate this memorizing that's happening.
Finally, you know, another really important part of neural networks is your data, your model is only as good as your data is, and if you have a garbage data set, you have garbage in, garbage out, right?
So, a big example of this is a data set called Labeled Faces in the Wild.
This is a very, very biased data set.
It doesn't actually have a good representation of all the classes that you're trying to learn.
It was 77% male and 83% white pictures of folks, it's just labeled heads.
And it was mostly made from news articles, so 530 images were of George W.
Bush alone, and that's twice as many as all black women in the entire data set combined, right?
So it's not a very representative data set of anything.
And so IBM and Microsoft and a couple other companies used this data set to try and do facial recognition for cameras, and it turns out, hey, it does great on light-skinned males, you know, error rate of 0.3%, but as soon as you put a dark-skinned female on,
you start to get massive error rates of like 34.7, 35%, you know, just because that data set was so skewed that it was just not able to actually learn properly for anyone but a light-skinned male.
So, overall, you know, kind of wrapping things up here, neural networks are really useful tools, right?
They're really efficient, they do a lot of great work, and they can allow us to learn, like, brand new problem... they can allow us to do new... sorry, my brain just had a moment for a second.
It's early in the morning, I haven't had enough coffee.
They're a brand new technique that allows us to do powerful feature extraction and powerful operations on large sets of data.
They increase the speed that we're able to analyse things, they avoid compounded biases versus having 10 people do something in a chain of work, they help us understand multidimensional data, you know, you try and compute a 10D matrix in your head and you probably can't,
whereas this is kind of the realm that neural networks like to live in, and you don't need to worry about approximations with a model, all details are included, you know?
We have a joke in physics of, you know, when we're trying to calculate the error resistance of a cow, we just assume it's spherical because it's a lot easier and it's close enough, right?
In this case, you can actually calculate the actual... you can assume it's a real cow because the neural network's the thing doing the work.
But, you know, while there are pros here, there are still some cons, right?
Like I mentioned, your model is only as good as your data, so you have to really watch out for biased data, and if you're not watching out for biased data, your models are going to be pretty bad.
They often don't work for out-of-distribution data, so if you're training something to look at dogs versus cats, it'll work great on dogs and cats, but as soon as you throw a bird at it, it has no idea what to do, right?
Hmm?
Right.
And so we really have to think carefully about how... about the data we're using and how we apply these methods because people really... it's such an early time for AI, it's getting wide adoption, people assume AI is magic, and AI is not magic.
And if you assume that AI is magic, you're going to assume it has a lot more capabilities than it does.
So deploying AI in a smart way is very important so that people don't assume it's the end-all be-all of a solution here and it's perfect and never gets things wrong.
Because it learns biases that even we're not aware of.
It will pick up on the slightest, subtlest of biases in your data set that you never thought could even exist.
So it's something... you have to really be careful when and where you're using AI in a lot of cases.
So that is it.
That is Introduction to Neural Networks.
Thank you for your time, thank you for listening to me, and I'll take any questions if you guys have any.
Thank you.
Yeah.
Yeah, for sure.
You kind of talked on overfitting.
Yes.
As you chain together these logistic progressions in the neural networks, how does it look to actually qualify something as overfit or underfit or the correct amount of...
Right.
Usually we tend to look at that training and test and validation split is kind of the best way we have... I'd say the best way we have to kind of keep track of this during training.
And we really start to look for that divergence of the performance.
So we start to see that the training value... it's like your training data is going up, right?
And it's starting to do really well in our training data set, but it's going down on the validation data set and it's doing poorly on that.
We know that it's overfitting because it's starting to only perform well in the training data set as opposed to just that validation data set or the test data set.
So we keep the validation data set separate.
Yeah.
So if we have 100,000 images, right?
We take 20,000 of those images, put them to the side for a validation data set.
Those are never used to train the model.
It's only those 80,000 that are used to train the model.
So those 20,000 are never put into that training loop or backpropagation that's calculated and the weights are updated with those.
So those are always held to the side.
You can do what's called k-me, or k-folding, where if you want to do it over multiple instances, you can split your data set into 10 different subsections and say, okay, I'm withholding subsection one this time.
Retrain it.
Okay, I'm withholding subsection two this time.
And you can get an average performance like that.
So if you really are worried about it.
But typically most people end up just doing the training test and validation split and that works pretty well most of the time.
Cool.
Yeah, thank you.
Any other questions we have?
Sure, I will do my best.
You have to have some sort of loss function in there to determine how wrong you are, but how you're calculating that loss can be pretty open to interpretation.
So if we're looking at supervised data, it's very easy.
It's very easy, like I mentioned, because you even said, classified data, it knows how wrong it is.
If we're looking at unsupervised data, a big case for that is anomaly detectors, which are, we use what's called autoencoder models, anomaly detectors.
And so for anomaly detectors, they use model architectures that are called autoencoders.
And what an autoencoder in principle is doing is trying to reconstruct its input as close to its input as possible.
But what we're doing is we're effectively compressing down, if you look at a model architecture, if you kind of map it out, it looks like an hourglass, because we reduce dimensionality, we reduce the size, the number of features that we're trying to go and say,
okay, let's compress our input down really, really small, and then based off of this couple features that are left over, let's try and reconstruct our input as exactly as possible from that limited number of features.
And really it's doing novel compression, right?
You're doing novel compression for something.
And from this, an autoencoder is trying to, like I said, reproduce its input.
And so what your loss function ends up being, even though it's an unsupervised method, is you're saying, okay, what's my input versus my output?
And you just do input minus output still, and you say, okay, I have that here, I know that I'm aiming for my input, and you don't even know what your input is, I just know that I'm aiming for that input.
And then from there I can calculate my loss.
Hopefully that helps a little bit.
Yes.
Yes, exactly.
Yeah, right?
That's a really fun, that's one of my favorite little fun like applications of AI, it's like, oh wow, that's a really smart way to do it, you know?
So yeah, it's cool.
So, yeah.
Sure.
Yeah, for sure.
When it comes to like, AI, but when it comes to actual practical applications of it, whether this is in your personal life, or whether you've seen other people do projects in it, what are some good fields to try to grapple
with , like that, to grow your ability to do that?
Right, so I'm biased because I'm in science, I work at Fermilab, and so, you know, I like a lot of the scientific applications that are around, and I think that there's a lot of useful, there's a lot of things that are along the whole range of complexity within science,
so down from using just a simple dense neural network to do something, all the way up to using transformers and chat GPT and diffusion models and everything.
We have applications for all of them.
So I think that like, within science, if you go and find different problems and datasets that are like, science-related, you can actually find a lot of good examples along the whole kind of stretch of complexity.
So that's kind of what I'd recommend.
Specifically, I'm interested in finance and economics.
Okay, interested in finance and economics.
So one of the problems that I had when I was taking my data science class was that at the end of the class, I passed the final, I got a B, I have no idea how, because if you gave me a set of like, pre-trained data, I still had no idea how to apply it and use some of these models.
Like, you can separate it out into supervised, unsupervised, and reinforcement learning, but which model do I use?
Why would I use logistic regression versus a neural net versus a decision tree or a random forest or something like that?
So I'll say, find something that you're interested in.
I was interested in network anomaly detection.
And then how would you identify the type of problem that is, one of those three categories?
Well, that's an unsupervised learning problem.
Okay, so what are my options for unsupervised learning?
And do you have something like K-means clustering, you have auto-encoders, there's other types of unsupervised learning algorithms.
How do you apply this same dataset to those different algorithms and then figure out which of those algorithms actually helps you?
So it helps to kind of get a dataset.
You can download datasets off of Kaggle.
I highly recommend get a Kaggle account, download datasets off of there, and try and apply different algorithms to it, or even just explore the data.
The other thing that I like about Kaggle is people will upload their notebooks, even if they're just exploring data.
One of my favorites is actually a notebook from the dataset on credit card fraud.
Because when you think of most models, you think of accuracy.
Well, I can make a credit card fraud algorithm that's 99.9% accurate, because it turns out credit card fraud happens less than 1% of the time.
So if I just let everything go through, then it's 99.9% accurate.
Great.
That doesn't help.
So then you've got to start exploring the data and get a better understanding of, like, well, accuracy is not the measurement I should be using.
What about recall precision or F1 scores?
And so that helps you understand the problem a little bit better.
Using realistic datasets helps you understand and come to a solution so that when somebody comes up to you and says, hey, I want to build a finance model that predicts the stock market.
Okay.
Well, unsupervised isn't going to do it.
Typical supervised isn't going to do it.
So we need something that's more like a regression model, like a linear regression to do a prediction.
How do we do that?
What are ways to break that down?
That's where you start getting into windowing and all these other topics.
So explore things that you're interested in.
Try to find datasets that are related to it.
And then create your own little problem sets and then play with the idea of what model should I use?
What kind of data should I use?
What are some different techniques that I should use?
And then see if you start running into problems.
And that's where you're going to start getting the real questions and start solving real problems.
I'll answer further.
There's a great saying that I use in a lot of my presentations.
Hi, I'm Rachel.
I'm with these guys.
It's all models are wrong.
Some models are useful.
And what Agus Sojanto, who used to be the chief model risk officer at Wells Fargo, would say is, what matters is how much wrong can you accept?
So as you deal with folks on the business end or the risk management end, folks are used to talking about precision recall and F1 and AUC.
And how much wrong can you accept?
And when are these metrics actually not useful to you?
So finding data sets that break your models or where the models aren't predictive, that actually will teach you a lot.
It will teach you a lot about the failure modes of your models and model training and validation.
And I think that's a lot of how you learn.
I actually have one more thing.
So one of the models that first made a lot of this click for me, and actually this was the first workshop that I built for the AI village, was building a spam filter.
How do you turn words into numbers to apply statistics?
And then what are some of the challenges you run with that?
So try to build a spam filter.
Try to build something simple and you'll learn a lot of these different techniques.
So things like tokenizing.
Tokenizing is how large language models work.
And this is, you know, we've been talking about typical machine learning for a very long time, for many, many years before LLMs became a thing.
A lot of them still use these same principles and they just build on top of each other.
Yeah.
It's like Gavin talked about credit card fraud and, you know, thinking about credit card fraud models, we had models that were stable, that were very predictive.
And in 2020, things just went straight off the rails.
What happened in 2020?
COVID.
So there were variables, there were features that were highly predictive that flipped.
So people who had never made online purchases before in their life started making online purchases.
And usually that's a great indicator of fraud.
That's a highly predictive variable.
It's a highly predictive feature and it flipped.
And so, you know, that's part of just thinking about what's in the data set and how is it predicting, you know, what features are you looking at and where does it break?
Rachel's a lawyer.
Anita works in academia as well.
So we all have very different perspectives, but it all kind of converges in this idea of machine learning and
statistics and math background.
Obviously, I've taken math, statistics, machine learning courses, but independently, being able to make a project or apply it in a way and go through that kind of a background is entirely different.
Build ship and break ship.
Build ship and break ship.
If it helps, I don't have a lot of the heavy math background.
I've taken Calc 1 and Calc 2.
I skipped Calc 3.
I don't have any background in linear algebra, but I can show you how to build a neural network in 11 months.
Because once you understand what you're trying to do or what these models are trying to do, it helps to abstract it into a problem that maybe is going to be solved.
Yeah .
I mean, I don't even have the math background.
I don't have that math background myself.
I'm a computer science major by background and, you know, I've been at Fermilab for 10 years now, like, and I don't have any sort of, like, I'm very similar to Gavin.
I took Calc 1, Calc 2, did poorly in both of them, skipped Calc 3, never took linear algebra, but, like, you learn a lot of this stuff just by doing it.
Like, the easiest way to do it is just open up a Python notebook, just do it, and, like, break it and be like, why did you break and how can I fix it?
So, yeah, I work a lot in, like, real-time AI.
And so what that is is, like, real-time AI for science is we have to do machine learning and neural networks in milliseconds or microseconds or nanoseconds sometimes.
And so this is taking machine learning and applying it to FPGAs, applying it to application-specific integrated circuits, you know, baking silicon, baking neural networks into silicon and stuff like that.
And so I have a demo that I didn't bring out this time that's a Pokemon card recognition demo that is running on a little one-and-a-half watt FPGA in, like, you know, 15 FPS image recognition classification just saying, hey, what Pokemon am I looking at for a webcam?
But it's really hard to do that in a very small model that you have to fit in real-time.
You know, you have to say, okay, how do I train a model that's small enough to fit on my board?
How do I train a model well enough so that, you know, okay, so I know that I can only look at a 32 pixel by 32 pixel input.
So how do I make sure I'm getting enough information in 32 pixels to...
It's an embedded model.
It's an embedded model on an FPGA.
Learning models that can not do that.
Yeah.
Yep.
Yeah, GPUs are too slow.
Yeah.
Yeah.
Yeah.
So it's very much a very interesting topic.
And, like, you have to figure out ways around the standard way of doing things a lot of time in my field because, okay, cool, your thing works on a GPU.
Great.
I can't do that.
How do we... let's roll from scratch and figure out what to do.
You know?
So, yes.
Yeah.
Thank you.
Cool.
Yeah.
Thank you.
So, yeah.
We're happy to have... if you guys have questions throughout the day, we'll be here all day.
Anita has another talk next on introduction to AI security.
And so definitely stick around for that if you guys have time.
But, yeah.
We should put the schedule up here.
Thank you.