In this post I’m going to assume you know the basics of linear algebra. If you are unfamiliar with matrix vector multiplication, this video is all you need. (Also check out the other videos in that Linear Alegbra review playlist for bonus points)

Stage setting:

Okay, so last post we were modelling a datapoint $\mathbf{y}_i$, from a vector of $\mathbf{y}$’s, as being generated by a corresponding datapoint $\mathbf{x}_i$, from a vector of $\mathbf{x}$’s, by the following relationship:

That is to say, we take our x value, multiply it by some number $w$, then add another number $b$ to it, and magically end up with an approximation of our y value. This works because the parameters $w$ & $b$ have not been chosen at random, they’ve been fit using data (normally the full $\mathbf{x}$ and $\mathbf{y}$ vectors). The most popular way to fit these parameters is called least squares, in which we pick our parameters to minimise a quantity called the residual sum of squares. Last post I told you that the solution to the least squares approch was:

But how do we get to this solution?

First we need to appreciate that when we loop over all of our datapoints, $ \sum_i^n \mathbf{y_i} \approx w\mathbf{x_i} + b $ this can be written concisely using matrix notation as:

Here the bias term has been absorbed into a parameter vector $\mathbf{\beta}$, and our original $\mathbf{x}$ vector of datapoints has been augmented with a column of ones. If this doesn’t make sense to you, make sure you have watched the videos in the link above. When written out in full the equivalence of notation should be clear:

Here, $b$ is $\mathbf{\beta}_1$ and $w$ is $\mathbf{\beta}_2$. So you can kind of think about matrix notation as just getting rid of pesky for i in range(): loops (better known as $ \sum$ signs). Though this is not entirely equivalent…

Second, what is the residual sum of squares? This is just the squared difference between all of our estimates of $\mathbf{y_i}$, and the actual values of $\mathbf{y_i}$. See the graph below for a visual explanation (not done just yet)! Here I’ve come slightly unstuck with my previous notation… I should have binned the $\approx$ sign earlier and instead written $\mathbf{\hat{y}} = \mathbf{X}\mathbf{\beta}$, to show that $ \mathbf{X}\mathbf{\beta}$ just gives us an estimate of the vector $\mathbf{y}$. Precisely:

Where $\epsilon$ is a vector of errors. Therefore, the residual sum of squares is just $\epsilon ^2$!

Residual sum of squares:

Given a particular $\mathbf{\beta}$ vector, we can calculate the squared sum of the difference between the estimates that particular parameter vector gives us and the actual values ($\epsilon ^2$). This can be thought of as calculating a measurement of the total lack of fit given by that specific $\mathbf{\beta}$ :

At this point, it is probably worth refreshing what the neat matrix notation above actually means. Apparently, it is a critical life skill to switch in and out of matrix notation on a whim…

Note there are $N$ datapoints/rows in our $\mathbf{y}$ vector/ $\mathbf{X}$ matrix, and $P$ parameters/columns in our $\mathbf{\beta}$ vector/ $\mathbf{X}$ matrix.

Okay, so the overall aim is the to find the parameter vector $\mathbf{\beta}$, that gives us the minimum value for the RSS. This is done by taking the derivative of RSS with respect to beta, and then solving for $\mathbf{\beta}$ when the derivative is 0. This solution, when $\frac{\partial RSS}{\partial \mathbf{\beta}} = 0$, will give us the $\mathbf{\beta}$ that corresponds to either a maximum or minimum of $RSS(\mathbf{\beta})$. Our function is a bowl though! - as RSS is the sum of squares it is always positive, and therefore it must be a bowl (we could also look at second derivatives, but that is for later!). To start, we need to expand out $ RSS({\beta}) = ({y}-{X}{\beta})^2$, and as $({y}-{X}{\beta})$ is a vector, to square it we need to take the dot product: $({y}-{X}{\beta})^T({y}-{X}{\beta})$.

Notice that the two inner terms are scalars, as they are the dot product of two vectors, and a scalar transposed is just the original scalar. Therefore, we can transpose either term:

Taking the derivative

Now we are in a position to take the derivative! Note the first term disappears as it does not depend on $\beta$.

You’d be forgiven for feeling like we must be a little stuck here, as I said you only needed to know the basics of linear algebra. But let’s break down what’s going on. Notice that the two terms are just scalars, one being the sum of the product between our estimates $\mathbf{\hat{y}}$, and $\mathbf{y}$ itself; the other the sum of the square of our estimates $\mathbf{\hat{y}}$. We want to know the expression for how these scalar quantities change when we nudge around our parameter column vector, $\beta$.

If the dimensions of the terms are written out, it’s also easy to see that $y^TX$ is just going to be a $1\times P$ vector, which I’ll call $c^T$, as $y^TX$ is $1\times N N\times P$. Similarly, $X^TX$ is a $P\times P$ matrix, which I’ll call $A$, as $X^TX $ is $P\times N N \times P$. We can therefore simplify our expression:

Here, we are actually in a position to cheat, and look up the answers (check out the 1st and 3rd rows below the solid line, replacing $x$ for $\beta$ and $a$ for $c$). Or you could read this next post to get there step by step! (This page was getting way to long so I split it off). Either way, cheating or not, we get the following identities:

We can substitute these identities into our original equation…

Finishing up

Now we just need to set to 0 and solve for beta: