Jekyll2021-04-14T23:15:03+05:30/feed.xmlA Fish without a BicycleTechnology and ArtSupport Vector Machine from First Principles: Part One2021-04-14T00:00:00+05:302021-04-14T00:00:00+05:30/2021/04/14/support-vector-machines-derivations<p>We will derive the intuition behind <strong>Support Vector Machines</strong> from first principles. This will involve deriving some basic vector algebra proofs, including exploring some intuitions behind hyperplanes. Then we’ll continue adding to our understanding the concepts behind quadratic optimisation.</p>
<p>We’ll finally bring everything together by adding on the idea of projecting hyperplanes into higher dimensional (possibly infinite dimensional) spaces, and look at the motivation behind the kernel trick. At that point, the basic intuition behind SVMs should be rock-solid, and the stage should be set for extending to concepts of soft margins and misclassification.</p>
<p>In this specific post, we will build up to deriving the optimisation problem that we’d like to eventually solve.</p>
<h2 id="the-mean-and-difference-a-simple-observation">The Mean and Difference: A Simple Observation</h2>
<p>Let’s take a set of tuples. Each tuple contains two numbers, \(\mathbf{X=\{(1,3), (0,4), (-5, 9)\}}\) . If we’re asked to find the mean of each of these pairs of numbers, the answer is <strong>2</strong> in all cases. Note that the position of the mean does not change, as long as each pair of numbers moves in opposite directions at the same rate, i.e., <strong>(0,4)</strong> is the result of both ends of <strong>(1,3)</strong> shrinking and growing by 1; the same idea applies to the other tuples.</p>
<p><img src="/assets/images/constant-mean.png" alt="Constant Mean" /></p>
<p>Let’s take a set of tuples. Each tuple contains two numbers, \(\mathbf{X=\{(1,3), (2,4), (7, 9)\}}\). If we’re asked to find the difference of each of these pairs of numbers, the answer is <strong>2</strong> in all cases. Note that the position of the mean keeps changing, as long as each pair of numbers moves in the smae direction at the same rate, i.e., <strong>(2,4)</strong> is the result of both ends of <strong>(1,3)</strong> moving by 1; the same idea applies to the other tuples.
Thus, the mean in any of the above cases can always be written as \(b\), and each tuple of number can be written as \((b-k, b+k)\), where <strong>k</strong> is a constant. Then the difference is always \((b+k)-(b-k)=\mathbf{2k}\) in all cases.</p>
<p><img src="/assets/images/constant-difference.png" alt="Constant Difference" /></p>
<p>This formulation will come in handy when we are expressing hyperplane intercepts later on, and exploring the possibilities of different hyperplane solutions for SVMs.</p>
<h2 id="equation-of-an-affine-hyperplane">Equation of an Affine Hyperplane</h2>
<p>The equation of a line in two dimensions passing through the origin, can always be written as:</p>
\[ax+by=0\]
<p>The equation of a line parallel to the above, but not passing through the origin, can be written as:</p>
\[ax+by=c\]
<p>\(ax+by=0\) is a linear subspace of \({\mathbb{R}}^2\) (see <a href="/2021/04/02/matrix-subspaces-intuitions.html">Subspace Intuitions</a>). It is also, by definition, a hyperplane in \({\mathbb{R}}^2\).</p>
<p>\(ax+by=c\) is an affine subspace of \({\mathbb{R}}^2\). The simplistic definition of an affine subspace is a vector subspace which does not necessarily pass through the origin. There is a lot more subtlety involved where affine geometry is involved, but for the moment, the intuitive high-school equation of a general equation of a line in two dimensions will suffice.
Extending this to a higher dimension \(N\), the equation of an affine hyperplane in \({\mathbb{R}}^N\) is:</p>
\[\mathbf{w_1x_1+w_2x_2+...+w_{N-1}x_{N-1}=b}\]
<p>The important thing to remember is that <strong>the dimensionality of a hyperplane is always one less than the dimensionality of the ambient space it inhabits</strong>; that’s why the indices in the above equation go up to \(N-1\).</p>
<p><strong>Why is this the general form of the equation?</strong> <br />
We can recover this general form with some simple matrix algebra. Let us assume that the hyperplane passing through the origin is represented by its normal \(N\). Then, since every point \(x\) on the hyoerplane is perpendicular to \(N\), we can write:</p>
\[N^Tx=0\]
<p>Let us now assume that this hyperplane has been displaced by an arbitrary vector \(u\); thus every point \(x\) has been displaced by a vector \(u\). To re-express the perpendicularity relationship, we must invert this displacement to bring the displaced hyperplane back to the origin, that is:</p>
\[N^T(x-u)=0 \\
\Rightarrow N^Tx=N^Tu \\
\Rightarrow \mathbf{N^Tx=c} \\\]
<p>The situation is shown below. Any \(x\) in the affine hyperplane is not perpendicular to the normal vector \(\vec{N}\). Only by translating it back to the original hyperplane (the linear subspace) can the perpendicularity relationship hold.</p>
<p><img src="/assets/images/affine-hyperplane.png" alt="Affine Hyperplane" /></p>
<p>where \(c=N^Tu\), a constant, and the components of \(N\) are the weights \(w_1\), \(w_2\), etc.</p>
<p>An interesting result to note is when a hyperplane is displaced along its normal. Let us assume that \(c=tN\), where \(t\) is some arbitrary scalar. Then, substituting this into the relationship we derived, we get:</p>
\[N^Tx=tN^TN \\
\Rightarrow N^Tx=t{\|N\|}^2 \\\]
<h2 id="perpendicular-distance-between-two-parallel-affine-hyperplanes">Perpendicular Distance between two Parallel Affine Hyperplanes</h2>
<p>Next, we derive the perpendicular distance between two affine hyperplanes. Given two hyperplanes of the form:</p>
<p>\(N^Tx=c_1 ....(H_1)\) <br />
\(N^Tx=c_2 ....(H_2)\)</p>
<p>we’d like to know the perpendicular distance between them. Note that they have the same normal because they are parallel, merely displaced from each other (and in this case, not passing through the origin, assuming \(c_1, c_2 \neq 0\)).</p>
<p>Assume a point \(P_1\) on \(H_1\), and a corresponding point \(P2\) on \(H_2\). Further, assume that the vector connecting these two points is the perpendicular</p>
\[N^TP_1=c_1 \\
N^TP_2=c_2 \\
P_2=P_1+tN \\
\Rightarrow P_2-P_1=tN\]
<p>The situation is show below:</p>
<p><img src="/assets/images/distance-between-two-hyperplanes.png" alt="Distance between two Affine Hyperplanes" /></p>
<p>Subtracting:</p>
\[N^T(P_2-P_1)=c_2-c_1 \\
\Rightarrow N^T.tN=c_2-c_1 \\
\Rightarrow tN^TN=c_2-c_1 \\
\Rightarrow t{\|N\|}^2=c_2-c_1 \\
\Rightarrow \mathbf{t=\frac{c_2-c_1}{\|N\|^2}}\]
<p>This recovers the scaling factor \(t\); we still need to multiply it with the magnitude of \(\vec{N}\) to give us the actual perpendicular distance between \(H_1\) and \(H_2\). Thus, the distance is:</p>
\[d_perp(H_1,H_2)=t\|N\| \\
\Rightarrow d_perp(H_1,H_2)=\frac{c_2-c_1}{\|N\|^2}\|N\| \\
\Rightarrow \mathbf{d_{perp}(H_1,H_2)=\frac{c_2-c_1}{\|N\|}}\]
<p>Let us perform a simple substitution where <strong>b</strong> is the mean of \(c_1\) and \(c_2\), i.e.,</p>
\[b=\frac{c_1+c_2}{2}\]
<p>and <strong>k</strong> is the distance from <strong>b</strong> to \(c_1\) and \(c_2\). Thus, we may write:</p>
\[c_1=b-k \\
c_2=b+k\]
<p>Consequently, the perpendicular distance between two affine hyperplanes can be rewritten as:</p>
\[d_{perp}(H_1,H_2)=\frac{b+k-(b-k)}{\|N\|} \\
\Rightarrow \mathbf{d_{perp}(H_1,H_2)=\frac{2k}{\|N\|}}\]
<p>Here’s the next question: what is the equation of the affine hyperplane halfway between \(H_1\) and \(H_2\). It is very tempting to assume that it is \(N^Tx=b\), but let us validate this intuition. The figure below shows the situation.</p>
<p><img src="/assets/images/halfway-distance-between-two-hyperplanes.png" alt="Halfway Distance between two Affine Hyperplanes" /></p>
<p>The scaling factor for this halfway hyperplane is obviously \(t/2=\frac{c_2-c_1}{2{\|N\|}^2}\).
We use the same procedure we did when calculating the distance between \(H_1\) and \(H_2\), except this time we seek the intercept factor, and know the scaling factor already. Thus, if we write, for \(H_1\) and the halfway hyperplane \(H_h\):</p>
\[N^TP_1=c_1 \\
N^TP_H=\beta \\
P_H-P_1=\frac{t}{2}N\]
<p>Subtracting, we get:</p>
\[N^T(P_H-P_1)=\beta -c_1 \\
\Rightarrow N^TN\frac{t}{2}=\beta -c_1 \\
\Rightarrow {\|N\|}^2\frac{c_2-c_1}{2{\|N\|}^2}=\beta -c_1 \\
\Rightarrow \frac{c_2-c_1}{2}=\beta -c_1 \\
\Rightarrow \mathbf{\beta=\frac{c_1+c_2}{2}=b}\]
<p>This indeed corresponds with our intuition that an affine hyperplane midway between two parallel affine hyperplanes will have its intercept as the mean of those on either side of it.</p>
<h2 id="framing-the-svm-optimisation-problem">Framing the SVM Optimisation Problem</h2>
<p>We now have all the background we need to state the general problem Support Vector Machines are attempting to solve.</p>
<h3 id="separating-hyperplane">Separating Hyperplane</h3>
<p>The primary purpose of SVMs is classification of training data. To put it very simply, we desire to find a affine hyperplane which can separate our data into two classes such that points in one class lie above the hyperplane, while all points in the other class, lie below the hyperplane.</p>
<p>The diagram below illustrates the concept.</p>
<p><img src="/assets/images/svm-separating-hyperplane.png" alt="SVM Hyperplane Problem" /></p>
<p>Let us state the first, and most important, assumption which accompanies this investigation, namely, that the data in the two classes, should be linearly separable. This implies that it should be possible to find a hyperplane, any hyperplane, in the first place which can separate the two classes of data neatly above and below the hyperplane.</p>
<p><strong>Note</strong>: We will relax this assumption later, but for the moment, let us proceed with the simple case.</p>
<p>The second condition we impose on our solution will become clearer from the discussion below.
If you look at the picture above, you’ll see that there is a lot of flexibility in terms of what this hyperplane can look like in terms of its parameters.
In fact, in the example above, and very generally, there are an infinite number of hyperplanes which can partition the data into two classes perfectly, i.e., an infinite number of combinations of weights in the equation of the affine hyperplane.
Here is an illustration of some example possibilities.</p>
<p><img src="/assets/images/svm-separating-hyperplane-possibilities.png" alt="SVM Hyperplane Possibilities" /></p>
<p>More data may constrain the space of solutions some more, but it will still be infinite in the most general case, assuming that the data is linearly separable. What combination should we choose?</p>
<p>This is where we’d like to impose some mathematical constraints on the solution to drive us toward a satisfactory solution.
The most important one we have already stated, which is that all data points belonging to one class should fall on one side of the hyperplane.
The second one is the one which gives Support Vector Machines their name. We’d like to maximise the Support Margin. What is a support margin? Let’s look at the diagram with a separating hyperplane once again.</p>
<h3 id="supporting-hyperplanes">Supporting Hyperplanes</h3>
<p>If we take two points, one from each class, such that they are the closest to each other (there can be more than one of each type, but this argument extends to that as well), and draw two parallel hyperplanes through them (making sure that the points still say linearly separable), we will have drawn something like the dotted lines in the figure below.</p>
<p><img src="/assets/images/svm-supporting-hyperplanes.png" alt="SVM Support Hyperplanes" /></p>
<p>These hyperplanes that we’ve drawn are not the actual separating hyperplane, but they ‘bracket’ the actual hyperplane which will be used to classify our data. Thus, they are called the <strong>supporting hyperplanes</strong> of the SVM. The perpendicular distance between these supporting hyperplanes is the support margin of the Support Vector Machine. The actual separating hyperplane lies midway between these supporting hyperplanes.</p>
<p>Now, at face value, it might seem that we haven’t really improved our problem definition by a lot. After all, it is definitely possible to draw an infinite number of sets of supporting hyperplanes (and consequently, an infinite number of separating hyperplanes). The diagram below shows two possibilities: \(H_1\), \(H_{1-}\), and \(H_{1+}\) form one separating hyperplane-supporting hyperplane set, and \(H_2\), \(H_{2-}\), and \(H_{2+}\) form another.</p>
<p><img src="/assets/images/svm-options-separating-hyperplanes-supporting-hyperplanes.png" alt="SVM Support Hyperplane Possibilities" /></p>
<p>This is where we state the optimisation which will narrow down our solution space. We wish to find the set of supporting hyperplanes which maximises the support margin, subject to the constraints that all the data still stay linearly separable.</p>
<p>This immediately has an important implication: namely, that no data points may exist inside the margin of the SVM. This immediately puts more constraints on our solution because now the data points of class 1 in our example, need to not only fall above the separating hyperplane, they also need to be above or on the supporting hyperplane \(H_+\); the same argument holds for the other class.</p>
<p>Let us quantify all of these conditions mathematically.
We seek a separating hyperplane of the form \(N^Tx=b\).
We seek supporting hyperplanes of the form \(N^Tx=b+k\) and \(N^Tx=b-k\).</p>
<h3 id="1-linearly-separable-data">1. Linearly Separable Data</h3>
<p>For a set of data \(x_i, i\in[1,N]\), if we assume that data is divided into two classes (-1,+1), we can write the constraint equations as:</p>
\[\mathbf{
N^Tx_i>b+k, \forall x_i|y_i=+1 \\
N^Tx_i<b-k, \forall x_i|y_i=-1
}\]
<h3 id="2-margin-maximisation">2. Margin Maximisation</h3>
<p>We have already derived the perpendicular distance between two affine hyperplanes of the form \(N^Tx=b+k\) and \(N^Tx=b-k\), which is \(\frac{2k}{\|N\|}\). We seek to obtain the following:</p>
\[\mathbf{m_{max}=max \frac{2k}{\|N\|}}\]
<p>This is an optimisation problem, which we will analyse in succeeding articles.</p>avishekWe will derive the intuition behind Support Vector Machines from first principles. This will involve deriving some basic vector algebra proofs, including exploring some intuitions behind hyperplanes. Then we’ll continue adding to our understanding the concepts behind quadratic optimisation.Dot Product: Algebraic and Geometric Equivalence2021-04-11T00:00:00+05:302021-04-11T00:00:00+05:30/2021/04/11/dot-product-algebraic-geometric-equivalence<p>The <strong>dot product of two vectors</strong> is geometrically simple: the product of the magnitudes of these vectors multiplied by the cosine of the angle between them. What is not immediately obvious is the algebraic interpretation of the dot product.</p>
<p>Specifically, this definition:</p>
\[\mathbf{A^TB=\sum_{i=1}^N A_iB_i}\]
<p><strong>Why should the sum of the products of the componets of two vectors result in the same conclusion?</strong></p>
<p>This article shows two different ways of proving this, one long, and the other one super short (and one I feel is a little more intuitive and less mechanical). In addition, we will conclude with the importance of the dot product in various Machine Learning techniques.</p>
<h2 id="proof-through-the-rule-of-cosines">Proof through the Rule of Cosines</h2>
<p>We wish to find the dot product of two vectors, \(\vec{A}\) and \(\vec{B}\). \(\vec{A}\) has magnitude \(a=\|A\|\), and \(\vec{B}\) has magnitude \(b=\|C\|\). In the diagram, \(\vec{C}\) is the difference of \(\vec{A}\) and \(\vec{B}\), i.e., \(\vec{A}-\vec{B}\) and has a magnitude \(c=\|C\|\). \(\theta\) is the angle between \(\vec{A}\) and \(\vec{B}\).</p>
<p>The situation is represented below.</p>
<p><img src="/assets/images/dot-product-proof-law-of-cosines.jpg" alt="Dot Product Proof through Rule of Cosines" /></p>
<p>In addition, I’ve drawn the perpendicular \(\vec{P}\) which has magnitude \(p\). \(\vec{P}\) divides \(\vec{A}\) into two parts: \(t\vec{A}\) and \((1-t)\vec{A}\).</p>
<p>Let us list down some basic trigonometric identities evident from the diagram above.</p>
\[{at\over b}=cos\theta \\
\Rightarrow at=b.cos\theta\]
<p>We also have:</p>
\[p=b.sin\theta\]
<p>By <strong>Pythagoras’ Theorem</strong>:</p>
\[c^2=p^2+{(1-t)}^2a^2 \\
=b^2sin^2\theta +a^2-2a^2t+a^2t^2 \\
=b^2sin^2\theta +b^2sin^2\theta +a^2-2a^2t \\
=b^2(sin^2\theta +sin^2\theta) +a^2-2a^2t \\
=b^2+a^2-2a^2t \\
\mathbf{c^2=a^2+b^2-2ab.cos\theta} \\\]
<p>This is the <strong>Rule of Cosines</strong>. Note that for \(\theta=90^{\circ}\), this identity reduces to Pythagoras’ Theorem.</p>
<p>Now, from vector algebra, we see that:</p>
\[C=A-B \\
\|C\|=\|A-B\| \\
{\|C\|}^2={\|A-B\|}^2\]
<p>Taking the dot product of a vector with itself is essentially its magnitude squared, so we can write, while multiplying everything out:</p>
\[C^TC={(A-B)}^T(A-B) \\
=A^TA+B^TB-A^TB-B^TA \\
=A^TA+B^TB-2A^TB\]
<p>Equating the above result with the identity we obtained while proving the Rule of Cosines, we get:</p>
\[A^TA+B^TB-2A^TB=a^2+b^2-2ab.cos\theta\]
<p>Since \(A^TA=a^2={\|A\|}^2\) and \(B^TB=b^2={\|B\|}^2\), the above reduces to:</p>
\[-2A^TB=-2{\|A\|}{\|B\|}.cos\theta \\
\Rightarrow \mathbf{A^TB={\|A\|}{\|B\|}.cos\theta}\]
<p>The above is the original definition of the dot product, thus we have proved that the geometric and algebraic interpretations of the dot product lead to the same result.</p>
<h2 id="proof-through-the-choice-of-basis">Proof through the Choice of Basis</h2>
<p>So, the above proof was somewhat circuitous, going through proving the Rule of Cosines. I’d like to sketch out a shorter, hopefully slightly more intuitive proof, that does not take thses many steps.</p>
<p>I’ve redrawn the same diagram as above for reference, and emphasised the vector nature of the objects we are dealing with. All other labelling remains the same.</p>
<p><img src="/assets/images/dot-product-proof-selection-of-basis.jpg" alt="Dot Product Proof through Choice of Basis" /></p>
<p>We start with the same identities, namely:</p>
\[at=b.cos\theta \\
p=b.sin\theta\]
<p>In fact, for this proof, we will not need the second identity at all, though we will use \(p\) in our work.</p>
<p>Here are the two new things we make explicit. The vectors \(\vec{P}\) and \(\vec{A}\) are at right angles to each other, we will define unit vectors (without loss of generality) \(\hat{i}\) in the direction of \(\vec{A}\), and \(\hat{j}\) in the direction of \(\vec{P}\). That is:</p>
\[\vec{P}=0\hat{i}+ \|P\| \hat{j} \\
\vec{A}=\|A\| \hat{i}+0\hat{j}\]
<p>Thus, we can write \(\vec{B}\) as:</p>
\[\vec{B}=t\vec{A}+\vec{P}\]
<p>If we take the component-wise product of \(\vec{A}\) and \(\vec{B}\), which is the same as multiplying \(A^T\) with \(B\), we get:</p>
\[A^TB=t{\|A\|}^2+0.\|P\| \\
A^TB=t{\|A\|}^2 =a^2t=a.at \\
\mathbf{A^TB=ab.cos\theta}\]
<p>which is the identity we are seeking to prove.</p>
<h2 id="applications-of-the-dot-product">Applications of the Dot Product</h2>
<ul>
<li>The dot product is used commonly as a similarity metric between data points. Since it is at its maximum possible value when two vectors are fully aligned. For example, it is used for creating the <strong>covariance matrix</strong> of a multivariate Gaussian distribution. It is also used as part of different statistical tests for <strong>correlation</strong>.</li>
<li>The dot product is an important <strong>tool for several proofs</strong> where orthogonality of vectors needs to be specified mathematically. Many conditions for results begin with assuming that the dot product of two vectors is zero.</li>
<li>The dot product usually starts out as a <strong>kernel</strong> in Machine Learning techniques like <strong>Support Vector Machines</strong> and <strong>Gaussian Processes</strong>. This kernel is then set to functions more appropriate for measuring similarity.</li>
<li>The algebraic interpretation of the dot product is the one most used for computation of the dot product in algorithms.</li>
</ul>avishekThe dot product of two vectors is geometrically simple: the product of the magnitudes of these vectors multiplied by the cosine of the angle between them. What is not immediately obvious is the algebraic interpretation of the dot product.Linear Regression: Assumptions and Results using the Maximum Likelihood Estimator2021-04-05T00:00:00+05:302021-04-05T00:00:00+05:30/2021/04/05/linear-regression-maximum-likelihood-estimator<p>Let’s look at <strong>Linear Regression</strong>. The “linear” term refers to the fact that the output variable is a <strong>linear combination</strong> of the input variables.</p>
<p>Thus, this is a linear equation:</p>
\[y=ax_1+bx_2+cx_3\]
<p>but this next ones are not:</p>
\[y=ax_1+bx_2x_3+cx_3 \\
y=ax_1+bx_2+cx_3{x_1}^2\]
<p>In the general form, we are looking for a relation like:</p>
\[\mathbf{
y=w_1x_1+w_2x_2+...+w_Nx_N \\
y=\sum_{i=1}^Nw_ix_i
}\]
<p>Linear regression is a useful (and uncomplicated) tool for building prediction models, either on its own, or in its more sophisticated incarnations (like <strong>General Linear Models</strong> or <strong>Piecewise Linear Regression</strong>). However, it is instructive to consider the applicability of the linear model in its simplest form to a set of data, because there are very specific guarantees that the data should provide if we are to represent it using a linear model.</p>
<p>We will state these assumptions, as well as derive them from the base assumptions; some pictures should also clarify the intuition behind these intuitions. Parts of this will require a basic understanding of what probability distributions and partial differential calculus to follow along, but not much beyond that.</p>
<p>Let’s develop some intuition first through some examples. Check the diagram below. It should be obvious that this dataset can be modelled using linear regression. What this demonstrates specifically though, is the <strong>linear relationship</strong> between the input and the output, if we discount all other factors.</p>
<p><img src="/assets/images/linear-regression-linearity.jpg" alt="Linearity of Linear Regression" /></p>
<p>In contrast, the following picture demonstrates a clearly nonlinear relationship between the input and the output data.</p>
<p><img src="/assets/images/linear-regression-nonlinearity.jpg" alt="Nonlinearity in Linear Regression" /></p>
<p>Let’s go back to the simple perfectly fit data set above. Obviously, no data set is going to be perfect like the contrived example above, so you are more likely to see data like this:</p>
<p><img src="/assets/images/linear-regression-residuals.jpg" alt="Residuals of Linear Regression" /></p>
<p>Thus, the prediction will never perfectly match the data (if it did, we have another problem, called overfitting, which we will visit sometime), but perfect prediction is not really the aim here, because observations can be easily affected by noise and other random effects. However, the effect of the noise needs to be quantified in some fashion, even if we cannot make accurate pointwise predictions about what the noise/error effect for a particular observation can be.</p>
<p>As it turns out, this leads us to the second important assumption about modelling data using Linear Regression, namely:</p>
<p><strong>The noise/error values are normally distributed around the prediction.</strong></p>
<p>Put another way, the error values should be equally randomly distributed around the prediction value. In terms of probability, this implies that the noise values should follow a Gaussian probability distribution. This also implies that we can assume that the prediction for a particular input is the average value of all the data points for that input (assuming multiple readings are taken for the same input). The prediction takes up the role of the mean in the resulting (hopefully) Gaussian distribution.</p>
<p>Let’s take the third example. Here, we definitely see a linear relationship between the input and output. The noise is also randomly distributed around the predicted value. But something else is going on here.</p>
<p><img src="/assets/images/variable-variance.jpg" alt="Showing Variance dependency on Input Variable" /></p>
<p>The graph tells us that even though the noise is normally distributed around the predicted value, the spread of these noise values is not constant. This leads us to the next important assumption of linear regression, namely that:</p>
<p><strong>The noise values shouls be distributed with constant variance.</strong></p>
<p>This above assumption could be folded into the linearity assumption, but I feel it is important enough to be stated on its own.</p>
<p>All of these assumptions can be summarised in the diagram below:</p>
<p><img src="/assets/images/linear-regression-conditions.jpg" alt="Linear Regression Assumptions" /></p>
<p>This shows that at each output value predicted by the model, the data is normally distributed with constant variance and the mean as the predicted value.</p>
<p>We would like to get a closed-form expression for the values of the mean and variance for a particular set of observations for a single value of input variable. That is, we want to estimate the parameters of the Gaussian distribution.
In doing so, we want to ground our intuition of the mean being the average of <em>N</em> values from the Gaussian distribution assumption.</p>
<p>How do we approach this? To start from basic principles, we need to start with a probability approach.
Let’s look at a couple of variations of the Gaussian distribution, one plot with varying means, the other with varying variance.</p>
<p><img src="/assets/images/gaussians-varying-means.png" alt="Linear Regression Assumptions" />
<img src="/assets/images/gaussians-varying-variances.png" alt="Linear Regression Assumptions" /></p>
<p>This yields different Gaussians depending on how we tune the parameters \(\mu\) and \(\sigma\). Our aim is to find a combination of \((\mu, \sigma)\) which best explains the distribution of the observations for a particular input value.</p>
<p>Let us introduce the Gaussian probability density function.
\(P(x)=\frac{1}{\sqrt{2\pi\sigma^2}}.e^{-\frac{ {(x-\mu)}^2}{2\sigma^2}}\)</p>
<p>Let us assume that for a given input <strong>I</strong>, we have a set of observations \((x_1, x_2, x_3,...,x_N)\).</p>
<p>Thus if we randomly pick a combination of \((\mu, \sigma)\), we can ask the question:</p>
<p><strong>What is the probability of observation \(x_i\) occurring, given a parameter set \((\mu, \sigma)\)?</strong>
For \(x_1\), \(x_2\), etc., this is obviously given by:</p>
\[P(x_1)=\frac{1}{\sqrt{2\pi\sigma^2}}.e^{-\frac{ {(x_1-\mu)}^2}{2\sigma^2}} \\
P(x_2)=\frac{1}{\sqrt{2\pi\sigma^2}}.e^{-\frac{ {(x_2-\mu)}^2}{2\sigma^2}} \\
.\\
.\\
.\\
P(x_N)=\frac{1}{\sqrt{2\pi\sigma^2}}.e^{-\frac{ {(x_N-\mu)}^2}{2\sigma^2}}\]
<p>Knowing this, we can say that the joint probability of all the observations \((x_1, x_2, x_3,...,x_N)\) occurring for a given parameter set \((\mu, \sigma)\) is:</p>
\[P(X)=P(x_1)P(x_2)...P(x_n)\]
<p>This is our starting point for deriving expressions for the optimal set \((\mu, \sigma)\). We want to <strong>maximise this probability, or likelihood</strong> \(P(X)\). That will give us the Gaussian which best explains the distribution of the observations around the predicted value. This is the idea behind the <strong>Maximum Likelihood Estimation</strong> approach.</p>
<h2 id="gaussian-mean-and-variance-using-maximum-likelihood-estimation">Gaussian Mean and Variance using Maximum Likelihood Estimation</h2>
<p>Let us rewrite the Gaussian distribution function, and the function that we are attempting to maximise the value of.</p>
\[P(x)=\frac{1}{\sqrt{2\pi\sigma^2}}.e^{-\frac{ {(x-\mu)}^2}{2\sigma^2}} \\
P(X)=P(x_1)P(x_2)...P(x_N) \\
P(X)=\prod_{i=1}^{N}P(x_i)\]
<p>Maximising the log of a function is the same as maximising the function itself; also working with logarithms will convert the problem of exponents and multiplications into addition and subtraction, which is much easier to work with.</p>
<p>With this in mind, we take the log on both sides (base \(e\)) to get:</p>
<p>\(log_e P(X)=\sum_{i=1}^{N}log_e P(x_i) \\
log_e P(X)=\sum_{i=1}^{N}log_e \frac{1}{\sqrt{2\pi\sigma^2}}.e^{-\frac{ {(x_i-\mu)}^2}{2\sigma^2}} \\
log_e P(X)=\sum_{i=1}^{N}log_e \frac{1}{\sqrt{2\pi\sigma^2}} + \sum_{i=1}^{N}log_e e^{-\frac{ {(x_i-\mu)}^2}{2\sigma^2}} \\
log_e P(x)=-\frac{1}{2}\sum_{i=1}^{N}log_e 2\pi\sigma^2 + \sum_{i=1}^{N}log_e e^{-\frac{ {(x_i-\mu)}^2}{2\sigma^2}} \\
log_e P(X)=-\frac{1}{2}\sum_{i=1}^{N}log_e 2\pi -\frac{1}{2}\sum_{i=1}^{N}log_e \sigma^2 + \sum_{i=1}^{N}log_e e^{-\frac{ {(x_i-\mu)}^2}{2\sigma^2}} \\\)
Dropping the first term on the right side, since it is a constant, we get:</p>
\[log_e P(X)\propto -\frac{1}{2}\sum_{i=1}^{N}\frac{ {(x_i-\mu)}^2}{\sigma^2} -\sum_{i=1}^{N}log_e \sigma \\
L(X)=-\frac{1}{2}\sum_{i=1}^{N}\frac{ {(x_i-\mu)}^2}{\sigma^2} -N.log_e \sigma \\\]
<p><strong>Thus, our problem of finding the best values for \(\mu\) and \(\sigma\) boils down to maximising the above expression \(L(x)\)</strong>.</p>
<p>Since this is an equation in two variables, let’s take the partial differential with respect to each variable, while treating the other as a constant.</p>
<h3 id="derivation-of-the-mean-mu">Derivation of the mean \(\mu\)</h3>
\[\frac{\partial {L(X)}}{\partial\mu}=\frac{1}{\sigma^2}\sum_{i=1}^{N}(x_i-\mu)\]
<p>Setting this partial derivative to 0, we get:</p>
\[\frac{1}{\sigma^2}\sum_{i=1}^{N}(x_i-\mu)=0 \\
\sum_{i=1}^{N}(x_i-\mu)=0 \\
\sum_{i=1}^{N}x_i- N\mu=0 \\
\mu=\frac{1}{N}\sum_{i=1}^{N}x_i\]
<p>The above is the definition of the arithmetical mean, essentially the average value of all the observations.</p>
<h3 id="derivation-of-the-variance-sigma">Derivation of the variance \(\sigma\)</h3>
\[\frac{\partial {L(x)}}{\partial\sigma}=\frac{1}{\sigma^3}\sum_{i=1}^{N}{(x_i-\mu)}^2 -\frac{N}{\sigma} \\
\frac{\partial {L(x)}}{\partial\sigma}=\frac{1}{\sigma}\left(\frac{1}{\sigma^2}\sum_{i=1}^{N}{(x_i-\mu)}^2 -N\right) \\\]
<p>Setting this partial derivative to 0, we get:</p>
\[\frac{1}{\sigma^2}\sum_{i=1}^{N}{(x_i-\mu)}^2 -N=0 \\
N\sigma^2=\sum_{i=1}^{N}{(x_i-\mu)}^2 \\
\sigma^2=\frac{1}{N}\sum_{i=1}^{N}{(x_i-\mu)}^2\]
<p>The above is the definition of variance of a Gaussian distribution.</p>
<p>Summarising the results below, we can say:</p>
\[\mathbf{
\mu=\frac{1}{N}\sum_{i=1}^{N}x_i \\
\sigma^2=\frac{1}{N}\sum_{i=1}^{N}{(x_i-\mu)}^2
}\]
<p>Note that we arrived at the definition of the average of a set of values with only the assumption of a Gaussian probability distribution. This means that <strong>taking the average of a set of values implies that those values are distributed normally</strong>.</p>
<p>It is important to note that even though the principle of the Maximum Likelihood Estimation technique itself is very general, <strong>not every probability distribution will allow us to derive a closed-form expression for the “mean” and “variance”</strong>. In those scenarios, you’ll want to use other optimisation techniques like Gradient Descent.</p>
<p>The obvious questions which arises after this discussion, are:</p>
<ul>
<li>How do we check for the normality of the data, short of visualising it (which might not be tenable for large data sets)?</li>
<li>Do we abandon Linear Regression if the data is not normal?</li>
</ul>
<p>To answer the first point, there are several metrics that we can use to gauge the normality of data. Some approaches are using <strong>Quantile-Quantile Plots</strong> and the <strong>Jarque-Bera Test</strong>.</p>
<p>The answer to the second question is no: we do not need to immediately abandon Linear Regression if the data is not normal. This is because there are several techniques that we can use:</p>
<ul>
<li>We can try to <strong>transform the data into something more Gaussian</strong>. These are essentially nonlinear functions applied on the data to make them normally distributed. The Box-Cox Transformation is an example of a class of such mappings.</li>
<li>We can <strong>relax the Gaussian distribution assumption, and use the underlying distribution</strong> that we <em>think</em> best represents the data, while still maintaining linear predictors. This leads us to <strong>Generalised Linear Models</strong>.</li>
</ul>avishekLet’s look at Linear Regression. The “linear” term refers to the fact that the output variable is a linear combination of the input variables.Matrix Rank and Some Results2021-04-04T00:00:00+05:302021-04-04T00:00:00+05:30/2021/04/04/proof-of-column-rank-row-rank-equality<p>I’d like to introduce some basic results about the rank of a matrix. Simply put, the rank of a matrix is the number of independent vectors in a matrix. Note that I didn’t say whether these are column vectors or row vectors; that’s because of the following section which will narrow down the specific cases (we will also prove that these numbers are equal for any matrix).</p>
<p>A matrix is <strong>full rank</strong> if 1) all of its column vectors are linearly independent, and 2) all of its row vectors are linearly independent.
A matrix is <strong>full column rank</strong> if all of its column vectors are linearly independent.
A matrix is <strong>full row rank</strong> if all of its row vectors are linearly independent.</p>
<p>For an \(M\times N\) matrix, if the rank of a matrix is less than than the smaller of M, N, i.e., \(min(M,N)\), then we call it <strong>degenerate</strong>, <strong>rank-deficient</strong>, <strong>singular</strong>, etc. This has implications for whether a matrix is invertible or not, namely that <strong>a degenerate matrix is not invertible.</strong> See <a href="/2021/04/03/matrix-intuitions.html">Assorted Intuitions about Matrices</a> for a quick intuition.</p>
<p>Note here, that I didn’t specify whether the rank implied column rank or row rank. As we shall see in a moment, we will prove that the column rank of a matrix always equals its row rank.</p>
<h2 id="proof-of-equality-of-column-rank-and-row-rank-of-a-matrix">Proof of Equality of Column Rank and Row Rank of a Matrix</h2>
<p>Before getting into the proof, let’s state an obvious fact (or maybe not so obvious, but at least it should follow from our definition of matrix multiplication).</p>
<p>The fact is that multiplying a matrx A, which has some column rank <strong>c</strong> and row rank <strong>r</strong> (just to be super general about ranks), cannot alter its column or row rank. Can you see why? It is because we understand that matrix multiplication is essentially both 1) a linear combination of column vectors, and 2) a linear combination of row vectors. <strong>A linear combination of a set of vectors cannot create a linearly independent vector.</strong></p>
<p>That’s like trying to combine the \((1,0,0)\) vector and the \((0,1,0)\) to create a \((0,0,1)\); you just cannot do it.</p>
<p>With that out of the way, let’s consider a matrix, any matrix with column rank <strong>c</strong> and row rank <strong>r</strong>. We want to determine a relation between these two ranks.</p>
<p>We should be able to express this matrix as a linear combination of its <strong>c</strong> column vectors. It would look like this:</p>
\[A=
\begin{bmatrix}
| && | && | && ... && | \\
bc_1 && bc_2 && bc_3 && ... && bc_c \\
| && | && | && ... && | \\
\end{bmatrix}
\begin{bmatrix}
--- r_1 --- \\
--- r_2 --- \\
--- r_3 --- \\
... \\
--- r_c --- \\
\end{bmatrix}\\\]
<p>The only assumption I’ve made in the identity above is that \(bc_1\), \(bc_2\), etc. are linearly independent; <strong>there are no assumptions about any of the rows \(r_1\), \(r_2\), etc.</strong></p>
<p>However, let us look at this same identity, but from the point of view of a linear combination of the row vectors \(r_2\), \(r_2\), …, \(r_c\). How many row vectors are there? <strong>c</strong> row vectors, of course, since by the rules of matrix multiplication, if the left matrix has <strong>c</strong> columns, the right matrix needs <strong>c</strong> rows.</p>
<p>This implies that the <strong>row rank of the right matrix is at most c</strong>. It can be less than <strong>c</strong>, since we have not made any assumptions about its row rank, but we now have an upper bound on the row rank of this matrix. By extension, the matrix <strong>A</strong>’s row rank cannot exceed <strong>c</strong> either. That is:</p>
\[r\leq c\]
<p>Now, we apply the same argument, but this time, we take the <strong>r</strong> linearly independent row vectors, from which we can get:</p>
\[c\leq r\]
<p>The only scenario which satisfies both of these above conditions is when \(\mathbf{r=c}\).</p>
<p><strong>The column rank of a matrix always equals its row rank.</strong>
It is important to note that this rule holds for every matrix. Let’s quickly talk of the implications of this for general matrix multiplication. From here on out, we will not distinguish between row rank and column rank, because the values are the same. We will simply refer to it as a matrix’s rank.</p>
<p>Let’s assume <strong>A</strong> is a matrix of rank \(R_A\) and matrix <strong>B</strong> has a rank of \(R_B\). When we multiply them, it results in a matrix C with rank \(R_C\). How is \(R_C\) related to \(R_A\) and \(R_B\)?</p>
<p>Well, simply based on the argument in the proof we just looked at, where we were multiplying two matrices, we can write:</p>
\[R_C\leq R_A \\
R_C\leq R_B\]
<p>This simply implies that
\(\mathbf{R_C=min(R_A, R_B)}\)</p>
<p>That is, <strong>the rank of the matrix product is equal to the smaller of the ranks of the two multiplying matrices.</strong></p>
<p>It should follow automatically, that the <strong>ranks of \(A^TA\), \(AA^T\) are always equal to the rank of \(A\)</strong>.</p>
<h2 id="notes">Notes</h2>
<ul>
<li>The rank can be obtained from the row echelon form (or reduced row echelon form) of a matrix.</li>
</ul>avishekI’d like to introduce some basic results about the rank of a matrix. Simply put, the rank of a matrix is the number of independent vectors in a matrix. Note that I didn’t say whether these are column vectors or row vectors; that’s because of the following section which will narrow down the specific cases (we will also prove that these numbers are equal for any matrix).Assorted Intuitions about Matrices2021-04-03T00:00:00+05:302021-04-03T00:00:00+05:30/2021/04/03/matrix-intuitions<p>Some of these points about matrices are worth noting down, as aids to intuition. I might expand on some of these points into their own posts.</p>
<ul>
<li>A matrix is a collection of <strong>column vectors</strong>.</li>
<li>A matrix is a collection of <strong>row vectors</strong>.</li>
<li>A matrix is a <strong>linear transformation</strong>, with its column vectors being the <strong>new basis</strong>.</li>
<li>A matrix \(A\) cannot be inverted (i.e., it does not have a unique inverse) <strong>if any of its column vectors are linearly dependent on the others</strong>.
<ul>
<li>This is because, then there will always be a non-zero vector solution which will lose all of its components to zero; and <strong>there is no way to reverse that operation to recover the original vector</strong>.</li>
<li>Mathematically, this means if there exists a nonzero \(x\), such that \(Ax=0\), \(A\) is not invertible.</li>
</ul>
</li>
<li><strong>The dot product of two vectors is a linear transformation of the right vector into the number line</strong>, with the individual scalar components of the left vector being the basis vectors on this one-dimensional number line.</li>
</ul>
<p><img src="/assets/images/even-one-linear-dependence-causes-non-invertible-matrix.jpg" alt="A Single Linearly Dependent Vector results in a non-invertible matrix" /></p>
<p><strong>Note that the above diagram is not mathematically correct.</strong> I drew 5 basis vectors in 2D space, and you cannot have more than 2 linearly independent basis vectors in two dimensions. This diagram is simply for illustration purposes.</p>
<ul>
<li>
<p>The <strong>determinant of a matrix</strong> is essentially the <strong>volume spanned by the basis vectors formed by its columns</strong>. A degenerate matrix has a determinant of zero because the measurement of this “hypervolume” on one axis becomes zero.</p>
</li>
<li>
<p><strong>The left null space of a matrix represents the set of normal vectors for the hyperplane defined by the column space of this matrix.</strong> This is because by definition, all vectors in the left null space are orthogonal to the column space. Thus, <strong>any vector in the left null space also represents the actual geometric equation of the hyperplane defined by the column space.</strong></p>
</li>
</ul>avishekSome of these points about matrices are worth noting down, as aids to intuition. I might expand on some of these points into their own posts.Matrix Outer Product: Columns-into-Rows and the LU Factorisation2021-04-02T00:00:00+05:302021-04-02T00:00:00+05:30/2021/04/02/vectors-matrices-outer-product-column-into-row-lu<p>We will discuss the Column-into-Rows computation technique for matrix outer products. This will lead us to one of the important factorisations (the LU Decomposition) that is used computationally when solving systems of equations, or computing matrix inverses.</p>
<h2 id="the-building-block">The Building Block</h2>
<p>Let’s start with the building block which we’ll extend to the computation of the matrix outer product. That component is the multiplication of a column vector (\(M\times 1\), left side) with a row vector (\(1\times N\), right side).
Without doing any computation, we can immediately say that the resulting matrix is \(M\times N\). Taking a concrete example here:</p>
\[\begin{bmatrix}
a_{11} \\
a_{21}
\end{bmatrix}
\begin{bmatrix}
b_{11} && b_{12} \\
\end{bmatrix}=
\begin{bmatrix}
a_{11}b_{11} && a_{11}b_{12} \\
a_{21}b_{11} && a_{11}b_{22} \\
\end{bmatrix}\]
<p>Note that this specific calculation itself can be done using the column picture/row picture/value-wise computation.</p>
<h2 id="extension-to-all-matrices">Extension to all matrices</h2>
<p>We will extend this computation to the outer product of two general matrices, A (\(M\times N\)) and B (\(N\times P\)). Here, \(a_1\), \(a_2\), …, \(a_N\) are the \(N\) columns of \(A\), and \(b_1\), \(b_2\), …, \(b_N\) are the \(N\) rows of \(B\).</p>
\[\begin{bmatrix}
| && | && | && |\\
a_1 && a_2 && ... && a_N\\
| && | && | && |\\
\end{bmatrix}
\begin{bmatrix}
--- && b_1 && --- \\
--- && b_2 && --- \\
--- && b_N && --- \\
\end{bmatrix}\\
=\mathbf{a_1b_1+a_2b_2+...+a_Nb_N}\]
<p>Each product in the result \(a_1b_1\), \(a_2b_2\), etc. is an \(M\times P\) matrix, and the sum of them is obviously \(M\times P\) as well.
Alright, so this is a way of computing the outer product of matrices. That’s great, but let’s look at why this is useful. That application is the LU Decomposition technique, and the high-level intuition behind it is sketched below.</p>
<ul>
<li>A single matrix will be expressed as the sum of the product of two matrices \({\ell}_1 u_1\), \({\ell}_2 u_2\), …, \({\ell}_N u_N\). It is important to note that we are dealing with an \(N\times N\) matrix, i.e., a square matrix.</li>
<li>We will go in the reverse direction and express this sum as a product of two matrices, \(\mathbf{L}\) and \(\mathbf{U}\).</li>
<li>Thus, we will have expressed \(A\) as \(\mathbf{A=LU}\), which is essentially a factorisation of A.</li>
<li>We will also consider the special nature of \(\mathbf{L}\) and \(\mathbf{U}\) as part of this discussion.</li>
</ul>
<h2 id="lu-factorisation-procedure">LU Factorisation: Procedure</h2>
<p>The final matrix \(L\) is a <strong>lower diagonal matrix</strong>, meaning that the all the elements above the diagonal are zero. Here is an example of a lower diagonal matrix:</p>
\[\begin{bmatrix}
1 && \mathbf{0} && \mathbf{0} && \mathbf{0}\\
5 && 6 && \mathbf{0} && \mathbf{0}\\
7 && 8 && 3 && \mathbf{0}\\
8 && 12 && 8 && 44\\
\end{bmatrix}\\\]
<p>As you will see, the other factor in the \(\mathbf{LU}\) combination, \(U\) will be an <strong>upper diagonal matrix</strong>, i.e., all its elements above its diagonal are zero.</p>
<p>The decomposition technique is the same as high school students follow when solving systems of equations. We will do the same, but in a more structured manner.</p>
<p><strong>Note on Terminology</strong>: This method for calculating the <strong>LU</strong> is formally known as the <strong>Gaussian Elimination</strong> method.</p>
<p>The steps are as follows:</p>
<ol>
<li>Subtract the first row from all the rows to force all the elements in the first column to become zero. Choose an appropriate multiplier for each row which allows you to do this.</li>
<li>This amounts to subtracting a matrix whose:
<ul>
<li>First row is 1 times the first row of A</li>
<li>Second row is x times the first row of A</li>
<li>…and so on</li>
</ul>
</li>
<li>Express the subtracting matrix as a product of two vectors.</li>
<li>Repeat this until A has all zeroes.</li>
<li>Aggregate the sum of matrix products (that represents the original matrix) as a single product of two matrices, \(L\) and \(U\).</li>
</ol>
<p>Let’s look at the general case.</p>
<h3 id="step-1">Step 1</h3>
<p>Force the first column of the matrix to become all zeroes, by subtracting suitable scaled versions of the first row from every row. Express the thing you’ve subtracted as product of a column vector and a row vector, like we discussed above, since that is our building block.</p>
\[\begin{bmatrix}
a_{11} && a_{12} && ... && a_{1N}\\
a_{21} && a_{22} && ... && a_{2N}\\
a_{31} && a_{32} && ... && a_{3N}\\
... && ... && .. && ...\\
a_{N1} && a_{N2} && ... && a_{NN}\\
\end{bmatrix}\\
=
\begin{bmatrix}
1\\
\ell_{12}\\
.\\
.\\
.\\
\ell_{1N}\\
\end{bmatrix}
\begin{bmatrix}
a_{11} && a_{12} && ... && a_{1N}\\
\end{bmatrix}
+
\begin{bmatrix}
\mathbf{0} && \mathbf{0} && ... && \mathbf{0}\\
\mathbf{0} && a'_{22} && ... && a'_{2N}\\
\mathbf{0} && a'_{32} && ... && a'_{3N}\\
... && ... && .. && ...\\
\mathbf{0} && a'_{N2} && ... && a'_{NN}\\
\end{bmatrix}\\\]
<h3 id="step-2">Step 2</h3>
<p>Call the first term on the RHS, as \(\ell_1u_1\), that is:</p>
\[\ell_1u_1+
\begin{bmatrix}
\mathbf{0} && \mathbf{0} && ... && \mathbf{0}\\
\mathbf{0} && a'_{22} && ... && a'_{2N}\\
\mathbf{0} && a'_{32} && ... && a'_{3N}\\
... && ... && .. && ...\\
\mathbf{0} && a'_{N2} && ... && a'_{NN}\\
\end{bmatrix}\\\]
<h3 id="step-3">Step 3</h3>
<p>Now subtract the second row from all the rows below it (with an appropriate multiplier) to make all the numbers in the second column, zero, that is:</p>
\[\begin{bmatrix}
\mathbf{0} && \mathbf{0} && ... && \mathbf{0}\\
\mathbf{0} && a'_{22} && ... && a'_{2N}\\
\mathbf{0} && a'_{32} && ... && a'_{3N}\\
... && ... && .. && ...\\
\mathbf{0} && a'_{N2} && ... && a'_{NN}\\
\end{bmatrix}\\
=
\ell_1u_1+
\begin{bmatrix}
0\\
1\\
\ell_{23}\\
.\\
.\\
\ell_{2N}\\
\end{bmatrix}
\begin{bmatrix}
0 && a'_{22} && a'_{23} && ... && a_{1N}\\
\end{bmatrix}
+
\begin{bmatrix}
\mathbf{0} && \mathbf{0} && ... && \mathbf{0}\\
\mathbf{0} && \mathbf{0} && ... && \mathbf{0}\\
\mathbf{0} && \mathbf{0} && ... && a''_{3N}\\
... && ... && .. && ...\\
\mathbf{0} && \mathbf{0} && ... && a''_{NN}\\
\end{bmatrix}\\\]
<h3 id="step-4">Step 4</h3>
<p>Call the first term on the LHS as \(\ell_2u_2\), so that:</p>
\[\begin{bmatrix}
\mathbf{0} && \mathbf{0} && ... && \mathbf{0}\\
\mathbf{0} && a'_{22} && ... && a'_{2N}\\
\mathbf{0} && a'_{32} && ... && a'_{3N}\\
... && ... && .. && ...\\
\mathbf{0} && a'_{N2} && ... && a'_{NN}\\
\end{bmatrix}\\
=
\ell_1u_1+
\ell_2u_2+
\begin{bmatrix}
\mathbf{0} && \mathbf{0} && ... && \mathbf{0}\\
\mathbf{0} && \mathbf{0} && ... && \mathbf{0}\\
\mathbf{0} && \mathbf{0} && ... && a''_{3N}\\
... && ... && .. && ...\\
\mathbf{0} && \mathbf{0} && ... && a''_{NN}\\
\end{bmatrix}\\\]
<h3 id="step-5">Step 5</h3>
<p>I hope you can see the pattern, we are gradually aiming to reduce all the elements of A to zero, while extracting all the \(\ell u\) factors as a sum.
Doing this will ultimately give us:</p>
\[\begin{bmatrix}
\mathbf{0} && \mathbf{0} && ... && \mathbf{0}\\
\mathbf{0} && a'_{22} && ... && a'_{2N}\\
\mathbf{0} && a'_{32} && ... && a'_{3N}\\
... && ... && .. && ...\\
\mathbf{0} && a'_{N2} && ... && a'_{NN}\\
\end{bmatrix}\\
=
\mathbf{
\ell_1u_1+
\ell_2u_2+...
\ell_Nu_N}\]
<p>where all \(\ell\)’s are column vectors and all \(u\)’s are row vectors.</p>
<p>If you remember the general pattern of outer product using the columns-into-rows approach, you can rewrite this entire sum as a product of two vectors. <strong>That is, for \(\ell_1 u_1\), \(\ell_1\) becomes the first column of \(L\) and \(u_1\) becomes the first row of \(U\), and so on.</strong></p>
\[\begin{bmatrix}
a_{11} && a_{12} && ... && a_{1N}\\
a_{21} && a_{22} && ... && a_{2N}\\
a_{31} && a_{32} && ... && a_{3N}\\
... && ... && .. && ...\\
a_{N1} && a_{N2} && ... && a_{NN}\\
\end{bmatrix}\\
=
\begin{bmatrix}
1 && \mathbf{0} && \mathbf{0} && ... && \mathbf{0}\\
\ell_{12} && 1 && \mathbf{0} && ... && \mathbf{0}\\
\ell_{13} && \ell_{23} && 1 && ... && \mathbf{0}\\
\ell_{14} && \ell_{24} && \ell_{34} && ... && \mathbf{0}\\
... && ... && .. && ... && ...\\
\ell_{1N} && \ell_{2N} && \ell_{3N} && .. && 1\\
\end{bmatrix}
\begin{bmatrix}
x_{11} && x_{12} && x_{13} && ... && x_{1N}\\
\mathbf{0} && x_{22} && x_{23} && ... && x_{2N}\\
\mathbf{0} && \mathbf{0} && x_{33} && ... && x_{3N}\\
\mathbf{0} && \mathbf{0} && \mathbf{0} && ... && x_{3N}\\
... && ... && .. && ... && ...\\
\mathbf{0} && \mathbf{0} && \mathbf{0} && .. && x_{NN}\\
\end{bmatrix}\\\]
<p>which is the form that we wanted, namely:</p>
\[A=LU\]
<h2 id="implications-for-machine-learning">Implications for Machine Learning</h2>
<p>The LU factorisation will mostly be seen in lower level matrix computational techniques. Below are some examples.</p>
<ul>
<li>For a system of equations given by \(Ax=b\), the LU decomposition technique can be used to solve systems of linear equations repeatedly for different values of \(b\) without doing the entire process of Gaussian Elimination every time for a different value of \(b\).</li>
<li>Matlab uses LU decomposition to calculate inverse matrices.</li>
<li>The LU decomposition technique can make calculating determinants easier. We will speak of determinants at a later point.</li>
</ul>avishekWe will discuss the Column-into-Rows computation technique for matrix outer products. This will lead us to one of the important factorisations (the LU Decomposition) that is used computationally when solving systems of equations, or computing matrix inverses.Intuitions about the Orthogonality of Matrix Subspaces2021-04-02T00:00:00+05:302021-04-02T00:00:00+05:30/2021/04/02/matrix-subspaces-intuitions<p>This is the easiest way I’ve been able to explain to myself around the orthogonality of matrix spaces. The argument will essentially be based on the geometry of planes which extends naturally to hyperplanes.</p>
<p>Some quick definitions first:</p>
<ul>
<li><strong>Column Space of \(A\)</strong>: The space spanned by a set of linearly independent column vectors.</li>
<li><strong>Null Space of \(A\)</strong>: The space spanned by the set of vectors which satisfy the condition \(Ax=0\). For a non-empty null space, this implies the following equivalent statements:
<ul>
<li>There exists some non-zero combination of the column vectors of \(A\) which results in the zero vector.</li>
<li>There exists at least one vector which gets transformed by matrix \(A\) into the zero vector.</li>
</ul>
</li>
<li><strong>Row Space of \(A\)</strong>: The space spanned by a set of linearly independent row vectors.</li>
<li><strong>Left Null Space of \(A\)</strong>: The space spanned by the set of vectors which satisfy the condition \(A^Tx=0\). For a non-empty left null space, this implies the following equivalent statements:
<ul>
<li>There exists some non-zero combination of the row vectors of \(A\) (i.e., column vectors of \(A^T\)) which results in the zero vector.</li>
<li>There exists at least one vector which gets transformed by matrix \(A^T\) into the zero vector.</li>
</ul>
</li>
</ul>
<p>The important point is that any argument we make around the column space and null space of \(A\) applies exactly to the row space and left null space of \(A^T\), and vice versa.</p>
<p>For purposes of this discussion, I’ll pick a matrix which already has linearly independent column and row vectors.</p>
\[A=
\begin{bmatrix}
a_{11} && a_{12} && ... && a_{1N} \\
a_{21} && a_{22} && ... && a_{2N} \\
a_{31} && a_{32} && ... && a_{3N} \\
\vdots && \vdots && \vdots && \vdots \\
a_{M1} && a_{M2} && ... && a_{MN} \\
\end{bmatrix}\]
<p>Let’s consider the non-zero null space of \(A\) and pick a vector from that space. Let that vector be \(x_O=(x_{O1}, x_{O2}, x_{O3}, ..., x_{ON})\).</p>
\[A=
\begin{bmatrix}
a_{11} && a_{12} && ... && a_{1N} \\
a_{21} && a_{22} && ... && a_{2N} \\
a_{31} && a_{32} && ... && a_{3N} \\
\vdots && \vdots && \vdots && \vdots \\
a_{M1} && a_{M2} && ... && a_{MN} \\
\end{bmatrix}
\begin{bmatrix}
x_{O1} \\
x_{O1} \\
x_{O1} \\
\vdots \\
x_{O1} \\
\end{bmatrix}
=
\begin{bmatrix}
a_{11}x_{O1} + a_{12}x_{O2} + a_{13}x_{O3} + ... + a_{1N}x_{ON} \\
a_{21}x_{O1} + a_{22}x_{O2} + a_{23}x_{O3} + ... + a_{2N}x_{ON} \\
a_{31}x_{O1} + a_{32}x_{O2} + a_{33}x_{O3} + ... + a_{3N}x_{ON} \\
\vdots \\
a_{M1}x_{O1} + a_{M2}x_{O2} + a_{M3}x_{O3} + ... + a_{MN}x_{ON} \\
\end{bmatrix}
= 0\]
<p>Let’s take the equation of the first row:</p>
\[a_{11}x_{O1} + a_{12}x_{O2} + a_{13}x_{O3} + ... + a_{1N}x_{ON}=0\]
<p>This represents a hyperplane:
\(\mathbf{a_{11}x + a_{12}x + a_{13}x + ... + a_{1N}x=0}\) with the normal vector \(\mathbf{\hat{n}=(a_{11}, a_{12}, a_{13}, ..., a_{1N})}\). <strong>Note that \(\hat{n}\) is also one of the row vectors which spans A’s row space.</strong></p>
<p>By the basic definition of hyperplanes and normal vectors (for a quick refresher, see <a href="/2021/03/29/vectors-normals-hyperplanes.html">Vectors, Normals, and Hyperplanes</a>), we can say that:</p>
<ul>
<li>The vector \(x_O\) is orthogonal to the normal vector \(\hat{n}\), i.e., \(\mathbf{x_O\perp \hat{n}}\). Equivalently, \(\mathbf{x_O\cdot \hat{n}=0}\) (the dot product is zero). This condition is true for every vector \(x_O\) in \(A\)’s null space.</li>
<li>Thus A’s null space is orthogonal to the first row vector of \(A\).</li>
</ul>
<p>This argument can be extended to all row vectors in \(A\), proving that \(A\)’s null space is orthogonal to every row vector in \(A\). By the property of linearity, this implies that \(A\)’s null space is orthogonal to \(A\)’s row space, i.e., \(\mathbf{N(A)\perp R(A)}\).</p>
<p>Now, apply the same argument for \(A^T\), i.e., the null space of \(A^T\) is orthogonal to \(A^T\)’s row space, i.e., \(\mathbf{N(A^T)\perp R(A^T)}\). But, we already know that:</p>
<p>\(\mathbf{R(A^T)=C(A)}\): The row space of \(A^T\) is the column space of \(A\).
\(\mathbf{N(A^T)=LN(A)}\): The null space of \(A^T\) is the left null space of \(A\).</p>
<p>Thus, the left null space of \(A\) is orthogonal to the column space of \(A\).</p>
<p>To summarise:</p>
<ul>
<li><strong>The left null space of \(A\) is orthogonal to the column space of \(A\).</strong></li>
<li><strong>The null space of \(A\) is orthogonal to the row space of \(A\).</strong></li>
</ul>avishekThis is the easiest way I’ve been able to explain to myself around the orthogonality of matrix spaces. The argument will essentially be based on the geometry of planes which extends naturally to hyperplanes.Matrix Outer Product: Value-wise computation and the Transposition Rule2021-04-01T00:00:00+05:302021-04-01T00:00:00+05:30/2021/04/01/vectors-matrices-outer-product-valuewise<p>We will discuss the value-wise computation technique for matrix outer products. This will lead us to a simple sketch of the proof of reversal of order for transposed outer products.</p>
<h2 id="value-wise-computation">Value-wise Computation</h2>
<p>This is probably the method most widely used in high school algebra. You are essentially viewing this as a value-by-value computationThere is a very simple mnemonic for remembering it, namely:</p>
<p><strong>The element \(Y_{ik}\) in the \(i\)th row and the \(k\)th column is the dot product of the \(i\)th row vector (of the Left Hand Matrix) amd the \(k\)th column vector (of the Right Hand Matrix).</strong></p>
<p><img src="/assets/images/value-by-value-outer-product.jpg" alt="Value-by-Value Multiplication" /></p>
<p>This is also something which makes it obvious that the number of columns of the left matrix should be equal to the number of rows of the right matrix, because the dot product involves pairwise multiplication, and that cannot happen if the number of components in the row vector and the column vector are unequal.</p>
<p>Mathematically, each element is computed as below:</p>
\[Y_{ik}=A_{i1}B_{1j}+A_{i2}B_{2j}+ ... +A_{ij}B_{jk}+...+A_{iN}B_{Nj}\]
<p>or, more compactly:</p>
\[Y_{ik}=\displaystyle \sum_{j=1}^{j} A_{ij}B_{jk}\]
<p>To reiterate, it is important for the number of columns of \(A\) and the number of rows \(B\) to be equal for matrix multiplication to be a valid operation. In this case, this number is \(N\). This important for proving why the order itself needs reversing, as you will see.</p>
<h3 id="proof-of-order-reversal-in-transpose-of-an-operation">Proof of Order Reversal in Transpose of an Operation</h3>
<p>Assume we have two matrices, \(A\) (\(M\times N\)) and \(B\) (\(N\times P\)), which we multiply to get \(Y\), i.e.,</p>
\[Y=AB\]
<p>We’d like to know what \(Y^T\) looks like, in terms of \(A^T\) and \(B^T\). I’ll attempt to elaborate the thought process while writing the identities.</p>
<p>From the definition of transpose, we know that \({Y_{ik}}^T=Y_{ki}\).
Also, by definition of the value-wise computation of two matrices A and B, we have:</p>
\[Y_{ik}=\displaystyle \sum_{j=1}^{N} A_{ij}B_{jk}\]
<p>Now, I’d like to express \({Y_{ik}}^T\) in terms of \(A^T\) and \(B^T\).
Now, in the transpose versions of A and B, where have the \(i\)th row of A and the \(k\)th column of B moved to?</p>
<p>\(i\)th row of A has become the \(i\)th column of \(A^T\)
\(k\)th column of B has become the \(k\)th row of \(B^T\)</p>
<p>Mathematically, this is expressed as:
\({A^T}_{ij}=A_{ji} \\
{B^T}_{jk}=B_{kj}\)</p>
<p>In order to preserve the property \({Y_{ik}}^T=Y_{ki}\), we want the dot product of the \(i\)th column of \(A^T\) (which is the \(i\)th row of \(A\)) with the \(k\)th row of \(B^T\) (which is the \(k\)th column of \(B\))</p>
\[Y_{ki}=\displaystyle \sum_{j=1}^{N} A_{ji}B_{kj} \\
Y_{ki}=\displaystyle \sum_{j=1}^{N} B_{kj}A_{ji} \\
Y_{ki}=\displaystyle \sum_{j=1}^{N} {B^T}_{jk}{A^T}_{ij}\]
<p>This is the value-wise computation of two matrices \(B^T\) and \(A^T\).
Thus:</p>
\[(AB)^T=B^TA^T\]
<p>Note the important reversal of \(A_{ji}B_{kj}\) to \(B_{kj}A_{ji}\). The result is the same since both of \(A_{ji}\) and \(B_{kj}\) are simple scalars. However, matrix multiplication can only succeed if the columns of the left matrix and rows of the right matrix are equal (which is why we iterate over j from 1 to \(N\); see the dimensions of the matrices \(A\) and \(B\)).</p>
<h2 id="conclusion">Conclusion</h2>
<p>This rule of swapping the order of the outer product will also apply to when we are calculating inverses.</p>avishekWe will discuss the value-wise computation technique for matrix outer products. This will lead us to a simple sketch of the proof of reversal of order for transposed outer products.Matrix Outer Product: Linear Combinations of Vectors2021-03-30T00:00:00+05:302021-03-30T00:00:00+05:30/2021/03/30/vectors-matrices-outer-product<p>Matrix multiplication (outer product) is a fundamental operation in almost any Machine Learning proof, statement, or computation. Much insight may be gleaned by looking at different ways of looking matrix multiplication. In this post, we will look at one (and possibly the most important) interpretation: namely, the <strong>linear combination of vectors</strong>.</p>
<p>In fact, the geometric interpretation of this operation allows us to infer many properties that might be obscured if we were treating matrix multiplication as simple sums of products of numbers.</p>
<p><strong>Quick Aside</strong>: There are some other ways of viewing matrix multiplication, which we will address in one of the future articles (element-wise, columns-into-rows).</p>
<p>To begin with, I’ll state the one fact that holds true no matter how you performing matrix multiplication. Even if you forget everything you read in this article, remember this thing:</p>
<p><strong>Matrix multiplication is a linear combination of a set of vectors.</strong></p>
<p>Let’s begin with a simple two-dimensional vector, like so:
\(A_1=\begin{bmatrix}
2 \\
3 \\
\end{bmatrix}\)</p>
<p>Let’s introduce a \(1\times 1\) vector \(x_1=\begin{bmatrix}
2 \\
\end{bmatrix}\)</p>
<p>We multiply them together, like so:
\(Y=A_1x_1
= \begin{bmatrix}
2 \\
3 \\
\end{bmatrix}.\begin{bmatrix}
2 \\
\end{bmatrix}
=\begin{bmatrix}
4 \\
6 \\
\end{bmatrix}\)</p>
<p>This is nothing special, simply a scaling of the \(\begin{bmatrix}2 && 3\end{bmatrix}^T\) matrix.</p>
<h2 id="1-linear-combination-of-column-vectors">1. Linear Combination of Column Vectors</h2>
<p>Let’s take the next step. We will add one more column vector to A, and add a number to \(x_1\).</p>
\[A_2=\begin{bmatrix}
2 && 5\\
3 && 10\\
\end{bmatrix}
v_2=\begin{bmatrix}
2 \\
3 \\
\end{bmatrix}\]
<p>In this approach, we are only correlating columns on both sides. Consider the following diagram to see how this combination works.</p>
<p><img src="/assets/images/matrix-multiplication-column-vector.png" alt="Column Vector Matrix Multiplication" /></p>
<p>What you are really doing is this: you are considering the <strong>weighted sum of all the column vectors</strong> of \(A_2\) (Remember, in this picture, \(A_2\) is just a bunch of column vectors).</p>
<p>What are these weights? Each value in the column in \(v_2\) is a weight. Each weight scales a column vector, and these weighted vectors are added together to form a single column.</p>
<p>I hope it becomes obvious that this implies that the number of column vectors in \(A_2\) (the number of columns in \(A_2\)) must equal the number of values in \(v_2\)’s column (the number of rows in \(v_2\)), because there has to exist a one-to-one correspondence between them in order for this operation to be possible.</p>
<p>This results in a <strong>linear combination of the column vectors</strong> in \(A_2\). A linear combination of two vectors is of the form \(\alpha x + \beta y\), where \(\alpha\) and \(\beta\) are simple scalars (numbers). What you are really doing is either squashing or stretching some vectors by some factor (this can be a negative number), and then adding them together. That’s what a linear combination essentially means.</p>
<p>To put it more concretely in this example, your computation is as follows:</p>
\[A_2=\left[2.\begin{bmatrix}
2 \\
3 \\
\end{bmatrix} + 3.\begin{bmatrix}
5 \\
10 \\
\end{bmatrix}\right]=
\begin{bmatrix}
19 \\
36 \\
\end{bmatrix}\]
<h3 id="11-the-geometric-interpretation">1.1 The Geometric Interpretation</h3>
<p>It is worth pausing to ground our understanding using the geometric interpretation. If you have followed the verbal explanation so far, the geometric interpretation should be pretty straightforward to comprehend.</p>
<p><img src="/assets/images/vector-linear-combination.png" alt="Linear Combinations of Vectors" /></p>
<p>The vector \(\begin{bmatrix}2 && 3\end{bmatrix}^T\) is stretched by a factor of 2, to become \(\begin{bmatrix}4 && 6\end{bmatrix}^T\).
Similarly, the vector \(\begin{bmatrix}5 && 10\end{bmatrix}^T\) is stretched by a factor of 3, to become \(\begin{bmatrix}15 && 30\end{bmatrix}^T\).
The sum of these vectors is \(\begin{bmatrix}19 && 36\end{bmatrix}^T\), as indicated by the red arrow in the diagram above.
<strong>This linear combination, as usual, extends to higher-dimensional vectors.</strong></p>
<h3 id="12-general-case">1.2 General Case</h3>
<p>Let’s extend to the more general case, where we add another column to \(v_2\), so now that we have:</p>
\[A_3=\begin{bmatrix}
2 && 5\\
3 && 10\\
\end{bmatrix}
v_3=\begin{bmatrix}
2 & -2\\
3 & -4\\
\end{bmatrix}\]
<p>How do we extend what we already know to this new case? Very simple: each column in \(v_3\) results in a corresponding column in the final result, and each output column is computed exactly the same. You are still linearly combining all the column vectors in \(A_3\), but the set of weights you’re using depends upon which column of \(v_3\) you are using for computation. Thus, this new computation is as follows:</p>
\[A_2=\left[\left(2.\begin{bmatrix}
2 \\
3 \\
\end{bmatrix} + 3.\begin{bmatrix}
5 \\
10 \\
\end{bmatrix}\right) \left(-2.\begin{bmatrix}
2 \\
3 \\
\end{bmatrix} + (-4).\begin{bmatrix}
5 \\
10 \\
\end{bmatrix}\right)\right]
\\
=\begin{bmatrix}
19 && -24\\
36 && -46\\
\end{bmatrix}\]
<p><strong>Quick Aside</strong>: The column vectors which are linearly combined, come from the left side of the expression. The weights come from the right side. This is important to know, since the row column interpretation (which we will study next) inverts the order.</p>
<h2 id="2-linear-combination-of-row-vectors">2. Linear Combination of Row Vectors</h2>
<p>The concept of linear combination of vectors works equally well, if you consider the rows of a matrix as vectors. The same concept applies: each row in the output is the sum of all the weighted row vectors.</p>
<p>There is one important distinction, however, which is worth noting. <strong>In the column vector approach, the column vectors are on the left of the expression, which is to say, the expression is of the form \(Av\).</strong></p>
<p>In the row vector approach, if you want to consider the rows as vectors, these vectors will come from the right hand side of the expression. Thus, if we wanted to perform the same operation, assuming we wish to use the same vectors, but treat them using the row vector approach, your expression has to assume the form \(v^TA^T\).</p>
<p>The algorithmic picture for multiplication using the row vector approach looks like this:</p>
<p><img src="/assets/images/matrix-multiplication-row-vector.png.png" alt="Column Vector Matrix Multiplication" /></p>
<p>It is important to note that the central idea here (regardless of whether we are considering column vectors or row vectors) is that we are computing linear combinations of vectors.
The geometric interpretation for this example stays the same.
You should also convince yourself by doing this calculation by hand.</p>
<p><strong>It is worth computing the original \(Av\) computation using the row vector approach, just to consider how different the rows are, and what the weights are. You should still get the same answer, however.</strong></p>
<h2 id="3-the-transponse-of-matrix-multiplication">3. The Transponse of Matrix Multiplication</h2>
<p>One thing you will have noticed is the way I set up the column vector and row vector examples.</p>
<p>Column Vector example: The computation was \(Av\) and the result was \(Y_1=\begin{bmatrix}
19 \\
36 \\
\end{bmatrix}\)
Row Vector example: The computation was \(v^TA^T\) and the result was \(Y_2=\begin{bmatrix}
19 && 36 \\
\end{bmatrix}\)</p>
<p>Obviously, \(Y_1^T=Y_2\). Substituting the original expressions in the above, we get:</p>
\[(Av)^T=v^TA^T\]
<p>This is just an example, but it is part of a more general rule about transposes, which is that:</p>
<p><strong>The transpose of a set of operations is the same set of operations on the transposed elements, but applied in reverse order</strong>.
We will sketch out a simple proof for this when we look at another method of matrix multiplication in one of the next articles.</p>
<p>We will also see a similar rule for inverses when discussing inverse matrices.</p>
<h2 id="conclusion">Conclusion</h2>
<ul>
<li>The fact that columns (and rows) of the product of matrices can be treated as a linear combinations of vectors, is the important idea behind the Gaussian (and the Gauss-Jordan) Elimination method for solving systems of equations, which is very related to how students solve simultaneous equations in high school algebra.</li>
<li>This insight also has important implications for the vector subspace that the resulting product matrix spans, as well as its rank, which we will talk about in future articles.</li>
</ul>avishekMatrix multiplication (outer product) is a fundamental operation in almost any Machine Learning proof, statement, or computation. Much insight may be gleaned by looking at different ways of looking matrix multiplication. In this post, we will look at one (and possibly the most important) interpretation: namely, the linear combination of vectors.Vectors, Normals, and Hyperplanes2021-03-29T00:00:00+05:302021-03-29T00:00:00+05:30/2021/03/29/vectors-normals-hyperplanes<p>Linear Algebra deals with matrices. But that is missing the point, because the more fundamental component of a matrix is what will allow us to build our intuition on this subject. This component is the vector, and in this post, I will introduce vectors, along with common notations of expression.</p>
<p>We will talk about the normal vector, and its relation to a line, a plane, and ultimately, a hyperplane. We will introduce the idea of the dot product (though I’ll not be delving into it only in a later article).</p>
<h2 id="vectors">Vectors</h2>
<p>It is very instructive to look at a single <strong>vector</strong>. Remember that our worldview is that a matrix is just a bunch of vectors. We will return to this point later.</p>
<p>The convention for defining vectors you will find in every textbook/paper is as a column of numbers (basically a <strong>column vector</strong>). We’d write a vector \(v{1}\) as:</p>
\[v_1=\begin{bmatrix}
x_1 \\
x_2 \\
\end{bmatrix}\]
<p>Vectors implicitly originate from the origin (in this case, \((0,0)\)). We take</p>
\[v_2=\begin{bmatrix}
x_{11} && x_{12} \\
x_{21} && x_{22} \\
\end{bmatrix}\]
<p>You can look at \(v{2}\) in two ways:</p>
<p>A set of two column vectors, i.e.,
\(\begin{bmatrix}
x_{11} \\
x_{21} \\
\end{bmatrix}\),
\(\begin{bmatrix}
x_{12} \\
x_{22} \\
\end{bmatrix}\)</p>
<p>or, a set of two row vectors, i.e.,
\(\begin{bmatrix}
x_{11} && x_{12} \\
\end{bmatrix}\),
\(\begin{bmatrix}
x_{21} && x_{22} \\
\end{bmatrix}\)</p>
<p>The <strong>column vector</strong> picture is usually more prevalent. Usually, if a column vector needs to be turned into a row vector (for purposes of multiplication if, say, you are trying to create a symmetric matrix), you’d still consider it as a column vector, and use the <strong>transpose</strong> operator, written as \(v^T\).</p>
<h2 id="lines-planes-and-hyperplanes">Lines, Planes, and Hyperplanes</h2>
<p>Let’s talk a little bit about lines, surfaces, and their higher dimensional variants (hyperplanes) and their normal vectors.</p>
<h3 id="1-lines">1. Lines</h3>
<p>Here’s a line \(6x+4y=0\). I’ve also drawn its normal vector which is \((6,4)\). Now, note the direct correlation between the coefficients of the line equation and the normal vector. In fact, this is a very general rule, and we will see the reason for this right now.
<img src="/assets/images/line-and-normal-vector.png" alt="Line and its Normal Vector" /></p>
<p><strong>Quick Aside</strong>: Why is \((6,4)\) the normal vector? This is because any point on the line \((6x+4y=0)\) forms a vector with the origin, which is perpendicular to the normal vector (which obviously translates to the entire line being perpendicular to the \((6,4)\) vector). This is shown below.</p>
<p><img src="/assets/images/line-and-normal-vector-relationships.png" alt="Line and its Normal Vector Relationships" /></p>
<p>We haven’t talked about the dot product yet (that comes later), but allow me to note a painfully obvious fact, any point on the line \(6x+4y=0\), satisfies that equation. Indeed, you can see this clearly if we take \((2,-3)\) as an example, and write:</p>
\[6.(2)+4.(-3)=0\]
<p>Let us interpret this simple calculation in another way. This is like taking two vectors \((6,4)\) and \((2,-3)\) and multiplying their individual components, and summing up the results. \((2,-3)\) is obviously the vector we chose, and \((6,4)\) is another vector, which in this case, is…our normal vector.</p>
<p>This operation of multiplying individual components of vectors, and summing them up, is the <strong>dot product</strong> operation. It is not obvious that taking the dot product of perpendicular vectors will always result in 0, but we will prove it later, and it is true in the general case.</p>
<p>Indeed, an alternate definition of a line is the <strong>set of all vectors which are perpendicular to the normal vector</strong> (in this case, \((6,4)\)). This is why a line (and as we will see, a plane, and hyperplanes) can be characterised by a single vector.</p>
<p><strong>Quick Aside</strong>: There are alternate ways to express a line (or plane or hyperplane) using vectors which involve the <strong>column space</strong> interpretation, but I defer that discussion to the Vector Subspaces topic.</p>
<p>The dot product operation is denoted as \(A\cdot B\), and I’ll defer more discussion on the dot product to its specific topic.</p>
<h3 id="2-planes">2. Planes</h3>
<p>Let’s move up a dimension to 3D. Here we consider a plane of the form:</p>
\[ax+by+cz=0\]
<p>To make things a little more concrete, consider the plane:</p>
\[-x-y+2z=0\]
<p>and a point on this plane \((5,5,5)\).</p>
<p><img src="/assets/images/plane-and-normal-vector.png" alt="Plane and its Normal Vector" /></p>
<p>Do verify for yourself that this point lies on the plane. Also, as you can see, the coefficients of the plane equation form the normal vector of the plane. The same concept that applied to lines, also applies here.</p>
<p>In other words, <strong>satisfying the condition that a point (or vector) lies on a plane (hyperplane) is the same thing as satisfying that the vector is perpendicular to the normal vector of that plane (hyperplane)</strong>. It is literally the same computation, thanks to how <strong>dot product</strong> is defined.</p>
<p>The image below illustrates this idea.</p>
<p><img src="/assets/images/normal-vectors-perpendicular-plane-vectors.jpg" alt="Vectors in a Plane and its Normal Vector" /></p>
<h3 id="3-hyperplanes">3. Hyperplanes</h3>
<p>We can extend the same concept to higher dimensions, except we cannot really sketch it out (at least in a way that would make intuitive sense to us). Incidentally, it’s just easier to refer to everything as a hyperplane when talking in the abstract, since a line is a one-dimensional hyperplane, a plane is a two-dimensional hyperplane, and so on.</p>
<p><strong>Note</strong>: I’d like to make the definition of a hyperplane explicit with respect to its dimensionality. <strong>A hyperplane in a vector space with dimensionality \({\mathbb{R}}^N\) is always of dimensionality \(N-1\)</strong>, i.e., one dimension lesser than the ambient space it inhabits.</p>
<p><strong>This guarantees a unique normal vector for that hyperplane</strong>, because by the Rank Nullity Theorem, the null space (where the normal vector resides) will have only one basis. As a counterexample, a 2D plane in \({\mathbb{R}}^4\) would have two linearly independent normal vectors, and thus does not qualify as a hyperplane in 4D space (but does qualify as one in 3D space).</p>
<p>What this implies is that any hyperplane defined as:</p>
\[w_1x_1+w_2x_2+w_3x_3+...+w_nx_n=0\]
<p>has its normal vector as: \(v=\begin{bmatrix}
w_1 \\
w_2 \\
... \\
w_n \\
\end{bmatrix}\)</p>
<p>Incidentally, the above form is the most common form of expressing a hyperplane, i.e., by referencing its normal vector.</p>
<h2 id="relevance-to-machine-learning">Relevance to Machine Learning</h2>
<p>In addition to vectors (and by extension, matrices) being used to frame almost every Machine Learning/Statistics problem, these are some examples of how they are used:</p>
<ul>
<li>Many Machine Learning problems related to prediction boil down to <strong>determining a hyperplane that best captures the trend of the data</strong>, subject to certain assumptions (eg: Linear Models/Generalised Linear Models).</li>
<li>Many Machine Learning problems related to classification, boil down to <strong>finding the optimal dividing hyperplane between two different classes of data</strong> (eg: Support Vector Machines).</li>
<li>Relationships between vectors give us important information about the space that they define (more on this in Vector Subspaces). This in turn can help us infer information certain important properties of a matrix (invertibility, eigenvectors, etc.). This can directly tell us whether certain Machine Learning processes can be applied or not.</li>
</ul>avishekLinear Algebra deals with matrices. But that is missing the point, because the more fundamental component of a matrix is what will allow us to build our intuition on this subject. This component is the vector, and in this post, I will introduce vectors, along with common notations of expression.