Jekyll2021-06-17T23:21:38+05:30/feed.xmlA Fish without a BicycleTechnology and ArtThe Gram-Schmidt Orthogonalisation2021-05-27T00:00:00+05:302021-05-27T00:00:00+05:30/2021/05/27/gram-scmidt-orthogonalisation<p>We discuss an important factorisation of a matrix, which allows us to convert a linearly independent but non-orthogonal basis to a <strong>linearly independent orthonormal basis</strong>. This uses a procedure which iteratively extracts vectors which are orthonormal to the previously-extracted vectors, to ultimately define the orthonormal basis. This is called the <strong>Gram-Schmidt Orthogonalisation</strong>, and we will also show a proof for this.</p>
<h2 id="projection-of-vectors-onto-vectors">Projection of Vectors onto Vectors</h2>
<p>This section derives the <strong>decomposition of a vector into two orthogonal components</strong>. These orthogonal components aren’t necessarily the standard basis vectors (\(\text{[1 0]}\) and \(\text{[0 1]}\) in \(\mathbb{R}^2\), for example); but they are guaranteed to be orthogonal to each other.</p>
<p>Assume we have the vector \(\vec{x}\) that we wish to decompose into two orthogonal components. Let us choose an arbitrary vector \(\vec{u}\) as one of the components; we will derive its orthogonal counterpart as part of this derivation.</p>
<p><img src="/assets/images/vector-projection.png" alt="Vector Projection" /></p>
<p>Since the projection will be collinear with \(\vec{u}\), let us assume the projection is \(t\vec{u}\), where \(t\) is a scalar.
The only constraint we wish to express is that the vector \(\vec{u}\) and the plumb line from the tip of the vector \(\vec{x}\) to \(\vec{u}\) are perpendicular, i.e., their dot product is zero. We can see from the above diagram that the plumb line is \(\vec{x}-t\vec{u}\). We can then write:</p>
\[u^T.(x-ut)=0 \\
\Rightarrow u^Tx=u^Tut \\
\Rightarrow t={(u^Tu)}^{-1}u^Tx\]
<p>We know that \(u^Tu\) is the dot product of \(\vec{u}\) with itself, and thus a scalar, so you could write it as:</p>
\[t=\frac{u^Tx}{u^Tu}\]
<p>and indeed, we’d be justified in doing that, but let’s not make that simplification, because there is a more general case coming up, where this will not be a scalar. Thus, the component of \(\vec{x}\) in the direction \(\vec{u}\) is \(ut={(u^Tu)}^{-1}u^Txu\) and the orthogonal component will be \(x-ut=x-{(u^Tu)}^{-1}u^Txu\).</p>
<p>There is one simplifying assumption we can make: if \(\vec{u}\) is a unit vector, then \(u^Tu=I\), which simplifies the expressions to:</p>
\[\mathbf{x_{u\parallel}={(u^Tu)}^{-1}u^Txu} \\
x_{u\parallel}=u^Txu \text{ (if u is a unit vector)}\\
\mathbf{x_{u\perp}=x-{(u^Tu)}^{-1}u^Txu} \\
x_{u\perp}=x-u^Txu \text{ (if u is a unit vector)}\]
<h2 id="projection-of-vectors-onto-vector-subspaces">Projection of Vectors onto Vector Subspaces</h2>
<p>The same logic applies when we are projecting vectors onto vector subspaces. We use the same constraint, i.e.:</p>
\[u^T.(x-ut)=0\]
<p>There are a few differences in the meaning of the symbols worth noting. \(u\) is no longer a single column vector; <strong>it is a set of column vectors which define a vector subspace</strong>. Let’s assume the vector subspace is embedded in \(\mathbb{R}^n\), and we have \(m\) linearly independent vectors in \(u\) (\(m\leq n\)). \(u\) now becomes a \(n\times m\) matrix.</p>
<p>The projection is no longer gotten from scaling a single vector; it is now expressible as a linear combination of these \(m\) vectors. <strong>This set of weightings is \(t\), which now becomes a \(m\times 1\) matrix.</strong> This change of \(t\) from a scalar to a \(m\times 1\) matrix is also the reason we didn’t simplify the \(u^Tu\) expression in the previous section; in the general case, \(t\) is not a scalar.</p>
<p>\(\vec{x}\) is still an \(n\times 1\) matrix; this hasn’t changed.</p>
<p>Thus, the results of projection of a vector onto a vector subspace are still the same.</p>
\[\mathbf{x_{u\parallel}={(u^Tu)}^{-1}u^Txu} \\
x_{u\parallel}=u^Txu \text{ (if u is a unit vector)}\\
\mathbf{x_{u\perp}=x-{(u^Tu)}^{-1}u^Txu} \\
x_{u\perp}=x-u^Txu \text{ (if u is a unit vector)}\]
<h2 id="gram-schmidt-orthogonalisation">Gram-Schmidt Orthogonalisation</h2>
<p>We are now in a position to describe the intuition behind <strong>Gram-Schmidt Orthogonalisation</strong>. Let us state the key idea first.</p>
<p><strong>For a set of \(m\) linearly independent vectors in \(\mathbb{R}^n\) which span some subspace \(V_m\), there exists aset of \(m\) orthonormal basis vectors, which span the same subspace \(V_m\).</strong></p>
<p>The procedure goes as follows:</p>
<p>Assume \(m\) <strong>linearly independent</strong> (but not orthogonal) vectors in \(\mathbb{R}^n\). They span some subspace \(V_m\) of dimensionality \(m\). Let these vectors be \(x_1\), \(x_2\), \(x_3\), …, \(x_m\).</p>
<ul>
<li>We have to start somewhere, so let’s assume that our first orthogonal basis vector is \(u_1=\frac{x_1}{\|x_1\|}\) (normalise to be a unit vector). <strong>\(u_1\) is our first orthogonal basis vector.</strong></li>
<li>We now project \(x_2\) onto \(u_1\), finding \({x_2}_{u_1\parallel}\) and \({x_2}_{u_1\perp}\) as we have described in the previous sections. We won’t really use \({x_2}_{u_1\parallel}\) except to calculate its orthogonal component \({x_2}_{u_1\perp}\).</li>
<li>
<p>Designate \(u_2={x_2}_{u_1\perp}\). Because of the way we have constructed \(u_2\), \(u_2\) is orthogonal to \(u_1\). <strong>We now have two orthogonal basis vectors, \(u_1\), \(u_2\).</strong> Normalise them to unit vectors as needed. Computationally, \(u_2\) looks like this:</p>
\[u_2=x_2-{u_1}^Tx_{2}u_1\]
</li>
<li>Now let us project \(x_3\) onto \(u_1\) and \(u_2\) to get \(({x_3}_{u_1\parallel}, {x_3}_{u_2\parallel})\). Calculate \({x_3}_{u_1,u_2\perp}=x_3-{x_3}_{u_1\parallel}-{x_3}_{u_2\parallel}\).</li>
<li>
<p>Designate \(u_3={x_3}_{u_1,u_2\perp}\). We now have three orthogonal basis vectors, \(u_1\), \(u_2\), \(u_3\). Normalise them to unit vectors as needed. Computationally, \(u_3\) looks like this:</p>
\[u_2=x_3-{u_1}^Tx_{3}u_1-{u_2}^Tx_{3}u_2\]
</li>
<li><strong>Repeat the above procedure for all the remaining vectors upto \(x_m\).</strong> At the end, we will have \(m\) orthogonal basis vectors \((u_1, u_2, ..., u_m)\) which will span the same vector subspace \(V_m\).</li>
</ul>
<p><img src="/assets/images/gram-schmidt-orthogonalisation.png" alt="Gram-Schmidt Orthogonalisation" /></p>
<p>You will notice that at every stage of this procedure, the next orthogonal basis vector to be computed, is given by the following general identity:</p>
\[u_{k+1}=x_{k+1}-\sum_{i=1}^{k}{u_i}^Tx_{k+1}u_i\]
<p>It is very easy to see that at every step, <strong>the latest basis vector is orthogonal to every other previously-generated basis vector</strong>. To see this, take the dot product on both sides with an arbitrary \(u_j\), such that \(j\leq k\).</p>
\[{u_j}^Tu_{k+1}={u_j}^Tx_{k+1}-\sum_{i=1}^{k}{u_j}^T\underbrace{ ({u_i}^Tx_{k+1}) }_{scalar}u_i \\
={u_j}^Tx_{k+1}-\sum_{i=1}^{k}\underbrace{ ({u_i}^Tx_{k+1}) }_{scalar}{u_j}^Tu_i\]
<p>Because of the way we have constructed the previous orthogonal basis vectors, we have \({u_j}^Tu_i=0\) for all \(j\neq i\), and \({u_j}^Tu_i=1\) for \(j=i\) (assuming unit basis vectors). Thus, the above identity becomes:</p>
\[{u_j}^Tu_{k+1}={u_j}^Tx_{k+1}-{u_j}^Tx_{k+1}=0\]
<h2 id="proof-of-gram-schmidt-orthogonalisation">Proof of Gram-Schmidt Orthogonalisation</h2>
<p>A very valid question is: <strong>why does the basis from the Gram-Schmidt procedure span the same vector subspace as the one spanned by the original non-orthogonal basis?</strong></p>
<p>The proof should make this clear; most of it follows almost directly from the procedure itself; we only need to fill in a few gaps, and formalise the presentation.</p>
<p>Given a set of \(m\) <strong>linearly independent vectors</strong> \((x_1, x_2, x_3, ..., x_m)\) in \(\mathbb{R}^n\) spanning a vector subspace \(V\in\mathbb{R}^m\), there exists an <strong>orthogonal basis</strong> \((u_1, u_2, u_3, ..., u_m)\) which spans the vector subspace \(V\in\mathbb{R}^m\).</p>
<p>We prove this by induction.</p>
<h3 id="1-proof-for-n1">1. Proof for \(n=1\)</h3>
<p><strong>Let us validate the hypothesis for \(n=1\).</strong> For \(x_1\), if we take \(u_1=\frac{x_1}{\|x_1\|}\), we can see that \(u_1\) spans the same vector subspacee as \(x_1\), since it’s merely a scaled version of \(x_1\).</p>
<h3 id="2-proof-for-nk1">2. Proof for \(n=k+1\)</h3>
<p>Let us now assume that the above statement holds for \(n=k\leq m\), i.e., there are \(k\) orthogonal basis vectors \((u_1, u_2, u_3, ..., u_k)\) which span the same vector subspace \(V\in\mathbb{R}^k\) as the set \((x_1, x_2, x_3, ..., x_k)\).</p>
<p>Now, consider the construction of the \((k+1)\)th orthogonal basis vector \(u_{k+1}\) like so:</p>
\[u_{k+1}=x_{k+1}-\sum_{i=1}^{k}{u_i}^Tx_{k+1}u_i\]
<p>It is very easy to see that at every step, <strong>the latest basis vector is orthogonal to every other previously-generated basis vector</strong>. To see this, take the dot product on both sides with an arbitrary \(u_j\), such that \(j\leq k\).</p>
\[{u_j}^Tu_{k+1}={u_j}^Tx_{k+1}-\sum_{i=1}^{k}{u_j}^T\underbrace{ ({u_i}^Tx_{k+1}) }_{scalar}u_i \\
={u_j}^Tx_{k+1}-\sum_{i=1}^{k}\underbrace{ ({u_i}^Tx_{k+1}) }_{scalar}{u_j}^Tu_i\]
<p>Because of the way we have constructed the previous orthogonal basis vectors, we have \({u_j}^Tu_i=0\) for all \(j\neq i\), and \({u_j}^Tu_i=1\) for \(j=i\) (assuming unit basis vectors). Thus, the above identity becomes:</p>
\[{u_j}^Tu_{k+1}={u_j}^Tx_{k+1}-{u_j}^Tx_{k+1}=0\]
<p>Thus, the newly constructed basis vector is orthogonal to every basis vector \((u_1, u_2, u_3, ..., u_k)\). This completes the induction part of the proof.</p>
<h3 id="3-proof-that-u_k1neq-0">3. Proof that \(u_{k+1}\neq 0\)</h3>
<p>We also prove that <strong>the newly-constructed basis vector is not a zero vector</strong>. For that, let us assume that \(u_{k+1}=0\). Then, we get:</p>
\[x_{k+1}-\sum_{i=1}^{k}{u_i}^Tx_{k+1}u_i=0 \\
x_{k+1}=\sum_{i=1}^{k}{u_i}^Tx_{k+1}u_i\]
<p>This implies that \(x_{k+1}\) is expressible as a linear combination of the set of vectors \((u_1, u_2, u_3, ..., u_k)\). But we have also assumed that this set spans the same vector subspace as \((x_1, x_2, x_3, ..., x_k)\).</p>
<p>This implies that \(x_{k+1}\) is expressible as a linear combination of the set \((x_1, x_2, x_3, ..., x_k)\), which is a <strong>contradiction</strong>, since the vectors in the full set \((x_1, x_2, x_3, ..., x_m)\) are linearly independent. Thus, \(u_{k+1}\) cannot be zero.</p>
\[\blacksquare\]avishekWe discuss an important factorisation of a matrix, which allows us to convert a linearly independent but non-orthogonal basis to a linearly independent orthonormal basis. This uses a procedure which iteratively extracts vectors which are orthonormal to the previously-extracted vectors, to ultimately define the orthonormal basis. This is called the Gram-Schmidt Orthogonalisation, and we will also show a proof for this.Real Analysis Proofs2021-05-18T00:00:00+05:302021-05-18T00:00:00+05:30/2021/05/18/peano-axiom-proofs-practice-1<p>Since I’m currently self-studying <strong>Real Analysis</strong>, I’ll be listing down proofs I either initially had trouble understanding, or enjoyed proving, here. These are very mathematical posts, and are for personal documentation, mostly.</p>
<h2 id="recursive-definitions">Recursive Definitions</h2>
<p>Source: <strong>Analysis 1</strong> by <em>Terence Tao</em></p>
<h2 id="definitions">Definitions</h2>
<ul>
<li>A natural number is any element in the set \(\mathbb{N}:=\{0,1,2,3,...\}\).</li>
</ul>
<h2 id="peano-axioms-used">Peano Axioms Used</h2>
<ol>
<li>\(0\) is a natural number.</li>
<li>If \(n\) is a natural number, \(\mathbf{n++}\) is also a natural number.</li>
<li>\(0\) is not the successor to any natural number, i.e., \(n++ \neq 0, \forall n\in\mathbb{N}\).</li>
<li>Different natural numbers must have different successors. If \(m\neq n\), then \(m++ \neq n++\). Conversely, if \(m++ \neq n++\), then \(m=n\).</li>
</ol>
<h2 id="proposition">Proposition</h2>
<p>Suppose there exists a function \(f_n:\mathbb{R}\rightarrow\mathbb{R}\). Let \(c\in\mathbb{N}\). Then we can assign a unique natural number \(a_n\) for each natural number \(n\), such that \(a_0=c\), and \(a_{n++}=f_n(a_n) \forall n\in\mathbb{N}\).</p>
<h3 id="proof-by-induction">Proof by Induction</h3>
<p><strong>For zero</strong></p>
<p>Let \(0\) be assigned \(a_0=c\).
Then, \(a_{0++}=f_0(a_0)\). Since \(0\) is never a successor to any natural number by Axiom (3), \(a_0\) will not recur as for \(a_{0++}\).</p>
<p><strong>For \(n\)</strong></p>
<p>From Axiom (4), we can infer that:</p>
\[n++\neq n,n-1,n-2,...,1,0 \\
\Rightarrow a_{n++}\neq a_n,a_{n-1},a_{n-2},...,a_1,a_0\]
<p>Thus, \(a_{n++}\) is unique in the set \(\{a_0,a_1,a_2,...,a_n,a_{n++}\}\).</p>
<p><strong>For \(n++\)</strong></p>
<p>By extension, for \((n++)++\), we can write:</p>
\[(n++)++\neq n++,n,n-1,n-2,...,1,0 \\
\Rightarrow a_{(n++)++}\neq a_{n++},a_n,a_{n-1},a_{n-2},...,a_1,a_0\]
<p>Thus, \(a_{(n++)++}\) is unique in the set \(\{a_0,a_1,a_2,...,a_n,a_{n++},a_{(n++)++}\}\). Thus, we can assign a unique natural number \(a_{(n++)++}\) such that \(a_{(n++)++}=f_{n++}(a_{n++})\).</p>
\[\blacksquare\]
<h2 id="proof-of-existence-of-real-cube-roots">Proof of Existence of Real Cube Roots</h2>
<p>Let \(r\in\mathbb{N}\)
For the case of \(r=0\), the cube root is \(0\).</p>
<p>Consider the set \(\mathbb{S}=\{x:x^3\leq r, x\in \mathbb{R}, r\in \mathbb{R}\}\).</p>
<p>This set is non-empty because $0\in\mathbb{S}$. It is also bounded by \(r+1\) because \((r+1)^3=r^3+3r^2+3r+1>r\).</p>
<p>Therefore, by the <strong>Completeness Axiom</strong>, \(\mathbb{S}\) has a least upper bound. Denote this least upper bound by \(x\).</p>
<p>By the <strong>Trichotomy property</strong>, these are the possible cases:</p>
<ul>
<li><strong>Case 1</strong>: \(x^3<r\)</li>
<li><strong>Case 2</strong>: \(x^3>r\)</li>
<li><strong>Case 3</strong>: \(x^3=r\).</li>
</ul>
<p><strong>Case 1</strong>: Assume that: \(\mathbb{x^3<r}\)</p>
<p>Then, by our definition of \(\mathbb{S}\), <strong>\(x\in\mathbb{S}\) and is its least upper bound</strong>, i.e., <strong>there are no elements in \(\mathbb{S}\) which are greater than \(x\)</strong>.</p>
<p>If the cube of the least upper bound \(x\) is less than \(r\), then it is enough to show that there exists a \(x+\delta:\delta>0\) whose cube is also less than \(r\).</p>
<p>Assume that \(0<\delta<1\). There can exist \(\delta>1\), but that would restrict the choice of upper bounds we have to play about with:</p>
<p>Then, we’d like to find a \(0<\delta<1\) such that \((x+\delta)^3<r\). This gives us:</p>
\[(x+\delta)^3<r \\
x^3+\delta^3+3x^2\delta+3x\delta^2<r \\
(x^3-r)+\delta^3+3x^2\delta+3x\delta^2<0\]
<p>We know that \(3x\delta^2<3x\delta\) is a positive quantity, and note that \(\delta^3<\delta\), thus we can say:</p>
\[(x^3-r)+\delta^3+3x^2\delta+3x\delta^2<(x^3-r)+\delta+3x^2\delta+3x\delta\]
<p>Then, it is enough to prove that:</p>
\[(x^3-r)+\delta+3x^2\delta+3x\delta<0\]
<p>With some algebraic manipulation, we get:</p>
\[(x^3-r)+\delta+3x^2\delta+3x\delta<0 \\
\Rightarrow \delta(1+3x^2+3x)<r-x^3 \\
\Rightarrow \delta<\frac{r-x^3}{1+3x^2+3x}\]
<p>If we assume \(\delta=\frac{1}{k}:k\in\mathbb{N}\), then we can say:</p>
\[k>\frac{1+3x^2+3x}{r-x^3}: k\in\mathbb{N}\]
<p>Since the <strong>Archimedean property</strong> states that natural numbers have no upper bound, \(k\) must exist.
This means, we have proven that there is a \(k\in\mathbb{N}\), for which there exists a cube root \((x+\frac{1}{k})\) which is larger than \(x\), such that \((x+\frac{1}{k})^3<r\). <strong>This implies that \((x+\frac{1}{k})\) exists in \(\mathbb{S}\)</strong>. However, this contradicts our assumption that no element greater than \(x\) exists in \(\mathbb{S}\).</p>
<p><strong>Thus, the statement \(x^3<r\) is false.</strong></p>
<p><strong>Case 2</strong>: Assume that: \(\mathbb{x^3>r}\)</p>
<p>If the cube of the least upper bound \(x\) is greater than \(r\), then it is enough to show that there exists a \(x-\delta:\delta>0\) whose cube is also greater than \(r\).</p>
<p>Assume that \(0<\delta<1\). There can exist \(\delta>1\), but that would restrict the choice of upper bounds we have to play about with:</p>
<p>Then, we’d like to find a \(0<\delta<1\) such that \((x+\delta)^3>r\). This gives us:</p>
\[(x-\delta)^3>r \\
x^3-\delta^3-3x^2\delta+3x\delta^2>r \\
(x^3-r)-\delta^3-3x^2\delta+3x\delta^2>0 \\\]
<p>Again note that since \(\delta^3<\delta\), and \(3x\delta^2\) is positive, we can write:</p>
\[(x^3-r)-\delta^3-3x^2\delta+3x\delta^2>(x^3-r)-\delta-3x^2\delta \\\]
<p>Thus it is enough to prove that:</p>
\[(x^3-r)-\delta-3x^2\delta>0\]
<p>Some algebraic manipulation gives us:</p>
\[(1+3x^2)\delta<x^3-r \\
\Rightarrow \delta<\frac{x^3-r}{1+3x^2}\]
<p>If we assume \(\delta=\frac{1}{k}:k\in\mathbb{N}\), then we can say:</p>
\[k>\frac{1+3x^2}{x^3-r}: k\in\mathbb{N}\]
<p>Since the <strong>Archimedean property</strong> states that natural numbers have no upper bound, \(k\) must exist.
This means, we have proven that there is a \(k\in\mathbb{N}\), for which there exists a cube root \((x-\frac{1}{k})\) which is smaller than \(x\), such that \((x-\frac{1}{k})^3<r\). <strong>Thus, \((x+\frac{1}{k})\) is a least upper bound for \(\mathbb{S}\)</strong>; however, this contradicts our assumtion that \(x\) is the least upper bound.</p>
<p><strong>Thus, the statement \(x^3>r\) is false.</strong></p>
<p>Thus, the only possibility is that <strong>Case 3</strong> is true, i.e., \(x^3=r\), thus implying the existence of real cube roots of real numbers.</p>
\[\blacksquare\]avishekSince I’m currently self-studying Real Analysis, I’ll be listing down proofs I either initially had trouble understanding, or enjoyed proving, here. These are very mathematical posts, and are for personal documentation, mostly.Support Vector Machines from First Principles: Linear SVMs2021-05-10T00:00:00+05:302021-05-10T00:00:00+05:30/2021/05/10/support-vector-machines-lagrange-multipliers<p>We have looked at how <strong>Lagrangian Multipliers</strong> and how they help build constraints as part of the function that we wish to optimise. Their relevance in <strong>Support Vector Machines</strong> is how the constraints about the classifier margin (i.e., the supporting hyperplanes) is incorporated in the search for the <strong>optimal hyperplane</strong>.</p>
<p>We introduced the first part of the problem in <a href="/2021/04/14/support-vector-machines-derivations.html">Support Vector Machines from First Principles: Part One</a>. We then took a detour through <strong>Vector Calculus</strong> and <strong>Constrained Quadratic Optimisation</strong> to build our mathematical understanding for the succeeding analysis.</p>
<p>We will now derive the analytical form of the Support Vector Machine variables in this post. This article will only discuss <strong>Linear Support Vector Machines</strong>, which apply to a <strong>linearly separable data set</strong>. <strong>Non-Linear Support Vector Machines</strong> will be discussed in an upcoming article.</p>
<p>The necessary background material for understanding this article is covered in the following articles:</p>
<ul>
<li><a href="/2021/04/14/support-vector-machines-derivations.html">Support Vector Machines from First Principles: Part One</a></li>
<li>Vector Calculus Background
<ul>
<li><a href="/2021/04/20/vector-calculus-simple-manifolds.html">Vector Calculus: Graphs, Level Sets, and Constraint Manifolds</a></li>
<li><a href="/2021/04/24/vector-calculus-lagrange-multipliers.html">Vector Calculus: Lagrange Multipliers</a></li>
<li><a href="/2021/04/29/inverse-function-theorem-implicit-function-theorem.html">Vector Calculus: Implicit Function Theorem and Inverse Function Theorem</a> (<strong>Note</strong>: This covers more theoretical background)</li>
</ul>
</li>
<li>Quadratic Form and Motivating Problem
<ul>
<li><a href="/2021/04/19/quadratic-form-optimisation-pca-motivation-part-one.html">Quadratic Optimisation: PCA as Motivation</a></li>
<li><a href="/2021/04/28/quadratic-optimisation-pca-lagrange-multipliers.html">Conclusion to Quadratic Optimisation: PCA as Motivation</a></li>
</ul>
</li>
<li>Quadratic Optimisation
<ul>
<li><a href="/2021/05/08/quadratic-optimisation-theory.html">Quadratic Optimisation: Mathematical Background</a></li>
<li><a href="/2021/05/10/quadratic-form-optimisation-kkt.html">Quadratic Optimisation: Karush-Kuhn-Tucker Conditions</a></li>
</ul>
</li>
</ul>
<p>Before we proceed with the calculations, I’ll restate the original problem again.</p>
<h2 id="support-vector-machine-problem-statement">Support Vector Machine Problem Statement</h2>
<p>For a set of data \(x_i, i\in[1,N]\), if we assume that data is divided into two classes (-1,+1), we can write the constraint equations as:</p>
\[\mathbf{m_{max}=max \frac{2k}{\|N\|}}\]
<p>subject to the following constraints/;</p>
\[\mathbf{
N^Tx_i\geq b+k, \forall x_i|y_i=+1 \\
N^Tx_i\leq b-k, \forall x_i|y_i=-1
}\]
<p><img src="/assets/images/svm-supporting-hyperplanes.png" alt="SVM Support Hyperplanes" /></p>
<p>We are also given a set of training examples \(x_i, i=1,2,...,n\) which are already labelled either <strong>+1</strong> or <strong>-1</strong>. <strong>The important assumption here is that these training data points are linearly separable</strong>, i.e., there exists a hyperplane which divides the two categories, such that no point is misclassified. Our task is to find this hyperplane with the maximum possible margin, which will be defined by its <strong>supporting hyperplanes</strong>.</p>
<h2 id="restatement-of-the-support-vector-machine-problem-statement">Restatement of the Support Vector Machine Problem Statement</h2>
<p>Remembering the standard form of a <strong>Quadratic Programming</strong> problem, we want the objective function to be a minimisation problem, as well as a quadratic problem.</p>
<p>Furthermore, we’d like to set the constant \(k=1\), and rewrite \(N\) with \(w\). Thus, the objective function may be rewritten as:</p>
\[\mathbf{min f(x)=\frac{w^Tw}{2}}\]
<p>since squaring \(w\) does not affect the outcome of the minimisation problem.</p>
<p>We have two constraints; we’d like to rewrite them in the form \(g(x)\leq 0\). Thus, we get:</p>
\[-(w^Tx_i-b)+1\leq 0, \forall x_i|y_i=+1\\
w^Tx_i-b+1\leq 0, \forall x_i|y_i=-1\]
<p>You will notice that they differ only in the sign of \((w^Tx_i-b)\), which is dependent on the reverse sign of \(y_i\). We can collapse these two inequalities into a single one by using \(y_i\) as a determinant of the sign.</p>
\[g_i(x)=\sum_{i=1}^n-y_i(w^Tx_i-b)+1\leq 0, \forall x_i|y_i\in\{-1,+1\}\]
<p>The <strong>Lagrangian</strong> then is:</p>
\[\mathbf{
L(w,\lambda,b)=f(x)+\lambda_i g_i(x)} \hspace{15mm}\text{(Standard Lagrangian Form)}\\
L(w,\lambda,b)=\frac{w^Tw}{2}+\sum_{i=1}^n\lambda_i [-y_i(w^Tx_i-b)+1] \\
\mathbf{
L(w,\lambda,b)=\frac{w^Tw}{2}-\sum_{i=1}^n\lambda_i [y_i(w^Tx_i-b)-1]
}\]
<p>for all \(x_i\) such that \(\lambda_i\geq 0\), \(g_i(x)\leq 0\), and \(y_i\in\{-1,+1\}\).</p>
<p>We have already assumed the <strong>Primal and Dual Feasibility Conditions</strong> above. The <strong>Dual Optimisation Problem</strong> is then:</p>
\[\text{max}_\lambda\hspace{4mm}\text{min}_{w,b} \hspace{4mm} L(w,\lambda,b)\]
\[\begin{equation}
\text{max}_\lambda\hspace{4mm}\text{min}_{w,b} \hspace{4mm} \frac{w^Tw}{2}-\sum_{i=1}^n\lambda_i [y_i(w^Tx_i-b)-1] \label{eq:lagrangian}
\end{equation}\]
<p>Note that the only constraints that will be activated will be the ones which are for points lying on the supporting hyperplanes.</p>
<h2 id="the-support-vector-machine-solution">The Support Vector Machine Solution</h2>
<p>We have three variables in the Lagrangian Dual: \((w,b,\lambda)\). We will now solve for each of them in turn.</p>
<h3 id="1-solving-for-wast">1. Solving for \(w^\ast\)</h3>
<p>Let’s see what the KKT Stationarity Condition gives us.</p>
\[\frac{\partial L}{\partial w}=w-\sum_{i=1}^n \lambda_ix_iy_i\]
<p>Setting this partial differential to zero, we get:</p>
\[\begin{equation}
\mathbf{
w^\ast=\sum_{i=1}^n \lambda_ix_iy_i \label{eq:weight}
}
\end{equation}\]
<p>If we denote \(w^\ast\) as the optimal solution for \(w\).</p>
<h3 id="2-solving-for-bast">2. Solving for \(b^\ast\)</h3>
<p>Differentiating with respect to \(b\), and setting it to zero, we get:</p>
\[\frac{\partial L}{\partial b}=0 \\
\Rightarrow \begin{equation}
\sum_{i=1}^n \lambda_iy_i=0 \label{eq:b-constraint}
\end{equation}\]
<p>This doesn’t give us an expression for \(b\) but does give us a specific condition that needs to be fulfilled by any point which lies on the supporting hyperplane.</p>
<p>Let us make the following observations:</p>
<ul>
<li>We already know \(w^\ast\). Thus, we know the <strong>separating hyperplane through the origin</strong>, though we do not know \(b\). In two dimensions, this would be the equivalent of the y-intercept.</li>
<li>For the points labelled \(+1\), the <strong>minimum value</strong> you get by plugging \(x_i\) into \(\mathbf{w^\ast x}\) is definitely a point on the (as yet undetermined) <strong>positive supporting hyperplane \(H^+\)</strong>. You can have multiple points which achieve this minimum value; all of those points lie on \(H^+\), which is obviously parallel to \(f(x)=w^\ast x\).</li>
<li>For the points labelled \(-1\), the <strong>maximum value</strong> you get by plugging \(x_i\) into \(\mathbf{w^\ast x}\) is definitely a point on the (as yet undetermined) <strong>negative supporting hyperplane \(H^-\)</strong>. You can have multiple points which achieve this maximum value; all of those points lie on \(H^-\), which is obviously parallel to \(f(x)=w^\ast x\).</li>
</ul>
<p>Therefore, we may find \(b^+\) and \(b^-\) by finding:</p>
<ul>
<li>\(H^+\) is the hyperplane with “slope” \(w^\ast\) and passing through the point \(x^+\) which gives the minimum value (positive or negative) for \(f(x)=w^\ast x\). There may be multiple points like \(x^+\); pick any one. \(H^+\) will have y-intercept \(b^+\).</li>
<li>\(H^-\) is the hyperplane with “slope” \(w^\ast\) and passing through the point \(x^-\) which gives the maximum value (positive or negative) for \(f(x)=w^\ast x\). There may be multiple points like \(x^-\); pick any one. \(H^-\) will have y-intercept \(b^-\).</li>
</ul>
<p><strong>\(H^+\) and \(H^-\) are the supporting hyperplanes.</strong> The situation is shown below.</p>
<p><img src="/assets/images/svm-solving-y-intercept.png" alt="Solving for Primal and Dual SVM Variables" /></p>
<p>We already saw in <a href="/2021/04/14/support-vector-machines-derivations.html">Support Vector Machines from First Principles: Part One</a> that the separating hyperplane \(H_0\) lies midway between \(H^+\) and \(H^-\), implying that \(b^\ast\) is the mean of \(b^+\) and \(b^-\). Thus, we get:</p>
\[\begin{equation}
\mathbf{
b^\ast=\frac{b^++b^-}{2} \label{eq:b}
}
\end{equation}\]
<h3 id="3-solving-for-lambdaast">3. Solving for \(\lambda^\ast\)</h3>
<p>Let us simplify the \(\eqref{eq:lagrangian}\) in light of these new identities. We write:</p>
\[L(\lambda,w^\ast,b^\ast)=\frac{w^Tw}{2}+\sum_{i=1}^n\lambda_i [y_i(w^Tx_i-b)-1] \\
=\frac{w^Tw}{2}+\sum_{i=1}^n\lambda_i y_i w^Tx_i- \sum_{i=1}^n\lambda_i y_ib + \sum_{i=1}^n\lambda_i\]
<p>The term \(\sum_{i=1}^n\lambda_i y_ib\) vanishes because of \(\eqref{eq:b-constraint}\), so we get:</p>
\[L(\lambda,w^\ast,b^\ast)=\frac{w^Tw}{2}+\sum_{i=1}^n\lambda_i y_i w^Tx_i + \sum_{i=1}^n\lambda_i\]
<p>Applying the identity \(\eqref{eq:weight}\) to this result, we get:</p>
\[L(\lambda,w^\ast,b^\ast)=\frac{1}{2} \sum_{i=1}^n\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_ix_j - \sum_{i=1}^n\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_ix_j + \sum_{i=1}^n \lambda_i \\
\mathbf{
L(\lambda,w^\ast,b^\ast)=\sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_ix_j
}\]
<p>Thus, \(\lambda_i\) can be solved by optimising \(L(\lambda,w^\ast,b^\ast)\), that is:</p>
\[\lambda^\ast=\text{arginf}_\lambda L(\lambda,w^\ast,b^\ast) \\
\mathbf{
\lambda^\ast=\text{arginf}_\lambda \left[\sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_ix_j\right]
}\]
<h3 id="solving-for-bast-a-shortcut">Solving for \(b^\ast\): A Shortcut</h3>
<p>I noted that to find \(b^+\) and \(b^-\), we needed to find respectively, the minimum and maximum values from each category applied to the candidate separating hyperplane \(f(x)=w^\ast x\). As it turns out, we do not need to look through all the points.</p>
<p>Recall that the support vectors are the ones which define the constraints in the form of supporting hyperplanes. Also, recall from our discussion on the Lagrangian Dual that the constraints are only activated for \(g(x)=0\), i.e., the Lagrange multipliers for those points are the only nonzero multipliers; all other constraints have their Lagrange multipliers as zero.</p>
<p>This means that if we have already computed the <strong>Lagrange multipliers</strong>, we only need to search through the <strong>points which have nonzero Lagrange multipliers</strong> to find \(b^+\) and \(b^-\). We do not need to find the maximum and minimum values, and the number of points we need to look at, is vastly reduced, presumably because most of the data points will be inside the halfspaces proper, and not exactly on the supporting hyperplanes \(H^+\) and \(H^-\).</p>
<h3 id="summary">Summary</h3>
<p>Note that at the end of our calculation, we will have arrived at (\(\lambda^\ast\), \(w^\ast\), \(b^\ast\)) as the optimal solution for the Lagrangian. Recall that by our <strong>assumptions of Quadratic Optimisation</strong>, this <strong>Lagrangian is a concave-convex function</strong>, and thus the primal and the dual optimum solutions coincide (<strong>no duality gap</strong>). In effect, this is the same solution that we’d have gotten if we’d solved the original optimisation problem.</p>
<p>Once the training has completed, categorising a new point from a test set, is done simply by finding:</p>
\[y_t=sgn[w^\ast x_t-b^\ast]\]
<p>Summarising, the expressions for the <strong>optimal Primal and Dual variables</strong> are:</p>
\[\mathbf{
w^\ast=\sum_{i=1}^n \lambda_ix_iy_i \\
b^\ast=\frac{b^++b^-}{2} \\
\lambda^\ast=\text{arginf}_\lambda \left[\sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_ix_j\right]
}\]
<h2 id="relationship-with-the-perceptron">Relationship with the Perceptron</h2>
<p>The <strong>Perceptron</strong> is a much simpler version of a Support Vector Machine. I’ll cover the Perceptron in its article, but simply put: the perceptron also attempts to create a linear discriminant hyperplane between two classes of data, with the purpose of classifying new data points into either one of these categories.</p>
<p>The form of the solution for the perceptron is also a hyperplane of the form \(f(x)=wx-b\). The perceptron may be trained sequentially, or batchwise, but regardless of the training sequence, the <strong>final adjustment that is applied to \(w\) in the hyperplane solution is proportional to \(\sum^{i=1}_n \eta x_iy_i\)</strong>. This is very similar to the identity \(w^\ast=\sum_{i=1}^n \lambda_ix_iy_i\) which we derived in \(\eqref{eq:weight}\).</p>
<p>However, since the <strong>Perceptron</strong> does not attempt to maximise the margin between the two categories, the <strong>separating hyperplane may perform well on the training set</strong>, but might end up arbitrarily close to the support vector in either category, thus <strong>increasing the risk of misclassification of new test points in that category, which lie close to the support vector</strong>.</p>avishekWe have looked at how Lagrangian Multipliers and how they help build constraints as part of the function that we wish to optimise. Their relevance in Support Vector Machines is how the constraints about the classifier margin (i.e., the supporting hyperplanes) is incorporated in the search for the optimal hyperplane.Quadratic Optimisation: Lagrangian Dual, and the Karush-Kuhn-Tucker Conditions2021-05-10T00:00:00+05:302021-05-10T00:00:00+05:30/2021/05/10/quadratic-form-optimisation-kkt<p>This article concludes the (very abbreviated) theoretical background required to understand <strong>Quadratic Optimisation</strong>. Here, we extend the <strong>Lagrangian Multipliers</strong> approach, which in its current form, admits only equality constraints. We will extend it to allow constraints which can be expressed as inequalities.</p>
<p>Much of this discussion applies to the general class of <strong>Convex Optimisation</strong>; however, I will be constraining the form of the problem slightly to simplify discussion. We have already developed most of the basic mathematical results (see <a href="/2021/05/08/quadratic-optimisation-theory.html">Quadratic Optimisation Concepts</a>) in order to fully appreciate the implications of the <strong>Karush-Kuhn-Tucker Theorem</strong>.</p>
<p><strong>Convex Optimisation</strong> solves problems framed using the following standard form:</p>
<p>Minimise (with respect to \(x\)), \(\mathbf{f(x)}\)</p>
<p>subject to:</p>
<p>\(\mathbf{g_i(x)\leq 0, i=1,...,n}\) <br />
\(\mathbf{h_i(x)=0, i=1,...,m}\)</p>
<p>where:</p>
<ul>
<li>\(\mathbf{f(x)}\) is a <strong>convex</strong> function</li>
<li>\(\mathbf{g_i(x)}\) are <strong>convex</strong> functions</li>
<li>\(\mathbf{h_i(x)}\) are <strong>affine</strong> functions.</li>
</ul>
<p>For <strong>Quadratic Optimisation</strong>, the extra constraint that is imposed is: \(g_i(x)\) is are also affine functions. Therefore, all of our constraints are essentially linear.</p>
<p>For this discussion, I’ll omit the equality constraints \(h_i(x)\) for clarity; any <strong>equality constraints can always be converted into inequality constraints</strong>, and become part of \(g_i(x)\).</p>
<p>Thus, this is the reframing of the <strong>Quadratic Optimisation</strong> problem for the purposes of this discussion.</p>
<p>Minimise (with respect to \(x\)), \(\mathbf{f(x)}\)</p>
<p>subject to: \(\mathbf{g_i(x)\leq 0, i=1,...,n}\)</p>
<p>where:</p>
<ul>
<li>\(\mathbf{f(x)}\) is a <strong>convex function</strong></li>
<li>\(\mathbf{g_i(x)}\) are <strong>affine functions</strong></li>
</ul>
<h2 id="karush-kuhn-tucker-stationarity-condition">Karush-Kuhn-Tucker Stationarity Condition</h2>
<p>We have already seen in <a href="/2021/04/24/vector-calculus-lagrange-multipliers.html">Vector Calculus: Lagrange Multipliers, Manifolds, and the Implicit Function Theorem</a> that the gradient vector of a function can be expressed as a <strong>linear combination of the gradient vectors</strong> of the constraint manifolds.</p>
\[\mathbf{
Df=\lambda_1 Dh_1(U,V)+\lambda_2 Dh_2(U,V)+\lambda_3 Dh_3(U,V)+...+\lambda_n Dh_n(U,V)
}\]
<p>We can rewrite this as:</p>
\[\mathbf{
Df(x)=\sum_{i=1}^n\lambda_i.Dg_i(x)
} \\
\Rightarrow f(x)=\sum_{i=1}^n\lambda_i.g_i(x)\]
<p>where \(x=(U,V)\). We will not consider the pivotal and non-pivotal variables separately in this discussion.</p>
<p>In this original formulation, we expressed the gradient vector as a linear combination of the gradient vector(s) of the constraint manifold(s).
We can bring everything over to one side and flip the signs of the Lagrangian Multipliers to get the following:</p>
\[Df(x)+\sum_{i=1}^n\lambda_i.Dg_i(x)=0\]
<p>Since the derivatives in this case represent the gradient vectors, we can rewrite the above as:</p>
\[\mathbf{
\begin{equation}
\nabla f(x)+\sum_{i=1}^n\lambda_i.\nabla g_i(x)=0
\label{eq:kkt-1}
\end{equation}
}\]
<p>This expresses the fact that the <strong>gradient vector of the tangent space must be opposite (and obviously parallel) to the direction of the gradient vector of the objective function</strong>. All it really amounts to is a <strong>change in the sign</strong> of the multipliers \(\lambda_i\); we do this so that the <strong>Lagrange multiplier terms act as penalties</strong> when the constraints \(g_i(x)\) are violated. We will see this in action when we explore the properties of the Lagrangian in the next few sections.</p>
<p>The identity \(\eqref{eq:kkt-1}\) is the <strong>Stationarity Condition</strong>, one of the <strong>Karush-Kuhn-Tucker Conditions</strong>.</p>
<h2 id="active-and-inactive-constraints">Active and Inactive Constraints</h2>
<p>In <strong>Quadratic Optimisation</strong>, \(g_i(x)|i=1,2,...,n\) represent the constraint functions. An important concept to get an intuition about, is the difference between dealing with pure equality constraints and inequality cnstraints.</p>
<p>The diagram below shows an example where all the constraints are equality constraints.</p>
<p><img src="/assets/images/optimisation-equality-constraints.png" alt="Equality Coonstraints" /></p>
<p>There are two points to note.</p>
<ul>
<li>All equality constraints are expressed in the form \(g_i(x)=0\) and they all must be satisfied simultaneously.</li>
<li><strong>All equality constraints, being affine, must be tangent to the objective function surface</strong>, since only then can the gradient vector of the solution be expressed as the Lagrangian combination of these tangent spaces.</li>
</ul>
<p>The situation changes when inequality constraints are involved. Here is another rough diagram to demonstrate. The y-coordinate represents the image of the objective function \(f(x)\). The x-coordinate represents the image of the constraint function \(g(x)\), i.e., the different values \(g(x)\) can take for different values of \(x\).</p>
<p>The equality condition in this case maps to the y-axis, since that corresponds to \(g(x)=0\). However, we’re dealing with inequality constraints here, namely, \(g(x) \leq 0\); thus the viable space of solutions for \(f(x)\) are all to the left of the y-axis.</p>
<p>As you can see, since \(g(x)\leq 0\), the solution is not required to touch the level set of the constraint manifold corresponding to zero. Such solutions might not be the optimal solutions (we will see why in a moment), but they are viable solutions nevertheless.</p>
<p>We now draw two example solution spaces with two different shapes.</p>
<p><img src="/assets/images/optimisation-active-constraint.png" alt="Active Constraint in Optimisation" /></p>
<p>In the first figure, the global minimum of \(f(x)\) violates the constraint since it lies in the \(g(x)>0\). Thus, we cannot pick that; we must pick minimum \(f(x)\) that does not violate the constraint \(g(x)\leq 0\). This point in the diagram lies on the y-axis, i.e., on \(g(x)=0\). The constraint \(g(x)\leq 0\) in this scenario is considered an <strong>active constraint</strong>.</p>
<p><img src="/assets/images/optimisation-inactive-constraint.png" alt="Inactive Constraint in Optimisation" /></p>
<p>Contrast this with the diagram above. Here, the shape of the solution space is different. The minimum \(f(x)\) lies within the \(g(x)\leq 0\) zone. This means that even if we minimise \(f(x)\) without regard to the constraint \(g(x)\leq 0\), we’ll still get the minimum solution which still satisfies the constraint. In this scenario, we call \(g(x)\leq 0\) an <strong>inactive constraint</strong>. This implies that in this scenario, we do not even need to consider the constraint \(g_i(x)\) as part of the objective function. As you will see, after we define the Lagrangian, this can be done by setting the corresponding Lagrangian multiplier to zero.</p>
<h2 id="the-lagrangian">The Lagrangian</h2>
<p>We now have the machinery to explore the <strong>Lagrangian Dual</strong> in some detail. We will first consider the <strong>Lagrangian</strong> of a function. The Lagrangian form is simply restating the Lagrange Multiplier form as a function \(L(X,\lambda)\), like so:</p>
\[L(x,\lambda)=f(x)+\sum_{i=1}^n\lambda_i.g_i(x)\text{ such that }\lambda_i\geq 0 \text{ and } g_i(x)\leq 0\]
<p>Let us note these conditions from the above identity:</p>
\[\begin{equation}
\mathbf{
g_i(x)\leq 0 \label{eq:kkt-4}
}
\end{equation}\]
\[\begin{equation}
\mathbf{
\lambda_i\geq 0 \label{eq:kkt-3}
}
\end{equation}\]
<ul>
<li><strong>Primal Feasibility Condition</strong>: The inequality \(\eqref{eq:kkt-4}\) is the <strong>Primal Feasibility Condition</strong>, one of the <strong>Karush-Kuhn-Tucker Conditions</strong>.</li>
<li><strong>Dual Feasibility Condition</strong>: The inequality \(\eqref{eq:kkt-3}\) is the <strong>Dual Feasibility Condition</strong>, one of the <strong>Karush-Kuhn-Tucker Conditions</strong>.</li>
</ul>
<p>We have simply moved all the terms of the Lagrangian formulation onto one side and denoted it by \(L(x,\lambda)\), like we talked about when concluding the <strong>Stationarity Condition</strong>.</p>
<p>Note that differentiating with respect to \(x\) and setting it to zero, will get us back to the usual <strong>Vector Calculus</strong>-motivated definition, i.e.:</p>
\[D_xL=
\mathbf{
\nabla f-{[\nabla G]}^T\lambda
}\]
<p>where \(G\) represents \(n\) constraint functions, \(\lambda\) represents the \(n\) Lagrange multipliers, and \(f\) is the objective function.</p>
<h2 id="the-primal-optimisation-problem">The Primal Optimisation Problem</h2>
<p>We will now explore the properties of the Lagrangian, both analytically, as well as geometrically.</p>
<p>Remembering the definition of the supremum of a function, we find the supremum of the Lagrangian with respect to \(\lambda\) (that is, to find the supremum in each case, we vary the value of \(\lambda\)) to be the following:</p>
\[sup_\lambda L(x,\lambda)=\begin{cases}
f(x) & \text{if } g_i(x)\leq 0 \\
\infty & \text{if } g_i(x)>0
\end{cases}\]
<p>Remember that \(\mathbf{\lambda \geq 0}\).</p>
<p>Thus, for the first case, if \(g_i(x) \leq 0\), the best we can do is set \(\lambda=0\), since any other non-negative value will not be the supremum.</p>
<p>In the second case, if \(g(x)>0\), the supremum of the function can be as high as we like as long as we keep increasing the value of \(\lambda\). Thus, we can simply set it to \(\infty\), and the corresponding supremum becomes \(\infty\).</p>
<p>We can see that the function \(sup_\lambda L(x)\) incorporates the constraints \(g_i(x)\) directly, there is a penalty of \(\infty\) for any constraint which is violated. Therefore, the original problem of minimising \(f(x)\) can be equivalently stated as:</p>
\[\text{Minimise (w.r.t. x) }sup_\lambda L(x,\lambda) \\
\text{where } L(x,\lambda)=f(x)+\sum_{i=1}^n\lambda_i.g_i(x)\]
<p>Equivalently, we say:</p>
\[\text{Find }\mathbf{inf_x\text{ }sup_\lambda L(x,\lambda)} \\
\text{where } L(x,\lambda)=f(x)+\sum_{i=1}^n\lambda_i.g_i(x)\]
<p>This is referred to in the mathematical optimisation field as the <strong>primal optimisation problem</strong>.</p>
<h2 id="karush-kuhn-tucker-complementary-slackness-condition">Karush-Kuhn-Tucker Complementary Slackness Condition</h2>
<p>We previously discussed the two possible scenarios when optimising with constraints: either a constraint is active, or it is inactive.</p>
<ul>
<li><strong>Constraint is active</strong>: This implies that the optimal point \(x^*\) lies on the constraint manifold. Thus, \(\mathbf{g_i(x^*)=0}\). Correspondingly, \(\mathbf{\lambda_i g(x^*)=0}\).</li>
<li><strong>Constraint is inactive</strong>: This implies that <strong>the optimal point \(x^*\) does not lie on the constraint manifold, but somewhere inside</strong>. Thus, \(\mathbf{g_i(x^*)<0}\). However, this also means that we can optimise \(f(x)\) without regard to the constraint \(g_i(x)\). The best way to get rid of this constraint then, is to set the corresponding Lagrange multiplier \(\mathbf{\lambda_i=0}\). Correspondingly, \(\mathbf{\lambda_i g(x^*)=0}\) again (albeit for different reasons from the active constraint case).</li>
</ul>
<p>Thus, we may conclude that all \(\lambda_i g_i(x)\) terms in the Lagrangian must be zero, regardless of whether the corresponding constraint is active or inactive.</p>
<p>Mathematically, this implies:</p>
\[\begin{equation}
\mathbf{
\sum_{i=1}^n\lambda_i.g_i(x)=0 \label{eq:kkt-2}
}
\end{equation}\]
<p>The identity \(\eqref{eq:kkt-2}\) is termed the Complementary Slackness Condition, one of the <strong>Karush-Kuhn-Tucker Conditions</strong>.</p>
<h2 id="the-karush-kuhn-tucker-conditions">The Karush-Kuhn-Tucker Conditions</h2>
<p>We are now in a position to summarise all the <strong>Karush-Kuhn-Tucker Conditions</strong>. The theorem states that for the optimisation problem given by:</p>
\[\mathbf{\text{Minimise}_x \hspace{3mm} f(x)}\]
<p>if the following conditions are met for some \(x^*\):</p>
<h3 id="1-primal-feasibility-condition">1. Primal Feasibility Condition</h3>
<p>\(\mathbf{g_i(x^*)\leq 0}\)</p>
<h3 id="2-dual-feasibility-condition">2. Dual Feasibility Condition</h3>
<p>\(\mathbf{\lambda_i\geq 0}\)</p>
<h3 id="3-stationarity-condition">3. Stationarity Condition</h3>
<p>\(\mathbf{\nabla f(x^*)+\sum_{i=1}^n\lambda_i.\nabla g_i(x^*)=0}\)</p>
<h3 id="4-complementary-slackness-condition">4. Complementary Slackness Condition</h3>
<p>\(\mathbf{\sum_{i=1}^n\lambda_i.g_i(x^*)=0}\)</p>
<p>then \(x^*\) is a <strong>local optimum</strong>.</p>
<h2 id="the-dual-optimisation-problem">The Dual Optimisation Problem</h2>
<p>We already know from the <a href="/2021/05/08/quadratic-optimisation-theory.html">Max-Min Inequality</a> that:</p>
\[\mathbf{\text{sup}_y \text{ inf}_x f(x,y)\leq \text{inf}_x \text{ sup}_y f(x,y)} \text{ }\forall x,y\in\mathbb{R}\]
<p>Since this is a general statement about any \(f(x,y)\), we can apply this inequality to the Primal Optimisation Problem, i.e.:</p>
\[\text{sup}_\lambda \text{ inf}_x L(x,\lambda) \leq \text{inf}_x \text{ sup}_\lambda L(x,\lambda)\]
<p>The right side is the <strong>Primal Optimisation Problem</strong>, and the left side is known as the <strong>Dual Optimisation Problem</strong>, and in this case, the <strong>Lagrangian Dual</strong>.</p>
<p>To understand the fuss about the <strong>Lagrangian Dual</strong>, we will begin with the more restrictive case where equality holds for the <strong>Max-Min Inequality</strong>, and later discuss the more general case and its implications. For this first part, we will assume that:</p>
\[\text{sup}_\lambda \text{ inf}_x L(x,\lambda) = \text{inf}_x \text{ sup}_\lambda L(x,\lambda)\]
<p>Let’s look at a motivating example. This is the graph of the Lagrangian for the following problem:</p>
\[\text{Minimise}_x f(x)=x^2 \\
\text{subject to: } x \leq 0\]
<p>The Lagrangian in this case is given by:</p>
\[L(x,\lambda)=x^2+\lambda x\]
<p>This is the corresponding graph of \(L(x,\lambda)\).</p>
<p><img src="/assets/images/lagrangian-shape.png" alt="Shape of Lagrangian for a Convex Objective Function" /></p>
<p>Let us summarise a few properties of this graph.</p>
<ul>
<li><strong>The function is convex in \(x\)</strong>: Assume \(\lambda=C\) is a constant, then the function has the form \(\mathbf{x^2+Cx}\) which is a family of parabolas. <strong>A parabola is a convex function</strong>, thus the result follows.</li>
<li><strong>The function is concave in \(\lambda\)</strong>: Assume that \(x=C\) and \(x^2=K\) are constants, then the function has the form \(\mathbf{C\lambda+K}\), which is the general form of <strong>affine functions</strong>. <strong>Affine functions are both convex and concave</strong>, but we will be drawing more conclusions based on their concave nature, so we will simply say that <strong>the Lagrangian is concave in \(\lambda\)</strong>. Thus, <strong>the Lagrangian is also a family of concave functions</strong>.</li>
<li>As a direct consequence of the Lagrangian being a family of concave functions, we can say that <strong>the pointwise infimum of the Lagrangian is a concave function</strong>. We established this result in <a href="/2021/05/08/quadratic-optimisation-theory.html">Quadratic Optimisation Concepts</a>. This result is irrespective of the shape of the Lagrangian in the direction of \(x\).</li>
</ul>
<p>This is important because it allows us to frame the Lagrangian of a Quadratic Optimisation as a concave-convex function. This triggers a whole list of simplifications, some of which I list below (we’ll discuss most of them in succeeding sections).</p>
<ul>
<li>Guarantee of a <strong>saddle point</strong></li>
<li><strong>Zero duality gap</strong> by default</li>
<li>No extra conditions for <strong>Strong Duality</strong></li>
</ul>
<p><img src="/assets/images/lagrangian-saddle.png" alt="Shape of Lagrangian for a Convex Objective Function" /></p>
<h2 id="geometric-intuition-of-the-lagrange-dual-problem">Geometric Intuition of the Lagrange Dual Problem</h2>
<p>Let us look at the <strong>geometric interpretation</strong> of the Lagrangian Dual. For this discussion, we will assume that the <strong>constraints are active</strong>. The Lagrangian itself is:</p>
\[L(x,\lambda)=f(x)+\sum_{i=1}^n\lambda_i.g_i(x)\text{ such that }\lambda_i\geq 0 \text{ and } g_i(x)\leq 0\]
<p>For the purposes of the discussion, let’s assume one constraint, so that the Lagrangian is now:</p>
\[L(x,\lambda)=f(x)+\lambda.g(x)\text{ such that }\lambda\geq 0 \text{ and } g(x)\leq 0\]
<p>Let us map \(f(x)\) (y-coordinate) and \(g(x)\) (x-coordinate), treating them as variables themselves. Then we see that the Lagrangian is of the form:</p>
\[C=\lambda.g(x)+f(x) \\
\Rightarrow f(x)=-\lambda.g(x)+C\]
<p><strong>This is the equation of a straight line</strong>, with <strong>slope \(-\lambda\)</strong> and <strong>y-intercept \(C\)</strong>. Note that \(C\) in this case represents the Lagrangian objective function.</p>
<p>Let’s walk through the Lagrangian maximisation-minimisation procedure step-by-step. The procedure is:</p>
\[\text{sup}_\lambda \text{ inf}_x L(x,\lambda)\]
<p>There are two important points to note here:</p>
<ul>
<li>We have restricted \(\lambda\geq 0\). Therefore the <strong>slope of the Lagrangian is always negative</strong>.</li>
<li><strong>Moving this line to the left decreases its y-intercept</strong>, in this case, \(C\).</li>
</ul>
<h3 id="1-infimum-with-respect-to-x">1. Infimum with respect to \(x\)</h3>
<p>The first step is \(\text{ inf}_x L(x,\lambda)\), which translates to:</p>
\[\text{ inf}_x \lambda.g(x)+f(x)\]
<ul>
<li>For a given value of \(\lambda\), find the lowest possible \(C\), such that all the constraints are still respected.</li>
</ul>
<p><strong>Geometrically</strong>, this means taking the line \(f(x)=\lambda g(x)\), and moving it as far to the left as possible while it has at least one point in \(G\).</p>
<p><strong>Algebraically</strong>, this gives us:</p>
\[0=\lambda.\frac{dg(x)}{dx}+\frac{df(x)}{dx} \\
\Rightarrow \frac{df(x)}{dx}=-\lambda.\frac{dg(x)}{dx} \\
\Rightarrow \nabla f(x)=-\lambda.\nabla g(x)\]
<p>This gives us the condition for such a minimisation to be possible, which, as you must have guessed, simply restates the <strong>Kuhn-Tucker Stationarity Condition</strong>.</p>
<p>The situation looks like below.</p>
<p><img src="/assets/images/infimum-supporting-hyperplane-convex-set.png" alt="Infimum Supporting Hyperplanes for a Convex Set" /></p>
<p>The important thing to note is that as a result of taking the infimum, all the Lagrangians are now <strong>supporting hyperplanes</strong> of \(G\).</p>
<p>Also, because \(\lambda\geq 0\) and also due to how the infimum works, none of the supporting hyperplanes touch \(G\) in the first quadrant (positive); they have all moved as far left as possible, and are effectively tangent to \(G\) at \(g(x)\leq 0\).</p>
<p>As you see below, <strong>this operation holds true even for nonconvex sets</strong>.</p>
<p><img src="/assets/images/infimum-supporting-hyperplanes-nonconvex-set.png" alt="Infimum Supporting Hyperplanes for a Convex Set" /></p>
<p>The infimum operation tells us what the supporting hyperplane for the convex set looks like for a given value of \(\lambda\). Obviously, this also implies that the Lagrangian is tangent to \(G\). This is expressed by the fact that the gradient vector of \(f(x)\) is parallel and opposite to the gradient vector of the constraint \(g(x)\).</p>
<p>Take special note of the Lagrangian line for \(\lambda_1\) in the nonconvex set scenario; we shall have occasion to revisit it very soon.</p>
<h3 id="1-supremum-with-respect-to-lambda">1. Supremum with respect to \(\lambda\)</h3>
<p>The above infimum (minimisation) operation has given us the Lagrangian in terms of \(\lambda\) only. This family of Lagrangians is represented by \(\text{ inf}_x \lambda.g(x)+f(x)\).</p>
<p><strong>Geometrically, you can assume that you have an infinite set of Lagrangians, one for every value of \(\lambda\), each of them a supporting hyperplane for the \([g(x), f(x)]\) set.</strong></p>
<p>Now, to actually find the optimum point, we’d like to select the <strong>supporting hyperplane that has the maximum corresponding cost \(C\)</strong>, or y-intercept. Algebraically, this implies finding \(\text{ inf}_\lambda \text{ inf}_x \lambda.g(x)+f(x)\).</p>
<p>Note that the Lagrangian is concave in \(\lambda\), thus the minimisation has also given us a concave problem to solve. In this case, we will be maximising this concave problem (which corresponds to minimising a convex problem).</p>
<p><img src="/assets/images/supremum-lagrangian-dual-convex-set.png" alt="Supremum Supporting Hyperplanes for a Convex Set" /></p>
<p>In the diagram above, I’ve marked the winning supporting hyperplane, thicker. For this hyperplane with its value of \(\lambda^*\), the y-intercept (the Lagrangian cost) is maximised. This critical point is marked \(d^*\).</p>
<h2 id="strong-duality">Strong Duality</h2>
<p>The interesting (and useful) thing to note is that if you were to solve the <strong>Primal Optimisation Problem</strong> instead of the <strong>Lagrangian Dual Problem</strong>, or even the original optimisation problem in the <strong>standard Quadratic Programming form</strong>, you will get the same result as \(d^*\).</p>
<p>This is the result of the function being concave in \(\lambda\) and convex in \(x\), <strong>implying the existence of a saddle point</strong>. This is also the situation where the equality clause of the <strong>Max-Min Inequality</strong> holds.</p>
<h2 id="weak-duality-and-the-duality-gap">Weak Duality and the Duality Gap</h2>
<p>I’d purposefully omitted the result of finding the supremum for the nonconvex case in the previous section. This is because the nonconvex scenario is what shows us the real difference between the <strong>Primal Optimisation Problem</strong> and its <strong>Lagrangian Dual</strong>.</p>
<p>The winning supporting hyperplane for the <strong>nonconvex set</strong> is shown below.</p>
<p><img src="/assets/images/duality_gap-nonconvex-set.png" alt="Supremum Supporting Hyperplanes for a Non-Convex Set" /></p>
<p>The solution for the <strong>Lagrangian Dual Problem</strong> is marked \(d^*\), and the solution for the <strong>Primal Optimisation Problem</strong> is marked \(p^*\). As you can clearly see, \(d^*\) and \(p^*\) do not coincide.</p>
<p>The dual solution is in this case, is not the actual solution, but <strong>it provides a lower bound on \(p^*\)</strong>, i.e., if we can compute \(d^*\), we can use it to decide if the solution by an optimisation algorithm is “good enough”. It is also a validation that we are not searching in an infeasible area of the solution space.</p>
<p><strong>This is the situation where the inequality condition of the Max-Min Inequality holds.</strong></p>
<p>The difference between the \(p^*\) and \(d^*\) is called the <strong>Duality Gap</strong>. Obviously, the duality gap is zero when conditions of <strong>Strong Duality</strong> are satisfied. When these conditions for Strong Duality are not satisfied, we say that <strong>Weak Duality</strong> holds.</p>
<h2 id="conditions-for-strong-duality">Conditions for Strong Duality</h2>
<p>There are many different conditions which, if satisfied by themselves, guarantee Strong Duality. In particular, textbooks cite <strong>Slater’s Constraint Qualification</strong> very frequently, and the <strong>Linear Independence Constraint Qualification</strong> also finds mention.</p>
<p><strong>The above-mentioned constraint qualifications assume that the constraints are nonlinear.</strong></p>
<p>However, for our current purposes, if we assume that the <strong>inequality constraints are affine functions</strong>, we do not need to satisfy any other condition: <strong>the duality gap will be zero by default</strong> under these conditions; the optimum dual solution will always equal the optimal primal solution, i.e., \(p^*=d^*\).</p>
<p>This also <strong>guarantees the existence of a saddle point</strong> in the solution of the Lagrangian. A saddle point of a function \(f(x,y)\) is defined as a point (x^<em>,y^</em>) which satisfies the following condition:</p>
\[f(x^*,\bigcirc)\leq f(x^*,y^*)\leq f(\bigcirc, y^*)\]
<p>where \(\bigcirc\) represents “any \(x\)” or “any \(y\)” depending upon its placement. Applying this to our objective function, we can write:</p>
\[f(x^*,\bigcirc)\leq f(x^*,\lambda^*)\leq f(\bigcirc, \lambda^*)\]
<p>The implication is that starting from the saddle point, the function slopes down in the direction of \(\lambda\), and slopes up in the direction of \(x\). The figure below shows the general shape of the Lagrangian with a convex objective function and affine (inequality and equality) constraints.</p>
<p><img src="/assets/images/lagrangian-saddle.png" alt="Shape of Lagrangian for a Convex Objective Function" /></p>
<p>The reason this leads to <strong>Strong Duality</strong> is this: minimising \(f(x,\lambda)\) with respect to \(x\) first, then maximising with respect to \(\lambda\), takes us to the same point \((x^*,y^*)\) that would be reached, if we first maximise \(f(x,\lambda)\) with respect to \(\lambda\), then minimise with respect to \(\lambda\).</p>
<p>Mathematically, this implies that:</p>
\[\mathbf{\text{sup}_\lambda \text{ inf}_x f(x,\lambda)= \text{inf}_x \text{ sup}_\lambda f(x,\lambda)}\]
<p>thus implying that the <strong>Duality Gap</strong> is zero.</p>
<h2 id="notes">Notes</h2>
<ul>
<li><strong>Karush-Kuhn-Tucker Conditions</strong> use <strong>Farkas’ Lemma</strong> for proof.</li>
<li>The <strong>Saddle Point Theorem</strong> is not proven here.</li>
</ul>avishekThis article concludes the (very abbreviated) theoretical background required to understand Quadratic Optimisation. Here, we extend the Lagrangian Multipliers approach, which in its current form, admits only equality constraints. We will extend it to allow constraints which can be expressed as inequalities.Quadratic Optimisation: Mathematical Background2021-05-08T00:00:00+05:302021-05-08T00:00:00+05:30/2021/05/08/quadratic-optimisation-theory<p>This article continues the original discussion on <strong>Quadratic Optimisation</strong>, where we considered <strong>Principal Components Analysis</strong> as a motivation. Originally, this article was going to begin delving into the <strong>Lagrangian Dual</strong> and the <strong>Karush-Kuhn-Tucker Theorem</strong>, but the requisite mathematical machinery to understand some of the concepts necessitated breaking the preliminary setup into its own separate article (which you’re now reading).</p>
<h2 id="affine-sets">Affine Sets</h2>
<p>Take any two vectors \(\vec{v_1}\) and \(\vec{v_2}\). All the vectors (or points, if you so prefer) along the line joining the tips of \(\vec{v_1}\) and \(\vec{v_2}\) obviously lie on a straight line. Thus, we can represent any vector along this line segment as:</p>
\[\vec{v}=\vec{v_1}+\theta(\vec{v_1}-\vec{v_2}) \\
=\theta \vec{v_1}+(1-\theta)\vec{v_2}\]
<p>We say that all these vectors (including \(\vec{v_1}\) and \(\vec{v_2}\)) form an <strong>affine set</strong>. More generally, a vector is a member of an affine set if it satisfies the following definition.</p>
\[\vec{v}=\theta_{1} \vec{v_1}+\theta_{2} \vec{v_2}+...+\theta_{n} \vec{v_n} \\
\theta_1+\theta_2+...+\theta_n=1\]
<p><img src="/assets/images/affine-set.png" alt="Affine Set" /></p>
<p>In words, a vector is an <strong>affine combination</strong> of \(n\) vectors if the <strong>coefficients of the linear combinations of those vectors sum to one</strong>.</p>
<h2 id="convex-and-non-convex-sets">Convex and Non-Convex Sets</h2>
<p>A set is said to be a <strong>convex set</strong> if for any two points belonging to the set, all their affine combinations also belong to the set. In simpler times, it means that a straight line between two points belonging to a convex set, lies completely inside the set.</p>
<p>Mathematically, the condition for convexity is the following:</p>
\[\theta p_1+(1-\theta)p_2 \in C, \text{ if } p_1,p_2 \in C\]
<p>The set shown below is a convex set.</p>
<p><img src="/assets/images/convex-set.png" alt="Convex Set" /></p>
<p>Any set that does not adhere to the above definition, is, by definition, a <strong>nonconvex set</strong>.</p>
<p>The set below is <strong>nonconvex</strong>. The red segments of the lines joining the points within the set lie outside the set, and thus violate the definition of convexity.</p>
<p><img src="/assets/images/nonconvex-set.png" alt="Nonconvex Set" /></p>
<h2 id="convex-and-concave-functions">Convex and Concave Functions</h2>
<p>The layman’s explanation of a convex function is that it is a bowl-shaped function. However, let us state this mathematically: we say a function is convex, <strong>if the graph of that function lies below every point on a line connecting any two points on that function</strong>.</p>
<p><img src="/assets/images/convex-function.png" alt="Convex Function" /></p>
<p>If \((x_1, f(x_1))\) and \((x_2, f(x_2))\) are two points on a function \(f(x)\), then \(f(x)\) is <strong>convex</strong> iff:</p>
\[\mathbf{f(\theta x_1+(1-\theta x_2))\leq \theta f(x_1)+(1-\theta)f(x_2)}\]
<p>Consider a point \(P\) on the line connecting \([x_1, f(x_1)]\) and \([x_2, f(x_2)]\), its coordinate on that line is \([\theta x_1+(1-\theta) x_2, \theta f(x_1)+(1-\theta) f(x_2)]\). The corresponding point on the graph is \([\theta x_1+(1-\theta) x_2, f([\theta x_1+(1-\theta) x_2)]\).</p>
<p><img src="/assets/images/concave-function.png" alt="Concave Function" />
The same condition, but inverted, can be applied to define a concave function. A function \(f(x)\) is <strong>concave</strong> iff:</p>
\[\mathbf{f(\theta x_1+(1-\theta x_2))\geq \theta f(x_1)+(1-\theta)f(x_2)}\]
<h2 id="affine-functions">Affine Functions</h2>
<p>An function \(f(x)\) is an <strong>affine function</strong> iff:</p>
\[\mathbf{f(\theta x_1+(1-\theta) x_2)=f(\theta x_1)+f((1-\theta) x_2)}\]
<p>Let’s take a simple function \(f(x)=Ax+C\) where \(x\) is a vector. \(A\) is a transformation matrix, and \(C\) is a constant vector. Then, for two vectors \(\vec{v_1}\) and \(\vec{v_2}\), we have:</p>
\[f(\theta \vec{v_1}+(1-\theta) \vec{v_2})=A.[\theta \vec{v_1}+(1-\theta) \vec{v_2}]+C \\
=A\theta \vec{v_1}+A(1-\theta) \vec{v_2}+(\theta+1-\theta)C \\
=A\theta \vec{v_1}+A(1-\theta) \vec{v_2}+\theta C+(1-\theta)C \\
=[\theta A\vec{v_1}+\theta C]+[(1-\theta) A\vec{v_2}+(1-\theta)C]\\
=\theta[A\vec{v_1}+C]+(1-\theta)[A\vec{v_2}+C]\\
=\theta f(\vec{v_1})+(1-\theta) f(\vec{v_2})\]
<p>Thus all mappings of the form \(\mathbf{f(x)=Ax+C}\) are <strong>affine functions</strong>.</p>
<p>We may draw another interesting conclusion: <strong>affine functions are both convex and concave</strong>. This is because affine functions satisfy the equality conditions for both convexity and concavity: <strong>an affine set on an affine function lies fully on the function itself</strong>.</p>
<h2 id="supporting-hyperplanes">Supporting Hyperplanes</h2>
<p>A <strong>supporting hyperplane</strong> for a set \(C\) is a hyperplane which has the following properties:</p>
<ul>
<li>The <strong>supporting hyperplane</strong> is guaranteed to contain at least one point which is also on the boundary of the set \(C\).</li>
<li>The <strong>supporting hyperplane</strong> divides \(\mathbb{R}^n\) into two <strong>half-spaces</strong> such that set \(C\) is completely contained by one of these half-spaces.</li>
</ul>
<p>The definition of a convex set can also be explained by supporting hyperplanes. If there exists at least one supporting hyperplane for each point on the boundary of a set \(C\), \(C\) is convex.</p>
<p>The diagram below shows an example of a supporting hyperplane for a convex set.
<img src="/assets/images/valid-supporting-hyperplane.png" alt="Supporting Hyperplane for a Convex Set" /></p>
<p>The diagram below shows an example of an invalid supporting hyperplane (the dotted hyperplane). The dotted supporting hyperplane cannot exist because set \(C\) lies in both the half-spaces defined by this hyperplane.</p>
<p><img src="/assets/images/invalid-supporting-hyperplane.png" alt="Invalid Supporting Hyperplane for a Non-Convex Set" /></p>
<h2 id="some-inequality-proofs">Some Inequality Proofs</h2>
<h3 id="result-1">Result 1</h3>
<p>If \(a\geq b\), and \(c\geq d\), then:</p>
\[min(a,c)\geq min(b,d)\]
<p>The proof goes like this, we can define the following inequalities in terms of the \(min\) function:</p>
\[\begin{eqnarray}
a \geq min(a,c) \label{eq:1} \\
c \geq min(a,c) \label{eq:2} \\
b \geq min(b,d) \label{eq:3} \\
d \geq min(b,d) \label{eq:4} \\
\end{eqnarray}\]
<p>Then, the identities \(a \geq b\) and \(\eqref{eq:3}\) imply:</p>
\[a \geq b \geq min(b,d)\]
<p>Similarly, the identities \(c \geq d\) and \(\eqref{eq:4}\) imply that:</p>
\[c \geq d \geq min(b,d)\]
<p>Therefore, regardless of our choice from \(a\), \(c\) from the function \(min(a,c)\), the result will always be greater than \(min(b,d)\). Thus we write:</p>
\[\begin{equation} \mathbf{min(a,c) \geq min(b,d)} \label{ineq:1}\end{equation}\]
<h3 id="result-2">Result 2</h3>
<p>Here we prove that:</p>
\[min(a+b, c+d) \geq min(a,c)+min(b,d)\]
<p>Here we take a similar approach, noting that:</p>
\[a \geq min(a,c) \\
c \geq min(a,c) \\
b \geq min(b,d) \\
d \geq min(b,d) \\\]
<p>Therefore, if we compute \(a+b\) and \(c+d\), we can write:</p>
\[a+b \geq min(a,c)+min(b,d) \\
c+d \geq min(a,c)+min(b,d)\]
<p>Therefore, regardless of our choice from \(a+b\), \(c+d\) from the function \(min(a+b,c+d)\), the result will always be greater than \(min(a,c)+min(b,d)\). Thus we write:</p>
\[\begin{equation}\mathbf{min(a+b, c+d) \geq min(a,c)+min(b,d)} \label{ineq:2} \end{equation}\]
<h2 id="infimum-and-supremum">Infimum and Supremum</h2>
<p>The <strong>infimum</strong> of a function \(f(x)\) is defined as:</p>
\[\mathbf{inf_x(f(x))=M | M<f(x) \forall x}\]
<p>The infimum is defined for all functions even if the minimum does not exist for a function, and is equal to the mimimum if it does exist.</p>
<p>The supremum of a function \(f(x)\) is defined as:</p>
\[\mathbf{sup_x(f(x))=M | M>f(x) \forall x}\]
<p>The <strong>supremum</strong> is defined for all functions even if the maximum does not exist for a function, and is equal to the maximum if it does exist.</p>
<h2 id="pointwise-infimum-and-pointwise-supremum">Pointwise Infimum and Pointwise Supremum</h2>
<p>The <strong>pointwise infimum</strong> of two functions \(f_1(x)\) and \(f_2(x)\) is defined as thus:</p>
\[pinf(f_1, f_2)=min\{f_1(x), f_2(x)\}\]
<p>The <strong>pointwise supremum</strong> of two functions \(f_1(x)\) and \(f_2(x)\) is defined as thus:</p>
\[psup(f_1, f_2)=max\{f_1(x), f_2(x)\}\]
<p>We’ll prove an interesting result that will prove useful when exploring the shape of the <strong>Lagrangian of the objective function</strong>, namely that <strong>the pointwise infimum of any set of concave functions is a concave function</strong>.</p>
<p><img src="/assets/images/concave-infimum.png" alt="Concave Pointwise Infimum" /></p>
<p>Let there be a chord \(C_1\) connecting (x_1, f_1(x_1)) and (x_2, f_1(x_2)) for a concave function \(f_1(x)\).
Let there be a chord \(C_2\) connecting (x_1, f_2(x_1)) and (x_2, f_2(x_2)) for a concave function \(f_2(x)\).</p>
<p>Let us fix two arbitrary x-coordinates \(x_1\) and \(x_2\). Then, by the definition of a <strong>concave function</strong> (see above), we can write for \(f_1\) and \(f_2\):</p>
\[f_1(\alpha x_1+\beta x_2)\geq \alpha f_1(x_1)+\beta f_1(x_2) \\
f_2(\alpha x_1+\beta x_2)\geq \alpha f_2(x_1)+\beta f_2(x_2)\]
<p>where \(\alpha+\beta=1\). Let us define the <strong>pointwise infimum</strong> function as:</p>
\[\mathbf{pinf(x)=min\{f_1(x), f_2(x)\}}\]
<p>Then:</p>
\[pinf(\alpha x_1+\beta x_2)=min\{ f_1(\alpha x_1+\beta x_2), f_2(\alpha x_1+\beta x_2)\} \\
\geq min\{ \alpha f_1(x_1)+\beta f_1(x_2), \alpha f_2(x_1)+\beta f_2(x_2)\} \hspace{4mm}\text{ (from }\eqref{ineq:1})\\
\geq \alpha.min\{f_1(x_1),f_2(x_1)\} + \beta.min\{f_1(x_2),f_2(x_2)\} \hspace{4mm}(\text{ from } \eqref{ineq:2})\\
= \mathbf{\alpha.pinf(x_1) + \beta.pinf(x_2)}\]
<p>Thus, we can summarise:</p>
\[\begin{equation}
\mathbf{pinf(\alpha x_1+\beta x_2) \geq \alpha.pinf(x_1) + \beta.pinf(x_2)}
\end{equation}\]
<p>which is the form of an <strong>concave function</strong>, and thus we can conclude that \(pinf(x)\) is a concave function if all of its component functions are concave.</p>
<p>Since this is a general result for any two coordinates \(x_1,x_2:x_1,x_2 \neq 0\), we can conclude that <strong>the pointwise infimum of two concave functions is also a concave function</strong>. This can be extended to an arbitrary set of arbitrary concave functions.</p>
<p>Using very similar arguments, we can also prove that <strong>the pointwise supremum of an arbitrary set of convex functions is also a convex function</strong>.</p>
<p>The other straightforward conclusion is that <strong>the pointwise infimum of any set of affine functions is always concave, because affine functions are concave</strong> (they are also convex, but we cannot draw any general conclusions about the pointwise infimum of convex functions).</p>
<p><strong>Note</strong>: The <strong>pointwise infimum</strong> and <strong>pointwise supremum</strong> have different definitions from the <strong>infimum</strong> and <strong>supremum</strong>, respectively.</p>
<h2 id="the-max-min-inequality">The Max-Min Inequality</h2>
<p>The <strong>Max-Min Inequality</strong> is a very general statement about the implications of ordering of maximisation/minimisation procedures along different axes of a function.</p>
<p>Fix a particular point \((x_0,y_0)\).</p>
\[\text{ inf}_xf(x,y_0)\leq f(x_0,y_0)\leq \text{ sup}_yf(x_0,y)\]
<p>This holds for any \((x_0,y_0)\), thus, we can simplify the notation, and omit the middle term to write:</p>
\[\text{ inf}_xf(x,y)\leq \text{ sup}_yf(x,y) \\
g(x,y)\leq h(x,y) \text{ }\forall x,y\in\mathbf{R}\]
<p>where \(g(x,y)=\text{ inf}_xf(x,y)\) and \(h(x,y)=\text{ sup}_yf(x,y)\). Note that at this point, \(g\) and \(h\) can be simple scalars or functions in their oown right; it depends upon the degree of the original function \(f(x,y)\).</p>
<p>In the general case, the infimum will define a function whose image will contain values which are all less than the smallest value in the image of the supremum function. We express this last statment as:</p>
\[\text{sup}_y g(x,y)\leq \text{inf}_x h(x,y) \\
\Rightarrow \mathbf{\text{sup}_y \text{ inf}_x f(x,y)\leq \text{inf}_x \text{ sup}_y f(x,y)}\text{ }\forall x,y\in\mathbf{R}\]
<p>This is the statement of the <strong>Max-Min Inequality</strong>.</p>
<h2 id="the-minimax-theorem">The Minimax Theorem</h2>
<p>The <strong>Minimax Theorem</strong> (first proof by <strong>John von Neumann</strong>) specifies conditions under which the <strong>Max-Min Inequality</strong> is an equality. This will prove useful in our discussion around solutions to the Lagrangian.
Specifically, the theorem states that the</p>
\[\mathbf{\text{sup}_y \text{ inf}_x f(x,y)\leq \text{inf}_x \text{ sup}_y f(x,y)} \text{ }\forall x,y\in\mathbf{R}\]
<p>if:</p>
<ul>
<li>\(f(x,y)\) is convex in \(y\) (keeping \(x\) constant)</li>
<li>\(f(x,y)\) is concave in \(x\) (keeping \(y\) constant)</li>
</ul>
<p>The diagram below shows the graph of such a function.</p>
<p><img src="/assets/images/quadratic-surface-no-cross-term-saddle.png" alt="Concave-Convex Function" /></p>
<p>The above conditions also imply the existence of a <strong>saddle point</strong> in the solution space, which, as we will discuss, will also be the <strong>optimal solution</strong>.</p>avishekThis article continues the original discussion on Quadratic Optimisation, where we considered Principal Components Analysis as a motivation. Originally, this article was going to begin delving into the Lagrangian Dual and the Karush-Kuhn-Tucker Theorem, but the requisite mathematical machinery to understand some of the concepts necessitated breaking the preliminary setup into its own separate article (which you’re now reading).Intuitions about the Implicit Function Theorem2021-04-29T00:00:00+05:302021-04-29T00:00:00+05:30/2021/04/29/inverse-function-theorem-implicit-function-theorem<p>We discussed the <strong>Implicit Function Theorem</strong> at the end of the article on <a href="/2021/04/24/vector-calculus-lagrange-multipliers.html">Lagrange Multipliers</a>, with some hand-waving to justify the linear behaviour on manifolds in arbitrary \(\mathbb{R}^N\).</p>
<p>This article delves a little deeper to develop some more intuition on the Implicit Function Theorem, but starts with its more specialised relative, the <strong>Inverse Function Theorem</strong>. This is because it is easier to start with reasoning about the Inverse Function Theorem.</p>
<h2 id="inverse-functions">Inverse Functions</h2>
<h3 id="monotonicity-in-one-dimension">Monotonicity in One Dimension</h3>
<p>Let’s start with a simple motivating example. We have the function \(f(x)=2x: x \in \mathbb{R}^n\). This gives a value, say \(y\), given an \(x\). We desire to find a function \(f^{-1}\) which is the inverse of \(f^{-1}\), i.e., given a \(y\), we wish to recover \(x\). Mathematically, we can say:</p>
\[f^{-1}(f(x))=x\]
<p>In this case, the inverse is pretty easy to determine, it is \(f^{-1}(x)=\frac{x}{2}\). The function \(f\) is thus a mapping from \(\mathbb{R} \rightarrow \mathbb{R}\).
Let us ask the question while we are still dealing with very simple functions: <strong>under what conoditions does a function not have an inverse?</strong></p>
<p>Let’s think of this intuitively with an example. Does the function \(f(x)=5\) have an inverse? This function forces all values of \(x\in \mathbb{R}^n\) to a value of 5. Even hypothetically, if \(f^{-1}\) exists, and we tried to find \(f^{-1}(5)\), there would not be one solution for \(x\). Algebraically, we could have written:</p>
\[f(x)=[0].x+[5]\]
<p>where \([0]\) is a \(1\times 1\) matrix with a zero in it, and in this, is the function matrix. The \([5]\) is the bias constant, and can be ignored for this discussion.</p>
<p>Obviously, \(f(x)\) collapses every \(x\) into the zero vector, and is thus not invertible. Correspondingly, the function does not have an inverse. Some intuition is developed about invertibility in <a href="/2021/04/03/matrix-intuitions.html">Assorted Intuitions about Matrics</a>.</p>
<p>This implies an important point, it is not necessary for all \(x\) to have the same output. Even if a single non-zero vector \(x\) folds into zero, then our function cannot be invertible. For this to happen, a function must continuously either keep increasing or decreasing: it cannot increase for a while, then decrease again, because that automatically implies that the output can be the same for two (or more) different inputs (implying that you cannot recover the input uniquely from a given output).</p>
<p>A function which always either only increases, or only decreases, is called a <strong>monotonic function</strong>.</p>
<p><strong>Monotonic functions</strong> have the property that their derivative is always either always positive or always negative throughout the domain. This property is evident, when you take the derivative of the function \(g(x)=2x\), which is \(\frac{dg(x)}{dx}=2\).</p>
<p>This will come in handy when we move to higher dimensions.</p>
<p>Let’s look at another well-known function, the sine curve.</p>
<p><img src="/assets/images/sine-wave.png" alt="Sine Curve" /></p>
<p>The sine function \(f(x)=sin(x)\) is <strong>not invertible</strong> in the domain \([\infty, -\infty]\). This is because values of \(x\) separated by \(\frac{\pi}{2}\) radians output the same value.</p>
<p>For the function \(f(x)=sin(x)\) to be invertible, <strong>we restrict its domain to \([-\frac{\pi}{2},\frac{\pi}{2}]\)</strong>. You can easily see that in the range \([-\frac{\pi}{2},\frac{\pi}{2}]\), the sine function is <strong>monotonic</strong> (in this case, increasing).</p>
<p>This also leads us to an important practice: that of explicitly defining the region of the domain of the function where it is monotonic. In most cases, excluding the problematic areas of the domain, allows us to apply stricter conditions to a local area of a function, which would not be possible if the function was considered at a global scale.</p>
<h3 id="function-inverses-in-higher-dimensions">Function Inverses in Higher Dimensions</h3>
<p>What if we wish to extend this to the two-dimensional case? We now have a function \(F:\mathbb{R}^2 \rightarrow \mathbb{R}^2\). I said “a function”, but it is actually a vector of two functions. An elementary function returns a single scalar value, and to get two values (remember, \(\mathbb{R}^2\)) for our output vector, we need two functions. Let us write this as:</p>
\[F(X)=\begin{bmatrix}
f_1(x_1, x_2) \\ f_2(x_1, x_2)
\end{bmatrix}
\\
f_1(x_1, x_2)=x_1+x_2 \\
f_2(x_1, x_2)=x_1-x_2 \\
\Rightarrow F(X)=
\begin{bmatrix}
1 && 1 \\
1 && -1
\end{bmatrix}\]
<p>where \(X=(x_1,x_2)\). I have simply rewritten the functions in matrix form above.
<strong>What is the inverse of this function?</strong> We can simply compute the inverse of this matrix to get the answer. I won’t show the steps here (I did this using augmented matrix Gaussian Elimination), but you can verify yourself that the inverse \(F^{-1}\) is:</p>
\[F^{-1}(X)=\begin{bmatrix}
\frac{1}{2} && \frac{1}{2} \\
\frac{1}{2} && -\frac{1}{2} \\
\end{bmatrix}\]
<p>This can be extended to all higher dimensions, obviously.</p>
<p>Let us repeat the same question as in the one-dimensional case: <strong>when is the function \(F\) not invertible?</strong> We need to make our definition a little more sophisticated in the case of multivariable functions; the new requirement is that all its partial derivatives always be invertible. Stated this way, this implies that the the gradient of the function (Jacobian) \(\nabla F\) be invertible over the entire region of interest.</p>
<p>Briefly, we’re looking at \(n\) equations with \(n\) unknowns, with all linearly independent column vectors. <strong>Linear independence is a necessary condition for invertibility.</strong></p>
<p>We are now ready to state the <strong>Inverse Function Theorem</strong> (well, the important part).</p>
<h2 id="inverse-function-theorem">Inverse Function Theorem</h2>
<p>The <strong>Inverse Function Theorem</strong> states that:</p>
<p>In the neighbourhood of a domain around \(x_0\) of a function \(F\) which is known to be <strong>continuously differentiable</strong>, if the <strong>derivative of the function \(DF(x_0)\)</strong> is <strong>invertible</strong>, then there exists an <strong>inverse function</strong> \(F^{-1}\) which exists in that same neighbourhood such that \(F^{-1}(F(x_0))=x_0\).</p>
<p>The theorem also gives us information about what the <strong>derivative of the inverse function</strong>, but we’ll not delve into that aspect for the moment. Any textbook on <strong>Vector Calculus</strong> should have the relevant results.</p>
<p>This is a very informal definition of the <strong>Inverse Function Theorem</strong>, but it conveys the most important part, namely: <strong>if the derivative of a function is invertible</strong> in some neighbourhood of \(x_0\), <strong>there exists an inverse of the function</strong> itself in that neighbourhood.</p>
<p>The reason we stress a lot on the word <strong>neighbourhood</strong> is that a lot of functions are not necessarily continuously differentiable, especially for nonlinear functions. Linear functions look the same as their derivatives at every point, which is why we didn’t need to worry about taking the derivative of \(f(x)=2x\) in our initial example.</p>
<p>The <strong>Inverse Function Theorem</strong> obviously applies to linear functions, but its real value lies in applying to <strong>nonlinear functions</strong>, where the neighbourhood is taken to be infinitesmal, which then leads us to the definition of the <strong>manifold</strong>, which we have talked about in <a href="/2021/04/24/vector-calculus-lagrange-multipliers.html">Vector Calculus: Lagrange Multipliers, Manifolds, and the Implicit Function Theorem</a>.</p>
<h2 id="implicit-function-theorem">Implicit Function Theorem</h2>
<p>What can we say about systems of functions which have \(n\) unknowns, but less than \(n\) equations? The <strong>Implicit Function Theorem</strong> gives us an answer to this; think of it as a more general version of the <strong>Inverse Function Theorem</strong>.</p>
<p>Much of the mechanics implied by this theorem is covered in <a href="/2021/04/24/vector-calculus-lagrange-multipliers.html">Vector Calculus: Lagrange Multipliers, Manifolds, and the Implicit Function Theorem</a>. However, here we take a big-picture view.</p>
<p>Suppose we have \(m+n\) unknowns and \(n\) equations.
Thus, we will have \(n\) pivotal variables, corresponding to \(n\) linearly independent column vectors of this system of linear equations.
This means that \(n\) pivotal variables can be expressed in terms of \(m\) free variables. Let us call the \(m\) free variables \(U=(u_1, u_2,..., u_m)\), and the \(n\) pivotal variables \(V=(v_1, v_2, ..., v_n)\).</p>
<p>Let us consider the original function \(F_{old}\).</p>
\[F_{old}(U,V)=\begin{bmatrix}
f_1(u_1, u_2, u_3, ..., u_m, v_1, v_2, v_3, ..., v_n) \\
f_2(u_1, u_2, u_3, ..., u_m, v_1, v_2, v_3, ..., v_n) \\
f_3(u_1, u_2, u_3, ..., u_m, v_1, v_2, v_3, ..., v_n) \\
\vdots \\
f_n(u_1, u_2, u_3, ..., u_m, v_1, v_2, v_3, ..., v_n)
\end{bmatrix}\]
<p>The new function \(F_{new}\) is what we obtain once we have expressed \(V\) in terms of only \(U\). It looks like this:</p>
\[F_{new}(U)=\begin{bmatrix}
u_1 \\
u_2 \\
u_3 \\
\vdots \\
u_m \\
\phi_1(u_1, u_2, u_3, ..., u_m) \\
\phi_2(u_1, u_2, u_3, ..., u_m) \\
\phi_3(u_1, u_2, u_3, ..., u_m) \\
\vdots \\
\phi_n(u_1, u_2, u_3, ..., u_m)
\end{bmatrix}\]
<p>Note that the original formulation had a function F_{old} which transformed the full set \((U,V)\) into a new vector. The new formulation now has \(m\) free variables which stay unchanged after the transform, and \(n\) pivotal variables \(V\) which are mapped from \(U\) with a new set of functions \(\Phi=(\phi_1,\phi_2,...,\phi_n,)\).</p>
<p>Now, instead of asking: <strong>“Is there an inverse of the function \(F_{old}\)?”</strong>, we ask: <strong>“Is there an inverse of the function \(F_{new}\)?”</strong></p>
<p>The situation is illustrated below.</p>
<p><img src="/assets/images/implicit-function-theorem.png" alt="Implicit Function Theorem Intuition" /></p>
<p>The <strong>Implicit Function Theorem</strong> states that if a mapping \(F_{old}(U,F_{new}(U))\) exists for a point \(c=(U_0, F_{new}(U_0))\) such that:</p>
<ul>
<li>
\[\mathbf{F_{old}(c)=0}\]
</li>
<li>\(F_{old}(c)\) is <strong>first order differentiable</strong> (\(C^1\) differentiable)</li>
<li>The derivative of \(F_{old}\) is invertible, implying \(L\) is also invertible, where \(L\) is defined as below:</li>
</ul>
\[L=\begin{bmatrix}
(D_1F_{old}, D_2F_{old}, D_3F_{old}, ..., D_nF_{old}) && (D_{n+1}F_{old}, D_{n+2}F_{old}, D_{n+3}F_{old}, ..., D_{n+m}F_{old}) \\
0 && I_{m \times m}
\end{bmatrix}\]
<p>then, the following holds true:</p>
<ul>
<li>There exists an inverse mapping \(F_{new}^{-1}\) for \(F_{new}\) such that \(F_{old}(F_{new}^{-1}(V), V)=0\) in the neighboourhood of \(c\)</li>
<li>There is a <strong>neighbourhood of \(c\)</strong> where this linear relationship holds for \(F(c)=0\).</li>
</ul>
<p>The above is the same statement as the one made by the <strong>Inverse Function Theorem</strong>, except that the system of linear equations in that scenario was completely determined. In the case of the <strong>Implicit Function Theorem</strong>, the system is <strong>underdetermined</strong>.</p>
<h3 id="note-on-the-derivative-matrix">Note on the Derivative Matrix</h3>
<p>Let us look at the matrix \(L\) defined above. Here, we have added padded the derivatives with the zero matrix and an identity matrix to make the whole matrix \(L\), square.</p>
<p>For simple linear surfaces, simply finding the inverse of the system of linear equations is enough, since as I noted, the gradient vector is the same as the surfce normal globally, but that is not true for “lumpy” functions globally. It is true for a neighbourhood \(x_0\). But what is the <strong>size of this neighbourhood</strong> such that the derivative approximates the actual function reasonably well?</p>
<p>Put another way, what is the size of the neighbourhood, <strong>where the first derivative does not change too fast</strong> for it to be useful in approximating the actual function? This requires the derivative satisfying the <strong>Lipschitz Condition</strong>, which is a way of putting a <strong>strong guarantee on continuous differentiability</strong>.</p>
<p>We will not go into the details of how this condition is satisfied, but only state that calculating a metric associated with this condition, requires us to compute \(L^{-1}\).</p>
<p>We know that \((D_1F_{old}, D_2F_{old}, D_3F_{old}, ..., D_nF_{old})\) is \(n \times n\) and is invertible, because we know that there are \(n\) linearly independent columns in \(F_{old}\).</p>
<p>The matrix \(L\) has the block form:</p>
\[L=
\begin{bmatrix}
A && C \\
0 && B
\end{bmatrix}\]
<p>where \(A\) and \(B\) are invertible, but \(C\) need not be. To see why this results in \(L\) being invertible, see <a href="/2021/04/29/quick-summary-of-common-matrix-product-methods.html">Intuitions around Matrix Multiplications</a>.</p>avishekWe discussed the Implicit Function Theorem at the end of the article on Lagrange Multipliers, with some hand-waving to justify the linear behaviour on manifolds in arbitrary \(\mathbb{R}^N\).Common Ways of Looking at Matrix Multiplications2021-04-29T00:00:00+05:302021-04-29T00:00:00+05:30/2021/04/29/quick-summary-of-common-matrix-product-methods<p>We consider the more frequently utilised viewpoints of <strong>matrix multiplication</strong>, and relate it to one or more applications where using a certain viewpoint is more useful. These are the viewpoints we will consider.</p>
<ul>
<li>Linear Combination of Columns</li>
<li>Linear Combination of Rows</li>
<li>Linear Transformation</li>
<li>Sum of Columns into Rows</li>
<li>Dot Product of Rows and Columns</li>
<li>Block Matrix Multiplication</li>
</ul>
<h2 id="linear-combination-of-columns">Linear Combination of Columns</h2>
<p>This is the most common, and probably one of the most useful, ways of looking at matrix multiplication. This is because the concept of <strong>linear combinations of columns</strong> is a fundamental way of determining linear independence (or linear dependence), which then informs us about many things, including:</p>
<ul>
<li>Dimensionality of the <strong>column space</strong> and <strong>row space</strong></li>
<li>Dimensionality of the <strong>null space</strong> and <strong>left null space</strong></li>
<li><strong>Uniqueness</strong> of solutions</li>
<li><strong>Invertibility</strong> of matrix</li>
</ul>
<p>This is obviously the most commonly used interpretation when defining and working with <strong>vector subspaces</strong>, as well.</p>
<p><img src="/assets/images/linear-combination-matrix-multiplication.jpg" alt="Linear Combination of Columns" /></p>
<h2 id="linear-combination-of-rows">Linear Combination of Rows</h2>
<p>There’s not much more to say about the linear combinations of rows. However, <strong>any deduction about the row rank of a matrix from looking at its row vectors automatically applies to the column rank as well</strong>, so it is useful in situations where you find looking at rows easier than columns.</p>
<h2 id="sum-of-columns-into-rows">Sum of Columns into Rows</h2>
<p>The product of a column of the left matrix and a row of the right matrix gives a matrix of the same dimensions as the final result. <strong>Thus, each product results in one “layer” of the final result.</strong> Subsequent “layers” are added on through summation. Thus, product looks like so:</p>
<p>Thus, for \(A\in\mathbb{R}^{m\times n}\) and \(B\in\mathbb{R}^{n\times p}\), we can write out the multiplication operation as below:</p>
\[\mathbf{AB=C_{A1}R_{B1}+C_{A2}R_{B2}+C_{A3}R_{B3}+...+C_{An}R_{Bn}}\]
<p>This is a common form of treating a matrix when performing <strong>LU Decomposition</strong>. See <a href="/2021/04/02/vectors-matrices-outer-product-column-into-row-lu.html">Matrix Outer Product: Columns-into-Rows and the LU Factorisation</a> for an extended explanation of the <strong>LU Factorisation</strong>.</p>
<h2 id="linear-transformation">Linear Transformation</h2>
<p>This is a very common way of perceiving matrix multiplication in <strong>computer graphics</strong>, as well as when considering <strong>change of basis</strong>. <strong>Lack of matrix invertibility can also be explained through whether a vector exists which can can be transformed into the zero vector by said matrix.</strong></p>
<h2 id="dot-product-of-rows-and-columns">Dot Product of Rows and Columns</h2>
<p>This is the common form of treating matrices when doing proofs where the transpose invariance property of symmetric matrices is utilised, i.e., \(A^T=A\). It is also the one taught in high school the most, and not really the best way to start understanding matrix multiplication.</p>
<h2 id="block-matrix-multiplication">Block Matrix Multiplication</h2>
<p><img src="/assets/images/block-matrix-multiplication.jpg" alt="Block Matrix Multiplication" />
The block matrix multiplication is not really a separate method of multiplication per se. It is more of a method for bringing a higher level of abstraction in a matrix, while still permitting the “blocks” to be treated as singular matrix entries.</p>
<p>One application of this is when proofs involve properties of a larger matrix composed of submatrices, which have interesting properties of their own, which we wish to exploit.</p>
<p>An interesting example is part of the statement of the <strong>Implicit Function Theorem</strong>. In one dimension, the validity of this theorem holds when the function being described is <strong>monotonic</strong> in a defined interval (always increasing or always decreasing in that interval). In higher dimensions, this requirement of monotonicity is stated more formally as saying that <strong>the derivative of the function is invertible within a defined interval</strong>. We discussed this theorem in the article on <a href="/2021/04/24/vector-calculus-lagrange-multipliers.html">Lagrange Multipliers</a>.</p>
<p>The motivation for this example is the mathematical description of that monotonicity requirement. More on this is discussed in <a href="/2021/04/29/inverse-function-theorem-implicit-function-theorem.html">Intuitions about the Implicit Function Theorem</a>.</p>
<p>We can prove that a matrix which looks like this:</p>
\[X=
\begin{bmatrix}
A && C \\
0 && B
\end{bmatrix}\]
<p>where \(A\), \(B\) are <strong>invertible submatrices</strong>, and \(I\) is the <strong>identity matrix</strong>, the matrix \(X\) is also <strong>invertible</strong>. Let us be precise about the dimensions of these matrices.</p>
\[X=(n+m)\times (n+m) \\
A=n \times n \\
0=m \times n \\
C=n \times m \\
B=m \times m\]
<p>Do verify for yourself that these submatrices align. To prove this, let us assume there exists a matrix \(X^{-1}\), which is the inverse of \(X\). Therefore, \(XX^{-1}=I\). Furthermore, let us assume the form of \(X^{-1}\) to be:</p>
\[X^{-1}=\begin{bmatrix}
P && Q \\
R && S
\end{bmatrix}\]
<p>Again, we make precise the dimensions of the submatrices of \(X^{-1}\).</p>
\[P=n \times n \\
Q=n \times m \\
R=m \times n \\
S=m \times m\]
<p>If we multiply \(XX^{-1}\), we get:</p>
\[XX^{-1}=
\begin{bmatrix}
AP+CR && AQ+CS \\
BR && BS
\end{bmatrix}=
\begin{bmatrix}
I_{n \times n} && 0_{n \times m} \\
0_{m \times n} && I_{m \times m}
\end{bmatrix}\]
<p>Let’s do a quick sanity check. Checking back to the dimensions of the matrices, we can immediately see that:</p>
<p>-\(AP\) and \(CR\) give a \(n \times n\) matrix.
-\(BS\) gives a \(m \times m\) matrix.
-\(AQ\) and \(CS\) give a \(n \times m\) matrix.
-\(BR\) give a \(m \times n\) matrix.</p>
<p>The cool thing is that uou can write out the element-wise equalities, and solve for \(P\), \(Q\), \(R\), \(S\), as if they were simple variables, as long as you adhere to the matrix operation rules of <strong>ordering</strong>, <strong>transpose</strong>, <strong>inverse</strong>, etc.</p>
<p>Thus, we can write:</p>
\[AP+CR=I \\
AQ+CS=0 \\
BR=0 \\
BS=I\]
<p>From the last two identities, we can immediately say that:</p>
\[R=0 \\
S=B^{-1}\]
<p>Solving for the remaining two variables \(P\) and \(Q\), we get:</p>
\[P=A^{-1} \\
Q=-A^{-1}CB^{-1}\]
<p>Thus the inverse of \(X\) is:</p>
\[XX^{-1}=
\begin{bmatrix}
A^{-1} && A^{-1}CB^{-1} \\
0 && B^{-1}
\end{bmatrix}\]
<p>The important point to note here is that <strong>the solution does not need \(C\) to be an invertible matrix</strong>; it may be rank-deficient, and \(X\) still remains an invertible matrix.</p>
<h3 id="recursive-calculation">Recursive Calculation</h3>
<p>The <strong>block matrix calculation</strong> can be extended to be recursive. We can simply break down any submatrix into its block matrices and perform the same operation, until (if you so wish) you reach the individual element level.</p>
<p><img src="/assets/images/recursive-block-matrix-multiplication.png" alt="Recursive Block Matrix Multiplication" /></p>avishekWe consider the more frequently utilised viewpoints of matrix multiplication, and relate it to one or more applications where using a certain viewpoint is more useful. These are the viewpoints we will consider.Quadratic Optimisation using Principal Component Analysis as Motivation: Part Two2021-04-28T00:00:00+05:302021-04-28T00:00:00+05:30/2021/04/28/quadratic-optimisation-pca-lagrange-multipliers<p>We pick up from where we left off in <a href="{ %post_url 2021-04-19-quadratic-form-optimisation-pca-motivation-part-one%">Quadratic Optimisation using Principal Component Analysis as Motivation: Part One</a>. We treated <strong>Principal Component Analysis</strong> as an optimisation, and took a detour to build our geometric intuition behind <strong>Lagrange Multipliers</strong>, wading through its proof to some level.</p>
<p>We now have all the machinery we need to tackle the PCA problem directly. As we will see, the Lagrangian approach to optimisation maps to this problem very naturally. It will probably turn out to be slightly anti-climactic, because the concept of eigenvalues will fall out of this application quite naturally.</p>
<h2 id="eigenvalues-and-eigenvalues">Eigenvalues and Eigenvalues</h2>
<p>We glossed over <strong>Eigenvalues</strong> and <strong>Eigenvectors</strong> when we looked at PCA earlier. Any basic linear algebra text should be able to provide you the geometric intuition of what an eigenvector of a matrix \(A\) is, functionally.</p>
<p>If we consider the matrix \(A\) as a \(n \times n\) matrix which represents a mapping \(\mathbb{R}^n \rightarrow \mathbb{R}^n\) which transforms a vector \(\vec{v} \in \mathbb{R}^n\), then \(\vec{v}\) is an <strong>eigenvector</strong> of \(A\) if the following condition is satisified:</p>
\[\mathbf{A\vec{v}=\lambda\vec{v}, \lambda \in \mathbb{R}}\]
<p>That is, regardless of what effect \(A\) has on other vectors (which are not collinear with \(\vec{v}\)) transforming vector \(\vec{v}\) results in only a <strong>scaling</strong> of \(\vec{v}\) by a real number \(\lambda\). \(\lambda\) is called the corresponding eigenvalue of \(A\).</p>
<p>There can be multiple eigenvalue/eigenvector values for a matrix. For the purposes of this article, it suffices to state that <strong>Principal Components Analysis</strong> is one of the methods of determining these components of a matrix.</p>
<p>Well, let’s rephrase that. Since <strong>PCA works on the covariance matrix</strong>, the eigenvectors and eigenvalues are those of the covariance matrix, not the original matrix. However, that does not affect what we are aiming for, which is finding the <strong>principal axes of maximum variance</strong>.</p>
<h2 id="lagrange-multipliers-are-eigenvalues">Lagrange Multipliers are Eigenvalues</h2>
<p>Let us pick up the optimisation problem where we left off:</p>
<p><strong>Maximise \(X^T\Sigma X\) <br />
Subject to: \(X^TX=1\)</strong></p>
<p>We spoke of <strong>quadratic forms</strong> as well; this is clearly a quadratic form of the matrix \(\Sigma\). Armed with our knowledge of vector calculus, let us state the above problem in terms of geometry.</p>
<p><strong>Find the critical point \(X\) on \(f(X)=X^T\Sigma X:\mathbb{R}^n\rightarrow \mathbb{R}\) such that:</strong></p>
<ul>
<li><strong>\(X\) lies on the unit sphere</strong>, i.e., the manifold equation is \(\mathbf{g(X)=X^TX=1}\).</li>
</ul>
<p>This is what the solution space might look like in the two-dimensinal case.</p>
<p><img src="/assets/images/candidate-axes-pca-with-constraints.png" alt="pca-axes-constraints" /></p>
<p>Let \(g(x)\) be the constraint manifold. We now wish to compute the derivatives of the cost function of \(f(X)\) and the manifold equation \(g(x)\).</p>
<p>For the cost function \(f(X)\), so we can write:</p>
\[D_Xf(X)=D[X^T\Sigma X] \\
= X^T\Sigma+\Sigma X \\
=X^T\Sigma+X^T{\Sigma} \\
=2X^T\Sigma\]
<p>Taking the derivative of the manifold equation \(g(x)\), we get:</p>
\[D_Xg(X)=2X^T\]
<p>The Lagrange approach tells us that there exists a <strong>Lagrange Multiplier</strong> \(\lambda_1\) for which the following holds true:</p>
\[D_Xg(X)=\lambda_1 D_Xf(X) \\
\Rightarrow 2X^T\Sigma=\lambda_1 2X^T \\
\Rightarrow X^T\Sigma=\lambda_1 X^T\]
<p>Taking the transpose on both sides, we get:</p>
\[{(X^T\Sigma)}^T={(\lambda_1 X^T)}^T \\
\Rightarrow \mathbf{\Sigma X=\lambda_1 X}\]
<p>That’s right, <strong>eigenvalues are nothing but Lagrange Multipliers when optimising for Principal Components Analysis</strong>!</p>
<p>We can go further, assuming there is more than one eigenvalue / eigenvector pair for matrix \(\Sigma\). Let us assume that \(\Sigma\) has two eigenvalues / eigenvectors. Let \(X_1\) be the first one, which we have already found. \(X_2\) is the second eigenvector. The necessary conditions for this eigenvector to exist are:</p>
<ul>
<li>\({X_2}^T{X_2}=1\), i.e., \(X_2\) exists on the constraint manifold of a unit circle.</li>
<li>\(X_2.X_1=0\), i.e., \(X_2\) is orthogonal to \(X_1\)</li>
</ul>
<p>The same argument as the first part of the proof holds, that is:</p>
\[D_Xf(X)=\mu X_1+\lambda_2 X_2 \\
\Sigma X_2=\mu X_1+\lambda_2 X_2 \\\]
<p>We have put in \(\mu\) as the multiplier for \(X_1\) because we do not know what value will be, that is, we need to determine its value.
Take the dot product with \(X_1\) on both sides, so we can write:</p>
\[(\Sigma X_2).X_1=\mu X_1.X_1+\lambda_2 X_2.X_1\]
<p>Dot products are commutative, and we know that X_1.X_0=0, thus we write:</p>
\[(\Sigma X_2).X_1=\mu X_1.X_1+\lambda_2 X_2.X_1 \\
\Rightarrow \mu X_1.X_1=0 \\
\Rightarrow \mu {\|X_1\|}^2=0 \\
\Rightarrow \mu=0\]
<p>Substituting \(\mu=0\) back into the original Lagrange Multipliers equation, we get:</p>
\[\mathbf{
\Sigma X_2=\lambda_2 X_2
}\]
<p>You can repeat this proof for every eigenvector.</p>
<h2 id="spectral-theorem-of-matrices">Spectral Theorem of Matrices</h2>
<p>You may not realise it, but we have also proved an important theorem of Linear Algebra, namely, the <strong>Spectral Theorem of Matrices</strong>, in this case, for symmetric matrices. It states a few things, two of which we have already proved.</p>
<ul>
<li>
\[\mathbf{A\vec{v}=\lambda\vec{v}}: \lambda \in \mathbb{R}, \vec{v} \in \mathbb{R}^n, A^T=A, A \in \mathbb{R}^n \times \mathbb{R}^n\]
</li>
<li>For every <strong>symmetric matrix \(A\)</strong>, <strong>there exists a decomposition \(UDU^T\)</strong>, where the <strong>columns of \(U\) are the eigenvectors</strong> of \(A\), and \(D\) is a diagonal matrix whose <strong>diagonal entries are the corresponding eigenvalues</strong>.</li>
<li>For <strong>non-symmetric matrices</strong>, the <strong>factorisation \(A=UDU^{-1}\) still exists</strong>.</li>
</ul>
<p>To see why the second statement is true, write out \(U\) and \(D\) as:</p>
\[U=\begin{bmatrix}
\vert && \vert && \vert && ... && \vert \\
v_1 && v_2 && v_3 && ... && v_n \\
\vert && \vert && \vert && ... && \vert
\end{bmatrix} \\
\\
D==\begin{bmatrix}
\lambda_1 && 0 && 0 && ... && 0 \\
0 && \lambda_2 && 0 && ... && 0 \\
0 && 0 && \lambda_3 && ... && 0 \\
\vert && \vert && \vert && ... && \vert \\
0 && 0 && 0 && ... && \lambda_n \\
\end{bmatrix}\]
<p>Then, if we multiply them, we get:</p>
\[UD=\begin{bmatrix}
\vert && \vert && \vert && ... && \vert \\
\lambda_1 v_1 && \lambda_2 v_2 && \lambda_3 v_3 && ... && \lambda_3 v_n \\
\vert && \vert && \vert && ... && \vert
\end{bmatrix} \\
=\begin{bmatrix}
\vert && \vert && \vert && ... && \vert \\
A v_1 && A v_2 && A v_3 && ... && A v_n \\
\vert && \vert && \vert && ... && \vert
\end{bmatrix} \\ = AU\]
<p>Thus, we have the identity: \(AU=UD\). Multiplying both sides by \(U^{-1}\), and remembering that for orthonormal matrices, \(X^{-1}=X^T\), we get:</p>
\[AUU^{-1}=UDU^{-1}
\Rightarrow AI=UDU^T
\Rightarrow \mathbf{A=UDU^T}\]
<p>This proves that <strong>for every symmetric matrix \(A\), there exists a decomposition \(UDU^T\), where the columns of \(U\) are the eigenvectors of \(A\), and \(D\) is a diagonal matrix, whose diagonal entries are the corresponding eigenvalues of \(A\)</strong>.</p>
<p>We have established a deep connectin between <strong>Lagrange Multipliers</strong> and <strong>eigenvalues</strong> of a matrix. However, <strong>Quadratic Programming</strong> covers some more material, which will be relevant when completing the derivation of the <strong>Support Vector Machine</strong> equations. This will be covered in an upcoming post.</p>
<h2 id="supplementary-material">Supplementary Material</h2>
<h3 id="1-proof-that-x-1xt-for-orthonormal-matrices">1. Proof that \(X^{-1}=X^T\) for orthonormal matrices</h3>
<p><strong>Quick Note</strong>: <strong>Orthogonal</strong> and <strong>Orthonormal</strong> mean the same thing.</p>
<p>Suppose \(X\) is an orthonormal matrix like so:</p>
\[U=\begin{bmatrix}
\vert && \vert && \vert && ... && \vert \\
x_1 && x_2 && x_3 && ... && x_n \\
\vert && \vert && \vert && ... && \vert
\end{bmatrix} \\
U^T=\begin{bmatrix}
--- && x_1 && --- \\
--- && x_2 && --- \\
--- && x_3 && --- \\
\hspace{2cm} && \vdots && \hspace{2cm} \\
--- && x_n && --- \\
\end{bmatrix}\]
<p>Then, multiplying the two, the following identity holds:</p>
\[A_{ij}=0, i\neq j \\
A_{ij}=1, i=j (orthonormality)\]
\[U^TU=I=U^{-1}U \\
\Rightarrow U^TU=U^{-1}U \\
\Rightarrow \mathbf{U^T=U^{-1}}\]avishekWe pick up from where we left off in Quadratic Optimisation using Principal Component Analysis as Motivation: Part One. We treated Principal Component Analysis as an optimisation, and took a detour to build our geometric intuition behind Lagrange Multipliers, wading through its proof to some level.Vector Calculus: Lagrange Multipliers, Manifolds, and the Implicit Function Theorem2021-04-24T00:00:00+05:302021-04-24T00:00:00+05:30/2021/04/24/vector-calculus-lagrange-multipliers<p>In this article, we finally put all our understanding of <strong>Vector Calculus</strong> to use by showing why and how <strong>Lagrange Multipliers</strong> work. We will be focusing on several important ideas, but the most important one is around the <strong>linearisation of spaces at a local level</strong>, which might not be smooth globally. The <strong>Implicit Function Theorem</strong> will provide a strong statement around the conditions necessary to satisfy this.</p>
<p>We will then look at <strong>critical points</strong>, and how constraining them to a manifold naturally leads to the condition that the <strong>normal vector of the curve to be optimised, must also be normal to the tangent space of the manifold</strong>.</p>
<p>We will then restate this in terms of <strong>Lagrange multipliers</strong>.</p>
<p><strong>Note</strong>: This article pulls a lot of understanding together, so be sure to have understood the material in <a href="/2021/04/20/vector-calculus-simple-manifolds.html">Vector Calculus: Graphs, Level Sets, and Linear Manifolds</a>, before delving into this article, or you might get thoroughly confused.</p>
<h2 id="definition-of-a-manifold">Definition of a Manifold</h2>
<p>Let us make precise the definition of a manifold now.</p>
<p>We’ve looked at a system of \(k\) linear equations of \(n\) variables, where \(n-k\) variables were expressed in terms of \(k\) independent variables. If we remove the requirement of these equations being linear, then the solution space, i.e., the set of points \((x_1, x_2,..., x_n)\) which satisfy this system, constitutes a <strong>k-manifold</strong> in \(\mathbb{R}^n\).</p>
<ul>
<li>For a single equation, a manifold is simply the graph of that function.</li>
<li>For multiple equations, a manifold is essentially the set of points which satisfy all of those equations. For example:
<ul>
<li>If we had two equations of intersecting lines in \(\mathbb{R}^2\), then the manifold would simply be the point of intersection.</li>
<li>If we had two equations of intersecting planes in \(\mathbb{R}^3\), the manifold would be the line of intersection of those two planes.</li>
<li>All vector subspaces are manifolds.</li>
</ul>
</li>
</ul>
<p><img src="/assets/images/manifold-examples.png" alt="Examples of Manifolds" /></p>
<p>The structure of this article follows this sequence.</p>
<ul>
<li><strong>Constrained Critical Points</strong> in Two Dimensions</li>
<li><strong>Constrained Critical Points</strong> in the <strong>General Case</strong></li>
<li><strong>Lagrangian</strong> Formulation and <strong>Proof</strong></li>
<li>Extension to <strong>Nonlinear Constraints</strong>: Implicit Function Theorem</li>
</ul>
<p>We start with two important preliminaries.</p>
<ul>
<li>Function Composition</li>
<li>Chain Rule and Functions as Objects</li>
</ul>
<h2 id="preliminary-function-composition-and-functions-as-objects">Preliminary: Function Composition and Functions as Objects</h2>
<p>Let us assume that:</p>
\[f(x,y)=xy \\
y=g(x)=x^2\]
<p>Then the notation \(\mathbf{F=f\circ g}\) represents function composition, where represents a function \(F\) which has the same output as \(f(x,g(x))\) (in this example). Also note that \(F\) is a function of only \(x\). In texts, \(f(x,y)\) is written as \(f(x,g(x))\) and is equivalent to the above form.</p>
<p>Do not let the fact that there is a function in the parameter of \(f\) confuse you; treat it as you would any other variable. If you know the actual expression \(f(x,g)\), you can differentiate with respect to \(g\) if needed. After all, \(g(x)\) is essentially \(y\).</p>
<p>Writing it as \(f(x,g(x))\) is notational shorthand for expressing that \(y\) is not really a free variable, it is expressed in terms of \(x\). Moreover \(D_xF(x)\) is the same thing as writing \(Df(x,g(x))\).</p>
<p>We can borrow our intuition of <strong>function pipelines in programming</strong> to make sense of this: an input \(x\) enters \(g(x)\), comes out as some output, which is then fed to the \(y\) parameter of \(f(x,y)\). The \(x\) parameter is already available, so it gets applied for free (actually, while programming, you can’t say things like “gets applied for free”, you actually have to do the necessary plumbing to allow \(x\) to reach \(f(x,y)\)).</p>
<p><strong>Note that the composite function \(F(x)\) takes in only one input, \(x\).</strong> This is because the first function that is applied is \(g(x)\). You do not need to specify a \(y\) – in fact, you should not – because the value of \(y\) is constrained to be (in this instance) \(x^2\).</p>
<p>You can do <strong>symbolic manipulation</strong> to directly substitute \(g(x)\) into \(F(x)\) to get \(F(x)=x.x^2=x^3\). Either way, that’s how function composition works. We introduce this because it will be used to build in the constraints for our optimisation problem, and we will see how <strong>taking derivatives of composed functions translates to the dot product of linear transformations</strong>.</p>
<h2 id="preliminary-chain-rule">Preliminary: Chain Rule</h2>
<p>Let us continue with the above example. Assume that:</p>
\[f(x,y)=xy \\
y=g(x)=x^2\]
<p>Then the notation \(F=f\circ g\) represents function composition, where \(F(x)=f(x,g(x))\), as we have already stated.</p>
<p>Now, if we wanted to find \(D_xf(x,g(x))\), it is trivial to see that substituting \(x^2\) for \(y\) in \(f(x,y)\) gives us \(f(x)=x^3\), therefore:</p>
\[D_xf(x,g(x))=3x^2\]
<p>However, let’s look at the <strong>Chain Rule</strong> of differentiation for the above \(f(x,y)\), because in our proofs, the actual form of \(f(x,y)\) and \(g(x)\) will not be available, and thus we will have to use the Chain Rule to express any results. We have only <strong>one free variable</strong> for this composite function, i.e., \(x\), so we may write:</p>
\[D_xf(x,g)=\frac{\partial f(x,g)}{\partial x} \\
=\frac{df(x, g)}{d{[x\hspace{3mm} g]}^T}.\frac{d{[x\hspace{3mm} g]}^T}{dx} \\\]
<p>Let’s denote \(\Phi (x)=\begin {bmatrix}x \\ g\end{bmatrix}\).
\({[x \hspace{3mm} g]}^T\) is a vector so we are partially differentiating \(f(x,y)\). In our example above, we can write for the <strong>first term</strong>, and substitute:</p>
\[\frac{df(x,g)}{d[x \hspace{3mm} g]}=\left[\frac{\partial f(x,g)}{\partial x} \hspace{3mm} \frac{\partial f(x,g)}{\partial g} \right] \\
\Rightarrow D_xf(x,y)=\left[\frac{\partial f(x,g)}{\partial x} \hspace{3mm} \frac{\partial f(x,g)}{\partial g} \right].\frac{d\Phi (x)}{dx}\]
<p>For the <strong>second term</strong>, we may write:</p>
\[\frac{d{[x \hspace{3mm} g]}^T}{dx}=\begin{bmatrix}
1 \\
\frac{dg}{dx}
\end{bmatrix} \\
\Rightarrow
D_xf(x,g)=\left[\frac{\partial f(x,g)}{\partial x} \hspace{3mm} \frac{\partial f(x,g)}{\partial g} \right].\begin{bmatrix}
1 \\
\frac{dg}{dx}
\end{bmatrix} \\
= \frac{\partial f(x,g)}{\partial x} + \frac{\partial f(x,g)}{\partial g}\frac{dg}{dx}\]
<p>Only now do we need to unroll \(g(x)\) to look at the specific form this derivative takes. We can write the following:</p>
\[\frac{dg}{dx}=2x \\
\frac{\partial f(x,g)}{\partial x}=g \\
\frac{\partial f(x,g)}{\partial g}=x\]
<p>Substituting these values back into the expression for \(D_xF(x)\), we get:</p>
\[D_xf(x,g)=g+x.2x=g(x)+2x^2 \\
= x^2+2x^2 \\
= 3x^2\]
<p>Yes, that was a complicated way of computing the same result we got earlier, but I want you to see the mechanics involved in applying the <strong>Chain Rule</strong> in the case of partial derivatives.</p>
<h2 id="constrained-critical-points-in-two-dimensions">Constrained Critical Points in Two Dimensions</h2>
<p>The previous example can be reused to illustrate the concept of constrained critical points in two dimensions. Let us revisit the intermediate expression we used, namely:</p>
\[D_xf(x,g)=\left[\frac{\partial f(x,g)}{\partial x} \hspace{3mm} \frac{\partial f(x,g)}{\partial g} \right].\frac{d\Phi (x)}{dx}\]
<p>If you notice carefully, <strong>the expression \(\left[\frac{\partial f(x,g)}{\partial x} \hspace{3mm} \frac{\partial f(x,g)}{\partial g} \right]\) exactly represents the gradient operator \(\nabla f(x,g(x))\)</strong>.</p>
<p>Furthermore, look at \(\frac {d\Phi (x)}{dx}=\begin{bmatrix}1 \\ \frac{dg}{dx}\end{bmatrix}\). We can recognise this as the <strong>parametric form of the line \(\mathbf{y=g'(x).x}\)</strong>. This is because any vector on the tangent can be represented as \(\begin{bmatrix}1 \\ \frac{dg}{dx}\end{bmatrix}.t\).</p>
<p>As a consequence, <strong>this is the tangent space of \(g(x)\)</strong>. If we represent \(\frac {d\Phi (x)}{dx}\) as \(T_x\) (the tangent space), we can rewrite the identity as:</p>
<p><strong>\(D_xf(x,g)=\nabla f(x,g(x)).T_x\)</strong></p>
<p><strong>Important Note</strong>: Note that \(T_x\) is <strong>not</strong> the normal vector to the tangent, but the actual vector along the tangent.</p>
<p>The picture below shows the situation. The function \(f(x,y)=xy\) is the function to be optimised. However, setting \(y=x^2\), and substituting it into \(f(x,y)\) so that it becomes \(f(x,g(x))\) immediately constrains the y-coordinate to always be such that for any \(x\), the point is always forced to move along the curve \(y=x^2\), regardless of which level set of \(f(x,g(x))\) is chosen.</p>
<p><img src="/assets/images/constrained-critical-points-2d.png" alt="Constrained Critical Points" /></p>
<p><strong>Note</strong>: The above example isn’t the best example because attempting to find a constrained critical point in this situation will result in \((0,undefined)\) on the curve, but the identities we derive here, still hold. We solve a more feasible problem next.</p>
<p>If we have a point \(P\) which satisfies \(g(x)\), i.e., has the coordinates \((x_0, g(x_0))\), then the following holds:</p>
<p><strong>\(D_xf(P)=\nabla f(P).T_x\)</strong></p>
<p><strong>The above expression represents the dot product of the gradient vector and the tangent vector at a point P which exists on the curve of the function defined by \(f(x,y)=xy\) and satisfies the constraint \(g(x)=x^2\).</strong></p>
<p>Let us look at a problem with a proper solution.</p>
\[f(x)=xy \\
y=g(x)=y=4-x\]
<p>Let’s use the result we derived above. We have:</p>
\[\nabla f=\begin{bmatrix}y && x\end{bmatrix} \\
\frac{d\Phi}{dx}=\begin{bmatrix}1 \\ -1\end{bmatrix}\]
<p>Multiplying the two, we get:</p>
\[D_xf=\nabla f.T_x=\begin{bmatrix}y && x\end{bmatrix}.\begin{bmatrix}1 \\ -1\end{bmatrix}=y-x=(4-x)-x=4-2x\]
<p>Setting the above to zero, we get:</p>
\[4-2x=0
\Rightarrow x=2
\Rightarrow y=2\]
<p>which is the solution we seek. Substituting \(x=2\) back into \(f(x,y)=xy\) gives us the correct level set, i.e., \(xy=4\). The solution is shown below.</p>
<p><img src="/assets/images/parabola-straight-line-constrained-critical-point.png" alt="Point Manifold" /></p>
<p>The constrained critical point is \((2,2)\). <strong>Note that this is different from finding an intersection between two curves.</strong> There are an infinite number of \(xy=C\) equations which can intersect with \(y=x^2\). For example, \(xy=2\) intersects with the constraint line in two places, But that does not maximise the value of \(xy\).</p>
<p>You will have also noticed that the curve containing the constrained critical point is tangent to the constraint curve (straight line, in this case). This is not a coincidence, as we will see when we get to the generalised, higher-dimensional case.</p>
<p>The above simple two-dimensional case will serve as our starting point. We now generalise in two conceptual directions.</p>
<h2 id="generalisation-to-multiple-constraints">Generalisation to Multiple Constraints</h2>
<p>Recall what we spoke about systems of linear equations in <a href="/2021/04/20/vector-calculus-simple-manifolds.html">Vector Calculus: Graphs, Level Sets, and Linear Manifolds</a>. Specifically, <strong>if there are \(n\) variables, and \(n-k\) equations, we can parametrically specify \(n-k\) variables in terms of the other \(k\) free variables</strong>.</p>
<p>We can generalise the above dot product derivation for this general case. Before we do this, in order to avoid the ugly-looking \(n-k\) expression, we restate the above as:</p>
<p><strong>If there are \(N\) variables, and \(n\) equations, we can parametrically specify \(n\) variables in terms of the other \(m\) free variables.</strong> Also, obviously, \(m+n=N\). Let the independent variables be \(U=(u_1, u_2, u_3,...,u_m)\), and the dependent variables be \(V=(v_1, v_2, v_3,...,v_m)\)</p>
<p>Now we write:</p>
\[G(U)=\begin{bmatrix}
u_1 \\
u_2 \\
u_3 \\
\vdots \\
u_m \\
\phi_1(u_1, u_2, u_3, ..., u_m) \\
\phi_2(u_1, u_2, u_3, ..., u_m) \\
\phi_3(u_1, u_2, u_3, ..., u_m) \\
\vdots \\
\phi_n(u_1, u_2, u_3, ..., u_m)
\end{bmatrix}
=\begin{bmatrix}
u_1 \\
u_2 \\
u_3 \\
\vdots \\
u_m \\
\phi_1 \\
\phi_2 \\
\phi_3 \\
\vdots \\
\phi_n
\end{bmatrix} \\
\\
\Rightarrow G(U)=(U,V) \\
f(u_1, u_2, u_3, ..., u_m, v_1, v_2, v_3, ..., v_n)\]
<p>You should be able to recognise both of the above as the <strong>higher-dimensional analogs of the simple two-dimensional case</strong> we looked at earlier.</p>
\[F(u_1, u_2, u_3, ..., u_m)=f \circ G \\
D_UF=\frac{\partial f}{\partial (u_1, u_2, u_3, ..., u_m)} \\
=\frac{\partial f}{\partial (u_1, u_2, u_3, ..., u_m, \phi_1, \phi_2, \phi_3, ..., \phi_n)}.\frac{\partial G}{(u_1, u_2, u_3, ..., u_m)}\]
<p><strong>The first term immediately reduces to \(D_{U,V}f\)</strong>, where \(u=(u_1, u_2, u_3, ..., u_m)\). Let’s look at the second term, because that is going to take the derivative of \(G\), which is no longer a simple function, but a <strong>matrix of functions</strong>.</p>
\[\frac{\partial G}{(u_1, u_2, u_3, ..., u_m)}=
\begin{bmatrix}
\frac{\partial u_1}{\partial u_1} && \frac{\partial u_1}{\partial u_2} && \frac{\partial u_1}{\partial u_3} && ... && \frac{\partial u_1}{\partial u_m} \\
\frac{\partial u_2}{\partial u_1} && \frac{\partial u_2}{\partial u_2} && \frac{\partial u_2}{\partial u_3} && ... && \frac{\partial u_2}{\partial u_m} \\
\frac{\partial u_3}{\partial u_1} && \frac{\partial u_3}{\partial u_2} && \frac{\partial u_3}{\partial u_3} && ... && \frac{\partial u_3}{\partial u_m} \\
\vdots && \vdots && \vdots && \vdots && \vdots \\
\frac{\partial u_m}{\partial u_1} && \frac{\partial u_m}{\partial u_2} && \frac{\partial u_m}{\partial u_3} && ... && \frac{\partial u_m}{\partial u_m} \\
\\
\frac{\partial \phi_1}{\partial u_1} && \frac{\partial \phi_1}{\partial u_2} && \frac{\partial \phi_1}{\partial u_3} && ... && \frac{\partial \phi_1}{\partial u_m} \\
\frac{\partial \phi_2}{\partial u_1} && \frac{\partial \phi_2}{\partial u_2} && \frac{\partial \phi_2}{\partial u_3} && ... && \frac{\partial \phi_2}{\partial u_m} \\
\frac{\partial \phi_3}{\partial u_1} && \frac{\partial \phi_3}{\partial u_2} && \frac{\partial \phi_3}{\partial u_3} && ... && \frac{\partial \phi_3}{\partial u_m} \\
\vdots && \vdots && \vdots && \vdots && \vdots \\
\frac{\partial \phi_n}{\partial u_1} && \frac{\partial \phi_n}{\partial u_2} && \frac{\partial \phi_n}{\partial u_3} && ... && \frac{\partial \phi_n}{\partial u_m}
\end{bmatrix} \\\]
<p>Yes, that is a lot of partial derivatives. But, as you can guess, most of this will be dramatically simplified. Note the first \(m\) rows: <strong>all but one column in each of those \(m\) rows will become zero</strong>. This simplifies to:</p>
\[\frac{\partial G}{(u_1, u_2, u_3, ..., u_m)}=
\begin{bmatrix}
1 && 0 && 0 && ... && 0 \\
0 && 1 && 0 && ... && 0 \\
0 && 0 && 1 && ... && 0 \\
\vdots && \vdots && \vdots && \vdots && \vdots \\
0 && 0 && 0 && ... && 1 \\
\\
\frac{\partial \phi_1}{\partial u_1} && \frac{\partial \phi_1}{\partial u_2} && \frac{\partial \phi_1}{\partial u_3} && ... && \frac{\partial \phi_1}{\partial u_m} \\
\frac{\partial \phi_2}{\partial u_1} && \frac{\partial \phi_2}{\partial u_2} && \frac{\partial \phi_2}{\partial u_3} && ... && \frac{\partial \phi_2}{\partial u_m} \\
\frac{\partial \phi_3}{\partial u_1} && \frac{\partial \phi_3}{\partial u_2} && \frac{\partial \phi_3}{\partial u_3} && ... && \frac{\partial \phi_3}{\partial u_m} \\
\vdots && \vdots && \vdots && \vdots && \vdots \\
\frac{\partial \phi_n}{\partial u_1} && \frac{\partial \phi_n}{\partial u_2} && \frac{\partial \phi_n}{\partial u_3} && ... && \frac{\partial \phi_n}{\partial u_m}
\end{bmatrix} \\\]
<p>To simplify notation further, we can collapse a lot of the above:</p>
\[\frac{\partial G}{\partial U}=
\begin{bmatrix}
I_{m \times m} \\
{\phi'(U)}_{n \times m}
\end{bmatrix}\]
<p>where \(I_{m \times m}\) stands for an \(m \times m\) identity matrix. Plugging this back into the equation for \(D_UF\), we get:</p>
\[\mathbf{
D_UF=D_{(U,V)}f_{1 \times (m+n)}.\begin{bmatrix}
I_{m \times m} \\
{\phi'(U)}_{n \times m}
\end{bmatrix} \\
= \nabla f.T_X
}\]
<p>where \(\mathbf{
T_X=\begin{bmatrix}
I_{m \times m} \\
{\phi'(U)}_{n \times m}
\end{bmatrix}
}\)</p>
<p>This is still in the same form as the simple case that we described above. The above resolves to \(1 \times m\) matrix. It will be instructive to study the columns of this matrix \(T_X\).</p>
<p>The left expression is simply the <strong>gradient vector</strong>, which is the vector normal to the surface of the curve \(f(U,V)\), which is a \(1 \times (m+n)\)
What can we say about the columns of \(T_X\)? Let’s look at the first column. It is:</p>
\[T_{X1}=\begin{bmatrix}
1 \\
0 \\
\vdots \\
0_m \\
{\phi'}_1 \\
{\phi'}_2 \\
\vdots \\
{\phi'}_n \\
\end{bmatrix}\]
<p>This represents the <strong>parametric form</strong> of a vector in the <strong>tangent space</strong> of the manifold. Just like the \(y=x^2\), where we had the parametric tangent vector as \(\begin{bmatrix}
1 \\
2x
\end{bmatrix}\)</p>
<p>, this one tells us how much the vector will change for a unit change along the \(u_1\) basis vector. Remember, we had \(n\) constraint equations, so all tangent vectors can be expressed as a combination of \(m\) linearly independent vectors, and \(u_1\) is one of them.</p>
<p>So, <strong>each entry in the \(1 \times m\) output represents the dot product between the gradient vector and one of the \(m\) tangent vectors</strong>.</p>
<h2 id="optimising-the-objective-function">Optimising the Objective Function</h2>
<p>Let’s take a step back and look at what we have done from a big-picture perspective. We have a function \(f\) of \(m+n\) variables that we’d like to optimise, subject to \(n\) constraints, expressed as equations. <strong>We took those constraints, and solved the linear system of equations to end up with \(n\) variables being expressed as a linear combination of \(m\) linearly independent vectors.</strong></p>
<p>These \(m\) vectors are all that are needed to completely determine the tangent space of the <strong>constraint manifold</strong>. They are <strong>vectors in the tangent space</strong> because they are vectors expressed as linear functions with the weights being the slopes of the constraint equations.</p>
<p>Taking the composite function \(f \circ G\) allows us to change the problem from a <strong>constrained problem to an unconstrained optimisation problem</strong>, because the <strong>constraints are already expressed between the relationships of the \(U\) set and \(V\) sets of variables</strong>.</p>
<p>In calculus, to find the critical point, we need to take the derivative and set it to zero. This may be a <strong>maximum</strong> or a <strong>minimum</strong>, and that usually depends upon what the <strong>second derivative</strong> looks like, but we will postpone discussion for later.</p>
<p>The output of \(\mathbf{D_{(U,V)}f}\) is a \(1 \times m\) vector, which we’d like to set to zero. This also implies an important result: <strong>at the critical point, the gradient vector is perpendicular to every tangent vector</strong>. This can also be restated as: <strong>the tangent space (the space spanned by the \(m\) tangent vectors) belongs to the kernel of \(\nabla f\)</strong>. Note that I did not say that it <strong>is</strong> the kernel of \(\nabla f\), merely that it <strong>belongs</strong> to that kernel.</p>
<p>Now, usually when we attempt to find the optimum point on a function (in this case, say \(f(U,V)\)), we would want to take its derivative and set it to zero. However, <strong>in the presence of other constraints</strong>, the point that we seek is not necessarily the global maximum/minimum, since that point (or those points) are <strong>not necessarily guaranteed to satisfy the constraints simultaneously</strong>. <strong>We still want a maximum/minimum, but we also want it to live on the constraint manifold.</strong> What we can say is that the <strong>direction derivative</strong> of the function \(f(U,V)\) in the direction of any vector in its tangent space will go to zero.</p>
<p>Restated another way, <strong>the gradient normal vector of the function \(F(U,V)\) is orthogonal to every vector in the tangent space of \(F(U,V)\)</strong>. Since orthogonality implies a dot product of zero, given the constraints we have, we can write the following condition as necessary for finding a <strong>critical point</strong>:</p>
\[\mathbf{
D_{(U,V)}f=\nabla f.T_X=0
}\]
<p>where:</p>
<ul>
<li>\(\mathbf{\nabla f}\) is \(\mathbf{1 \times (m+n)}\)</li>
<li>\(\mathbf{T_X}\) is \(\mathbf{(m+n)\times m}\)</li>
<li>\(\mathbf{D_{(U,V)}f}\) is \(\mathbf{1 \times m}\).</li>
</ul>
<p>The figure below shows a simplified situation.</p>
<p><img src="/assets/images/orthogonal-gradient-vector-tangent-space.png" alt="Gradient Normal Vector orthogonal to Tangent Space" /></p>
<h2 id="proof-of-lagrange-multipliers">Proof of Lagrange Multipliers</h2>
<p>We are about three-quarters of the way done. <strong>We have proved that the tangent space belongs to the kernel (null space) of the gradient vector.</strong> But we haven’t gotten to proving the assertion about <strong>Lagrange Multipliers</strong> yet. What we really need to prove is that the <strong>gradient vector can be expressed as linear combinations of the vectors in tangent space</strong>, which will lead us directly to the conclusion we are hoping to prove.</p>
<p>We need to express another identity, using the <strong>level sets of the original constraint functions</strong> themselves. If you remember, \(G\) has been derived through row reduction techniques from the original \(n\) constraint functions. Let’s call them \(h_i\), and define them as below:</p>
\[h_1(u_1,u_2,u_3,...,u_m,v_1,v_2,v_3,...,v_n)=c_1 \\
h_2(u_1,u_2,u_3,...,u_m,v_1,v_2,v_3,...,v_n)=c_2 \\
h_3(u_1,u_2,u_3,...,u_m,v_1,v_2,v_3,...,v_n)=c_3 \\
\vdots \\
h_n(u_1,u_2,u_3,...,u_m,v_1,v_2,v_3,...,v_n)=c_n\]
<p>More generally, we write:
\(h_i(U,V)=c_i\)</p>
<p>Taking the derivative, and remembering that \(U=(u_1, u_2, u_3,...,u_m)\), \(V=(v_1, v_2, v_3,...,v_m)\) are shorthands for the reams of variables that I’d like to not write:</p>
\[\frac{\partial h_i(U,V)}{\partial U}=0 \\
\Rightarrow \frac{\partial h_i(U,V)}{\partial (U,V)}.\frac{\partial G(U)}{\partial U}=0 \\
\Rightarrow \mathbf{D_{(U,V)h_i}.T_X=0}\]
<p>If we define:</p>
\[H=\begin{bmatrix}
h_1 \\
h_2 \\
\vdots \\
h_n \\
\end{bmatrix}\]
<p>We can write:</p>
\[\mathbf{D_{(U,V)}H.T_X=0}\]
<p>Do check that the indexes match: \(H\) is \(m \times (m+n)\), and \(T_X\) is \((m+n) \times m\), so yes, they are compatible.</p>
<p>We now have these two identities:</p>
\[\mathbf{
D_{(U,V)}f.T_X=\nabla f.T_X=0 \\
D_{(U,V)}H.T_X=0
}\]
<p>Let’s get rid of some extraneous notation to get:</p>
\[\mathbf{
Df.T_X=0 \\
DH.T_X=0
}\]
<p>This implies that:</p>
\[C(T_X) \subset N(Df) = {R(Df)}^{\perp}\\
C(T_X) \subset N(DH) = {R(DH)}^{\perp}\]
<p>Let us make some observations on the ranks of these matrices:</p>
<ul>
<li>\(T_X\) is \((m+n) \times m\), but has rank \(m\). \(f\) is \(1\times (m+n)\). \(f\) can have at most rank 1.</li>
<li>\(T_X\) is \((m+n) \times m\), but has rank \(m\). \(H\) is \(n \times (m+n)\), so its maximum column/row rank is \(n\). Then, by the Rank-Nullity Theorem, its left null space/null space has rank \(m\).</li>
</ul>
<p>\(C(T_X)\) and \({R(DH)}^{\perp}\) have the same rank \(m\). Thus they are equal. This implies that:</p>
\[{R(DH)}^{\perp} \subset {R(Df)}^{\perp}\]
<p>By the <strong>Subset Rule</strong>, we can say:</p>
\[{({R(DH)}^{\perp})}^{\perp} \supset {({R(Df)}^{\perp})}^{\perp} \\
\Rightarrow R(DH) \supset R(Df)\]
<p>Check the indexes again:</p>
<ul>
<li>\(DH\) is \(n \times (m+n)\), so \(n\) row vectors of length \((m+n)\) each.</li>
<li>\(Df\) is \(1 \times (m+n)\), so 1 row vector of length \((m+n)\).</li>
</ul>
<p>This implies that the row span of \(Df\) is contained within the row span of \(DH\). To put it another way:</p>
<p><strong>The row vector of \(Df\) can be expressed as a linear combination of the row vectors of \(DH\).</strong></p>
<p>Thus, we can write:</p>
\[\mathbf{
Df=\lambda_1 Dh_1(U,V)+\lambda_2 Dh_2(U,V)+\lambda_3 Dh_3(U,V)+...+\lambda_n Dh_n(U,V)
}
\\
\square\]
<p>The weights of these linear combinations are called <strong>Lagrange Multipliers</strong>.
We can simplify this notationally to:</p>
\[\mathbf{
\nabla f={[\nabla H]}^T\lambda
}\]
<p>where:</p>
<ul>
<li>\(\nabla f\) is \((m+n) \times 1\) (1 function, partial derivatives in \(m+n\) variables)</li>
<li>\(\nabla H\) is \(n \times (m+n)\) (\(n\) equations, partial derivatives in \(m+n\) variables)</li>
<li>\(\lambda\) is \(n\times 1\) (\(n\) Lagrange multipliers)</li>
</ul>
<h2 id="generalisation-to-nonlinear-functions">Generalisation to Nonlinear Functions</h2>
<p><strong>There is an important assumption I’ve left unsaid.</strong> In every example we’ve seen, I’ve always said that the <strong>constraints represent a system of linear equations</strong>. This might be true if our constraint equations are always straight lines, but is certainly <strong>not</strong> the case in other situations. Some examples of nonlinear constraints are:</p>
<ul>
<li>
\[x^2+x^3+y=3\]
</li>
<li>
\[xy+z=4\]
</li>
<li>
\[xy^4+z=4\]
</li>
</ul>
<p>In all but a few “easy” cases, it is absolutely not possible to factor out variables, such that some dependent variables are expressed in terms of some independent variables. Even if that were possible, the assumption of a linear relationship would not hold.</p>
<p>That is not the only problem. Constraint equations define the solution space such that even if the constraints are individually tractable to analyse, <strong>the manifold formed by their intersection cannot be described by any easily-discovered equation</strong>, linear or non-linear.</p>
<p>For example, take a look at this beauty.</p>
<p><img src="/assets/images/manifold-from-intersecting-cylinders.png" alt="3D Manifold from Intersecting Cylinders" /></p>
<p><strong>The red line shows the manifold, which satisfies the equations of both these cylinders.</strong> This intersection is not easily expressible; also it is guaranteed to be nonlinear in nature. And this is just two cylinders. It is not uncommon to have more constraint equations, all similarly nonlinear, and possibly <strong>higher-dimensional</strong>. We cannot even visualise such surfaces, let alone the intersections between them.</p>
<p><strong>How are we to resolve this quandary?</strong></p>
<h2 id="implicit-function-theorem">Implicit Function Theorem</h2>
<p>Functions come in many shapes and sizes. They aren’t always necessarily linear. However, that does not mean that analysis of these nonlinear functions is intractable. Calculus makes the mostly nonlinear world around us, and tells us that we can treat any curve or surface, in any dimension, as linear if we only zoom into it close enough.</p>
<p>It basically asks us to pretend that a complicated curve (and the corresponding function) is a <strong>linear function</strong>. This approximation is grossly wrong at larger scales, but gets better and better the more we zoom in. This is essentially the concept behind tangents to curves. The slope of a tangent comes from considering two points: one point on the curve proper, and another curve in the neighbourhood of the first point, which is as close as possible to the first point, but not the same as the first point.</p>
<p>In calculus, this is termed the limit. That \(\frac{dy}{dx}\) that we bandy about so much, essentially expresses a <strong>linear relationship</strong> between \(x\) and \(y\) at that point. This <strong>piecewise linearity at infinitesmal scales</strong> is what enables us to frame problems in a way that are solvable, instead of being overwhelmed by the nonlinearity of the function.</p>
<p>In practice, we speak of <strong>locality</strong>: the neighbourhood of a point, as being a small non-zero-sized area around the point, smaller than you can possibly imagine. Then, you carry out your nice linear calculations in this neighbourhood, assured of the fact that you have zoomed in enough that the function looks linear.</p>
<p>Let us return to our original quandry: <strong>how can we even begin to find a critical point on a constraint manifold if we do not even know how to express some variables in terms of others in the constraint equations</strong>? Remember, this parameterisation is what allows us to encode the constraints of the manifold into the function of the curve that we desire to find a critical point on.</p>
<p>The first question we ask is: <strong>does such a relationship even exist</strong>? And since this is calculus, we ask <strong>whether this relationship exists locally</strong>, i.e., when we zoom in. Even if we only know whether it exists locally, that can still help us discover useful properties about the curve. The third question we can ask is whether we can know any aspects of this relationship.</p>
<p>The <strong>Implicit Function Theorem</strong> has an answer to these questions.</p>
<p>We will not delve much into the <strong>Implicit Function Theorem</strong>, merely state its results. That itself should validate the assumption around the linear relationship that we have been using all this time.</p>
<p>The <strong>Implicit Function Theorem</strong> states that if a mapping \(F(x)\) exists for a point \(c\) such that:</p>
<ul>
<li>
\[\mathbf{F(c)=0}\]
</li>
<li>\(F(c)\) is <strong>first order differentiable</strong> (\(C^1\) differentiable)</li>
<li>The derivative of F(c), i.e., \(DF(c)\) is <strong>onto</strong>, i.e., for every value of \(DF(c)\), there exists a corresponding input.</li>
</ul>
<p>then, the following holds true:</p>
<ul>
<li><strong>There exists a system of linear equations \(DF(c)=0\)</strong> which has \(n\) pivotal variables in the level set constraint equations which can be expressed as functions of \(m\) independent (non-pivotal) variables.</li>
<li>There is a <strong>neighbourhood of \(c\)</strong> where this linear relationship holds for \(F(c)=0\).</li>
</ul>
<p><strong>We could not have made the assumption of the existence of this mapping and its inverse for the general case of nonlinear constraints without the Implicit Function Theorem.</strong></p>avishekIn this article, we finally put all our understanding of Vector Calculus to use by showing why and how Lagrange Multipliers work. We will be focusing on several important ideas, but the most important one is around the linearisation of spaces at a local level, which might not be smooth globally. The Implicit Function Theorem will provide a strong statement around the conditions necessary to satisfy this.Vector Calculus: Graphs, Level Sets, and Constraint Manifolds2021-04-20T00:00:00+05:302021-04-20T00:00:00+05:30/2021/04/20/vector-calculus-simple-manifolds<p>In this article, we take a detour to understand the mathematical intuition behind <strong>Constrained Optimisation</strong>, and more specifically the method of <strong>Lagrangian multipliers</strong>. We have been discussing <strong>Linear Algebra</strong>, specifically matrices, for quite a bit now. <strong>Optimisation theory</strong>, and <strong>Quadratic Optimisation</strong> as well, relies heavily on <strong>Vector Calculus</strong> for many of its results and proofs.</p>
<p>Most of the rules for single variable calculus translate over to vectors calculus, but now we are dealing with <strong>vector-valued functions</strong> and <strong>partial differentials</strong>.</p>
<p>This article will introduce the building blocks that we will need to reach our destination of understanding Lagrangians, with slightly more rigour than a couple of contour plots. <strong>Please note that this is in no way an exhaustive introduction to Vector Calculus, only the concepts necessary to progress towards the stated goal are introduced.</strong> If you’re interested in studying the topic further, there is a wealth of material to be found.</p>
<p>We will motivate most of the theory by illustrating the two-dimensional case in pictures, but understand that in actuality, we will often deal with <strong>higher-dimensional vector spaces</strong>.</p>
<p>Before we delve into the material proper, let’s look at the big picture approach that you will go through as part of this.</p>
<ul>
<li>Orthogonal Complements</li>
<li>Graphs and Level Sets</li>
<li>Gradients and Jacobians (We will not cover the basic material here too much, there may be other standalone posts about them, though)</li>
<li>Tangent Spaces</li>
<li>Parameterisation in an underdetermined system of Linear Equations</li>
</ul>
<h2 id="required-review-material">Required Review Material</h2>
<ul>
<li><a href="/2021/04/03/matrix-intuitions.html">Matrix Intuitions</a></li>
<li><a href="/2021/04/04/proof-of-column-rank-row-rank-equality.html">Matrix Rank and Some Results</a></li>
<li><a href="/2021/04/02/matrix-subspaces-intuitions.html">Intuitions about the Orthogonality of Matrix Subspaces</a></li>
</ul>
<h2 id="linear-algebra-quick-recall-and-some-identities">Linear Algebra: Quick Recall and Some Identities</h2>
<p>The reason we revisit this topic is because I want to introduce some new notation notation, and talk about a couple of properties you may or may not be aware of.</p>
<h2 id="orthogonal-complements">Orthogonal Complements</h2>
<p>We have already met <strong>Orthogonal Complements</strong> in <a href="/2021/04/02/matrix-subspaces-intuitions.html">Intuitions about Matrix Subspaces</a>, when we were talking about <strong>column spaces</strong>, <strong>row spaces</strong>, <strong>null spaces</strong>, and <strong>left null spaces</strong>. Recalling quickly, the <strong>column space</strong> and the <strong>left null space</strong> are mutually orthogonal complements, and the <strong>row space</strong> and the <strong>null space</strong> are mutually orthogonal complements.</p>
<p>The other fact worth recalling is that the <strong>column rank of a matrix is equal to its row rank</strong>. Since the <strong>rank of a matrix determines the dimensionality of a vector subspace</strong> (1 if it is a line, 2 if it’s a plane, and so on), it follows that the <strong>column space and row space of a matrix have the same dimensionality</strong>. Note that this dimensionality is not dependent on dimensionality of the vector space that the row/column space is embedded in, for example, a 1D subspace (a line) can exist in a three-dimensional vector space.</p>
<p>One (obvious) fact is that the dimensionality of the ambient vector space \(V\) is equal to or larger than the dimensionality of its row space/column space/null space/left null space of \(A\). That is:</p>
\[dim(V)\geq dim(S) \| S\in {C(A), R(A), LN(A), N(A)}\]
<p>We will make more precise statements about these relationships in the next section on <strong>Rank-Nullity Theorem</strong>.</p>
<h2 id="rank-nullity-theorem">Rank Nullity Theorem</h2>
<p>The <strong>Rank Nullity</strong> Theorem states that the sum of the dimensionality of the column space (rank) and that of its orthogonal complement, <strong>left null space</strong>, (<strong>nullity</strong>) is equal to the <strong>dimension of the vector space they are embedded in</strong>.
By the same token, the the dimensionality of the <strong>row space</strong> (<strong>rank</strong>) and that of its orthogonal complement, the <strong>null space</strong>, (<strong>nullity</strong>) is equal to the <strong>dimension of the vector space they are embedded in</strong>.</p>
<p>Mathematically, this implies:</p>
\[\mathbf{dim(C(A))+dim(LN(A))=dim(V) \\
dim(R(A))+dim(N(A))=dim(V)}\]
<p>where \(V\) is the embedding space. This always ends up as the number of basis vectors required to uniquely identify a vector in \(V\). To take a motivating example:</p>
<ul>
<li>A vector \(U=(1,2,3)\) requires three basis vectors (\((1,0,0), (0,1,0), (0,0,1)\)) to specify it completely in \({\mathbb{R}}^3\). Note that this is choice of basis vectors is not unique; I simply picked the <strong>standard basis vectors</strong> to illustrate the point. Thus \(dim(U)\) in this case is 3.</li>
<li>The dimensionality of \(V=(1,2,3)\) is 1, since it is a vector. The column space of this “matrix” \(\begin{bmatrix} 1 \\ 2 \end{bmatrix}\) is basically a straight line extending infinitely long in both directions of this vector.</li>
<li><strong>To fully cover the ambient vector space \(V={\mathbb{R}}^3\)</strong>, you need a vector space which is a plane, which is a two-dimensional subspace. This is mechanically deducible from the <strong>Rank-Nullity Theorem</strong> (3-1=2), but you can also intuit that <strong>the entire 3D space \(V\) can be covered by taking a plane and translating it infinitely forwards and backwards along the vector \(U\)</strong>.</li>
<li>The plane we would like to pick is the orthogonal complement of \(V\), i.e., the plane \(P=x+2y+3z=0\). Can you see why it’s an orthogonal complement? Taking the dot product of the plane \(P\) and vector \(V\) gives us zero.</li>
<li>It is also clear that to represent this plane in the form of a vector subspace (i.e., matrix form), we need <strong>two linearly independent 3D column vectors</strong>. This will automatically imply that the rank of this matrix is 2, which validates the conclusion that we drew earlier from the <strong>Rank-Nullity Theorem</strong>.</li>
</ul>
<h3 id="subset-containership-of-orthogonal-complements">Subset containership of Orthogonal Complements</h3>
<p>This rule is pretty simple, it states the following:</p>
<p><strong>If \(U\subset V\), then \(U^{\perp}\supset V^{\perp}\)</strong></p>
<p>We illustrate this with a simple motivating example, but without proof.
If \(\vec{U}=(1,0,0)\), then its orthogonal complement \(U^{\perp}\) is the plane \(x=0\), as shown below:
<img src="/assets/images/vector-u-and-its-orthogonal-complement-plane-u-perp.png" alt="Vector U and Plane U-Perp" /></p>
<p>If \(V\) is the plane \(z=0\), then its orthogonal complement is \(\vec{V^{\perp}}=(0,0,1)\), as shown below:
<img src="/assets/images/plane-v-and-its-orthogonal-complement-vector-v-perp.png" alt="Plane V and Vector V-Perp" /></p>
<p>Now, the relation \(\mathbf{U\subset V}\) clearly holds, since the vector \(\vec{U}=(1,0,0)\) exists in the plane \(V=z=0\), as shown below:</p>
<p><img src="/assets/images/plane-v-containing-vector-u.png" alt="Plane V Contains Vector U" /></p>
<p>Similarly, the relation \(\mathbf{U^{\perp}\supset V^{\perp}}\) clearly holds, since \(U^{\perp}=x=0\) contains the vector \(\vec{V^{\perp}}=(0,0,1)\), as shown below:</p>
<p><img src="/assets/images/plane-u-perp-containing-vector-v-perp.png" alt="Plane U-Perp Contains Vector V-Perp" /></p>
<p>This validates the <strong>Subset Rule</strong>.</p>
<h2 id="graphs-and-level-sets">Graphs and Level Sets</h2>
<p><strong>Optimisation problems boil down to maximising (or minimising) a function of (usually) multiple variables.</strong> The form of the function has a bearing on how easy or hard the solution might be. This section introduces some terminology that further discussion will refer to, and it is best to not admit any ambiguity in some of these terms.</p>
<h3 id="graph-of-a-function">Graph of a Function</h3>
<p>Consider the function \(f(x)=x^2\). By itself, it represents a single curve. This is a function of variable.
This is the picture of what it looks like:</p>
<p><img src="/assets/images/quadratic-x2-single-variable.png" alt="Basic Parabola" /></p>
<p>However, consider what the function \(g(x)=(x,f(x))\) looks like. This notation might seem unfamiliar – after all, isn’t a function supposed to output a single value? – but we have already dealt with functions that return multiple values; we just bundled them up in matrices. So it is the case in this scenario: \(g(x)\) takes in a value in \(\mathbb{R}\) and returns a matrix (usually a column vector, for the sake of consistency) of the form:</p>
<p>\(\begin{bmatrix}
x \\
f(x)
\end{bmatrix}\). This function \(g(x)=(x,f(x))\) is called the <strong>graph</strong> of the function \(f(x).\)Later on, when we introduce vector-valued functions, \(x\) and \(f(x)\) will themselves be column vectors in their own right.</p>
<h3 id="level-set-of-a-function">Level Set of a Function</h3>
<p>Consider the function \(f(x_1,x_2)=x_1^2-x_2\). This is a function of two independent variables \(x_1\) and \(x_2\). Note that this function is very similar to \(y=x^2\) (which can be rewritten as \(x^2-y=0\)), except there is no constraint on the output value. If we wish to observe the shape of this function, we will need to look at it in 3D: <strong>two dimensions for the independent variables, and the third for the output of the function \(f(x_1,x_2)\)</strong>.</p>
<p>Together, this will form a two dimensional surface in 3D space; the picture below shows what this surface would look like.</p>
<p><img src="/assets/images/parabola-bare-surface.png" alt="Parabola Bare Surface" /></p>
<p>The surface above is a surface because there is no constraint. However, you can think of this surface as the family of all possible parabolas of the form \(x^2-y=C\), where we have let \(C\) unspecified. Each value of \(C\) gives us one parabola in this family. Let’s look at this in 2D first in the image below.</p>
<p><img src="/assets/images/quadratic-x-x2-graph.png" alt="Parabola Level Sets in 2D" /></p>
<p>Each member of this family of parabolas is obtained by fixing \(C\) to a specific value. This fixing function is called the <strong>Level Set</strong> of the function \(f(x)\). We denote it here by \(G(C)\), where \(C\) is the constant which selects a particular parabola from this family. Thus, \(G(2)\) results in the equation \(x_1^2-x_2=4\).</p>
<p>The above diagram does not give us the full picture. Remember, the actual value of \(C\) is not pictured here, because it exists in the third dimension. We need to go back to our parabola surface we pictured earlier. We will do the same thing, plot some sample members of this family, but this time we will also position them along the Z-axis based on what value \(f(x)\) assumes (which is of course \(C\), since we are explicitly fixing for each level set/parabola).</p>
<p><img src="/assets/images/quadratic-parabola-level-sets-1.png" alt="Parabola Level Sets in 3D" />
<img src="/assets/images/quadratic-parabola-level-sets-2.png" alt="Parabola Level Sets in 3D End-On" /></p>
<p>Both are the same graph, just rotated a bit to give you a better idea of where these level sets lie in space in relation to the surface of the family of parabolas. As you can see, each level set is fixed at a particular \(z\) value because we are fixing \(C\). Each level set always lies on this surface. <strong>In essence, each level set is a horizontal cross-section of the full surface, with the position of the cut specified by fixing \(C\).</strong></p>
<p>A circle is usually a better candidate for demonstrating this visually, thus we will repeat the same illustration with the function \(f(x)=x^2+y^2\). This defines the family of circles centered at \((0,0)\) with all possible radii \(C\in\mathbb{R}\).</p>
<p>As usual, picking a particular radius fixes a level set on the surface of this family of circles. The image below shows the situation:</p>
<p><img src="/assets/images/circle-level-sets.png" alt="Circle Level Sets in 3D" />
<img src="/assets/images/circle-level-sets-end-on-view.png" alt="Circle Level Sets in 3D End-On" /></p>
<p>Let’s put <strong>Level Sets</strong> into action. Let’s take a familiar exercise, and apply our understanding of level sets to it, that will allow us to interpret the solution with our new-found knowledge.</p>
<p><strong>Exercise: Find the tangent line to the curve \(y=x^2+4\), and \(x=3\).</strong></p>
<p>This is the level set of a function \(g(x,y)=y-x^2\), with \(G(4)\), if we assume \(G\) to be the level set function.</p>
<p>Let’s work with just \(g(x)\). Differentiating partially:</p>
\[Dg(x,y) = \left[ \frac{\partial g(x)}{\partial x} \frac{\partial g(x)}{\partial x} \right] = [2x \hspace{1cm} 1]\]
<p>This immediately gives us the vector normal to the curve at a given \((x,y)\). The normal vector at \(x=3\) is
\(\begin{bmatrix}
-6 \\ 1
\end{bmatrix}\)</p>
<p><strong>However this is not the tangent line.</strong> If you consider the actual line (actually, one of the lines, you can get infinitely many lines by translation, we discuss this below) this normal vector represents, that is: \(\mathbf{y-6x=0}\) which passes through the origin, and is definitely not tangent to the curve, as shown in the picture below. It has the correct slope, but it is displaced.</p>
<p><img src="/assets/images/tangent-lines-level-sets.png" alt="Level Sets of Tangers" /></p>
<p>The reason is that \(\mathbf{t(x,y)=y-6x}\) represents a <strong>family of tangent lines</strong>; each level set, fixed by a value of \(c\) represents a particular member of that family. For our curve \(y-x^2=4\), the tangent line and this curve is the same graph <strong>only at \((3,13)\)</strong>, which is the point where the tangent line should touch. This means that we can find the level set parameter we are seeking, by substituting \((3,13)\) into \(t(x,y)=y-6x\). This gives us:</p>
\[t(3,13)=13-18=-5\]
<p>Plugging in -5 as the level set value for the function gives us \(\mathbf{y-6x=-5}\), which is the correct tangent line. This line is shown in bold in the plot above.</p>
<p>We could have solved it taking a slightly different, possibly more general, approach. We know that the tangent exists at \((3,13)\). We know that \(y\) can be expressed in terms of \(x\) as \(y=x^2+4\). Thus, any point in the neighbourhood of \((3,13)\) will still lie on the tangent line, and must satisfy the following:</p>
\[y-13=\frac{\partial y}{\partial x}(x-3)\]
<p>Well, \(y\) does not depend upon anything other than \(x\), so taking the partial is the same as your normal derivative, which is \(\frac{dy}{dx}=6\). Substituting this back into the above identity, we get:</p>
\[y-13=6(x-3) \\
\Rightarrow y-13=6x-18 \\
\Rightarrow y-6x-=-5 \\
\Rightarrow \mathbf{y-6x-=-5}\]
<p>This confirms the previous calculation.</p>
<p>This second line of finding out a tangent line will prove more useful, when we extend the discussion to tangent spaces in higher dimensions. Keep this at the back of your mind.</p>
<p>There is one more point that I want to make which will be important as it will be restated in the proof for points constrained to a manifold. The function \(g(x)=y=x^2\) represents \(y\) in terms of \(x\). The differential of this function is obviously \(Dg(x)\).</p>
<p>\(Dg(x)\) thus is the mapping which relates the \(y\) coordinate to the \(x\) coordinate for all points on the tangent line. Thus, <strong>the graph of \(Dg(x)\) is the tangent line (more generally the tangent space) for this particular curve</strong>.</p>
<h2 id="gradients-and-jacobians">Gradients and Jacobians</h2>
<p>I will touch on Jacobians lightly because a significant part of what is to come in this article, will differentiate functions with respect to vectors, very often. Jacobians are the application of partial derivatives on multiple vector-valued functions, i.e., they accept vectors as inputs.
To make this more concrete, and show the most general case for matrices, we take \(m\) functions, each taking as input an n-dimensional vector. Thus, our initial matrix will be an \(m \times n\) matrix, like so:</p>
<p>Taking the Jacobian of \(F\) is partial differentiation of \(f_i\)with respect to each \(x_i\), and repeating this for all functions in the matrix.</p>
\[J_XF=\begin{bmatrix}
\frac{\partial f_1}{\partial x_1} && \frac{\partial f_1}{\partial x_2} && \cdots && \frac{\partial f_1}{\partial x_n} \\
\frac{\partial f_2}{\partial x_1} && \frac{\partial f_2}{\partial x_2} && \cdots && \frac{\partial f_2}{\partial x_n} \\
\frac{\partial f_3}{\partial x_1} && \frac{\partial f_3}{\partial x_2} && \cdots && \frac{\partial f_3}{\partial x_n} \\
\vdots \\
\frac{\partial f_m}{\partial x_1} && \frac{\partial f_m}{\partial x_2} && \cdots && \frac{\partial f_m}{\partial x_n} \\
\end{bmatrix}\]
<p>where the subscript \(X\) indicates we are differentiating with respect to the vector \(X=(x_1,x_2,...,x_n)\).</p>
<h2 id="tangent-spaces">Tangent Spaces</h2>
<p>The exercise in the section on <strong>Level Sets</strong> leads directly from the one-dimensional case to a discussion on <strong>Tangent Spaces</strong> in higher dimensions.
Remember the equation to find the tangent space in one dimension, where we represented one variable in terms of the other? I reproduce it below for reference:</p>
\[y-y_0=\frac{dy}{dx}(x-x_0)\]
<p>Here, we have simply replaced the concrete values with \((x_0,y_0)\). This is usually referred to as the <strong>parametric form of a linear function</strong>. It tells you how much \(y\) when you change \(x\) by a certain value. Specifically, it also tells you how much \(y\) changes by, when you change \(x\) by 1.This value is also, unsurprisingly, the slope of the linear function, represented here as \(\frac{\partial y}{\partial x}\). Any vector on this linear function can be represented as \(\begin{bmatrix}1 \\ \frac{dy}{dx}\end{bmatrix}.t\)</p>
<p>Now, consider any equation of three variables. As an example:</p>
\[x+2y+3z=0\]
<p>Here, we can represent \(z\) as a function of \((x,y)\), like so:</p>
\[z=-\frac{2y+3z}{3}\]
<p>We have essentially moved up one dimension, where <strong>one variable is now expressed in terms of two variables</strong>, instead of one. If we wish to find the tangent space for this, we can use the same concept, except that this time partial differentiation will make sense, since you have two dependent variables, so you take the partial derivative for each of them separately. So, we can write:</p>
\[z-z_0=\begin{bmatrix}
\frac{\partial z}{\partial x} && \frac{\partial z}{\partial y}
\end{bmatrix}
\begin{bmatrix}
x-x_0 \\
y-y_0
\end{bmatrix}\]
<p>Thus, similar to the previous case, if we denote \(g(x,y)=\begin{bmatrix}
\frac{\partial z}{\partial x} && \frac{\partial z}{\partial y}
\end{bmatrix}\), \(g(x,y)\) denotes a mapping from the independent variables \(x\) and \(y\) to \(z\). Stated more generally, <strong>\(g(x,y)\) is a mapping from \(\mathbb{R}^2\) to \(\mathbb{R}\)</strong>.
As before, the graph of \(g(x,y)\) is the tangent space. Remember, a graph is simply the set of all inputs (in this case, \(x\) and \(y\)) and their corresponding outputs (in this case \(z\)); thus <strong>the graph of \(g(x,y)\) is simply all the \((x,y,z)\) tuples which lie on the tangent space</strong>.</p>
<p>We can extend this easily enough to n-dimensional space, where one variable can be expressed in terms of \(n-1\) independent variables. That is, if \(x_n=g([x_1, x_2,..., x_{n-1}])\)</p>
\[x_n=\begin{bmatrix}
\frac{\partial g}{\partial x_1} && \frac{\partial g}{\partial x_2} && ... && \frac{\partial g}{\partial x_{n-1}}
\end{bmatrix}
\begin{bmatrix}
x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_{n-1}
\end{bmatrix}\]
<p>This is a single linear equation. We will have more to say about linearity in nonlinear curves in another article. Now we extend this to multiple equations. Before we do that though, let’s consider a simple motivating example to make the intuition about independent and dependent variables in a linear system of equations, more solid.</p>
<p>Consider two equations:</p>
\[x+y+z=0 \\
x+2y+3z=0\]
<p>What is the solution to this set of equations? Let’s use the <strong>row reduction technique</strong> to find the pivots. We first write out the system of linear equations as a matrix.</p>
\[\begin{bmatrix}
1 && 1 && 1 \\
1 && 2 && 3
\end{bmatrix}\]
<p>Let’s now convert to <strong>reduced row echelon form</strong>, first subtracting \(R_1\) from \(R_2\)</p>
\[\begin{bmatrix}
1 && 1 && 1 \\
0 && 1 && 2
\end{bmatrix}\]
<p>Now, we subtract \(R_2\) from \(R_1\) to get:</p>
\[\begin{bmatrix}
1 && 0 && -1 \\
0 && 1 && 2
\end{bmatrix}\]
<p>We have arrived at the row reduced echelon form. We see that we have two pivots, \(x\) and \(y\); rank of the matrix is 2.</p>
\[x-z=0 \\
y+2z=0
\Rightarrow x=z \\
y=-2z\]
<p>Thus, we have one free variable (\(z\)), which can be used to express \(x\) and \(y\). Note that the rank of the matrix, which is 2, and the number of variables, which is 3. <strong>As a general rules, if we have a system of \(n-k\) equations for \(n\) variables, there are \(k\) free variables (parameters), and the remaining \(n-k\) variables can be expressed in terms of the \(k\) parameters.</strong></p>
<p>You will have also noticed that the final result is the parametric form of a line in 3D space. This is not surprising, because the solution represents the intersection of two planes.</p>
<p><img src="/assets/images/intersecting-planes-1d-manifold.png" alt="Intersecting Planes forming a 1D Manifold" /></p>
<p>We can say that this line (\(x=z,y=-2z\)) is a <strong>manifold</strong>. The definition of a manifold involves several conditions that need to be satisfied, and they are related to ideas about <strong>differentiability</strong> and <strong>local linearity</strong>. We will visualise some of those ideas in the next article.
<strong>The important thing to connect this material with Machine Learning is that an optimisation problem defines a set of constraints.</strong> These constraints limit the area where an optimal solution can exist. These constraints are what we represent by \(n-k\) equations. Note that, in the usual case of optimisation, we will have more variables than equations, i.e., \(n>k\). In this context, we may call the manifold, a <strong>constraint manifold</strong>.</p>
<p>Thus, if we have a system of \(\mathbf{n-k}\) equations, assuming this linear system has maximum possible rank, i.e., <strong>all the column vectors are linearly independent</strong>, we will have \(n-k\) pivots. This implies that there exist \(\mathbf{k}\) <strong>free variables (parameters) which can be used to express the remaining \(n-k\) dependent variables</strong>.</p>
<p>The matrix of these functions can be represented as:</p>
\[F(X)=\begin{bmatrix}
f_1(x_1,x_2,...,x_n) \\
f_2(x_1,x_2,...,x_n) \\
f_3(x_1,x_2,...,x_n) \\
\vdots \\
f_{n-k}(x_1,x_2,...,x_n) \\
\end{bmatrix}\]
<p>Then, we may extend our single-function expression of the slope, and write as below:</p>
\[\begin{bmatrix}
x_{k+1} \\ x_{k+2} \\ x_{k+3} \\ \vdots \\ x_{n-k}
\end{bmatrix}
=\begin{bmatrix}
\frac{\partial g_1}{\partial x_1} && \frac{\partial g_1}{\partial x_2} && ... && \frac{\partial g_1}{\partial x_k} \\
\frac{\partial g_2}{\partial x_1} && \frac{\partial g_2}{\partial x_2} && ... && \frac{\partial g_2}{\partial x_k} \\
\vdots \\
\frac{\partial g_{n-k}}{\partial x_1} && \frac{\partial g_{n-k}}{\partial x_2} && ... && \frac{\partial g_{n-k}}{\partial x_k} \\
\end{bmatrix}
\begin{bmatrix}
x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_k
\end{bmatrix}\]
<p>The above says that \(n-k\) variables can be expressed as a linear transformation of \(k\) variables. Each entry in the output vector is expressed as a combination of multiple input variables, and are <strong>related to each input variable by the “slope” obtained through the partial derivative</strong>.</p>
<p>This is a generalisation we say earlier, of a single output variable expressed through a single input variable, related to it through a single function. In that example, that function was the slope of the graph, obtained by differentiating the equation of the curve; thus the graph of that function was the tangent space.</p>
<p>So it is with this complicated-looking expression: it represents the tangent space of a manifold which is defined by \(n-k\) equations.</p>
<p>One question that will inevitably arise is: <strong>how does this set of linear equations ultimately motivate optimisation on complex manifolds?</strong> The <strong>Implicit Function Theorem</strong> answers this question, and is the jumping-off point for the next post on Vector Calculus.</p>avishekIn this article, we take a detour to understand the mathematical intuition behind Constrained Optimisation, and more specifically the method of Lagrangian multipliers. We have been discussing Linear Algebra, specifically matrices, for quite a bit now. Optimisation theory, and Quadratic Optimisation as well, relies heavily on Vector Calculus for many of its results and proofs.