Jekyll2021-07-20T03:22:35+05:30/feed.xmlA Fish without a BicycleTechnology and ArtKernel Functions: Functional Analysis and Linear Algebra Preliminaries2021-07-17T00:00:00+05:302021-07-17T00:00:00+05:30/2021/07/17/kernel-functions-functional-analysis-preliminaries<p>This article lays the groundwork for an important construction called <strong>Reproducing Kernel Hilbert Spaces</strong>, which allows a certain class of functions (called <strong>Kernel Functions</strong>) to be a valid representation of an <strong>inner product</strong> in (potentially) higher-dimensional space. This construction will allow us to perform the necessary higher-dimensional computations, without projecting every point in our data set into higher dimensions, explicitly, in the case of <strong>Non-Linear Support Vector Machines</strong>, which will be discussed in the upcoming article.</p>
<p>This construction, it is to be noted, is not unique to Support Vector Machines, and applies to the general class of techniques in Machine Learning, called <strong>Kernel Methods</strong>. An important part of the construction relies on defining the <strong>inner product of functions</strong>, as well as notions of <strong>Positive Semi-Definiteness</strong>: these are the concepts we will discuss in this article.</p>
<p>A lot of this material stems from <strong>Functional Analysis</strong>, and we will attempt to introduce the relevant material here as painlessly as possible for the engineer.</p>
<p>Most of the ideas here can be intuitively related to familiar notions of \(\mathbb{R}^n\) spaces, and we’ll use motivating examples to connect the mathematical machinery to the engineer’s intuition.</p>
<h2 id="motivation-for-kernel-functions">Motivation for Kernel Functions</h2>
<p>We begin with the motivation for introducing this material. Consider a set of ten data points, all one-dimensional. We write them as:</p>
\[x_1, x_2, x_3, ..., x_10\in\mathbf{R}\]
<p>Furthermore, let us assume that some of them belong to the class <strong>Green</strong>, and the rest to the class <strong>Red</strong>.
Let us assign values and classes to them, to be more concrete:</p>
<table>
<thead>
<tr>
<th>\(x_1\)</th>
<th>\(x_2\)</th>
<th>\(x_3\)</th>
<th>\(x_4\)</th>
<th>\(x_5\)</th>
<th>\(x_6\)</th>
<th>\(x_7\)</th>
<th>\(x_8\)</th>
<th>\(x_9\)</th>
<th>\(x_{10}\)</th>
</tr>
</thead>
<tbody>
<tr>
<td>-5</td>
<td>-4</td>
<td>-3</td>
<td>-2</td>
<td>-1</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr>
<td>Red</td>
<td>Red</td>
<td>Red</td>
<td>Green</td>
<td>Green</td>
<td>Green</td>
<td>Green</td>
<td>Green</td>
<td>Red</td>
<td>Red</td>
</tr>
</tbody>
</table>
<p>We represent them on the number line, like so</p>
<p><img src="/assets/images/linearly-non-separable-data-set-1d.png" alt="Linearly Non-separable data set in 1D" /></p>
<p>Our aim is to find a <strong>linear partitioning line</strong> which separates all the <strong>Green</strong> points from all the <strong>Red</strong> sets. Note that this linear partitioning “line” is called different names in different dimensions, the common term being the <strong>separating hyperplane</strong>:</p>
<ul>
<li>A point in \(\mathbb{R}\)</li>
<li>A line in \(\mathbb{R}^2\)</li>
<li>A plane in \(\mathbb{R}^3\)</li>
</ul>
<p>In this case, we would like to find a single point that creates a <strong>Red</strong> and a <strong>Green</strong> partition.</p>
<p>You can quickly see that there is no way you can choose a single point that can do this. This is obviously because the <strong>Green</strong> data set is “surrounded” by the <strong>Red</strong> data set on either side.</p>
<p>There is a way out of this quandary: we can <strong>lift our original data set into a higher dimension</strong>. As an illustration, let us pick the function \(f(x)=x^2\) to lift our data set into \(\mathbb{R}^2\).</p>
<p>Our data set now becomes:</p>
<table>
<thead>
<tr>
<th>\(x_1\)</th>
<th>\(x_2\)</th>
<th>\(x_3\)</th>
<th>\(x_4\)</th>
<th>\(x_5\)</th>
<th>\(x_6\)</th>
<th>\(x_7\)</th>
<th>\(x_8\)</th>
<th>\(x_9\)</th>
<th>\(x_{10}\)</th>
</tr>
</thead>
<tbody>
<tr>
<td>-5</td>
<td>-4</td>
<td>-3</td>
<td>-2</td>
<td>-1</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr>
<td>25</td>
<td>16</td>
<td>9</td>
<td>4</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>4</td>
<td>9</td>
<td>16</td>
</tr>
<tr>
<td>Red</td>
<td>Red</td>
<td>Red</td>
<td>Green</td>
<td>Green</td>
<td>Green</td>
<td>Green</td>
<td>Green</td>
<td>Red</td>
<td>Red</td>
</tr>
</tbody>
</table>
<p>Now, can we separate the resulting two-dimensional points using a linear function? Yes, we can. I’ve picked a random straight line \(x-2y+12=0\) to illustrate this separation, but there is an infinite number of straight lines that will do the work. This situation is shown below:</p>
<p><img src="/assets/images/linearly-separable-dataset-in-2d.png" alt="Linearly Separable data set in 2D" /></p>
<p>You could have lifted this same data set to \(\mathbb{R}^3\), \(\mathbb{R}^4\), etc. as well, but in this particular case, lifting it to \(\mathbb{R}^2\) makes it nicely linearly separable, so we don’t need to go higher.</p>
<p>The same concept applies to higher dimensional data sets. A linearly non-separable data set in \(\mathbb{R}^2\) can be made linearly separable (using a plane) if it is lifted to \(\mathbb{R}^3\).</p>
<p>This is a common technique used in Machine Learning algorithms as a way to make classification problems easier. Generally speaking, <strong>a linearly non-separable data set can be made linearly separable by projecting it onto a higher dimension</strong>. This projection into higher dimensions is not an ML algorithm by itself, but can be an important step as part of data preparation.</p>
<p>This by itself is not the problem. As we will see later on, Machine Learning algorithms tend to perform the inner product operation very frequently. This inner product needs to be performed on thousands (if not millions) of data points, each of which need to be lifted to a higher dimension, before the inner product operation can be carried out.</p>
<p>We ask the following question: <strong>when is a function on a pair of original vectors also the inner product of those vectors projected into a higher dimensional space?</strong> If we can answer this question, then we can circumvent the process of projecting a pair of vectors into higher-dimensional space, and then computing their inner products; we can simply apply a single function which gives us the inner products in higher dimensional space.</p>
<p><img src="/assets/images/kernel-function-shortcut-diagram.jpg" alt="Kernel Function Shortcut" /></p>
<p>Such a function which satisfies our requirements is called a <strong>Kernel Function</strong>, and is the main motivation for developing the mathematical preliminaries in this article.</p>
<h2 id="mathematical-preliminaries">Mathematical Preliminaries</h2>
<h2 id="functions-as-infinite-dimensional-vectors">Functions as Infinite-Dimensional Vectors</h2>
<p>We can treat <strong>functions as vectors</strong>. Indeed, this is one of the unifying ideas behind <strong>Functional Analysis</strong>. A proper treatment of this concept can be found in any good Functional Analysis text, but I will introduce the relevant concepts here.</p>
<p>We are used to dealing with <strong>finite dimensional vectors</strong>, mostly in \(\mathbb{R}^n\). A function \(f:\mathbb{R}^n\rightarrow \mathbb{R}\) in the most general sense, can be treated as a <strong>sequence of vectors</strong>. However, even if we restrict the domain of a function, there is always an <strong>infinite number of values</strong> that the function \(f\) will take within even this restricted domain, since there are always going to be an infinite number of real numbers in the chosen interval.</p>
<p>We can proceed to build our intuition by looking at the following arbitrary function. For convenience of discussion, we restrict the discussion to the range \(x\in[a,b]\).</p>
<p><img src="/assets/images/univariate-function-between-a-b.png" alt="Univariate Restricted Function" /></p>
<p>Let us decide to approximate this function \(f(x)\) by taking five samples at equal intervals \(\Delta x\), as below:</p>
<p><img src="/assets/images/univariate-function-between-a-b-sparse-samples.png" alt="Univariate Restricted Function Sparse Samples" /></p>
<p>We may represent this approximation of \(f(x)\) by a vector of these five samples, i.e.,</p>
\[\tilde{f}(x)=\begin{bmatrix}
f(x_1) \\
f(x_2) \\
f(x_3) \\
f(x_4) \\
f(x_5)
\end{bmatrix}\]
<p>The above is a vector in \(\mathbb{R}^5\). Increasing the number of samples (which in turn decreases \(\Delta x\)), results in a higher dimensional vector, as shown below:</p>
<p><img src="/assets/images/univariate-function-between-a-b-dense-samples.png" alt="Univariate Restricted Function Dense Samples" /></p>
<p>Now, we can approximate \(f(x)\) with a 9-dimensional vector, like so:</p>
\[\tilde{f}(x)=\begin{bmatrix}
f_1 \\
f_2 \\
f_3 \\
f_4 \\
f_5 \\
f_6 \\
f_7 \\
f_8 \\
f_9
\end{bmatrix}\]
<p>Clearly, the <strong>higher the dimensionality of our approximating vector, the better the approximation to the original function</strong>. Ultimately, as \(n\rightarrow \infty\), and \(\Delta x\rightarrow 0\), we recover the true function, and the “approximating vector” is now infinite-dimensional. However, the infinite dimensionality does not prevent us from performing usual vector algebra operations on this function.</p>
<p>Indeed, we can show that <strong>functions respect all the axioms of a vector space</strong>, that is:</p>
<ul>
<li>The operations of <strong>Vector Addition</strong> and <strong>Scalar Multiplication</strong> are valid for functions.</li>
<li><strong>Commutativity of Addition</strong>: \(a+b=b+a\)</li>
<li><strong>Associativity of Addition</strong>: \(a+(b+c)=(a+b)+c\)</li>
<li><strong>Existence of Additive Identity</strong>: \(a+0=a\)</li>
<li><strong>Existence of Additive Inverse</strong>: \(a+(-a)=0\)</li>
<li><strong>Existence of Multiplicative Identity</strong>: \(a.1=a\)</li>
<li><strong>Distributivity of Scalar Multiplication with respect to Vector Addition</strong>: \(\alpha\cdot(a+b)=\alpha\cdot a + \alpha\cdot b\)</li>
<li><strong>Distributivity of Scalar Multiplication with respect to Field Addition</strong>: \((\alpha + \beta)\cdot a=\alpha\cdot a + \beta\cdot a\)</li>
</ul>
<h2 id="hilbert-spaces">Hilbert Spaces</h2>
<p>There is one important property of vector spaces that we’ve taken for granted in our discussions on Linear Algebra so far: the fact that the inner product is a defined operation in our \(\mathbb{R}^n\) Euclidean space.</p>
<p>The <strong>Dot Product</strong> that I’ve covered in previous posts with reference to Linear Algebra is essentially a <strong>specialisation of the general concept of the Inner Product applied to finite-dimensional Euclidean spaces</strong>. We will begin using the more general term <strong>Inner Product</strong> in further discussions.</p>
<p>Another important point to note: we will be switching up notation. <strong>Inner products will henceforth be designated as \(\langle\bullet,\bullet\rangle\).</strong> Thus, the inner product of two vectors \(x\) and \(y\) will be written as \(\langle x,y\rangle\).</p>
<p><strong>The inner product is not defined on a vector space by default</strong>: the property must be explicitly stated as valid on a vector space. <strong>A vector space equipped with an inner product operation is formally known as a Hilbert space.</strong> Thus, the vector spaces we have been dealing with in Linear Algebra so far, have necessarily been Hilbert spaces.</p>
<p>There are a few important properties any candidate for an inner product must satisfy. All these properties intuitively make sense since we’ve been using them without stating them implicitly while doing Matrix Algebra.</p>
<ul>
<li><strong>Positive Definite</strong>: \(\langle x,x\rangle>0\) if \(x\neq 0\)</li>
<li><strong>Symmetric</strong>: \(\langle x,y\rangle=\langle y,x\rangle\)</li>
<li><strong>Linear</strong>:
<ul>
<li>
\[\langle \alpha x,y\rangle=\alpha\langle x,y\rangle, \alpha\in\mathbb{R}\]
</li>
<li>
\[\langle x+y,z\rangle=\langle x,z\rangle+\langle y,z\rangle\]
</li>
</ul>
</li>
</ul>
<h2 id="norm-induced-by-inner-product">Norm induced by Inner Product</h2>
<p>Another interesting property we have taken for granted is the existence of the <strong>norm of a vector \(\|\bullet\|\)</strong>. In plain Linear Algebra, the <strong>norm is essentially the magnitude of a vector</strong>. What is interesting is that the norm need not be separately specified as a property of vector space; it comes into existence automatically if an inner product is defined on a vector space. To see why this is the case, note that:</p>
\[\langle x,x\rangle=\|x\|^2 \\
\Rightarrow \|x\|=\sqrt{\langle x,x\rangle}\]
<h2 id="inner-product-of-functions">Inner Product of Functions</h2>
<p>Since the <strong>vector space of functions is also a Hilbert space</strong>, we should be able to take the <strong>inner product of two functions</strong>. Here’s some intuition about what the inner product of functions actually means, and how it comes into being.</p>
<p>We will begin by assuming some approximation of two functions \(f\) and \(g\) using finite-dimensional vectors. For concreteness’ sake, assume we represent them using 5-dimensional vectors.
We have represented the approximating vectors \(\tilde{f}\) and \(\tilde{g}\), like so:</p>
\[\tilde{f}(x)=\begin{bmatrix}
f_1 \\
f_2 \\
f_3 \\
f_4 \\
f_5
\end{bmatrix}
\tilde{g}(x)=\begin{bmatrix}
g_1 \\
g_2 \\
g_3 \\
g_4 \\
g_5
\end{bmatrix}\]
<p>As usual, we have restricted the domain of discussion to be \([a,b]\). Also note that</p>
<p><img src="/assets/images/two-functions-sampled.png" alt="Two Functions Sampled" /></p>
<p>The approximate vector after multiplying the coefficients in the corresponding dimensions would then be:</p>
\[\tilde{f(x)}\tilde{g(x)}=\begin{bmatrix}
f_1.g_1 \\
f_2.g_2 \\
f_3.g_3 \\
f_4.g_4 \\
f_5.g_5
\end{bmatrix}\]
<p><strong>Note that the above vector is not the actual inner product</strong>, for that we will still need to sum up the samples as we describe next.
Let us assume that if the inner product \(\langle f,g\rangle\) was computed perfectly, it would have the graph as below. Here the true inner product is shown along with the vector \(\tilde{f(x)}\tilde{g(x)}\) overlaid onto it.</p>
<p><img src="/assets/images/true-inner-product-with-overlaid-samples.png" alt="True Inner Product with Overlaid Samples" /></p>
<p>Now let us consider how you’d want to calculate the inner product. Naively, we can simply sum up the values of \(\tilde{f(x)}\tilde{g(x)}\). However, this would not necessarily be a good approximation, since we are leaving out parts of the function that we are not sampling. All the parts of the function between any consecutive pair of samples are not accounted for at all. How are we then to approximate these missing values.</p>
<p>In the absence of further data, the best we can do is assume that <strong>those missing values are the same as the value of the sample immediately preceding them</strong>. Essentially, to compute the approximation for the missing parts of the function, we need to compute the area of the approximating rectangle, the height of which is the value of the immediately preceding sample. The approximation is as shown below.</p>
<p><img src="/assets/images/true-inner-product-with-approximating-rectangles.png" alt="True Inner Product with Approximating Rectangles" /></p>
<p>Thus, the approximate inner product \({\langle f,g\rangle}_{approx}\) is:</p>
\[{\langle f,g\rangle}_{approx}=\sum_{i=1}^N f_i\cdot g_i\cdot\Delta x\]
<p>Of course, this is only an approximation, and the more samples we take, the better our approximation will be. In the limit where \(\Delta x\rightarrow 0\), and \(i\rightarrow\infty\), this changes into an integral with limits \(a\) and \(b\), as below:</p>
\[{\langle f,g\rangle}=\int_a^b f(x)g(x)dx\]
<p>Thus, the <strong>inner product of two functions is the area under the product of the two functions</strong>.</p>
<h2 id="functions-as-basis-vectors">Functions as Basis Vectors</h2>
<p>If functions can be treated as vectors, we should be able to express - and create - <strong>functions as linear combinations of other functions</strong>. The implication then is also that <strong>functions can also serve as basis vectors</strong>.
We can essentially bring to bear all the machinery of Linear Algebra, since its results apply to all sorts of vectors. Things like orthogonality, projections, etc. also apply to functions now. The unifying idea is that vectors can be matrices, or functions, or other things.</p>
<p>Thus, a set of functions can span a vector space of functions. This idea will be an important part in the construction of RKHS’s.</p>
<h2 id="postive-semi-definite-kernels-and-the-gram-matrix">Postive Semi-Definite Kernels and the Gram Matrix</h2>
<p><strong>Positive semi-definite matrices</strong> are <strong>square symmetric</strong> matrices which have the following property:</p>
\[v^TSv\geq 0\]
<p>where \(S\) is the \(n\times n\) matrix, and \(v\) is a \(n\times 1\) vector. You can convince yourself that the final result of \(v^TSv\) is a single scalar.</p>
<p>Positive semi-definite matrices can always be expressed as the product of a matrix and its transpose.</p>
\[A=L^TL\]
<p>See <a href="/2021/07/08/cholesky-ldl-factorisation.html">Cholesky and \(LDL^T\) Factorisations</a> for further details on how this decomposition works.
To see why a Cholesky-decomposable matrix satisfies the positive semi-definiteness property, rewrite \(v^TSv\) so that:</p>
\[v^TSv=v^TL^TLv \\
=(v^TL^T)(Lv) \\
={(Lv)}^T(Lv)
={\|Lv\|}^2 \geq 0\]
<h2 id="inner-product-and-the-gram-matrix">Inner Product and the Gram Matrix</h2>
<p>With this intuition, we turn to a common operation in many Machine Learning algorithms: the <strong>Inner Product</strong>. The inner product is a very common operation. As we discussed in the first section of this article, inner product calculations usually need to be combined with projecting the original input vectors to a higher dimensional space first. We will revisit the SVM equations in an upcoming post to see the use of the Gram Matrix, which in the context of kernel functions is simply the matrix of all possible inner products of all data points.</p>
<p>Assume we have \(n\) data vectors \(x_1\), \(x_2\), \(x_3\), …, \(x_n\).
The matrix that will be used to characterise the positive semi-definiteness of kernels is:</p>
\[K=\begin{bmatrix}
\kappa(x_1, x_1) && \kappa(x_2, x_1) && ... && \kappa(x_n, x_1) \\
\kappa(x_1, x_2) && \kappa(x_2, x_2) && ... && \kappa(x_n, x_2) \\
\kappa(x_1, x_3) && \kappa(x_2, x_3) && ... && \kappa(x_n, x_3) \\
\vdots && \vdots && \ddots && \vdots \\
\kappa(x_1, x_n) && \kappa(x_2, x_n) && ... && \kappa(x_n, x_n) \\
\end{bmatrix}\]
<p>where \(\kappa(x,y)\) is the kernel function. We will say that the kernel function is positive semi-definite if:</p>
\[v^TKv\geq 0\]
<p>where \(v\) is any \(n\times 1\) vector.
Let’s expand out the final result because that is a form we will see in both the construction of the <strong>Reproducing Kernel Hilbert Spaces</strong>, as well as the solutions for <strong>Support Vector Machines</strong>.</p>
<p>Let \(v=\begin{bmatrix}
\alpha_1 \\
\alpha_2 \\
\vdots \\
\alpha_n \\
\end{bmatrix}\)</p>
<p>Then, expanding everything out, we got:</p>
\[v^TKv=
\begin{bmatrix}
\alpha_1 && \alpha_2 && \ldots && \alpha_n
\end{bmatrix}
\cdot
\begin{bmatrix}
\kappa(x_1, x_1) && \kappa(x_2, x_1) && ... && \kappa(x_n, x_1) \\
\kappa(x_1, x_2) && \kappa(x_2, x_2) && ... && \kappa(x_n, x_2) \\
\kappa(x_1, x_3) && \kappa(x_2, x_3) && ... && \kappa(x_n, x_3) \\
\vdots && \vdots && \ddots && \vdots \\
\kappa(x_1, x_n) && \kappa(x_2, x_n) && ... && \kappa(x_n, x_n) \\
\end{bmatrix}
\cdot
\begin{bmatrix}
\alpha_1 \\
\alpha_2 \\
\vdots \\
\alpha_n \\
\end{bmatrix}
\\=
{\begin{bmatrix}
\alpha_1\kappa(x_1, x_1) + \alpha_2\kappa(x_1, x_2) + \alpha_3\kappa(x_1, x_3) + ... + \alpha_n\kappa(x_1, x_n) \\
\alpha_1\kappa(x_2, x_1) + \alpha_2\kappa(x_2, x_2) + \alpha_3\kappa(x_2, x_3) + ... + \alpha_n\kappa(x_2, x_n) \\
\alpha_1\kappa(x_3, x_1) + \alpha_2\kappa(x_3, x_2) + \alpha_3\kappa(x_3, x_3) + ... + \alpha_n\kappa(x_3, x_n) \\
\vdots \\
\alpha_1\kappa(x_n, x_1) + \alpha_2\kappa(x_n, x_2) + \alpha_3\kappa(x_n, x_3) + ... + \alpha_n\kappa(x_n, x_n) \\
\end{bmatrix}}^T
\cdot
\begin{bmatrix}
\alpha_1 \\
\alpha_2 \\
\vdots \\
\alpha_n \\
\end{bmatrix}
\\=
{\begin{bmatrix}
\sum_{i=1}^n\alpha_i\kappa(x_1, x_i) \\
\sum_{i=1}^n\alpha_i\kappa(x_2, x_i) \\
\sum_{i=1}^n\alpha_i\kappa(x_3, x_i) \\
\vdots \\
\sum_{i=1}^n\alpha_i\kappa(x_n, x_i) \\
\end{bmatrix}}^T
\cdot
\begin{bmatrix}
\alpha_1 \\
\alpha_2 \\
\vdots \\
\alpha_n \\
\end{bmatrix} \\
=
\alpha_1\sum_{i=1}^n\alpha_i\kappa(x_1, x_i)
+\alpha_2\sum_{i=1}^n\alpha_i\kappa(x_2, x_i)
+\alpha_3\sum_{i=1}^n\alpha_i\kappa(x_3, x_i)
+\ldots
+\alpha_n\sum_{i=1}^n\alpha_i\kappa(x_n, x_i) \\
=
\sum_{j=1}^n\sum_{i=1}^n\alpha_i\alpha_j\kappa(x_j, x_i)\]
<p>Note that the first factor in a couple of lines, is written in transpose form to make it more readable.
Thus, from the above expansion, we get:</p>
\[K=v^TKv=\sum_{j=1}^n\sum_{i=1}^n\alpha_i\alpha_j\kappa(x_j, x_i)\]
<p>For a positive semi-definite kernel \(K\), we must have this expression non-negative, that is:</p>
\[\sum_{j=1}^n\sum_{i=1}^n\alpha_i\alpha_j\kappa(x_j, x_i) \geq 0\]avishekThis article lays the groundwork for an important construction called Reproducing Kernel Hilbert Spaces, which allows a certain class of functions (called Kernel Functions) to be a valid representation of an inner product in (potentially) higher-dimensional space. This construction will allow us to perform the necessary higher-dimensional computations, without projecting every point in our data set into higher dimensions, explicitly, in the case of Non-Linear Support Vector Machines, which will be discussed in the upcoming article.Real Analysis: Patterns for Proving Irrationality of Square Roots2021-07-09T00:00:00+05:302021-07-09T00:00:00+05:30/2021/07/09/real-analysis-proofs-2<p>ld
Continuing on my journey through <strong>Real Analysis</strong>, we will focus here on common <strong>proof patterns</strong> which apply to <strong>irrational square roots</strong>.
These patterns apply to the following sort of proof exercises:</p>
<ul>
<li>Prove that \(\sqrt 2\) is irrational.</li>
<li>Prove that \(\sqrt 3\) is irrational.</li>
<li>Prove that \(\sqrt 7\) is irrational.</li>
<li>Prove that \(\sqrt 12\) is irrational.</li>
<li>…etc.</li>
</ul>
<p>The proofs are all based on <strong>Proof by Contradiction</strong>. <strong>Thus, the starting point is always to assume that the square root of the number in question, say \(\sqrt 7\), is indeed a rational number</strong>, implying that it can be expressed as \(\frac{p}{q}\), where \(p,q\in\mathbb{N}\).</p>
<p>Let’s take a specific example which will demonstrate the first proof pattern:</p>
<h2 id="prove-that-sqrt-2-is-irrational">Prove that \(\sqrt 2\) is irrational</h2>
<h3 id="proof">Proof</h3>
<p>We assume that \(\sqrt 2\) is rational. Therefore, we can express it as a ratio of two integers \(\frac{p}{q}:p,q\in\mathbb{N}\), which have no common factors between them. Thus, we may write:</p>
\[\frac{p^2}{q^2}=2 \
\Rightarrow p^2=2q^2\]
<p>The easiest templatised way to get started in all situations is to make a quick <strong>truth table</strong>, to narrow down the <strong>feasibility of \(p, q\) being odd and/or even</strong>.</p>
<table>
<thead>
<tr>
<th>p</th>
<th>q</th>
<th>Feasible?</th>
<th>Reason</th>
</tr>
</thead>
<tbody>
<tr>
<td>Even</td>
<td>Even</td>
<td>False</td>
<td>By definition</td>
</tr>
<tr>
<td>Even</td>
<td>Odd</td>
<td>True</td>
<td> </td>
</tr>
<tr>
<td>Odd</td>
<td>Even</td>
<td>False</td>
<td>\(2q^2\) is even, so \(p^2\) must be even, thus \(p\) must be even</td>
</tr>
<tr>
<td>Odd</td>
<td>Odd</td>
<td>False</td>
<td>\(2q^2\) is even, so \(p^2\) cannot be odd, thus \(p\) cannot be odd</td>
</tr>
</tbody>
</table>
<p>Thus, the only valid option is \(p\) even and \(q\) odd.</p>
<p>At this point we’ll show two different ways of arriving at a contradiction. The first one is the ‘classic’ proof.</p>
<h4 id="proof-pattern-1"><u>Proof Pattern 1</u></h4>
<p>Since \(p\) is even, we set \(p=2k, k\in\mathbb{N}\), so that:
\({(2k)}^2=2q^2 \\
\Rightarrow 4k^2=2q^2 \\
\Rightarrow 2k^2=q^2\)</p>
<p>This implies that \(q^2\) is even, therefore \(q\) has to be even.
This leads us to a contradiction, since \(p\) and \(q\) cannot both be even.
The second contradiction this leads us to is that the truth table shows that \(q\) is odd, but we concluded that \(q\) is even.</p>
<p>Therefore, \(\sqrt 2\) is not a rational number.
\(\blacksquare\)</p>
<p><strong>NOTE:</strong> In the above proof, <strong>we do not exploit the fact that \(q\) is odd</strong>. We only show by some algebraic manipulation that \(q\) is even, which contradicts two of the facts deduced from the truth table.</p>
<h4 id="proof-pattern-2"><u>Proof Pattern 2</u></h4>
<p>Since we have determined \(p\) even and \(q\) odd, we set \(p=2m, m\in\mathbb{N}\), and \(q=2n+1, n\in\mathbb{N}\). Then, substituting these expressions, we get:</p>
\[p^2=2q^2 \\
\Rightarrow {(2m)}^2=2{(2n+1)}^2 \\
\Rightarrow 4m^2=2(4n^2+4n+1) \\
\Rightarrow \underbrace{2m^2}_{even}=\underbrace{\underbrace{4n^2+4n}_{even}+1}_{odd}\]
<p>This leads us to a contradiction, because the left hand side is even, but the right hand side is odd.
Therefore, \(\sqrt 2\) is not a rational number.
\(\blacksquare\)</p>
<p><strong>NOTE</strong>: In the above proof, <strong>we do exploit the fact that \(q\) is odd</strong>, because we restate it as n odd number, to arrive at the contradiction.</p>
<p>In some cases, you will not need to use all the facts in the truth table, in some proofs you will have to.</p>
<p>Let us apply this template to another similar problem to show another pattern.</p>
<h2 id="prove-that-sqrt-3-is-irrational">Prove that \(\sqrt 3\) is irrational</h2>
<h3 id="proof-1">Proof</h3>
<p>We assume that \(\sqrt 3\) is rational. Therefore, we can express it as a ratio of two integers \(\frac{p}{q}:p,q\in\mathbb{N}\), which have no common factors between them. Thus:</p>
\[\frac{p^2}{q^2}=3 \\
\Rightarrow p^2=3q^2\]
<p>As usual, we make a quick truth table, to narrow down the feasibility of \(p, q\) being odd and/or even.</p>
<table>
<thead>
<tr>
<th>p</th>
<th>q</th>
<th>Feasible?</th>
<th>Reason</th>
</tr>
</thead>
<tbody>
<tr>
<td>Even</td>
<td>Even</td>
<td>False</td>
<td>By definition</td>
</tr>
<tr>
<td>Even</td>
<td>Odd</td>
<td>False</td>
<td>\(3q^2\) is odd, so \(p^2\) cannot be even, thus \(p\) cannot be even</td>
</tr>
<tr>
<td>Odd</td>
<td>Even</td>
<td>False</td>
<td>\(12q^2\) is even, so \(p^2\) cannot be odd, thus \(p\) cannot be odd</td>
</tr>
<tr>
<td>Odd</td>
<td>Odd</td>
<td>True</td>
<td> </td>
</tr>
</tbody>
</table>
<p>Thus, the only valid option is \(p\) odd and \(q\) odd.</p>
<h4 id="proof-pattern-2-1"><u>Proof Pattern 2</u></h4>
<p>You cannot arrive at a contradiction using <strong>Proof Pattern 1</strong> like we did above, at this point. Instead, we express both \(p\) and \(q\) as odd numbers. We set \(p=2m+1\) and \(q=2n+1\), so that:
\({(2m+1)}^2=3{(2n+1)}^2 \\
\Rightarrow 4m^2+4m+1=3(4n^2+4n+1) \\
\Rightarrow 4m^2+4m+1=12n^2+12n+3 \\
\Rightarrow 4m^2+4m=12n^2+12n+2 \\
\Rightarrow \underbrace{2(m^2+m)}_{even}=\underbrace{\underbrace{6(n^2+n)}_{even}+1}_{odd} \\\)</p>
<p>Thus, we arrive at a contradiction, where the right side is even, but the left side is odd.</p>
<p>Therefore, \(\sqrt 3\) is not a rational number.
\(\blacksquare\)</p>
<h4 id="proof-pattern-3"><u>Proof Pattern 3</u></h4>
<p>We don’t express \(p\) and \(q\) as odd numbers in this pattern. We do some simple algebraic manipulation on the original form, like so:</p>
\[\frac{p^2}{q^2}=3 \\
\Rightarrow p^2=3q^2 \\
\Rightarrow p^2-q^2=2q^2 \\
\Rightarrow \underbrace{(p+q)(p-q)}_{both\ even\ or\ both\ odd}=\underbrace{2q^2}_{even}\]
<p>The left side has to be even, by the above condition. Now, we note that \(p+q\) and \(p-q\) can be either both even, or both odd, but never one even, one odd. Why? This is easily verified:</p>
<ul>
<li><strong>\(p\) even, \(q\) even</strong>: \(p+q=2(m+n), p-q=2(m-n)\)</li>
<li><strong>\(p\) even, \(q\) odd</strong>: \(p+q=2(m+n)+1, p-q=2(m-n)-1\)</li>
<li><strong>\(p\) odd, \(q\) even</strong>: \(p+q=2(m+n)+1, p-q=2(m-n)+1\)</li>
<li><strong>\(p\) odd, \(q\) odd</strong>: \(p+q=2(m+n+1), p-q=2(m-n)\)</li>
</ul>
<p>Thus, the only way the right hand side condition can hold is if both \(p+q\) and \(p-q\) are even. Write \(p+q=2x\) and \(p-q=2y\), so that:</p>
\[(p+q)(p-q)=2q^2 \\
\Rightarrow 4xy=2q^2 \\
\Rightarrow q^2=2xy\]
<p>This implies that \(q\) is even. But this contradicts our fact from the truth table that \(q\) is odd.</p>
<p>Therefore, \(\sqrt 3\) is not a rational number.
\(\blacksquare\)</p>
<p><strong>NOTE</strong>: This proof pattern does not recur often, I’ve only been able to successfully apply it for the proof of irrationality of \(\sqrt 3\)</p>
<p>The final pattern we’d like to write down, reduces a problem to a simpler one.</p>
<h2 id="prove-that-sqrt-12-is-irrational">Prove that \(\sqrt 12\) is irrational</h2>
<h3 id="proof-2">Proof</h3>
<p>We assume that \(\sqrt 12\) is rational. Therefore, we can express it as a ratio of two integers \(\frac{p}{q}:p,q\in\mathbb{N}\), which have no common factors between them. Thus:</p>
\[\frac{p^2}{q^2}=12 \\
\Rightarrow p^2=12q^2\]
<p>As usual, we make a quick truth table, to narrow down the feasibility of \(p, q\) being odd and/or even.</p>
<table>
<thead>
<tr>
<th>p</th>
<th>q</th>
<th>Feasible?</th>
<th>Reason</th>
</tr>
</thead>
<tbody>
<tr>
<td>Even</td>
<td>Even</td>
<td>False</td>
<td>By definition</td>
</tr>
<tr>
<td>Even</td>
<td>Odd</td>
<td>True</td>
<td> </td>
</tr>
<tr>
<td>Odd</td>
<td>Even</td>
<td>False</td>
<td>\(12q^2\) is even, so \(p^2\) cannot be odd, thus \(p\) cannot be odd</td>
</tr>
<tr>
<td>Odd</td>
<td>Odd</td>
<td>False</td>
<td>\(12q^2\) is even, so \(p^2\) cannot be odd, thus \(p\) cannot be odd</td>
</tr>
</tbody>
</table>
<p>Thus, the only valid option is \(p\) even and \(q\) odd.</p>
<h4 id="proof-pattern-4"><u>Proof Pattern 4</u></h4>
<p>We only express \(p\) as an even number, initially. We set \(p=2m\), so that</p>
\[{(2m)}^2=12q^2 \\
4m^2=12q^2 \\
m^2=3q^2\]
<p><strong>NOTE</strong>: The above problem now reduces to proving that \(\sqrt 3\) is irrational. You’ll see that expanding \(q\) intially as an odd number will not get you anywhere.</p>
<p>All proofs of irrationality of square roots of appropriate numbers can (probably) be derived using the patterns shown above, unless I chance upon a very special case.</p>avishekld Continuing on my journey through Real Analysis, we will focus here on common proof patterns which apply to irrational square roots. These patterns apply to the following sort of proof exercises:The Cholesky and LDL* Factorisations2021-07-08T00:00:00+05:302021-07-08T00:00:00+05:30/2021/07/08/cholesky-ldl-factorisation<p>This article discusses a set of two useful (and closely related) factorisations for <strong>positive-definite matrices</strong>: the <strong>Cholesky</strong> and the <strong>\(LDL^T\)</strong> factorisations. Both of them find various uses: the Cholesky factorisation particularly is used when <strong>solving large systems of linear equations</strong>.</p>
<p><strong>NOTE</strong>: By definition, <strong>a positive-definite matrix is symmetric</strong>.</p>
<h2 id="factorisation-forms">Factorisation Forms</h2>
<ul>
<li>
<p>The <strong>Cholesky factorisation</strong> decomposes a positive definite matrix into the following form:</p>
\[\mathbf{A=LL^T}\]
<p>where \(A\) is <strong>positive-definite</strong>, and \(L\) is a <strong>lower triangular matrix</strong>.</p>
</li>
<li>
<p>The <strong>\(LDL^T\) factorisation</strong> as its name suggests, decomposes a <strong>positive definite matrix</strong> into the following form:</p>
\[\mathbf{A=LDL^T}\]
<p>where \(A\) is <strong>positive-definite</strong>, \(D\) is a <strong>diagonal matrix</strong>, and \(L\) is a <strong>lower triangular matrix</strong> which has <strong>1 in all its diagonal elements</strong>.</p>
</li>
</ul>
<h2 id="cholesky-factorisation">Cholesky Factorisation</h2>
<p>We will derive expressions for the <strong>Cholesky</strong> method by working backwards from the desired form of the factors. We will look at the \(3\times 3\) case to reinforce the pattern.</p>
\[A= \begin{bmatrix}
a_{11} && a_{12} && a_{13} \\
a_{21} && a_{22} && a_{23} \\
a_{31} && a_{32} && a_{33} \\
\end{bmatrix}
\\
L=
\begin{bmatrix}
L_{11} && 0 && 0 \\
L_{21} && L_{22} && 0\\
L_{31} && L_{32} && a_{33}\\
\end{bmatrix}\]
<p>Since we want \(A=LL^T\), we can write out \(LL^T\) as:</p>
\[LL^T=
\begin{bmatrix}
L_{11} && 0 && 0 \\
L_{21} && L_{22} && 0\\
L_{31} && L_{32} && L_{33}\\
\end{bmatrix}
\cdot
\begin{bmatrix}
L_{11} && L_{21} && L_{31} \\
0 && L_{22} && L_{32}\\
0 && 0 && L_{33}\\
\end{bmatrix}
=
\begin{bmatrix}
{L_{11}}^2 && L_{11}L_{21} && L_{11}L_{31} \\
L_{21}L_{11} && {L_{21}}^2 + {L_{22}}^2 && L_{21}L_{31} + L_{22}L_{32}\\
L_{31}L_{11} && L_{31}L_{21} + L_{32}L_{22} && {L_{31}}^2 + {L_{32}}^2 + {L_{33}}^2\\
\end{bmatrix}\]
<p><strong>The product of a matrix and its transpose is always symmetric</strong>, so we can ignore the upper right triangular portion of the above result, when computing the elements. Thus, we have the following equality:</p>
\[A= \begin{bmatrix}
A_{11} && A_{12} && A_{13} \\
A_{21} && A_{22} && A_{23} \\
A_{31} && A_{32} && A_{33} \\
\end{bmatrix}
=
\begin{bmatrix}
{L_{11}}^2 && - && - \\
L_{21}L_{11} && {L_{21}}^2 + {L_{22}}^2 && -\\
L_{31}L_{11} && L_{31}L_{21} + L_{32}L_{22} && {L_{31}}^2 + {L_{32}}^2 + {L_{33}}^2\\
\end{bmatrix}\]
<p>The element \(L_{11}\) is the easiest to compute; equating the terms gives us:</p>
\[L_{11}=\sqrt{a_{11}}\]
<p>Now let us consider the diagonal elements; the pattern suggests the follow the form:</p>
\[A_{ii}=\sum_{k=1}^i{L_{ik}}^2 \\
={L_{ii}}^2 + \sum_{k=1}^{i-1}{L_{ik}}^2 \\
\Rightarrow \mathbf{L_{ii}=\sqrt{A_{ii} - \sum_{k=1}^{i-1}{L_{ik}}^2}} \\\]
<p>Now let us consider the non-diagonal elements in the lower triangular section of the matrix. The pattern suggests the following form. You can convince yourself by computing the results for \(a_{21}\), \(a_{31}\), and \(a_{32}\).</p>
\[A_{ij}=\sum_{k=1}^j L_{ik}L_{jk} \\
= L_{ij}L_{jj} + \sum_{k=1}^{j-1} L_{ik}L_{jk} \\
\Rightarrow \mathbf{L_{ij}=\frac{1}{L_{jj}}\cdot \left( A_{ij} - \sum_{k=1}^{j-1} L_{ik}L_{jk} \right)}\]
<p>We consider how the two equations above can help us compute this. For this illustration, we pick the <strong>Cholesky-Crout</strong> algorithm, which proceeds to find the elements of \(L\), <strong>column by column</strong>. The same concept works if you compute <strong>row by row</strong> (<strong>Cholesky–Banachiewicz algorithm</strong>).</p>
<h2 id="example-computation-cholesky-crout">Example Computation (Cholesky-Crout)</h2>
<p>Note, at each stage, <strong>we’ve bolded the terms which are already known from one of the previous steps</strong>. Otherwise, we would not be able to compute the result of the current step. All the terms on the right hand side in each step should be known quantities.</p>
<h3 id="first-column">First Column</h3>
<p>The first column is easy to compute.</p>
\[L_{11}=\sqrt{\mathbf{a_{11}}} \\
L_{21}=\frac{\mathbf{A_{21}}}{\mathbf{L_{11}}} \\
L_{31}=\frac{\mathbf{A_{31}}}{\mathbf{L_{11}}} \\
...\]
<h3 id="second-column">Second Column</h3>
<p>The first element in the lower triangular section of the second column is the diagonal element (remember we’re only considering the lower triangular section, since the upper triangular section will be the mirror image, because \(LL^T\) is symmetric). For this, we can apply the <strong>diagonal element formula</strong>, like so:</p>
\[L_{22}=\sqrt{\mathbf{A_{22}} - {\mathbf{L_{21}}}^2} \\
L_{32}=\frac{1}{\mathbf{L_{22}}}\cdot\sqrt{\mathbf{A_{32}} - \mathbf{L_{31}}\mathbf{L_{21}}} \\\]
<h3 id="third-column">Third Column</h3>
<p>The first element in the lower triangular section of the third column is the last diagonal element \(L_{33}\). We can again apply the diagonal element formula, like so:</p>
\[L_{33}=\sqrt{\mathbf{A_{33}} - ({\mathbf{L_{31}}}^2 + {\mathbf{L_{32}}}^2)} \\\]
<h2 id="ldlt-factorisation">\(LDL^T\) Factorisation</h2>
<p>Let’s look at the <strong>\(LDL^T\)</strong> factorisation. This is very similar to the <strong>Cholesky</strong> factorisation, except for the fact that <strong>it avoids the need to compute square roots for every term computation</strong>. The form that the \(LDL^T\) factorisation takes is:</p>
\[A=LDL^T\]
<p>where \(A\) is a <strong>positive-definite matrix</strong>, \(L\) is lower triangular with all its diagonal elements set to 1, and \(D\) is a diagonal matrix.</p>
<p>Let us take the \(3\times 3\) matrix as an example again, and we will follow the same approach as we did with the Cholesky factorisation.</p>
\[\begin{bmatrix}
A_{11} && A_{12} && A_{13} \\
A_{21} && A_{22} && A_{23} \\
A_{31} && A_{32} && A_{33} \\
\end{bmatrix}
=
\begin{bmatrix}
1 && 0 && 0 \\
L_{21} && 1 && 0\\
L_{31} && L_{32} && 1\\
\end{bmatrix}
\cdot
\begin{bmatrix}
D_{11} && 0 && 0 \\
0 && D_{22} && 0\\
0 && 0 && D_{33}\\
\end{bmatrix}
\cdot
\begin{bmatrix}
1 && L_{21} && L_{31} \\
0 && 1 && L_{32}\\
0 && 0 && 1\\
\end{bmatrix}\]
<p>Multiplying out the right hand side gives us the following:</p>
\[\begin{bmatrix}
D_{11} && 0 && 0 \\
L_{21}D_{11} && D_{22} && 0\\
L_{31}D_{11} && L_{32}D_{22} && D_{33}\\
\end{bmatrix}
\cdot
\begin{bmatrix}
1 && L_{21} && L_{31} \\
0 && 1 && L_{32}\\
0 && 0 && 1\\
\end{bmatrix} \\
=
\begin{bmatrix}
D_{11} && L_{21}D_{11} && L_{31}D_{11} \\
L_{21}D_{11} && {L_{21}}^2D_{11} + D_{22} && L_{31}L_{21}D_{11} + L_{32}D_{22}\\
L_{31}D_{11} && L_{31}L_{21}D_{11} + L_{32}D_{22} && {L_{31}}^2D_{11} + {L_{32}}^2D_{22} + D_{33} \\
\end{bmatrix}\]
<p>This suggests the following pattern for the diagonal elements.</p>
\[A_{ii}=D_{ii} + \sum_{k=1}^{i-1}{L_{ik}}^2D_{kk} \\
\Rightarrow \mathbf{D_{ii} = A_{ii} - \sum_{k=1}^{i-1}{L_{ik}}^2D_{kk}}\]
<p>For the off-diagonal elements, the following pattern is suggested.</p>
\[A_{ij}=L_{ij}D_{jj} + \sum_{k=1}^{j-1}L_{ik}L_{jk}D_{kk} \\
\Rightarrow \mathbf{L_{ij} = \frac{1}{D_{jj}}\cdot \left( A_{ij} - \sum_{k=1}^{j-1}L_{ik}L_{jk}D_{kk}\right)}\]
<p>The example computation for the \(LDL^T\) is not shown here, but it proceeds in exactly the same way as the Cholesky example computation above.</p>
<p><strong>The important thing to note is that these equations work because every element we are computing, depends only on other elements in the matrix which are above and to the left of that particular element.</strong></p>
<p>Since we begin from the top left and proceed column-wise, we know all the factors needed to compute any element. Further, the <strong>symmetry of the matrix</strong> allows us to do the same thing <strong>row-wise</strong> instead.</p>
<h2 id="implications-forward-and-backward-substitution">Implications: Forward and Backward Substitution</h2>
<p>Assume we have a system of linear equations, contrived to be arranged like so:</p>
\[x_1=c_1 \\
2x_1+3x_2=c_2 \\
4x_1-5x_2+6x_3=c_3 \\
2x_1+7x_2-8x_3+3x_4=c_3 \\\]
<p>How would you find \(x_1\), \(x_2\), \(x_3\), \(x_4\)?
In this case, it is very easy, because <strong>you can always start at the top</strong>, knowing what \(x_1\) is, substitute it into the second equation, get \(x_2\), plug \(x_1\) and \(x_2\) into the third equation, and so on. No tiresome Gaussian Elimination is required, because the equations are set up to allow for the solutions to be arrived at very quickly. This is called solution by <strong>Forward Substitution</strong>.</p>
<p>In the same vein, consider the following system of equations:</p>
\[2x_1+7x_2-8x_3+3x_4=c_3 \\
\hspace{1.2cm}4x_2-5x_3+6x_4=c_3 \\
\hspace{2.4cm}2x_3+3x_4=c_2 \\
\hspace{3.7cm}x_4=c_1 \\\]
<p>How would you solve the above system? Very easy, in the same way as <strong>Forward Substitution</strong>, except in this case, you’d be working backwards from the bottom. This is <strong>Backward Substitution</strong>, and if you have a system of equations arranged in either of the above configurations, the solution is usually very direct.</p>
<p>If these equations were converted into matrix form, you see immediately that the <strong>forward substitution form is an lower triangular matrix</strong>, like so:</p>
\[\begin{bmatrix}
1 && 0 && 0 && 0\\
2 && 3 && 0 && 0\\
4 && 5 && 6 && 0\\
2 && 7 &&8 && 3\\
\end{bmatrix}
\cdot X =
\begin{bmatrix}
c_1 \\
c_2 \\
c_3 \\
c_4 \\
\end{bmatrix} \\\]
<p>Similarly, the <strong>backward substitution</strong> form is a <strong>upper triangular matrix</strong>, like so:</p>
\[\begin{bmatrix}
2 && 7 && 8 && 3\\
0 && 4 && 5 && 6\\
0 && 0 && 2 && 3\\
0 && 0 && 0 && 1\\
\end{bmatrix}
\cdot X =
\begin{bmatrix}
c_4 \\
c_3 \\
c_2 \\
c_1 \\
\end{bmatrix}\]
<p>This is why the <strong>Cholesky</strong> and <strong>\(LDL^T\)</strong> factorisations are so useful; once the original system is recast into one of these forms, solution of the system of linear equations proceeds very directly.</p>
<h2 id="applications">Applications</h2>
<h3 id="1-solutions-to-linear-equations">1. Solutions to Linear Equations</h3>
<p><strong>Cholesky factorisation</strong> is used in solving large systems of linear equations, because we can exploit the <strong>lower triangular</strong> nature of \(L\) and the <strong>upper triangular</strong> nature of \(L^T\).</p>
\[AX=B
\Rightarrow LL^TX=B\]
<p>If we now set \(Y=L^TX\), then we can write:
\(LY=B\)</p>
<p>This can be solved very simply using <strong>forward substitution</strong>, because \(L\) is lower triangular. Once we have computed \(Y\), we solve the following system which we used for substitution, i.e.:</p>
\[L^TX=Y\]
<p>This is also a very easy computation, since \(L^T\) is upper triangular, and thus \(X\) can be solved using <strong>backward substitution</strong>.</p>
<p>It is also important to note that one of the aims of these factorisation algorithms is to also ensure that the <strong>decomposed factors are as sparse as possible</strong>. To this end, there is usually a step before the actual factorisation where the initial matrix is reconfigured (swapping columns and/or rows) based on certain metrics to ensure that the factors end up being as sparse as possible. Such algorithms are called <strong>Minimum Degree Algorithms</strong>.</p>
<h3 id="2-interior-point-method-algorithms-in-linear-programming">2. Interior Point Method Algorithms in Linear Programming</h3>
<p>In <strong>Linear Programming</strong> solvers (like <strong>GLPK</strong>), <strong>Cholesky factorisation</strong> is used as part of the <strong>Interior Point Method</strong> solver in each step. Again, the primary use is to <strong>solve a system of linear equations</strong>, but this is an example of where it fits in a larger real-world context.</p>avishekThis article discusses a set of two useful (and closely related) factorisations for positive-definite matrices: the Cholesky and the \(LDL^T\) factorisations. Both of them find various uses: the Cholesky factorisation particularly is used when solving large systems of linear equations.The Gram-Schmidt Orthogonalisation2021-05-27T00:00:00+05:302021-05-27T00:00:00+05:30/2021/05/27/gram-scmidt-orthogonalisation<p>We discuss an important factorisation of a matrix, which allows us to convert a linearly independent but non-orthogonal basis to a <strong>linearly independent orthonormal basis</strong>. This uses a procedure which iteratively extracts vectors which are orthonormal to the previously-extracted vectors, to ultimately define the orthonormal basis. This is called the <strong>Gram-Schmidt Orthogonalisation</strong>, and we will also show a proof for this.</p>
<h2 id="projection-of-vectors-onto-vectors">Projection of Vectors onto Vectors</h2>
<p>This section derives the <strong>decomposition of a vector into two orthogonal components</strong>. These orthogonal components aren’t necessarily the standard basis vectors (\(\text{[1 0]}\) and \(\text{[0 1]}\) in \(\mathbb{R}^2\), for example); but they are guaranteed to be orthogonal to each other.</p>
<p>Assume we have the vector \(\vec{x}\) that we wish to decompose into two orthogonal components. Let us choose an arbitrary vector \(\vec{u}\) as one of the components; we will derive its orthogonal counterpart as part of this derivation.</p>
<p><img src="/assets/images/vector-projection.png" alt="Vector Projection" /></p>
<p>Since the projection will be collinear with \(\vec{u}\), let us assume the projection is \(t\vec{u}\), where \(t\) is a scalar.
The only constraint we wish to express is that the vector \(\vec{u}\) and the plumb line from the tip of the vector \(\vec{x}\) to \(\vec{u}\) are perpendicular, i.e., their dot product is zero. We can see from the above diagram that the plumb line is \(\vec{x}-t\vec{u}\). We can then write:</p>
\[u^T.(x-ut)=0 \\
\Rightarrow u^Tx=u^Tut \\
\Rightarrow t={(u^Tu)}^{-1}u^Tx\]
<p>We know that \(u^Tu\) is the dot product of \(\vec{u}\) with itself, and thus a scalar, so you could write it as:</p>
\[t=\frac{u^Tx}{u^Tu}\]
<p>and indeed, we’d be justified in doing that, but let’s not make that simplification, because there is a more general case coming up, where this will not be a scalar. Thus, the component of \(\vec{x}\) in the direction \(\vec{u}\) is \(ut={(u^Tu)}^{-1}u^Txu\) and the orthogonal component will be \(x-ut=x-{(u^Tu)}^{-1}u^Txu\).</p>
<p>The one important thing to note is the expression for \(t\) in the general case, i.e., when it is not a scalar. It is basically the expression for the <strong>left inverse of a general matrix</strong>:</p>
<p>There is one simplifying assumption we can make: if \(\vec{u}\) is a unit vector, then \(u^Tu=I\), which simplifies the expressions to:</p>
\[\mathbf{x_{u\parallel}={(u^Tu)}^{-1}u^Txu} \\
x_{u\parallel}=u^Txu \text{ (if u is a unit vector)}\\
\mathbf{x_{u\perp}=x-{(u^Tu)}^{-1}u^Txu} \\
x_{u\perp}=x-u^Txu \text{ (if u is a unit vector)}\]
<h2 id="projection-of-vectors-onto-vector-subspaces">Projection of Vectors onto Vector Subspaces</h2>
<p>The same logic applies when we are projecting vectors onto vector subspaces. We use the same constraint, i.e.:</p>
\[u^T.(x-ut)=0\]
<p>There are a few differences in the meaning of the symbols worth noting. \(u\) is no longer a single column vector; <strong>it is a set of column vectors which define a vector subspace</strong>. Let’s assume the vector subspace is embedded in \(\mathbb{R}^n\), and we have \(m\) linearly independent vectors in \(u\) (\(m\leq n\)). \(u\) now becomes a \(n\times m\) matrix.</p>
<p>The projection is no longer gotten from scaling a single vector; it is now expressible as a linear combination of these \(m\) vectors. <strong>This set of weightings is \(t\), which now becomes a \(m\times 1\) matrix.</strong> This change of \(t\) from a scalar to a \(m\times 1\) matrix is also the reason we didn’t simplify the \(u^Tu\) expression in the previous section; in the general case, \(t\) is not a scalar.</p>
<p>\(\vec{x}\) is still an \(n\times 1\) matrix; this hasn’t changed.</p>
<p>Thus, the results of projection of a vector onto a vector subspace are still the same.</p>
\[\mathbf{x_{u\parallel}={(u^Tu)}^{-1}u^Txu} \\
x_{u\parallel}=u^Txu \text{ (if u is a unit vector)}\\
\mathbf{x_{u\perp}=x-{(u^Tu)}^{-1}u^Txu} \\
x_{u\perp}=x-u^Txu \text{ (if u is a unit vector)}\]
\[t={(u^Tu)}^{-1}u^Tx\]
<h2 id="gram-schmidt-orthogonalisation">Gram-Schmidt Orthogonalisation</h2>
<p>We are now in a position to describe the intuition behind <strong>Gram-Schmidt Orthogonalisation</strong>. Let us state the key idea first.</p>
<p><strong>For a set of \(m\) linearly independent vectors in \(\mathbb{R}^n\) which span some subspace \(V_m\), there exists aset of \(m\) orthonormal basis vectors, which span the same subspace \(V_m\).</strong></p>
<p>The procedure goes as follows:</p>
<p>Assume \(m\) <strong>linearly independent</strong> (but not orthogonal) vectors in \(\mathbb{R}^n\). They span some subspace \(V_m\) of dimensionality \(m\). Let these vectors be \(x_1\), \(x_2\), \(x_3\), …, \(x_m\).</p>
<ul>
<li>We have to start somewhere, so let’s assume that our first orthogonal basis vector is \(u_1=\frac{x_1}{\|x_1\|}\) (normalise to be a unit vector). <strong>\(u_1\) is our first orthogonal basis vector.</strong></li>
<li>We now project \(x_2\) onto \(u_1\), finding \({x_2}_{u_1\parallel}\) and \({x_2}_{u_1\perp}\) as we have described in the previous sections. We won’t really use \({x_2}_{u_1\parallel}\) except to calculate its orthogonal component \({x_2}_{u_1\perp}\).</li>
<li>
<p>Designate \(u_2={x_2}_{u_1\perp}\). Because of the way we have constructed \(u_2\), \(u_2\) is orthogonal to \(u_1\). <strong>We now have two orthogonal basis vectors, \(u_1\), \(u_2\).</strong> Normalise them to unit vectors as needed. Computationally, \(u_2\) looks like this:</p>
\[u_2=x_2-{u_1}^Tx_{2}u_1\]
</li>
<li>Now let us project \(x_3\) onto \(u_1\) and \(u_2\) to get \(({x_3}_{u_1\parallel}, {x_3}_{u_2\parallel})\). Calculate \({x_3}_{u_1,u_2\perp}=x_3-{x_3}_{u_1\parallel}-{x_3}_{u_2\parallel}\).</li>
<li>
<p>Designate \(u_3={x_3}_{u_1,u_2\perp}\). We now have three orthogonal basis vectors, \(u_1\), \(u_2\), \(u_3\). Normalise them to unit vectors as needed. Computationally, \(u_3\) looks like this:</p>
\[u_2=x_3-{u_1}^Tx_{3}u_1-{u_2}^Tx_{3}u_2\]
</li>
<li><strong>Repeat the above procedure for all the remaining vectors upto \(x_m\).</strong> At the end, we will have \(m\) orthogonal basis vectors \((u_1, u_2, ..., u_m)\) which will span the same vector subspace \(V_m\).</li>
</ul>
<p><img src="/assets/images/gram-schmidt-orthogonalisation.png" alt="Gram-Schmidt Orthogonalisation" /></p>
<p>You will notice that at every stage of this procedure, the next orthogonal basis vector to be computed, is given by the following general identity:</p>
\[u_{k+1}=x_{k+1}-\sum_{i=1}^{k}{u_i}^Tx_{k+1}u_i\]
<p>It is very easy to see that at every step, <strong>the latest basis vector is orthogonal to every other previously-generated basis vector</strong>. To see this, take the dot product on both sides with an arbitrary \(u_j\), such that \(j\leq k\).</p>
\[{u_j}^Tu_{k+1}={u_j}^Tx_{k+1}-\sum_{i=1}^{k}{u_j}^T\underbrace{ ({u_i}^Tx_{k+1}) }_{scalar}u_i \\
={u_j}^Tx_{k+1}-\sum_{i=1}^{k}\underbrace{ ({u_i}^Tx_{k+1}) }_{scalar}{u_j}^Tu_i\]
<p>Because of the way we have constructed the previous orthogonal basis vectors, we have \({u_j}^Tu_i=0\) for all \(j\neq i\), and \({u_j}^Tu_i=1\) for \(j=i\) (assuming unit basis vectors). Thus, the above identity becomes:</p>
\[{u_j}^Tu_{k+1}={u_j}^Tx_{k+1}-{u_j}^Tx_{k+1}=0\]
<h2 id="proof-of-gram-schmidt-orthogonalisation">Proof of Gram-Schmidt Orthogonalisation</h2>
<p>A very valid question is: <strong>why does the basis from the Gram-Schmidt procedure span the same vector subspace as the one spanned by the original non-orthogonal basis?</strong></p>
<p>The proof should make this clear; most of it follows almost directly from the procedure itself; we only need to fill in a few gaps, and formalise the presentation.</p>
<p>Given a set of \(m\) <strong>linearly independent vectors</strong> \((x_1, x_2, x_3, ..., x_m)\) in \(\mathbb{R}^n\) spanning a vector subspace \(V\in\mathbb{R}^m\), there exists an <strong>orthogonal basis</strong> \((u_1, u_2, u_3, ..., u_m)\) which spans the vector subspace \(V\in\mathbb{R}^m\).</p>
<p>We prove this by induction.</p>
<h3 id="1-proof-for-n1">1. Proof for \(n=1\)</h3>
<p><strong>Let us validate the hypothesis for \(n=1\).</strong> For \(x_1\), if we take \(u_1=\frac{x_1}{\|x_1\|}\), we can see that \(u_1\) spans the same vector subspacee as \(x_1\), since it’s merely a scaled version of \(x_1\).</p>
<h3 id="2-proof-for-nk1">2. Proof for \(n=k+1\)</h3>
<p>Let us now assume that the above statement holds for \(n=k\leq m\), i.e., there are \(k\) orthogonal basis vectors \((u_1, u_2, u_3, ..., u_k)\) which span the same vector subspace \(V\in\mathbb{R}^k\) as the set \((x_1, x_2, x_3, ..., x_k)\).</p>
<p>Now, consider the construction of the \((k+1)\)th orthogonal basis vector \(u_{k+1}\) like so:</p>
\[u_{k+1}=x_{k+1}-\sum_{i=1}^{k}{u_i}^Tx_{k+1}u_i\]
<p>It is very easy to see that at every step, <strong>the latest basis vector is orthogonal to every other previously-generated basis vector</strong>. To see this, take the dot product on both sides with an arbitrary \(u_j\), such that \(j\leq k\).</p>
\[{u_j}^Tu_{k+1}={u_j}^Tx_{k+1}-\sum_{i=1}^{k}{u_j}^T\underbrace{ ({u_i}^Tx_{k+1}) }_{scalar}u_i \\
={u_j}^Tx_{k+1}-\sum_{i=1}^{k}\underbrace{ ({u_i}^Tx_{k+1}) }_{scalar}{u_j}^Tu_i\]
<p>Because of the way we have constructed the previous orthogonal basis vectors, we have \({u_j}^Tu_i=0\) for all \(j\neq i\), and \({u_j}^Tu_i=1\) for \(j=i\) (assuming unit basis vectors). Thus, the above identity becomes:</p>
\[{u_j}^Tu_{k+1}={u_j}^Tx_{k+1}-{u_j}^Tx_{k+1}=0\]
<p>Thus, the newly constructed basis vector is orthogonal to every basis vector \((u_1, u_2, u_3, ..., u_k)\). This completes the induction part of the proof.</p>
<h3 id="3-proof-that-u_k1neq-0">3. Proof that \(u_{k+1}\neq 0\)</h3>
<p>We also prove that <strong>the newly-constructed basis vector is not a zero vector</strong>. For that, let us assume that \(u_{k+1}=0\). Then, we get:</p>
\[x_{k+1}-\sum_{i=1}^{k}{u_i}^Tx_{k+1}u_i=0 \\
x_{k+1}=\sum_{i=1}^{k}{u_i}^Tx_{k+1}u_i\]
<p>This implies that \(x_{k+1}\) is expressible as a linear combination of the set of vectors \((u_1, u_2, u_3, ..., u_k)\). But we have also assumed that this set spans the same vector subspace as \((x_1, x_2, x_3, ..., x_k)\).</p>
<p>This implies that \(x_{k+1}\) is expressible as a linear combination of the set \((x_1, x_2, x_3, ..., x_k)\), which is a <strong>contradiction</strong>, since the vectors in the full set \((x_1, x_2, x_3, ..., x_m)\) are linearly independent. Thus, \(u_{k+1}\) cannot be zero.</p>
\[\blacksquare\]avishekWe discuss an important factorisation of a matrix, which allows us to convert a linearly independent but non-orthogonal basis to a linearly independent orthonormal basis. This uses a procedure which iteratively extracts vectors which are orthonormal to the previously-extracted vectors, to ultimately define the orthonormal basis. This is called the Gram-Schmidt Orthogonalisation, and we will also show a proof for this.Real Analysis Proofs #12021-05-18T00:00:00+05:302021-05-18T00:00:00+05:30/2021/05/18/peano-axiom-proofs-practice-1<p>Since I’m currently self-studying <strong>Real Analysis</strong>, I’ll be listing down proofs I either initially had trouble understanding, or enjoyed proving, here. These are very mathematical posts, and are for personal documentation, mostly.</p>
<h2 id="recursive-definitions">Recursive Definitions</h2>
<p>Source: <strong>Analysis 1</strong> by <em>Terence Tao</em></p>
<h2 id="definitions">Definitions</h2>
<ul>
<li>A natural number is any element in the set \(\mathbb{N}:=\{0,1,2,3,...\}\).</li>
</ul>
<h2 id="peano-axioms-used">Peano Axioms Used</h2>
<ol>
<li>\(0\) is a natural number.</li>
<li>If \(n\) is a natural number, \(\mathbf{n++}\) is also a natural number.</li>
<li>\(0\) is not the successor to any natural number, i.e., \(n++ \neq 0, \forall n\in\mathbb{N}\).</li>
<li>Different natural numbers must have different successors. If \(m\neq n\), then \(m++ \neq n++\). Conversely, if \(m++ \neq n++\), then \(m=n\).</li>
</ol>
<h2 id="proposition">Proposition</h2>
<p>Suppose there exists a function \(f_n:\mathbb{R}\rightarrow\mathbb{R}\). Let \(c\in\mathbb{N}\). Then we can assign a unique natural number \(a_n\) for each natural number \(n\), such that \(a_0=c\), and \(a_{n++}=f_n(a_n) \forall n\in\mathbb{N}\).</p>
<h3 id="proof-by-induction">Proof by Induction</h3>
<p><strong>For zero</strong></p>
<p>Let \(0\) be assigned \(a_0=c\).
Then, \(a_{0++}=f_0(a_0)\). Since \(0\) is never a successor to any natural number by Axiom (3), \(a_0\) will not recur as for \(a_{0++}\).</p>
<p><strong>For \(n\)</strong></p>
<p>From Axiom (4), we can infer that:</p>
\[n++\neq n,n-1,n-2,...,1,0 \\
\Rightarrow a_{n++}\neq a_n,a_{n-1},a_{n-2},...,a_1,a_0\]
<p>Thus, \(a_{n++}\) is unique in the set \(\{a_0,a_1,a_2,...,a_n,a_{n++}\}\).</p>
<p><strong>For \(n++\)</strong></p>
<p>By extension, for \((n++)++\), we can write:</p>
\[(n++)++\neq n++,n,n-1,n-2,...,1,0 \\
\Rightarrow a_{(n++)++}\neq a_{n++},a_n,a_{n-1},a_{n-2},...,a_1,a_0\]
<p>Thus, \(a_{(n++)++}\) is unique in the set \(\{a_0,a_1,a_2,...,a_n,a_{n++},a_{(n++)++}\}\). Thus, we can assign a unique natural number \(a_{(n++)++}\) such that \(a_{(n++)++}=f_{n++}(a_{n++})\).</p>
\[\blacksquare\]
<h2 id="proof-of-existence-of-real-cube-roots">Proof of Existence of Real Cube Roots</h2>
<p>Let \(r\in\mathbb{N}\)
For the case of \(r=0\), the cube root is \(0\).</p>
<p>Consider the set \(\mathbb{S}=\{x:x^3\leq r, x\in \mathbb{R}, r\in \mathbb{R}\}\).</p>
<p>This set is non-empty because $0\in\mathbb{S}$. It is also bounded by \(r+1\) because \((r+1)^3=r^3+3r^2+3r+1>r\).</p>
<p>Therefore, by the <strong>Completeness Axiom</strong>, \(\mathbb{S}\) has a least upper bound. Denote this least upper bound by \(x\).</p>
<p>By the <strong>Trichotomy property</strong>, these are the possible cases:</p>
<ul>
<li><strong>Case 1</strong>: \(x^3<r\)</li>
<li><strong>Case 2</strong>: \(x^3>r\)</li>
<li><strong>Case 3</strong>: \(x^3=r\).</li>
</ul>
<p><strong>Case 1</strong>: Assume that: \(\mathbb{x^3<r}\)</p>
<p>Then, by our definition of \(\mathbb{S}\), <strong>\(x\in\mathbb{S}\) and is its least upper bound</strong>, i.e., <strong>there are no elements in \(\mathbb{S}\) which are greater than \(x\)</strong>.</p>
<p>If the cube of the least upper bound \(x\) is less than \(r\), then it is enough to show that there exists a \(x+\delta:\delta>0\) whose cube is also less than \(r\).</p>
<p>Assume that \(0<\delta<1\). There can exist \(\delta>1\), but that would restrict the choice of upper bounds we have to play about with:</p>
<p>Then, we’d like to find a \(0<\delta<1\) such that \((x+\delta)^3<r\). This gives us:</p>
\[(x+\delta)^3<r \\
x^3+\delta^3+3x^2\delta+3x\delta^2<r \\
(x^3-r)+\delta^3+3x^2\delta+3x\delta^2<0\]
<p>We know that \(3x\delta^2<3x\delta\) is a positive quantity, and note that \(\delta^3<\delta\), thus we can say:</p>
\[(x^3-r)+\delta^3+3x^2\delta+3x\delta^2<(x^3-r)+\delta+3x^2\delta+3x\delta\]
<p>Then, it is enough to prove that:</p>
\[(x^3-r)+\delta+3x^2\delta+3x\delta<0\]
<p>With some algebraic manipulation, we get:</p>
\[(x^3-r)+\delta+3x^2\delta+3x\delta<0 \\
\Rightarrow \delta(1+3x^2+3x)<r-x^3 \\
\Rightarrow \delta<\frac{r-x^3}{1+3x^2+3x}\]
<p>If we assume \(\delta=\frac{1}{k}:k\in\mathbb{N}\), then we can say:</p>
\[k>\frac{1+3x^2+3x}{r-x^3}: k\in\mathbb{N}\]
<p>Since the <strong>Archimedean property</strong> states that natural numbers have no upper bound, \(k\) must exist.
This means, we have proven that there is a \(k\in\mathbb{N}\), for which there exists a cube root \((x+\frac{1}{k})\) which is larger than \(x\), such that \((x+\frac{1}{k})^3<r\). <strong>This implies that \((x+\frac{1}{k})\) exists in \(\mathbb{S}\)</strong>. However, this contradicts our assumption that no element greater than \(x\) exists in \(\mathbb{S}\).</p>
<p><strong>Thus, the statement \(x^3<r\) is false.</strong></p>
<p><strong>Case 2</strong>: Assume that: \(\mathbb{x^3>r}\)</p>
<p>If the cube of the least upper bound \(x\) is greater than \(r\), then it is enough to show that there exists a \(x-\delta:\delta>0\) whose cube is also greater than \(r\).</p>
<p>Assume that \(0<\delta<1\). There can exist \(\delta>1\), but that would restrict the choice of upper bounds we have to play about with:</p>
<p>Then, we’d like to find a \(0<\delta<1\) such that \((x+\delta)^3>r\). This gives us:</p>
\[(x-\delta)^3>r \\
x^3-\delta^3-3x^2\delta+3x\delta^2>r \\
(x^3-r)-\delta^3-3x^2\delta+3x\delta^2>0 \\\]
<p>Again note that since \(\delta^3<\delta\), and \(3x\delta^2\) is positive, we can write:</p>
\[(x^3-r)-\delta^3-3x^2\delta+3x\delta^2>(x^3-r)-\delta-3x^2\delta \\\]
<p>Thus it is enough to prove that:</p>
\[(x^3-r)-\delta-3x^2\delta>0\]
<p>Some algebraic manipulation gives us:</p>
\[(1+3x^2)\delta<x^3-r \\
\Rightarrow \delta<\frac{x^3-r}{1+3x^2}\]
<p>If we assume \(\delta=\frac{1}{k}:k\in\mathbb{N}\), then we can say:</p>
\[k>\frac{1+3x^2}{x^3-r}: k\in\mathbb{N}\]
<p>Since the <strong>Archimedean property</strong> states that natural numbers have no upper bound, \(k\) must exist.
This means, we have proven that there is a \(k\in\mathbb{N}\), for which there exists a cube root \((x-\frac{1}{k})\) which is smaller than \(x\), such that \((x-\frac{1}{k})^3<r\). <strong>Thus, \((x+\frac{1}{k})\) is a least upper bound for \(\mathbb{S}\)</strong>; however, this contradicts our assumtion that \(x\) is the least upper bound.</p>
<p><strong>Thus, the statement \(x^3>r\) is false.</strong></p>
<p>Thus, the only possibility is that <strong>Case 3</strong> is true, i.e., \(x^3=r\), thus implying the existence of real cube roots of real numbers.</p>
\[\blacksquare\]avishekSince I’m currently self-studying Real Analysis, I’ll be listing down proofs I either initially had trouble understanding, or enjoyed proving, here. These are very mathematical posts, and are for personal documentation, mostly.Quadratic Optimisation: Lagrangian Dual, and the Karush-Kuhn-Tucker Conditions2021-05-10T00:00:00+05:302021-05-10T00:00:00+05:30/2021/05/10/quadratic-form-optimisation-kkt<p>This article concludes the (very abbreviated) theoretical background required to understand <strong>Quadratic Optimisation</strong>. Here, we extend the <strong>Lagrangian Multipliers</strong> approach, which in its current form, admits only equality constraints. We will extend it to allow constraints which can be expressed as inequalities.</p>
<p>Much of this discussion applies to the general class of <strong>Convex Optimisation</strong>; however, I will be constraining the form of the problem slightly to simplify discussion. We have already developed most of the basic mathematical results (see <a href="/2021/05/08/quadratic-optimisation-theory.html">Quadratic Optimisation Concepts</a>) in order to fully appreciate the implications of the <strong>Karush-Kuhn-Tucker Theorem</strong>.</p>
<p><strong>Convex Optimisation</strong> solves problems framed using the following standard form:</p>
<p>Minimise (with respect to \(x\)), \(\mathbf{f(x)}\)</p>
<p>subject to:</p>
<p>\(\mathbf{g_i(x)\leq 0, i=1,...,n}\) <br />
\(\mathbf{h_i(x)=0, i=1,...,m}\)</p>
<p>where:</p>
<ul>
<li>\(\mathbf{f(x)}\) is a <strong>convex</strong> function</li>
<li>\(\mathbf{g_i(x)}\) are <strong>convex</strong> functions</li>
<li>\(\mathbf{h_i(x)}\) are <strong>affine</strong> functions.</li>
</ul>
<p>For <strong>Quadratic Optimisation</strong>, the extra constraint that is imposed is: \(g_i(x)\) is are also affine functions. Therefore, all of our constraints are essentially linear.</p>
<p>For this discussion, I’ll omit the equality constraints \(h_i(x)\) for clarity; any <strong>equality constraints can always be converted into inequality constraints</strong>, and become part of \(g_i(x)\).</p>
<p>Thus, this is the reframing of the <strong>Quadratic Optimisation</strong> problem for the purposes of this discussion.</p>
<p>Minimise (with respect to \(x\)), \(\mathbf{f(x)}\)</p>
<p>subject to: \(\mathbf{g_i(x)\leq 0, i=1,...,n}\)</p>
<p>where:</p>
<ul>
<li>\(\mathbf{f(x)}\) is a <strong>convex function</strong></li>
<li>\(\mathbf{g_i(x)}\) are <strong>affine functions</strong></li>
</ul>
<h2 id="karush-kuhn-tucker-stationarity-condition">Karush-Kuhn-Tucker Stationarity Condition</h2>
<p>We have already seen in <a href="/2021/04/24/vector-calculus-lagrange-multipliers.html">Vector Calculus: Lagrange Multipliers, Manifolds, and the Implicit Function Theorem</a> that the gradient vector of a function can be expressed as a <strong>linear combination of the gradient vectors</strong> of the constraint manifolds.</p>
\[\mathbf{
Df=\lambda_1 Dh_1(U,V)+\lambda_2 Dh_2(U,V)+\lambda_3 Dh_3(U,V)+...+\lambda_n Dh_n(U,V)
}\]
<p>We can rewrite this as:</p>
\[\mathbf{
Df(x)=\sum_{i=1}^n\lambda_i.Dg_i(x)
} \\
\Rightarrow f(x)=\sum_{i=1}^n\lambda_i.g_i(x)\]
<p>where \(x=(U,V)\). We will not consider the pivotal and non-pivotal variables separately in this discussion.</p>
<p>In this original formulation, we expressed the gradient vector as a linear combination of the gradient vector(s) of the constraint manifold(s).
We can bring everything over to one side and flip the signs of the Lagrangian Multipliers to get the following:</p>
\[Df(x)+\sum_{i=1}^n\lambda_i.Dg_i(x)=0\]
<p>Since the derivatives in this case represent the gradient vectors, we can rewrite the above as:</p>
\[\mathbf{
\begin{equation}
\nabla f(x)+\sum_{i=1}^n\lambda_i.\nabla g_i(x)=0
\label{eq:kkt-1}
\end{equation}
}\]
<p>This expresses the fact that the <strong>gradient vector of the tangent space must be opposite (and obviously parallel) to the direction of the gradient vector of the objective function</strong>. All it really amounts to is a <strong>change in the sign</strong> of the multipliers \(\lambda_i\); we do this so that the <strong>Lagrange multiplier terms act as penalties</strong> when the constraints \(g_i(x)\) are violated. We will see this in action when we explore the properties of the Lagrangian in the next few sections.</p>
<p>The identity \(\eqref{eq:kkt-1}\) is the <strong>Stationarity Condition</strong>, one of the <strong>Karush-Kuhn-Tucker Conditions</strong>.</p>
<h2 id="active-and-inactive-constraints">Active and Inactive Constraints</h2>
<p>In <strong>Quadratic Optimisation</strong>, \(g_i(x)|i=1,2,...,n\) represent the constraint functions. An important concept to get an intuition about, is the difference between dealing with pure equality constraints and inequality cnstraints.</p>
<p>The diagram below shows an example where all the constraints are equality constraints.</p>
<p><img src="/assets/images/optimisation-equality-constraints.png" alt="Equality Coonstraints" /></p>
<p>There are two points to note.</p>
<ul>
<li>All equality constraints are expressed in the form \(g_i(x)=0\) and they all must be satisfied simultaneously.</li>
<li><strong>All equality constraints, being affine, must be tangent to the objective function surface</strong>, since only then can the gradient vector of the solution be expressed as the Lagrangian combination of these tangent spaces.</li>
</ul>
<p>The situation changes when inequality constraints are involved. Here is another rough diagram to demonstrate. The y-coordinate represents the image of the objective function \(f(x)\). The x-coordinate represents the image of the constraint function \(g(x)\), i.e., the different values \(g(x)\) can take for different values of \(x\).</p>
<p>The equality condition in this case maps to the y-axis, since that corresponds to \(g(x)=0\). However, we’re dealing with inequality constraints here, namely, \(g(x) \leq 0\); thus the viable space of solutions for \(f(x)\) are all to the left of the y-axis.</p>
<p>As you can see, since \(g(x)\leq 0\), the solution is not required to touch the level set of the constraint manifold corresponding to zero. Such solutions might not be the optimal solutions (we will see why in a moment), but they are viable solutions nevertheless.</p>
<p>We now draw two example solution spaces with two different shapes.</p>
<p><img src="/assets/images/optimisation-active-constraint.png" alt="Active Constraint in Optimisation" /></p>
<p>In the first figure, the global minimum of \(f(x)\) violates the constraint since it lies in the \(g(x)>0\). Thus, we cannot pick that; we must pick minimum \(f(x)\) that does not violate the constraint \(g(x)\leq 0\). This point in the diagram lies on the y-axis, i.e., on \(g(x)=0\). The constraint \(g(x)\leq 0\) in this scenario is considered an <strong>active constraint</strong>.</p>
<p><img src="/assets/images/optimisation-inactive-constraint.png" alt="Inactive Constraint in Optimisation" /></p>
<p>Contrast this with the diagram above. Here, the shape of the solution space is different. The minimum \(f(x)\) lies within the \(g(x)\leq 0\) zone. This means that even if we minimise \(f(x)\) without regard to the constraint \(g(x)\leq 0\), we’ll still get the minimum solution which still satisfies the constraint. In this scenario, we call \(g(x)\leq 0\) an <strong>inactive constraint</strong>. This implies that in this scenario, we do not even need to consider the constraint \(g_i(x)\) as part of the objective function. As you will see, after we define the Lagrangian, this can be done by setting the corresponding Lagrangian multiplier to zero.</p>
<h2 id="the-lagrangian">The Lagrangian</h2>
<p>We now have the machinery to explore the <strong>Lagrangian Dual</strong> in some detail. We will first consider the <strong>Lagrangian</strong> of a function. The Lagrangian form is simply restating the Lagrange Multiplier form as a function \(L(X,\lambda)\), like so:</p>
\[L(x,\lambda)=f(x)+\sum_{i=1}^n\lambda_i.g_i(x)\text{ such that }\lambda_i\geq 0 \text{ and } g_i(x)\leq 0\]
<p>Let us note these conditions from the above identity:</p>
\[\begin{equation}
\mathbf{
g_i(x)\leq 0 \label{eq:kkt-4}
}
\end{equation}\]
\[\begin{equation}
\mathbf{
\lambda_i\geq 0 \label{eq:kkt-3}
}
\end{equation}\]
<ul>
<li><strong>Primal Feasibility Condition</strong>: The inequality \(\eqref{eq:kkt-4}\) is the <strong>Primal Feasibility Condition</strong>, one of the <strong>Karush-Kuhn-Tucker Conditions</strong>.</li>
<li><strong>Dual Feasibility Condition</strong>: The inequality \(\eqref{eq:kkt-3}\) is the <strong>Dual Feasibility Condition</strong>, one of the <strong>Karush-Kuhn-Tucker Conditions</strong>.</li>
</ul>
<p>We have simply moved all the terms of the Lagrangian formulation onto one side and denoted it by \(L(x,\lambda)\), like we talked about when concluding the <strong>Stationarity Condition</strong>.</p>
<p>Note that differentiating with respect to \(x\) and setting it to zero, will get us back to the usual <strong>Vector Calculus</strong>-motivated definition, i.e.:</p>
\[D_xL=
\mathbf{
\nabla f-{[\nabla G]}^T\lambda
}\]
<p>where \(G\) represents \(n\) constraint functions, \(\lambda\) represents the \(n\) Lagrange multipliers, and \(f\) is the objective function.</p>
<h2 id="the-primal-optimisation-problem">The Primal Optimisation Problem</h2>
<p>We will now explore the properties of the Lagrangian, both analytically, as well as geometrically.</p>
<p>Remembering the definition of the supremum of a function, we find the supremum of the Lagrangian with respect to \(\lambda\) (that is, to find the supremum in each case, we vary the value of \(\lambda\)) to be the following:</p>
\[sup_\lambda L(x,\lambda)=\begin{cases}
f(x) & \text{if } g_i(x)\leq 0 \\
\infty & \text{if } g_i(x)>0
\end{cases}\]
<p>Remember that \(\mathbf{\lambda \geq 0}\).</p>
<p>Thus, for the first case, if \(g_i(x) \leq 0\), the best we can do is set \(\lambda=0\), since any other non-negative value will not be the supremum.</p>
<p>In the second case, if \(g(x)>0\), the supremum of the function can be as high as we like as long as we keep increasing the value of \(\lambda\). Thus, we can simply set it to \(\infty\), and the corresponding supremum becomes \(\infty\).</p>
<p>We can see that the function \(sup_\lambda L(x)\) incorporates the constraints \(g_i(x)\) directly, there is a penalty of \(\infty\) for any constraint which is violated. Therefore, the original problem of minimising \(f(x)\) can be equivalently stated as:</p>
\[\text{Minimise (w.r.t. x) }sup_\lambda L(x,\lambda) \\
\text{where } L(x,\lambda)=f(x)+\sum_{i=1}^n\lambda_i.g_i(x)\]
<p>Equivalently, we say:</p>
\[\text{Find }\mathbf{inf_x\text{ }sup_\lambda L(x,\lambda)} \\
\text{where } L(x,\lambda)=f(x)+\sum_{i=1}^n\lambda_i.g_i(x)\]
<p>This is referred to in the mathematical optimisation field as the <strong>primal optimisation problem</strong>.</p>
<h2 id="karush-kuhn-tucker-complementary-slackness-condition">Karush-Kuhn-Tucker Complementary Slackness Condition</h2>
<p>We previously discussed the two possible scenarios when optimising with constraints: either a constraint is active, or it is inactive.</p>
<ul>
<li><strong>Constraint is active</strong>: This implies that the optimal point \(x^*\) lies on the constraint manifold. Thus, \(\mathbf{g_i(x^*)=0}\). Correspondingly, \(\mathbf{\lambda_i g(x^*)=0}\).</li>
<li><strong>Constraint is inactive</strong>: This implies that <strong>the optimal point \(x^*\) does not lie on the constraint manifold, but somewhere inside</strong>. Thus, \(\mathbf{g_i(x^*)<0}\). However, this also means that we can optimise \(f(x)\) without regard to the constraint \(g_i(x)\). The best way to get rid of this constraint then, is to set the corresponding Lagrange multiplier \(\mathbf{\lambda_i=0}\). Correspondingly, \(\mathbf{\lambda_i g(x^*)=0}\) again (albeit for different reasons from the active constraint case).</li>
</ul>
<p>Thus, we may conclude that all \(\lambda_i g_i(x)\) terms in the Lagrangian must be zero, regardless of whether the corresponding constraint is active or inactive.</p>
<p>Mathematically, this implies:</p>
\[\begin{equation}
\mathbf{
\sum_{i=1}^n\lambda_i.g_i(x)=0 \label{eq:kkt-2}
}
\end{equation}\]
<p>The identity \(\eqref{eq:kkt-2}\) is termed the Complementary Slackness Condition, one of the <strong>Karush-Kuhn-Tucker Conditions</strong>.</p>
<h2 id="the-karush-kuhn-tucker-conditions">The Karush-Kuhn-Tucker Conditions</h2>
<p>We are now in a position to summarise all the <strong>Karush-Kuhn-Tucker Conditions</strong>. The theorem states that for the optimisation problem given by:</p>
\[\mathbf{\text{Minimise}_x \hspace{3mm} f(x)}\]
<p>if the following conditions are met for some \(x^*\):</p>
<h3 id="1-primal-feasibility-condition">1. Primal Feasibility Condition</h3>
<p>\(\mathbf{g_i(x^*)\leq 0}\)</p>
<h3 id="2-dual-feasibility-condition">2. Dual Feasibility Condition</h3>
<p>\(\mathbf{\lambda_i\geq 0}\)</p>
<h3 id="3-stationarity-condition">3. Stationarity Condition</h3>
<p>\(\mathbf{\nabla f(x^*)+\sum_{i=1}^n\lambda_i.\nabla g_i(x^*)=0}\)</p>
<h3 id="4-complementary-slackness-condition">4. Complementary Slackness Condition</h3>
<p>\(\mathbf{\sum_{i=1}^n\lambda_i.g_i(x^*)=0}\)</p>
<p>then \(x^*\) is a <strong>local optimum</strong>.</p>
<h2 id="the-dual-optimisation-problem">The Dual Optimisation Problem</h2>
<p>We already know from the <a href="/2021/05/08/quadratic-optimisation-theory.html">Max-Min Inequality</a> that:</p>
\[\mathbf{\text{sup}_y \text{ inf}_x f(x,y)\leq \text{inf}_x \text{ sup}_y f(x,y)} \text{ }\forall x,y\in\mathbb{R}\]
<p>Since this is a general statement about any \(f(x,y)\), we can apply this inequality to the Primal Optimisation Problem, i.e.:</p>
\[\text{sup}_\lambda \text{ inf}_x L(x,\lambda) \leq \text{inf}_x \text{ sup}_\lambda L(x,\lambda)\]
<p>The right side is the <strong>Primal Optimisation Problem</strong>, and the left side is known as the <strong>Dual Optimisation Problem</strong>, and in this case, the <strong>Lagrangian Dual</strong>.</p>
<p>To understand the fuss about the <strong>Lagrangian Dual</strong>, we will begin with the more restrictive case where equality holds for the <strong>Max-Min Inequality</strong>, and later discuss the more general case and its implications. For this first part, we will assume that:</p>
\[\text{sup}_\lambda \text{ inf}_x L(x,\lambda) = \text{inf}_x \text{ sup}_\lambda L(x,\lambda)\]
<p>Let’s look at a motivating example. This is the graph of the Lagrangian for the following problem:</p>
\[\text{Minimise}_x f(x)=x^2 \\
\text{subject to: } x \leq 0\]
<p>The Lagrangian in this case is given by:</p>
\[L(x,\lambda)=x^2+\lambda x\]
<p>This is the corresponding graph of \(L(x,\lambda)\).</p>
<p><img src="/assets/images/lagrangian-shape.png" alt="Shape of Lagrangian for a Convex Objective Function" /></p>
<p>Let us summarise a few properties of this graph.</p>
<ul>
<li><strong>The function is convex in \(x\)</strong>: Assume \(\lambda=C\) is a constant, then the function has the form \(\mathbf{x^2+Cx}\) which is a family of parabolas. <strong>A parabola is a convex function</strong>, thus the result follows.</li>
<li><strong>The function is concave in \(\lambda\)</strong>: Assume that \(x=C\) and \(x^2=K\) are constants, then the function has the form \(\mathbf{C\lambda+K}\), which is the general form of <strong>affine functions</strong>. <strong>Affine functions are both convex and concave</strong>, but we will be drawing more conclusions based on their concave nature, so we will simply say that <strong>the Lagrangian is concave in \(\lambda\)</strong>. Thus, <strong>the Lagrangian is also a family of concave functions</strong>.</li>
<li>As a direct consequence of the Lagrangian being a family of concave functions, we can say that <strong>the pointwise infimum of the Lagrangian is a concave function</strong>. We established this result in <a href="/2021/05/08/quadratic-optimisation-theory.html">Quadratic Optimisation Concepts</a>. This result is irrespective of the shape of the Lagrangian in the direction of \(x\).</li>
</ul>
<p>This is important because it allows us to frame the Lagrangian of a Quadratic Optimisation as a concave-convex function. This triggers a whole list of simplifications, some of which I list below (we’ll discuss most of them in succeeding sections).</p>
<ul>
<li>Guarantee of a <strong>saddle point</strong></li>
<li><strong>Zero duality gap</strong> by default</li>
<li>No extra conditions for <strong>Strong Duality</strong></li>
</ul>
<p><img src="/assets/images/lagrangian-saddle.png" alt="Shape of Lagrangian for a Convex Objective Function" /></p>
<h2 id="geometric-intuition-of-the-lagrange-dual-problem">Geometric Intuition of the Lagrange Dual Problem</h2>
<p>Let us look at the <strong>geometric interpretation</strong> of the Lagrangian Dual. For this discussion, we will assume that the <strong>constraints are active</strong>. The Lagrangian itself is:</p>
\[L(x,\lambda)=f(x)+\sum_{i=1}^n\lambda_i.g_i(x)\text{ such that }\lambda_i\geq 0 \text{ and } g_i(x)\leq 0\]
<p>For the purposes of the discussion, let’s assume one constraint, so that the Lagrangian is now:</p>
\[L(x,\lambda)=f(x)+\lambda.g(x)\text{ such that }\lambda\geq 0 \text{ and } g(x)\leq 0\]
<p>Let us map \(f(x)\) (y-coordinate) and \(g(x)\) (x-coordinate), treating them as variables themselves. Then we see that the Lagrangian is of the form:</p>
\[C=\lambda.g(x)+f(x) \\
\Rightarrow f(x)=-\lambda.g(x)+C\]
<p><strong>This is the equation of a straight line</strong>, with <strong>slope \(-\lambda\)</strong> and <strong>y-intercept \(C\)</strong>. Note that \(C\) in this case represents the Lagrangian objective function.</p>
<p>Let’s walk through the Lagrangian maximisation-minimisation procedure step-by-step. The procedure is:</p>
\[\text{sup}_\lambda \text{ inf}_x L(x,\lambda)\]
<p>There are two important points to note here:</p>
<ul>
<li>We have restricted \(\lambda\geq 0\). Therefore the <strong>slope of the Lagrangian is always negative</strong>.</li>
<li><strong>Moving this line to the left decreases its y-intercept</strong>, in this case, \(C\).</li>
</ul>
<h3 id="1-infimum-with-respect-to-x">1. Infimum with respect to \(x\)</h3>
<p>The first step is \(\text{ inf}_x L(x,\lambda)\), which translates to:</p>
\[\text{ inf}_x \lambda.g(x)+f(x)\]
<ul>
<li>For a given value of \(\lambda\), find the lowest possible \(C\), such that all the constraints are still respected.</li>
</ul>
<p><strong>Geometrically</strong>, this means taking the line \(f(x)=\lambda g(x)\), and moving it as far to the left as possible while it has at least one point in \(G\).</p>
<p><strong>Algebraically</strong>, this gives us:</p>
\[0=\lambda.\frac{dg(x)}{dx}+\frac{df(x)}{dx} \\
\Rightarrow \frac{df(x)}{dx}=-\lambda.\frac{dg(x)}{dx} \\
\Rightarrow \nabla f(x)=-\lambda.\nabla g(x)\]
<p>This gives us the condition for such a minimisation to be possible, which, as you must have guessed, simply restates the <strong>Kuhn-Tucker Stationarity Condition</strong>.</p>
<p>The situation looks like below.</p>
<p><img src="/assets/images/infimum-supporting-hyperplane-convex-set.png" alt="Infimum Supporting Hyperplanes for a Convex Set" /></p>
<p>The important thing to note is that as a result of taking the infimum, all the Lagrangians are now <strong>supporting hyperplanes</strong> of \(G\).</p>
<p>Also, because \(\lambda\geq 0\) and also due to how the infimum works, none of the supporting hyperplanes touch \(G\) in the first quadrant (positive); they have all moved as far left as possible, and are effectively tangent to \(G\) at \(g(x)\leq 0\).</p>
<p>As you see below, <strong>this operation holds true even for nonconvex sets</strong>.</p>
<p><img src="/assets/images/infimum-supporting-hyperplanes-nonconvex-set.png" alt="Infimum Supporting Hyperplanes for a Convex Set" /></p>
<p>The infimum operation tells us what the supporting hyperplane for the convex set looks like for a given value of \(\lambda\). Obviously, this also implies that the Lagrangian is tangent to \(G\). This is expressed by the fact that the gradient vector of \(f(x)\) is parallel and opposite to the gradient vector of the constraint \(g(x)\).</p>
<p>Take special note of the Lagrangian line for \(\lambda_1\) in the nonconvex set scenario; we shall have occasion to revisit it very soon.</p>
<h3 id="1-supremum-with-respect-to-lambda">1. Supremum with respect to \(\lambda\)</h3>
<p>The above infimum (minimisation) operation has given us the Lagrangian in terms of \(\lambda\) only. This family of Lagrangians is represented by \(\text{ inf}_x \lambda.g(x)+f(x)\).</p>
<p><strong>Geometrically, you can assume that you have an infinite set of Lagrangians, one for every value of \(\lambda\), each of them a supporting hyperplane for the \([g(x), f(x)]\) set.</strong></p>
<p>Now, to actually find the optimum point, we’d like to select the <strong>supporting hyperplane that has the maximum corresponding cost \(C\)</strong>, or y-intercept. Algebraically, this implies finding \(\text{ inf}_\lambda \text{ inf}_x \lambda.g(x)+f(x)\).</p>
<p>Note that the Lagrangian is concave in \(\lambda\), thus the minimisation has also given us a concave problem to solve. In this case, we will be maximising this concave problem (which corresponds to minimising a convex problem).</p>
<p><img src="/assets/images/supremum-lagrangian-dual-convex-set.png" alt="Supremum Supporting Hyperplanes for a Convex Set" /></p>
<p>In the diagram above, I’ve marked the winning supporting hyperplane, thicker. For this hyperplane with its value of \(\lambda^*\), the y-intercept (the Lagrangian cost) is maximised. This critical point is marked \(d^*\).</p>
<h2 id="strong-duality">Strong Duality</h2>
<p>The interesting (and useful) thing to note is that if you were to solve the <strong>Primal Optimisation Problem</strong> instead of the <strong>Lagrangian Dual Problem</strong>, or even the original optimisation problem in the <strong>standard Quadratic Programming form</strong>, you will get the same result as \(d^*\).</p>
<p>This is the result of the function being concave in \(\lambda\) and convex in \(x\), <strong>implying the existence of a saddle point</strong>. This is also the situation where the equality clause of the <strong>Max-Min Inequality</strong> holds.</p>
<h2 id="weak-duality-and-the-duality-gap">Weak Duality and the Duality Gap</h2>
<p>I’d purposefully omitted the result of finding the supremum for the nonconvex case in the previous section. This is because the nonconvex scenario is what shows us the real difference between the <strong>Primal Optimisation Problem</strong> and its <strong>Lagrangian Dual</strong>.</p>
<p>The winning supporting hyperplane for the <strong>nonconvex set</strong> is shown below.</p>
<p><img src="/assets/images/duality_gap-nonconvex-set.png" alt="Supremum Supporting Hyperplanes for a Non-Convex Set" /></p>
<p>The solution for the <strong>Lagrangian Dual Problem</strong> is marked \(d^*\), and the solution for the <strong>Primal Optimisation Problem</strong> is marked \(p^*\). As you can clearly see, \(d^*\) and \(p^*\) do not coincide.</p>
<p>The dual solution is in this case, is not the actual solution, but <strong>it provides a lower bound on \(p^*\)</strong>, i.e., if we can compute \(d^*\), we can use it to decide if the solution by an optimisation algorithm is “good enough”. It is also a validation that we are not searching in an infeasible area of the solution space.</p>
<p><strong>This is the situation where the inequality condition of the Max-Min Inequality holds.</strong></p>
<p>The difference between the \(p^*\) and \(d^*\) is called the <strong>Duality Gap</strong>. Obviously, the duality gap is zero when conditions of <strong>Strong Duality</strong> are satisfied. When these conditions for Strong Duality are not satisfied, we say that <strong>Weak Duality</strong> holds.</p>
<h2 id="conditions-for-strong-duality">Conditions for Strong Duality</h2>
<p>There are many different conditions which, if satisfied by themselves, guarantee Strong Duality. In particular, textbooks cite <strong>Slater’s Constraint Qualification</strong> very frequently, and the <strong>Linear Independence Constraint Qualification</strong> also finds mention.</p>
<p><strong>The above-mentioned constraint qualifications assume that the constraints are nonlinear.</strong></p>
<p>However, for our current purposes, if we assume that the <strong>inequality constraints are affine functions</strong>, we do not need to satisfy any other condition: <strong>the duality gap will be zero by default</strong> under these conditions; the optimum dual solution will always equal the optimal primal solution, i.e., \(p^*=d^*\).</p>
<p>This also <strong>guarantees the existence of a saddle point</strong> in the solution of the Lagrangian. A saddle point of a function \(f(x,y)\) is defined as a point (x^<em>,y^</em>) which satisfies the following condition:</p>
\[f(x^*,\bigcirc)\leq f(x^*,y^*)\leq f(\bigcirc, y^*)\]
<p>where \(\bigcirc\) represents “any \(x\)” or “any \(y\)” depending upon its placement. Applying this to our objective function, we can write:</p>
\[f(x^*,\bigcirc)\leq f(x^*,\lambda^*)\leq f(\bigcirc, \lambda^*)\]
<p>The implication is that starting from the saddle point, the function slopes down in the direction of \(\lambda\), and slopes up in the direction of \(x\). The figure below shows the general shape of the Lagrangian with a convex objective function and affine (inequality and equality) constraints.</p>
<p><img src="/assets/images/lagrangian-saddle.png" alt="Shape of Lagrangian for a Convex Objective Function" /></p>
<p>The reason this leads to <strong>Strong Duality</strong> is this: minimising \(f(x,\lambda)\) with respect to \(x\) first, then maximising with respect to \(\lambda\), takes us to the same point \((x^*,y^*)\) that would be reached, if we first maximise \(f(x,\lambda)\) with respect to \(\lambda\), then minimise with respect to \(\lambda\).</p>
<p>Mathematically, this implies that:</p>
\[\mathbf{\text{sup}_\lambda \text{ inf}_x f(x,\lambda)= \text{inf}_x \text{ sup}_\lambda f(x,\lambda)}\]
<p>thus implying that the <strong>Duality Gap</strong> is zero.</p>
<h2 id="notes">Notes</h2>
<ul>
<li><strong>Karush-Kuhn-Tucker Conditions</strong> use <strong>Farkas’ Lemma</strong> for proof.</li>
<li>The <strong>Saddle Point Theorem</strong> is not proven here.</li>
</ul>avishekThis article concludes the (very abbreviated) theoretical background required to understand Quadratic Optimisation. Here, we extend the Lagrangian Multipliers approach, which in its current form, admits only equality constraints. We will extend it to allow constraints which can be expressed as inequalities.Support Vector Machines from First Principles: Linear SVMs2021-05-10T00:00:00+05:302021-05-10T00:00:00+05:30/2021/05/10/support-vector-machines-lagrange-multipliers<p>We have looked at how <strong>Lagrangian Multipliers</strong> and how they help build constraints as part of the function that we wish to optimise. Their relevance in <strong>Support Vector Machines</strong> is how the constraints about the classifier margin (i.e., the supporting hyperplanes) is incorporated in the search for the <strong>optimal hyperplane</strong>.</p>
<p>We introduced the first part of the problem in <a href="/2021/04/14/support-vector-machines-derivations.html">Support Vector Machines from First Principles: Part One</a>. We then took a detour through <strong>Vector Calculus</strong> and <strong>Constrained Quadratic Optimisation</strong> to build our mathematical understanding for the succeeding analysis.</p>
<p>We will now derive the analytical form of the Support Vector Machine variables in this post. This article will only discuss <strong>Linear Support Vector Machines</strong>, which apply to a <strong>linearly separable data set</strong>. <strong>Non-Linear Support Vector Machines</strong> will be discussed in an upcoming article.</p>
<p>The necessary background material for understanding this article is covered in the following articles:</p>
<ul>
<li><a href="/2021/04/14/support-vector-machines-derivations.html">Support Vector Machines from First Principles: Part One</a></li>
<li>Vector Calculus Background
<ul>
<li><a href="/2021/04/20/vector-calculus-simple-manifolds.html">Vector Calculus: Graphs, Level Sets, and Constraint Manifolds</a></li>
<li><a href="/2021/04/24/vector-calculus-lagrange-multipliers.html">Vector Calculus: Lagrange Multipliers</a></li>
<li><a href="/2021/04/29/inverse-function-theorem-implicit-function-theorem.html">Vector Calculus: Implicit Function Theorem and Inverse Function Theorem</a> (<strong>Note</strong>: This covers more theoretical background)</li>
</ul>
</li>
<li>Quadratic Form and Motivating Problem
<ul>
<li><a href="/2021/04/19/quadratic-form-optimisation-pca-motivation-part-one.html">Quadratic Optimisation: PCA as Motivation</a></li>
<li><a href="/2021/04/28/quadratic-optimisation-pca-lagrange-multipliers.html">Conclusion to Quadratic Optimisation: PCA as Motivation</a></li>
</ul>
</li>
<li>Quadratic Optimisation
<ul>
<li><a href="/2021/05/08/quadratic-optimisation-theory.html">Quadratic Optimisation: Mathematical Background</a></li>
<li><a href="/2021/05/10/quadratic-form-optimisation-kkt.html">Quadratic Optimisation: Karush-Kuhn-Tucker Conditions</a></li>
</ul>
</li>
</ul>
<p>Before we proceed with the calculations, I’ll restate the original problem again.</p>
<h2 id="support-vector-machine-problem-statement">Support Vector Machine Problem Statement</h2>
<p>For a set of data \(x_i, i\in[1,N]\), if we assume that data is divided into two classes (-1,+1), we can write the constraint equations as:</p>
\[\mathbf{m_{max}=max \frac{2k}{\|N\|}}\]
<p>subject to the following constraints/;</p>
\[\mathbf{
N^Tx_i\geq b+k, \forall x_i|y_i=+1 \\
N^Tx_i\leq b-k, \forall x_i|y_i=-1
}\]
<p><img src="/assets/images/svm-supporting-hyperplanes.png" alt="SVM Support Hyperplanes" /></p>
<p>We are also given a set of training examples \(x_i, i=1,2,...,n\) which are already labelled either <strong>+1</strong> or <strong>-1</strong>. <strong>The important assumption here is that these training data points are linearly separable</strong>, i.e., there exists a hyperplane which divides the two categories, such that no point is misclassified. Our task is to find this hyperplane with the maximum possible margin, which will be defined by its <strong>supporting hyperplanes</strong>.</p>
<h2 id="restatement-of-the-support-vector-machine-problem-statement">Restatement of the Support Vector Machine Problem Statement</h2>
<p>Remembering the standard form of a <strong>Quadratic Programming</strong> problem, we want the objective function to be a minimisation problem, as well as a quadratic problem.</p>
<p>Furthermore, we’d like to set the constant \(k=1\), and rewrite \(N\) with \(w\). Thus, the objective function may be rewritten as:</p>
\[\mathbf{min f(x)=\frac{w^Tw}{2}}\]
<p>since squaring \(w\) does not affect the outcome of the minimisation problem.</p>
<p>We have two constraints; we’d like to rewrite them in the form \(g(x)\leq 0\). Thus, we get:</p>
\[-(w^Tx_i-b)+1\leq 0, \forall x_i|y_i=+1\\
w^Tx_i-b+1\leq 0, \forall x_i|y_i=-1\]
<p>You will notice that they differ only in the sign of \((w^Tx_i-b)\), which is dependent on the reverse sign of \(y_i\). We can collapse these two inequalities into a single one by using \(y_i\) as a determinant of the sign.</p>
\[g_i(x)=\sum_{i=1}^n-y_i(w^Tx_i-b)+1\leq 0, \forall x_i|y_i\in\{-1,+1\}\]
<p>The <strong>Lagrangian</strong> then is:</p>
\[\mathbf{
L(w,\lambda,b)=f(x)+\lambda_i g_i(x)} \hspace{15mm}\text{(Standard Lagrangian Form)}\\
L(w,\lambda,b)=\frac{w^Tw}{2}+\sum_{i=1}^n\lambda_i [-y_i(w^Tx_i-b)+1] \\
\mathbf{
L(w,\lambda,b)=\frac{w^Tw}{2}-\sum_{i=1}^n\lambda_i [y_i(w^Tx_i-b)-1]
}\]
<p>for all \(x_i\) such that \(\lambda_i\geq 0\), \(g_i(x)\leq 0\), and \(y_i\in\{-1,+1\}\).</p>
<p>We have already assumed the <strong>Primal and Dual Feasibility Conditions</strong> above. The <strong>Dual Optimisation Problem</strong> is then:</p>
\[\text{max}_\lambda\hspace{4mm}\text{min}_{w,b} \hspace{4mm} L(w,\lambda,b)\]
\[\begin{equation}
\text{max}_\lambda\hspace{4mm}\text{min}_{w,b} \hspace{4mm} \frac{w^Tw}{2}-\sum_{i=1}^n\lambda_i [y_i(w^Tx_i-b)-1] \label{eq:lagrangian}
\end{equation}\]
<p>Note that the only constraints that will be activated will be the ones which are for points lying on the supporting hyperplanes.</p>
<h2 id="the-support-vector-machine-solution">The Support Vector Machine Solution</h2>
<p>We have three variables in the Lagrangian Dual: \((w,b,\lambda)\). We will now solve for each of them in turn.</p>
<h3 id="1-solving-for-wast">1. Solving for \(w^\ast\)</h3>
<p>Let’s see what the KKT Stationarity Condition gives us.</p>
\[\frac{\partial L}{\partial w}=w-\sum_{i=1}^n \lambda_ix_iy_i\]
<p>Setting this partial differential to zero, we get:</p>
\[\begin{equation}
\mathbf{
w^\ast=\sum_{i=1}^n \lambda_ix_iy_i \label{eq:weight}
}
\end{equation}\]
<p>If we denote \(w^\ast\) as the optimal solution for \(w\).</p>
<h3 id="2-solving-for-bast">2. Solving for \(b^\ast\)</h3>
<p>Differentiating with respect to \(b\), and setting it to zero, we get:</p>
\[\frac{\partial L}{\partial b}=0 \\
\Rightarrow \begin{equation}
\sum_{i=1}^n \lambda_iy_i=0 \label{eq:b-constraint}
\end{equation}\]
<p>This doesn’t give us an expression for \(b\) but does give us a specific condition that needs to be fulfilled by any point which lies on the supporting hyperplane.</p>
<p>Let us make the following observations:</p>
<ul>
<li>We already know \(w^\ast\). Thus, we know the <strong>separating hyperplane through the origin</strong>, though we do not know \(b\). In two dimensions, this would be the equivalent of the y-intercept.</li>
<li>For the points labelled \(+1\), the <strong>minimum value</strong> you get by plugging \(x_i\) into \(\mathbf{w^\ast x}\) is definitely a point on the (as yet undetermined) <strong>positive supporting hyperplane \(H^+\)</strong>. You can have multiple points which achieve this minimum value; all of those points lie on \(H^+\), which is obviously parallel to \(f(x)=w^\ast x\).</li>
<li>For the points labelled \(-1\), the <strong>maximum value</strong> you get by plugging \(x_i\) into \(\mathbf{w^\ast x}\) is definitely a point on the (as yet undetermined) <strong>negative supporting hyperplane \(H^-\)</strong>. You can have multiple points which achieve this maximum value; all of those points lie on \(H^-\), which is obviously parallel to \(f(x)=w^\ast x\).</li>
</ul>
<p>Therefore, we may find \(b^+\) and \(b^-\) by finding:</p>
<ul>
<li>\(H^+\) is the hyperplane with “slope” \(w^\ast\) and passing through the point \(x^+\) which gives the minimum value (positive or negative) for \(f(x)=w^\ast x\). There may be multiple points like \(x^+\); pick any one. \(H^+\) will have y-intercept \(b^+\).</li>
<li>\(H^-\) is the hyperplane with “slope” \(w^\ast\) and passing through the point \(x^-\) which gives the maximum value (positive or negative) for \(f(x)=w^\ast x\). There may be multiple points like \(x^-\); pick any one. \(H^-\) will have y-intercept \(b^-\).</li>
</ul>
<p><strong>\(H^+\) and \(H^-\) are the supporting hyperplanes.</strong> The situation is shown below.</p>
<p><img src="/assets/images/svm-solving-y-intercept.png" alt="Solving for Primal and Dual SVM Variables" /></p>
<p>We already saw in <a href="/2021/04/14/support-vector-machines-derivations.html">Support Vector Machines from First Principles: Part One</a> that the separating hyperplane \(H_0\) lies midway between \(H^+\) and \(H^-\), implying that \(b^\ast\) is the mean of \(b^+\) and \(b^-\). Thus, we get:</p>
\[\begin{equation}
\mathbf{
b^\ast=\frac{b^++b^-}{2} \label{eq:b}
}
\end{equation}\]
<h3 id="3-solving-for-lambdaast">3. Solving for \(\lambda^\ast\)</h3>
<p>Let us simplify the \(\eqref{eq:lagrangian}\) in light of these new identities. We write:</p>
\[L(\lambda,w^\ast,b^\ast)=\frac{w^Tw}{2}+\sum_{i=1}^n\lambda_i [y_i(w^Tx_i-b)-1] \\
=\frac{w^Tw}{2}+\sum_{i=1}^n\lambda_i y_i w^Tx_i- \sum_{i=1}^n\lambda_i y_ib + \sum_{i=1}^n\lambda_i\]
<p>The term \(\sum_{i=1}^n\lambda_i y_ib\) vanishes because of \(\eqref{eq:b-constraint}\), so we get:</p>
\[L(\lambda,w^\ast,b^\ast)=\frac{w^Tw}{2}+\sum_{i=1}^n\lambda_i y_i w^Tx_i + \sum_{i=1}^n\lambda_i\]
<p>Applying the identity \(\eqref{eq:weight}\) to this result, we get:</p>
\[L(\lambda,w^\ast,b^\ast)=\frac{1}{2} \sum_{i=1}^n\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_ix_j - \sum_{i=1}^n\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_ix_j + \sum_{i=1}^n \lambda_i \\
\mathbf{
L(\lambda,w^\ast,b^\ast)=\sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_ix_j
}\]
<p>Thus, \(\lambda_i\) can be solved by optimising \(L(\lambda,w^\ast,b^\ast)\), that is:</p>
\[\lambda^\ast=\text{arginf}_\lambda L(\lambda,w^\ast,b^\ast) \\
\mathbf{
\lambda^\ast=\text{arginf}_\lambda \left[\sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_ix_j\right]
}\]
<h3 id="solving-for-bast-a-shortcut">Solving for \(b^\ast\): A Shortcut</h3>
<p>I noted that to find \(b^+\) and \(b^-\), we needed to find respectively, the minimum and maximum values from each category applied to the candidate separating hyperplane \(f(x)=w^\ast x\). As it turns out, we do not need to look through all the points.</p>
<p>Recall that the support vectors are the ones which define the constraints in the form of supporting hyperplanes. Also, recall from our discussion on the Lagrangian Dual that the constraints are only activated for \(g(x)=0\), i.e., the Lagrange multipliers for those points are the only nonzero multipliers; all other constraints have their Lagrange multipliers as zero.</p>
<p>This means that if we have already computed the <strong>Lagrange multipliers</strong>, we only need to search through the <strong>points which have nonzero Lagrange multipliers</strong> to find \(b^+\) and \(b^-\). We do not need to find the maximum and minimum values, and the number of points we need to look at, is vastly reduced, presumably because most of the data points will be inside the halfspaces proper, and not exactly on the supporting hyperplanes \(H^+\) and \(H^-\).</p>
<h3 id="summary">Summary</h3>
<p>Note that at the end of our calculation, we will have arrived at (\(\lambda^\ast\), \(w^\ast\), \(b^\ast\)) as the optimal solution for the Lagrangian. Recall that by our <strong>assumptions of Quadratic Optimisation</strong>, this <strong>Lagrangian is a concave-convex function</strong>, and thus the primal and the dual optimum solutions coincide (<strong>no duality gap</strong>). In effect, this is the same solution that we’d have gotten if we’d solved the original optimisation problem.</p>
<p>Once the training has completed, categorising a new point from a test set, is done simply by finding:</p>
\[y_t=sgn[w^\ast x_t-b^\ast]\]
<p>Summarising, the expressions for the <strong>optimal Primal and Dual variables</strong> are:</p>
\[\mathbf{
w^\ast=\sum_{i=1}^n \lambda_ix_iy_i \\
b^\ast=\frac{b^++b^-}{2} \\
\lambda^\ast=\text{arginf}_\lambda \left[\sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_ix_j\right]
}\]
<h2 id="relationship-with-the-perceptron">Relationship with the Perceptron</h2>
<p>The <strong>Perceptron</strong> is a much simpler version of a Support Vector Machine. I’ll cover the Perceptron in its article, but simply put: the perceptron also attempts to create a linear discriminant hyperplane between two classes of data, with the purpose of classifying new data points into either one of these categories.</p>
<p>The form of the solution for the perceptron is also a hyperplane of the form \(f(x)=wx-b\). The perceptron may be trained sequentially, or batchwise, but regardless of the training sequence, the <strong>final adjustment that is applied to \(w\) in the hyperplane solution is proportional to \(\sum^{i=1}_n \eta x_iy_i\)</strong>. This is very similar to the identity \(w^\ast=\sum_{i=1}^n \lambda_ix_iy_i\) which we derived in \(\eqref{eq:weight}\).</p>
<p>However, since the <strong>Perceptron</strong> does not attempt to maximise the margin between the two categories, the <strong>separating hyperplane may perform well on the training set</strong>, but might end up arbitrarily close to the support vector in either category, thus <strong>increasing the risk of misclassification of new test points in that category, which lie close to the support vector</strong>.</p>avishekWe have looked at how Lagrangian Multipliers and how they help build constraints as part of the function that we wish to optimise. Their relevance in Support Vector Machines is how the constraints about the classifier margin (i.e., the supporting hyperplanes) is incorporated in the search for the optimal hyperplane.Quadratic Optimisation: Mathematical Background2021-05-08T00:00:00+05:302021-05-08T00:00:00+05:30/2021/05/08/quadratic-optimisation-theory<p>This article continues the original discussion on <strong>Quadratic Optimisation</strong>, where we considered <strong>Principal Components Analysis</strong> as a motivation. Originally, this article was going to begin delving into the <strong>Lagrangian Dual</strong> and the <strong>Karush-Kuhn-Tucker Theorem</strong>, but the requisite mathematical machinery to understand some of the concepts necessitated breaking the preliminary setup into its own separate article (which you’re now reading).</p>
<h2 id="affine-sets">Affine Sets</h2>
<p>Take any two vectors \(\vec{v_1}\) and \(\vec{v_2}\). All the vectors (or points, if you so prefer) along the line joining the tips of \(\vec{v_1}\) and \(\vec{v_2}\) obviously lie on a straight line. Thus, we can represent any vector along this line segment as:</p>
\[\vec{v}=\vec{v_1}+\theta(\vec{v_1}-\vec{v_2}) \\
=\theta \vec{v_1}+(1-\theta)\vec{v_2}\]
<p>We say that all these vectors (including \(\vec{v_1}\) and \(\vec{v_2}\)) form an <strong>affine set</strong>. More generally, a vector is a member of an affine set if it satisfies the following definition.</p>
\[\vec{v}=\theta_{1} \vec{v_1}+\theta_{2} \vec{v_2}+...+\theta_{n} \vec{v_n} \\
\theta_1+\theta_2+...+\theta_n=1\]
<p><img src="/assets/images/affine-set.png" alt="Affine Set" /></p>
<p>In words, a vector is an <strong>affine combination</strong> of \(n\) vectors if the <strong>coefficients of the linear combinations of those vectors sum to one</strong>.</p>
<h2 id="convex-and-non-convex-sets">Convex and Non-Convex Sets</h2>
<p>A set is said to be a <strong>convex set</strong> if for any two points belonging to the set, all their affine combinations also belong to the set. In simpler times, it means that a straight line between two points belonging to a convex set, lies completely inside the set.</p>
<p>Mathematically, the condition for convexity is the following:</p>
\[\theta p_1+(1-\theta)p_2 \in C, \text{ if } p_1,p_2 \in C\]
<p>The set shown below is a convex set.</p>
<p><img src="/assets/images/convex-set.png" alt="Convex Set" /></p>
<p>Any set that does not adhere to the above definition, is, by definition, a <strong>nonconvex set</strong>.</p>
<p>The set below is <strong>nonconvex</strong>. The red segments of the lines joining the points within the set lie outside the set, and thus violate the definition of convexity.</p>
<p><img src="/assets/images/nonconvex-set.png" alt="Nonconvex Set" /></p>
<h2 id="convex-and-concave-functions">Convex and Concave Functions</h2>
<p>The layman’s explanation of a convex function is that it is a bowl-shaped function. However, let us state this mathematically: we say a function is convex, <strong>if the graph of that function lies below every point on a line connecting any two points on that function</strong>.</p>
<p><img src="/assets/images/convex-function.png" alt="Convex Function" /></p>
<p>If \((x_1, f(x_1))\) and \((x_2, f(x_2))\) are two points on a function \(f(x)\), then \(f(x)\) is <strong>convex</strong> iff:</p>
\[\mathbf{f(\theta x_1+(1-\theta x_2))\leq \theta f(x_1)+(1-\theta)f(x_2)}\]
<p>Consider a point \(P\) on the line connecting \([x_1, f(x_1)]\) and \([x_2, f(x_2)]\), its coordinate on that line is \([\theta x_1+(1-\theta) x_2, \theta f(x_1)+(1-\theta) f(x_2)]\). The corresponding point on the graph is \([\theta x_1+(1-\theta) x_2, f([\theta x_1+(1-\theta) x_2)]\).</p>
<p><img src="/assets/images/concave-function.png" alt="Concave Function" />
The same condition, but inverted, can be applied to define a concave function. A function \(f(x)\) is <strong>concave</strong> iff:</p>
\[\mathbf{f(\theta x_1+(1-\theta x_2))\geq \theta f(x_1)+(1-\theta)f(x_2)}\]
<h2 id="affine-functions">Affine Functions</h2>
<p>An function \(f(x)\) is an <strong>affine function</strong> iff:</p>
\[\mathbf{f(\theta x_1+(1-\theta) x_2)=f(\theta x_1)+f((1-\theta) x_2)}\]
<p>Let’s take a simple function \(f(x)=Ax+C\) where \(x\) is a vector. \(A\) is a transformation matrix, and \(C\) is a constant vector. Then, for two vectors \(\vec{v_1}\) and \(\vec{v_2}\), we have:</p>
\[f(\theta \vec{v_1}+(1-\theta) \vec{v_2})=A.[\theta \vec{v_1}+(1-\theta) \vec{v_2}]+C \\
=A\theta \vec{v_1}+A(1-\theta) \vec{v_2}+(\theta+1-\theta)C \\
=A\theta \vec{v_1}+A(1-\theta) \vec{v_2}+\theta C+(1-\theta)C \\
=[\theta A\vec{v_1}+\theta C]+[(1-\theta) A\vec{v_2}+(1-\theta)C]\\
=\theta[A\vec{v_1}+C]+(1-\theta)[A\vec{v_2}+C]\\
=\theta f(\vec{v_1})+(1-\theta) f(\vec{v_2})\]
<p>Thus all mappings of the form \(\mathbf{f(x)=Ax+C}\) are <strong>affine functions</strong>.</p>
<p>We may draw another interesting conclusion: <strong>affine functions are both convex and concave</strong>. This is because affine functions satisfy the equality conditions for both convexity and concavity: <strong>an affine set on an affine function lies fully on the function itself</strong>.</p>
<h2 id="supporting-hyperplanes">Supporting Hyperplanes</h2>
<p>A <strong>supporting hyperplane</strong> for a set \(C\) is a hyperplane which has the following properties:</p>
<ul>
<li>The <strong>supporting hyperplane</strong> is guaranteed to contain at least one point which is also on the boundary of the set \(C\).</li>
<li>The <strong>supporting hyperplane</strong> divides \(\mathbb{R}^n\) into two <strong>half-spaces</strong> such that set \(C\) is completely contained by one of these half-spaces.</li>
</ul>
<p>The definition of a convex set can also be explained by supporting hyperplanes. If there exists at least one supporting hyperplane for each point on the boundary of a set \(C\), \(C\) is convex.</p>
<p>The diagram below shows an example of a supporting hyperplane for a convex set.
<img src="/assets/images/valid-supporting-hyperplane.png" alt="Supporting Hyperplane for a Convex Set" /></p>
<p>The diagram below shows an example of an invalid supporting hyperplane (the dotted hyperplane). The dotted supporting hyperplane cannot exist because set \(C\) lies in both the half-spaces defined by this hyperplane.</p>
<p><img src="/assets/images/invalid-supporting-hyperplane.png" alt="Invalid Supporting Hyperplane for a Non-Convex Set" /></p>
<h2 id="some-inequality-proofs">Some Inequality Proofs</h2>
<h3 id="result-1">Result 1</h3>
<p>If \(a\geq b\), and \(c\geq d\), then:</p>
\[min(a,c)\geq min(b,d)\]
<p>The proof goes like this, we can define the following inequalities in terms of the \(min\) function:</p>
\[\begin{eqnarray}
a \geq min(a,c) \label{eq:1} \\
c \geq min(a,c) \label{eq:2} \\
b \geq min(b,d) \label{eq:3} \\
d \geq min(b,d) \label{eq:4} \\
\end{eqnarray}\]
<p>Then, the identities \(a \geq b\) and \(\eqref{eq:3}\) imply:</p>
\[a \geq b \geq min(b,d)\]
<p>Similarly, the identities \(c \geq d\) and \(\eqref{eq:4}\) imply that:</p>
\[c \geq d \geq min(b,d)\]
<p>Therefore, regardless of our choice from \(a\), \(c\) from the function \(min(a,c)\), the result will always be greater than \(min(b,d)\). Thus we write:</p>
\[\begin{equation} \mathbf{min(a,c) \geq min(b,d)} \label{ineq:1}\end{equation}\]
<h3 id="result-2">Result 2</h3>
<p>Here we prove that:</p>
\[min(a+b, c+d) \geq min(a,c)+min(b,d)\]
<p>Here we take a similar approach, noting that:</p>
\[a \geq min(a,c) \\
c \geq min(a,c) \\
b \geq min(b,d) \\
d \geq min(b,d) \\\]
<p>Therefore, if we compute \(a+b\) and \(c+d\), we can write:</p>
\[a+b \geq min(a,c)+min(b,d) \\
c+d \geq min(a,c)+min(b,d)\]
<p>Therefore, regardless of our choice from \(a+b\), \(c+d\) from the function \(min(a+b,c+d)\), the result will always be greater than \(min(a,c)+min(b,d)\). Thus we write:</p>
\[\begin{equation}\mathbf{min(a+b, c+d) \geq min(a,c)+min(b,d)} \label{ineq:2} \end{equation}\]
<h2 id="infimum-and-supremum">Infimum and Supremum</h2>
<p>The <strong>infimum</strong> of a function \(f(x)\) is defined as:</p>
\[\mathbf{inf_x(f(x))=M | M<f(x) \forall x}\]
<p>The infimum is defined for all functions even if the minimum does not exist for a function, and is equal to the mimimum if it does exist.</p>
<p>The supremum of a function \(f(x)\) is defined as:</p>
\[\mathbf{sup_x(f(x))=M | M>f(x) \forall x}\]
<p>The <strong>supremum</strong> is defined for all functions even if the maximum does not exist for a function, and is equal to the maximum if it does exist.</p>
<h2 id="pointwise-infimum-and-pointwise-supremum">Pointwise Infimum and Pointwise Supremum</h2>
<p>The <strong>pointwise infimum</strong> of two functions \(f_1(x)\) and \(f_2(x)\) is defined as thus:</p>
\[pinf(f_1, f_2)=min\{f_1(x), f_2(x)\}\]
<p>The <strong>pointwise supremum</strong> of two functions \(f_1(x)\) and \(f_2(x)\) is defined as thus:</p>
\[psup(f_1, f_2)=max\{f_1(x), f_2(x)\}\]
<p>We’ll prove an interesting result that will prove useful when exploring the shape of the <strong>Lagrangian of the objective function</strong>, namely that <strong>the pointwise infimum of any set of concave functions is a concave function</strong>.</p>
<p><img src="/assets/images/concave-infimum.png" alt="Concave Pointwise Infimum" /></p>
<p>Let there be a chord \(C_1\) connecting (x_1, f_1(x_1)) and (x_2, f_1(x_2)) for a concave function \(f_1(x)\).
Let there be a chord \(C_2\) connecting (x_1, f_2(x_1)) and (x_2, f_2(x_2)) for a concave function \(f_2(x)\).</p>
<p>Let us fix two arbitrary x-coordinates \(x_1\) and \(x_2\). Then, by the definition of a <strong>concave function</strong> (see above), we can write for \(f_1\) and \(f_2\):</p>
\[f_1(\alpha x_1+\beta x_2)\geq \alpha f_1(x_1)+\beta f_1(x_2) \\
f_2(\alpha x_1+\beta x_2)\geq \alpha f_2(x_1)+\beta f_2(x_2)\]
<p>where \(\alpha+\beta=1\). Let us define the <strong>pointwise infimum</strong> function as:</p>
\[\mathbf{pinf(x)=min\{f_1(x), f_2(x)\}}\]
<p>Then:</p>
\[pinf(\alpha x_1+\beta x_2)=min\{ f_1(\alpha x_1+\beta x_2), f_2(\alpha x_1+\beta x_2)\} \\
\geq min\{ \alpha f_1(x_1)+\beta f_1(x_2), \alpha f_2(x_1)+\beta f_2(x_2)\} \hspace{4mm}\text{ (from }\eqref{ineq:1})\\
\geq \alpha.min\{f_1(x_1),f_2(x_1)\} + \beta.min\{f_1(x_2),f_2(x_2)\} \hspace{4mm}(\text{ from } \eqref{ineq:2})\\
= \mathbf{\alpha.pinf(x_1) + \beta.pinf(x_2)}\]
<p>Thus, we can summarise:</p>
\[\begin{equation}
\mathbf{pinf(\alpha x_1+\beta x_2) \geq \alpha.pinf(x_1) + \beta.pinf(x_2)}
\end{equation}\]
<p>which is the form of an <strong>concave function</strong>, and thus we can conclude that \(pinf(x)\) is a concave function if all of its component functions are concave.</p>
<p>Since this is a general result for any two coordinates \(x_1,x_2:x_1,x_2 \neq 0\), we can conclude that <strong>the pointwise infimum of two concave functions is also a concave function</strong>. This can be extended to an arbitrary set of arbitrary concave functions.</p>
<p>Using very similar arguments, we can also prove that <strong>the pointwise supremum of an arbitrary set of convex functions is also a convex function</strong>.</p>
<p>The other straightforward conclusion is that <strong>the pointwise infimum of any set of affine functions is always concave, because affine functions are concave</strong> (they are also convex, but we cannot draw any general conclusions about the pointwise infimum of convex functions).</p>
<p><strong>Note</strong>: The <strong>pointwise infimum</strong> and <strong>pointwise supremum</strong> have different definitions from the <strong>infimum</strong> and <strong>supremum</strong>, respectively.</p>
<h2 id="the-max-min-inequality">The Max-Min Inequality</h2>
<p>The <strong>Max-Min Inequality</strong> is a very general statement about the implications of ordering of maximisation/minimisation procedures along different axes of a function.</p>
<p>Fix a particular point \((x_0,y_0)\).</p>
\[\text{ inf}_xf(x,y_0)\leq f(x_0,y_0)\leq \text{ sup}_yf(x_0,y)\]
<p>This holds for any \((x_0,y_0)\), thus, we can simplify the notation, and omit the middle term to write:</p>
\[\text{ inf}_xf(x,y)\leq \text{ sup}_yf(x,y) \\
g(x,y)\leq h(x,y) \text{ }\forall x,y\in\mathbf{R}\]
<p>where \(g(x,y)=\text{ inf}_xf(x,y)\) and \(h(x,y)=\text{ sup}_yf(x,y)\). Note that at this point, \(g\) and \(h\) can be simple scalars or functions in their oown right; it depends upon the degree of the original function \(f(x,y)\).</p>
<p>In the general case, the infimum will define a function whose image will contain values which are all less than the smallest value in the image of the supremum function. We express this last statment as:</p>
\[\text{sup}_y g(x,y)\leq \text{inf}_x h(x,y) \\
\Rightarrow \mathbf{\text{sup}_y \text{ inf}_x f(x,y)\leq \text{inf}_x \text{ sup}_y f(x,y)}\text{ }\forall x,y\in\mathbf{R}\]
<p>This is the statement of the <strong>Max-Min Inequality</strong>.</p>
<h2 id="the-minimax-theorem">The Minimax Theorem</h2>
<p>The <strong>Minimax Theorem</strong> (first proof by <strong>John von Neumann</strong>) specifies conditions under which the <strong>Max-Min Inequality</strong> is an equality. This will prove useful in our discussion around solutions to the Lagrangian.
Specifically, the theorem states that the</p>
\[\mathbf{\text{sup}_y \text{ inf}_x f(x,y)\leq \text{inf}_x \text{ sup}_y f(x,y)} \text{ }\forall x,y\in\mathbf{R}\]
<p>if:</p>
<ul>
<li>\(f(x,y)\) is convex in \(y\) (keeping \(x\) constant)</li>
<li>\(f(x,y)\) is concave in \(x\) (keeping \(y\) constant)</li>
</ul>
<p>The diagram below shows the graph of such a function.</p>
<p><img src="/assets/images/quadratic-surface-no-cross-term-saddle.png" alt="Concave-Convex Function" /></p>
<p>The above conditions also imply the existence of a <strong>saddle point</strong> in the solution space, which, as we will discuss, will also be the <strong>optimal solution</strong>.</p>avishekThis article continues the original discussion on Quadratic Optimisation, where we considered Principal Components Analysis as a motivation. Originally, this article was going to begin delving into the Lagrangian Dual and the Karush-Kuhn-Tucker Theorem, but the requisite mathematical machinery to understand some of the concepts necessitated breaking the preliminary setup into its own separate article (which you’re now reading).Common Ways of Looking at Matrix Multiplications2021-04-29T00:00:00+05:302021-04-29T00:00:00+05:30/2021/04/29/quick-summary-of-common-matrix-product-methods<p>We consider the more frequently utilised viewpoints of <strong>matrix multiplication</strong>, and relate it to one or more applications where using a certain viewpoint is more useful. These are the viewpoints we will consider.</p>
<ul>
<li>Linear Combination of Columns</li>
<li>Linear Combination of Rows</li>
<li>Linear Transformation</li>
<li>Sum of Columns into Rows</li>
<li>Dot Product of Rows and Columns</li>
<li>Block Matrix Multiplication</li>
</ul>
<h2 id="linear-combination-of-columns">Linear Combination of Columns</h2>
<p>This is the most common, and probably one of the most useful, ways of looking at matrix multiplication. This is because the concept of <strong>linear combinations of columns</strong> is a fundamental way of determining linear independence (or linear dependence), which then informs us about many things, including:</p>
<ul>
<li>Dimensionality of the <strong>column space</strong> and <strong>row space</strong></li>
<li>Dimensionality of the <strong>null space</strong> and <strong>left null space</strong></li>
<li><strong>Uniqueness</strong> of solutions</li>
<li><strong>Invertibility</strong> of matrix</li>
</ul>
<p>This is obviously the most commonly used interpretation when defining and working with <strong>vector subspaces</strong>, as well.</p>
<p><img src="/assets/images/linear-combination-matrix-multiplication.jpg" alt="Linear Combination of Columns" /></p>
<h2 id="linear-combination-of-rows">Linear Combination of Rows</h2>
<p>There’s not much more to say about the linear combinations of rows. However, <strong>any deduction about the row rank of a matrix from looking at its row vectors automatically applies to the column rank as well</strong>, so it is useful in situations where you find looking at rows easier than columns.</p>
<h2 id="sum-of-columns-into-rows">Sum of Columns into Rows</h2>
<p>The product of a column of the left matrix and a row of the right matrix gives a matrix of the same dimensions as the final result. <strong>Thus, each product results in one “layer” of the final result.</strong> Subsequent “layers” are added on through summation. Thus, product looks like so:</p>
<p>Thus, for \(A\in\mathbb{R}^{m\times n}\) and \(B\in\mathbb{R}^{n\times p}\), we can write out the multiplication operation as below:</p>
\[\mathbf{AB=C_{A1}R_{B1}+C_{A2}R_{B2}+C_{A3}R_{B3}+...+C_{An}R_{Bn}}\]
<p>This is a common form of treating a matrix when performing <strong>LU Decomposition</strong>. See <a href="/2021/04/02/vectors-matrices-outer-product-column-into-row-lu.html">Matrix Outer Product: Columns-into-Rows and the LU Factorisation</a> for an extended explanation of the <strong>LU Factorisation</strong>.</p>
<h2 id="linear-transformation">Linear Transformation</h2>
<p>This is a very common way of perceiving matrix multiplication in <strong>computer graphics</strong>, as well as when considering <strong>change of basis</strong>. <strong>Lack of matrix invertibility can also be explained through whether a vector exists which can can be transformed into the zero vector by said matrix.</strong></p>
<h2 id="dot-product-of-rows-and-columns">Dot Product of Rows and Columns</h2>
<p>This is the common form of treating matrices when doing proofs where the transpose invariance property of symmetric matrices is utilised, i.e., \(A^T=A\). It is also the one taught in high school the most, and not really the best way to start understanding matrix multiplication.</p>
<h2 id="block-matrix-multiplication">Block Matrix Multiplication</h2>
<p><img src="/assets/images/block-matrix-multiplication.jpg" alt="Block Matrix Multiplication" />
The block matrix multiplication is not really a separate method of multiplication per se. It is more of a method for bringing a higher level of abstraction in a matrix, while still permitting the “blocks” to be treated as singular matrix entries.</p>
<p>One application of this is when proofs involve properties of a larger matrix composed of submatrices, which have interesting properties of their own, which we wish to exploit.</p>
<p>An interesting example is part of the statement of the <strong>Implicit Function Theorem</strong>. In one dimension, the validity of this theorem holds when the function being described is <strong>monotonic</strong> in a defined interval (always increasing or always decreasing in that interval). In higher dimensions, this requirement of monotonicity is stated more formally as saying that <strong>the derivative of the function is invertible within a defined interval</strong>. We discussed this theorem in the article on <a href="/2021/04/24/vector-calculus-lagrange-multipliers.html">Lagrange Multipliers</a>.</p>
<p>The motivation for this example is the mathematical description of that monotonicity requirement. More on this is discussed in <a href="/2021/04/29/inverse-function-theorem-implicit-function-theorem.html">Intuitions about the Implicit Function Theorem</a>.</p>
<p>We can prove that a matrix which looks like this:</p>
\[X=
\begin{bmatrix}
A && C \\
0 && B
\end{bmatrix}\]
<p>where \(A\), \(B\) are <strong>invertible submatrices</strong>, and \(I\) is the <strong>identity matrix</strong>, the matrix \(X\) is also <strong>invertible</strong>. Let us be precise about the dimensions of these matrices.</p>
\[X=(n+m)\times (n+m) \\
A=n \times n \\
0=m \times n \\
C=n \times m \\
B=m \times m\]
<p>Do verify for yourself that these submatrices align. To prove this, let us assume there exists a matrix \(X^{-1}\), which is the inverse of \(X\). Therefore, \(XX^{-1}=I\). Furthermore, let us assume the form of \(X^{-1}\) to be:</p>
\[X^{-1}=\begin{bmatrix}
P && Q \\
R && S
\end{bmatrix}\]
<p>Again, we make precise the dimensions of the submatrices of \(X^{-1}\).</p>
\[P=n \times n \\
Q=n \times m \\
R=m \times n \\
S=m \times m\]
<p>If we multiply \(XX^{-1}\), we get:</p>
\[XX^{-1}=
\begin{bmatrix}
AP+CR && AQ+CS \\
BR && BS
\end{bmatrix}=
\begin{bmatrix}
I_{n \times n} && 0_{n \times m} \\
0_{m \times n} && I_{m \times m}
\end{bmatrix}\]
<p>Let’s do a quick sanity check. Checking back to the dimensions of the matrices, we can immediately see that:</p>
<p>-\(AP\) and \(CR\) give a \(n \times n\) matrix.
-\(BS\) gives a \(m \times m\) matrix.
-\(AQ\) and \(CS\) give a \(n \times m\) matrix.
-\(BR\) give a \(m \times n\) matrix.</p>
<p>The cool thing is that you can write out the element-wise equalities, and solve for \(P\), \(Q\), \(R\), \(S\), as if they were simple variables, as long as you adhere to the matrix operation rules of <strong>ordering</strong>, <strong>transpose</strong>, <strong>inverse</strong>, etc.</p>
<p>Thus, we can write:</p>
\[AP+CR=I \\
AQ+CS=0 \\
BR=0 \\
BS=I\]
<p>From the last two identities, we can immediately say that:</p>
\[R=0 \\
S=B^{-1}\]
<p>Solving for the remaining two variables \(P\) and \(Q\), we get:</p>
\[P=A^{-1} \\
Q=-A^{-1}CB^{-1}\]
<p>Thus the inverse of \(X\) is:</p>
\[XX^{-1}=
\begin{bmatrix}
A^{-1} && A^{-1}CB^{-1} \\
0 && B^{-1}
\end{bmatrix}\]
<p>The important point to note here is that <strong>the solution does not need \(C\) to be an invertible matrix</strong>; it may be rank-deficient, and \(X\) still remains an invertible matrix.</p>
<h3 id="recursive-calculation">Recursive Calculation</h3>
<p>The <strong>block matrix calculation</strong> can be extended to be recursive. We can simply break down any submatrix into its block matrices and perform the same operation, until (if you so wish) you reach the individual element level.</p>
<p><img src="/assets/images/recursive-block-matrix-multiplication.png" alt="Recursive Block Matrix Multiplication" /></p>avishekWe consider the more frequently utilised viewpoints of matrix multiplication, and relate it to one or more applications where using a certain viewpoint is more useful. These are the viewpoints we will consider.Intuitions about the Implicit Function Theorem2021-04-29T00:00:00+05:302021-04-29T00:00:00+05:30/2021/04/29/inverse-function-theorem-implicit-function-theorem<p>We discussed the <strong>Implicit Function Theorem</strong> at the end of the article on <a href="/2021/04/24/vector-calculus-lagrange-multipliers.html">Lagrange Multipliers</a>, with some hand-waving to justify the linear behaviour on manifolds in arbitrary \(\mathbb{R}^N\).</p>
<p>This article delves a little deeper to develop some more intuition on the Implicit Function Theorem, but starts with its more specialised relative, the <strong>Inverse Function Theorem</strong>. This is because it is easier to start with reasoning about the Inverse Function Theorem.</p>
<h2 id="inverse-functions">Inverse Functions</h2>
<h3 id="monotonicity-in-one-dimension">Monotonicity in One Dimension</h3>
<p>Let’s start with a simple motivating example. We have the function \(f(x)=2x: x \in \mathbb{R}^n\). This gives a value, say \(y\), given an \(x\). We desire to find a function \(f^{-1}\) which is the inverse of \(f^{-1}\), i.e., given a \(y\), we wish to recover \(x\). Mathematically, we can say:</p>
\[f^{-1}(f(x))=x\]
<p>In this case, the inverse is pretty easy to determine, it is \(f^{-1}(x)=\frac{x}{2}\). The function \(f\) is thus a mapping from \(\mathbb{R} \rightarrow \mathbb{R}\).
Let us ask the question while we are still dealing with very simple functions: <strong>under what conoditions does a function not have an inverse?</strong></p>
<p>Let’s think of this intuitively with an example. Does the function \(f(x)=5\) have an inverse? This function forces all values of \(x\in \mathbb{R}^n\) to a value of 5. Even hypothetically, if \(f^{-1}\) exists, and we tried to find \(f^{-1}(5)\), there would not be one solution for \(x\). Algebraically, we could have written:</p>
\[f(x)=[0].x+[5]\]
<p>where \([0]\) is a \(1\times 1\) matrix with a zero in it, and in this, is the function matrix. The \([5]\) is the bias constant, and can be ignored for this discussion.</p>
<p>Obviously, \(f(x)\) collapses every \(x\) into the zero vector, and is thus not invertible. Correspondingly, the function does not have an inverse. Some intuition is developed about invertibility in <a href="/2021/04/03/matrix-intuitions.html">Assorted Intuitions about Matrics</a>.</p>
<p>This implies an important point, it is not necessary for all \(x\) to have the same output. Even if a single non-zero vector \(x\) folds into zero, then our function cannot be invertible. For this to happen, a function must continuously either keep increasing or decreasing: it cannot increase for a while, then decrease again, because that automatically implies that the output can be the same for two (or more) different inputs (implying that you cannot recover the input uniquely from a given output).</p>
<p>A function which always either only increases, or only decreases, is called a <strong>monotonic function</strong>.</p>
<p><strong>Monotonic functions</strong> have the property that their derivative is always either always positive or always negative throughout the domain. This property is evident, when you take the derivative of the function \(g(x)=2x\), which is \(\frac{dg(x)}{dx}=2\).</p>
<p>This will come in handy when we move to higher dimensions.</p>
<p>Let’s look at another well-known function, the sine curve.</p>
<p><img src="/assets/images/sine-wave.png" alt="Sine Curve" /></p>
<p>The sine function \(f(x)=sin(x)\) is <strong>not invertible</strong> in the domain \([\infty, -\infty]\). This is because values of \(x\) separated by \(\frac{\pi}{2}\) radians output the same value.</p>
<p>For the function \(f(x)=sin(x)\) to be invertible, <strong>we restrict its domain to \([-\frac{\pi}{2},\frac{\pi}{2}]\)</strong>. You can easily see that in the range \([-\frac{\pi}{2},\frac{\pi}{2}]\), the sine function is <strong>monotonic</strong> (in this case, increasing).</p>
<p>This also leads us to an important practice: that of explicitly defining the region of the domain of the function where it is monotonic. In most cases, excluding the problematic areas of the domain, allows us to apply stricter conditions to a local area of a function, which would not be possible if the function was considered at a global scale.</p>
<h3 id="function-inverses-in-higher-dimensions">Function Inverses in Higher Dimensions</h3>
<p>What if we wish to extend this to the two-dimensional case? We now have a function \(F:\mathbb{R}^2 \rightarrow \mathbb{R}^2\). I said “a function”, but it is actually a vector of two functions. An elementary function returns a single scalar value, and to get two values (remember, \(\mathbb{R}^2\)) for our output vector, we need two functions. Let us write this as:</p>
\[F(X)=\begin{bmatrix}
f_1(x_1, x_2) \\ f_2(x_1, x_2)
\end{bmatrix}
\\
f_1(x_1, x_2)=x_1+x_2 \\
f_2(x_1, x_2)=x_1-x_2 \\
\Rightarrow F(X)=
\begin{bmatrix}
1 && 1 \\
1 && -1
\end{bmatrix}\]
<p>where \(X=(x_1,x_2)\). I have simply rewritten the functions in matrix form above.
<strong>What is the inverse of this function?</strong> We can simply compute the inverse of this matrix to get the answer. I won’t show the steps here (I did this using augmented matrix Gaussian Elimination), but you can verify yourself that the inverse \(F^{-1}\) is:</p>
\[F^{-1}(X)=\begin{bmatrix}
\frac{1}{2} && \frac{1}{2} \\
\frac{1}{2} && -\frac{1}{2} \\
\end{bmatrix}\]
<p>This can be extended to all higher dimensions, obviously.</p>
<p>Let us repeat the same question as in the one-dimensional case: <strong>when is the function \(F\) not invertible?</strong> We need to make our definition a little more sophisticated in the case of multivariable functions; the new requirement is that all its partial derivatives always be invertible. Stated this way, this implies that the the gradient of the function (Jacobian) \(\nabla F\) be invertible over the entire region of interest.</p>
<p>Briefly, we’re looking at \(n\) equations with \(n\) unknowns, with all linearly independent column vectors. <strong>Linear independence is a necessary condition for invertibility.</strong></p>
<p>We are now ready to state the <strong>Inverse Function Theorem</strong> (well, the important part).</p>
<h2 id="inverse-function-theorem">Inverse Function Theorem</h2>
<p>The <strong>Inverse Function Theorem</strong> states that:</p>
<p>In the neighbourhood of a domain around \(x_0\) of a function \(F\) which is known to be <strong>continuously differentiable</strong>, if the <strong>derivative of the function \(DF(x_0)\)</strong> is <strong>invertible</strong>, then there exists an <strong>inverse function</strong> \(F^{-1}\) which exists in that same neighbourhood such that \(F^{-1}(F(x_0))=x_0\).</p>
<p>The theorem also gives us information about what the <strong>derivative of the inverse function</strong>, but we’ll not delve into that aspect for the moment. Any textbook on <strong>Vector Calculus</strong> should have the relevant results.</p>
<p>This is a very informal definition of the <strong>Inverse Function Theorem</strong>, but it conveys the most important part, namely: <strong>if the derivative of a function is invertible</strong> in some neighbourhood of \(x_0\), <strong>there exists an inverse of the function</strong> itself in that neighbourhood.</p>
<p>The reason we stress a lot on the word <strong>neighbourhood</strong> is that a lot of functions are not necessarily continuously differentiable, especially for nonlinear functions. Linear functions look the same as their derivatives at every point, which is why we didn’t need to worry about taking the derivative of \(f(x)=2x\) in our initial example.</p>
<p>The <strong>Inverse Function Theorem</strong> obviously applies to linear functions, but its real value lies in applying to <strong>nonlinear functions</strong>, where the neighbourhood is taken to be infinitesmal, which then leads us to the definition of the <strong>manifold</strong>, which we have talked about in <a href="/2021/04/24/vector-calculus-lagrange-multipliers.html">Vector Calculus: Lagrange Multipliers, Manifolds, and the Implicit Function Theorem</a>.</p>
<h2 id="implicit-function-theorem">Implicit Function Theorem</h2>
<p>What can we say about systems of functions which have \(n\) unknowns, but less than \(n\) equations? The <strong>Implicit Function Theorem</strong> gives us an answer to this; think of it as a more general version of the <strong>Inverse Function Theorem</strong>.</p>
<p>Much of the mechanics implied by this theorem is covered in <a href="/2021/04/24/vector-calculus-lagrange-multipliers.html">Vector Calculus: Lagrange Multipliers, Manifolds, and the Implicit Function Theorem</a>. However, here we take a big-picture view.</p>
<p>Suppose we have \(m+n\) unknowns and \(n\) equations.
Thus, we will have \(n\) pivotal variables, corresponding to \(n\) linearly independent column vectors of this system of linear equations.
This means that \(n\) pivotal variables can be expressed in terms of \(m\) free variables. Let us call the \(m\) free variables \(U=(u_1, u_2,..., u_m)\), and the \(n\) pivotal variables \(V=(v_1, v_2, ..., v_n)\).</p>
<p>Let us consider the original function \(F_{old}\).</p>
\[F_{old}(U,V)=\begin{bmatrix}
f_1(u_1, u_2, u_3, ..., u_m, v_1, v_2, v_3, ..., v_n) \\
f_2(u_1, u_2, u_3, ..., u_m, v_1, v_2, v_3, ..., v_n) \\
f_3(u_1, u_2, u_3, ..., u_m, v_1, v_2, v_3, ..., v_n) \\
\vdots \\
f_n(u_1, u_2, u_3, ..., u_m, v_1, v_2, v_3, ..., v_n)
\end{bmatrix}\]
<p>The new function \(F_{new}\) is what we obtain once we have expressed \(V\) in terms of only \(U\). It looks like this:</p>
\[F_{new}(U)=\begin{bmatrix}
u_1 \\
u_2 \\
u_3 \\
\vdots \\
u_m \\
\phi_1(u_1, u_2, u_3, ..., u_m) \\
\phi_2(u_1, u_2, u_3, ..., u_m) \\
\phi_3(u_1, u_2, u_3, ..., u_m) \\
\vdots \\
\phi_n(u_1, u_2, u_3, ..., u_m)
\end{bmatrix}\]
<p>Note that the original formulation had a function F_{old} which transformed the full set \((U,V)\) into a new vector. The new formulation now has \(m\) free variables which stay unchanged after the transform, and \(n\) pivotal variables \(V\) which are mapped from \(U\) with a new set of functions \(\Phi=(\phi_1,\phi_2,...,\phi_n,)\).</p>
<p>Now, instead of asking: <strong>“Is there an inverse of the function \(F_{old}\)?”</strong>, we ask: <strong>“Is there an inverse of the function \(F_{new}\)?”</strong></p>
<p>The situation is illustrated below.</p>
<p><img src="/assets/images/implicit-function-theorem.png" alt="Implicit Function Theorem Intuition" /></p>
<p>The <strong>Implicit Function Theorem</strong> states that if a mapping \(F_{old}(U,F_{new}(U))\) exists for a point \(c=(U_0, F_{new}(U_0))\) such that:</p>
<ul>
<li>
\[\mathbf{F_{old}(c)=0}\]
</li>
<li>\(F_{old}(c)\) is <strong>first order differentiable</strong> (\(C^1\) differentiable)</li>
<li>The derivative of \(F_{old}\) is invertible, implying \(L\) is also invertible, where \(L\) is defined as below:</li>
</ul>
\[L=\begin{bmatrix}
(D_1F_{old}, D_2F_{old}, D_3F_{old}, ..., D_nF_{old}) && (D_{n+1}F_{old}, D_{n+2}F_{old}, D_{n+3}F_{old}, ..., D_{n+m}F_{old}) \\
0 && I_{m \times m}
\end{bmatrix}\]
<p>then, the following holds true:</p>
<ul>
<li>There exists an inverse mapping \(F_{new}^{-1}\) for \(F_{new}\) such that \(F_{old}(F_{new}^{-1}(V), V)=0\) in the neighboourhood of \(c\)</li>
<li>There is a <strong>neighbourhood of \(c\)</strong> where this linear relationship holds for \(F(c)=0\).</li>
</ul>
<p>The above is the same statement as the one made by the <strong>Inverse Function Theorem</strong>, except that the system of linear equations in that scenario was completely determined. In the case of the <strong>Implicit Function Theorem</strong>, the system is <strong>underdetermined</strong>.</p>
<h3 id="note-on-the-derivative-matrix">Note on the Derivative Matrix</h3>
<p>Let us look at the matrix \(L\) defined above. Here, we have added padded the derivatives with the zero matrix and an identity matrix to make the whole matrix \(L\), square.</p>
<p>For simple linear surfaces, simply finding the inverse of the system of linear equations is enough, since as I noted, the gradient vector is the same as the surfce normal globally, but that is not true for “lumpy” functions globally. It is true for a neighbourhood \(x_0\). But what is the <strong>size of this neighbourhood</strong> such that the derivative approximates the actual function reasonably well?</p>
<p>Put another way, what is the size of the neighbourhood, <strong>where the first derivative does not change too fast</strong> for it to be useful in approximating the actual function? This requires the derivative satisfying the <strong>Lipschitz Condition</strong>, which is a way of putting a <strong>strong guarantee on continuous differentiability</strong>.</p>
<p>We will not go into the details of how this condition is satisfied, but only state that calculating a metric associated with this condition, requires us to compute \(L^{-1}\).</p>
<p>We know that \((D_1F_{old}, D_2F_{old}, D_3F_{old}, ..., D_nF_{old})\) is \(n \times n\) and is invertible, because we know that there are \(n\) linearly independent columns in \(F_{old}\).</p>
<p>The matrix \(L\) has the block form:</p>
\[L=
\begin{bmatrix}
A && C \\
0 && B
\end{bmatrix}\]
<p>where \(A\) and \(B\) are invertible, but \(C\) need not be. To see why this results in \(L\) being invertible, see <a href="/2021/04/29/quick-summary-of-common-matrix-product-methods.html">Intuitions around Matrix Multiplications</a>.</p>avishekWe discussed the Implicit Function Theorem at the end of the article on Lagrange Multipliers, with some hand-waving to justify the linear behaviour on manifolds in arbitrary \(\mathbb{R}^N\).