In order to motivate the search for minima and maxima, we present an
extended example which finds widespread use in science and technology:
the singular value decomposition of a matrix.
Consider a matrix $A\in \mathbb{R}^{m\times n}$. Recall that the rank
of $A$
$$
\operatorname{rank}(A)=\operatorname{dim}\Bigl( \operatorname{span}\{
A_{\bullet 1},A_{\bullet 2},\dots,A_{\bullet n}\}\Bigr)
$$
is the dimension of the range of $A$. We use the notation $A_{\bullet
k}$ for the $k$-th column of $A$. Now the "simplest" possible matrices
are column $\mathbb{R}^{m\times 1}$ and row $\mathbb{R}^{1\times n}$
vectors. They represent the simple linear maps
$$
\underline{u}:\mathbb{R}\to \mathbb{R}^m,\: x\mapsto x\, u
$$
and
$$
\underline{v}:\mathbb{R}^n\to \mathbb{R}, ,\: u\mapsto v u
$$
for $u\in \mathbb{R}^{m\times 1}\simeq \mathbb{R}^m$ and for
$v\in\mathbb{R}^{1\times n}\simeq \mathbb{R}^n$. If we interpret $v$
as a column vector as well, then we would write $v^\mathsf{T}u$ instead
of $vu$. Combining two of these we obtain a rank-one matrix
$$
u\, v^\mathsf{T}\in \mathbb{R}^{m\times n}
$$
which we think of as the map $\mathbb{R}^m\ni x\mapsto (v^\mathsf{T}x)u\in
\mathbb{R}^n$. Of course, for this matrix to actually be rank-one, it needs to be
assumed that $u\neq 0$ and $v\neq 0$. Notice that
$$
u\, v^\mathsf{T}=\| u\| \| v\| \frac{u}{\| u\|}\frac{v^\mathsf{T}}{\|
v\|}= \sigma \bar u \bar v ^\mathsf{T},
$$
where $\sigma \geq 0$ and $\| \bar u\|=1=\|\bar v\|$.
Notice that
$$
Ax=\sum _{i=1}^m \Bigl( \sum _{j=1}^n a_{ij}x_j\Bigr)e_i=
\sum _{i=1}^m \sum _{j=1}^n a_{ij}e^m_i \bigl( e^n_j\bigr)^\mathsf{T}x,
$$
so that
$$
A=\sum _{i=1}^m \sum _{j=1}^n a_{ij}e^m_i \bigl( e^n_j\bigr)^\mathsf{T},
$$
and the superscripts are used to indicate the space in which the
natural basis is taken. Clearly if $\operatorname{rank}(A)=k$, then
$k$ vectors "should suffice", that is, we should have that
$$
A=\sum _{j=1}^k \sigma_j u_j\, v_j^\mathsf{T}.
$$
The question we would like to answer is: does this actually work? Can
we find
$$
\sigma_j, u_j\in \mathbb{R}^m,v_j\in \mathbb{R}^n\text{ such that }
A=\sum _{j=1}^k \sigma_j u_j\, v_j^\mathsf{T}\text{, if }\operatorname{rank}(A)=k?
$$
As an exercise, prove that
$\operatorname{rank}(A)=\operatorname{rank}(A^\mathsf{T})$. We define
the Frobenius norm of a matrix by
$$
\| A\| _{F}=\sqrt{\operatorname{tr}(A^\mathsf{T}A)}=\Bigl[\sum _{i=1}^m
\sum _{j=1}^na_{ij}^2\Bigr]^{1/2}
$$
and consider the problem
$$
\operatorname{argmin}_{\sigma_j,u_j,v_j}\frac{1}{2}\| A-\sum _{j=1}^k
\sigma_j u_j\, v_j^\mathsf{T}\| ^2_F
$$
If a minimum is reached and is $0$, then we would be done! We start
with
$$
\operatorname{argmin}\big\{ \frac{1}{2}\| A-\sigma u\,
v^\mathsf{T}\| ^2_F \, :\, \sigma,u\in \mathbb{R}^m,v\in
\mathbb{R}^n, \|u\|=1=\|v\|\big\}.
$$
Convince yourself that
$$
E(\sigma,u,v):=\frac{1}{2}\| A-\sigma u\, v^\mathsf{T}\| _F=\frac{1}{2}\| A\|^2
_F-\sigma\, u^TAv+\frac{\sigma^2}{2}\underset{=1}{\underbrace{\| u\|^2\| v\|^2}}.
$$
If the mimimum is such that $\sigma=0$, then we have that
$$
\frac{1}{2}\| A\|_F^2\leq \frac{1}{2}\| A-\sigma\, uv^\mathsf{T}\|
_F^2=\frac{1}{2}\| A\|_F^2-\sigma \, u^\mathsf{T}Av+\frac{\sigma^2}{2}\text{ for
all }\sigma ,u,v.
$$
This implies that $(u^\mathsf{T}Av)\sigma -\sigma^2/2\leq 0$ for all $\sigma$ no
matter what $u$ and $v$ are, which yields that
$\frac{1}{2}(u^\mathsf{T}Av)^2\leq 0$ for all $u,v$ since the maximum
of the expression is at $\sigma=u^\mathsf{T}Av$. It follows that, in
this case, $u^\mathsf{T}Av=0$ for all $u,v$ and therefore that
$A=0$. We conclude that the minimum is positive if $A\neq 0$.
Notice that
$$
DE(\sigma,u,v)=\begin{bmatrix} -u^\mathsf{T}Av+\sigma & -\sigma
Av+\sigma^2u & -\sigma A^\mathsf{T}u+\sigma^2 v\end{bmatrix},
$$
so that for a critical point $\sigma_*,u_*,v_*$ it holds that
$$
\sigma_*=u_*^\mathsf{T}Av_*\text{ and }\sigma_* \bigl(
Av_*-u_*\bigr) =0=\sigma_* \bigl( A^\mathsf{T}u_*-v_*\bigr).
$$
We can therefore consider
$$
\tilde E(u,v)=E \bigl( \sigma(u,v),u,v\bigr)\text{ on
}\mathbb{S}^m\times \mathbb{S}^n,
$$
where $\sigma(u,v)=u^\mathsf{T}Av$ and $\mathbb{S}^l$ denotes the unit
sphere in $\mathbb{R}^l$. It follows from the compactness of $
\mathbb{S}^m\times \mathbb{S}^n$ and the continuity of $\tilde E$ that
a minimum exists, i.e. unit vectors $u_1,v_1$ can be found such that
$$
\sigma_1=u_1^\mathsf{T}Av_1\text{ and } \tilde E(u_1,v_1)>0\text{ if }
A\neq 0.
$$
Notice that $\tilde E(u_1,v_1)\leq \tilde E(u,v)$ for all $u,v$ and that
$$
\tilde E(u,v)= E \bigl( \sigma(u,v),u,v\bigr)\leq E(\sigma,u,v) \text{ for all
}\sigma ,u,v,
$$
since $\sigma(u,v)$ minimizes $\frac{1}{2}\| A-\sigma\, uv^\mathsf{T}\|
_F^2$ for any fixed given $u,v$. Next observe that
$$
\operatorname{rank}\bigl( A-\sigma
_1u_1v_1^\mathsf{T}\bigr)<\operatorname{rank}(A),
$$
since $Av_1=u_1\neq 0$ but
$$
\bigl( A-\sigma _1u_1v_1^\mathsf{T}\bigr)v_1=Av_1-\sigma_1 u_1=0,
$$
by the stationarity conditions $DE=0$ and the fact that $\sigma_1>0$
for $A\neq 0$.
Now it only remains to replace $A$ by $A_1=A-\sigma _1u_1v_1^\mathsf{T}$
and consider
$$
\operatorname{argmin}\big\{ \frac{1}{2}\| A_1-\sigma u\,
v^\mathsf{T}\| ^2_F \, :\, \sigma,u\in \mathbb{R}^m,v\in
\mathbb{R}^n, \|u\|=1=\|v\|\big\}.
$$
If the minimum is 0, then $A_1=0$ and we are done since $A=\sigma
_1u_1v_1^\mathsf{T}$. If not, we can find $\sigma_2>0$ as well as
$u_2,v_2$ of norm 1 which minimize the new energy and such that
$$
\operatorname{rank}(A_2)=\operatorname{rank}\bigl( A_1-\sigma
_2u_2v_2^\mathsf{T}\bigr)<\operatorname{rank}(A_1).
$$
Since the rank is strictly decereasing in each step, continuing in
this fashion, a $k\in \mathbb{N}$ is found along with
$\sigma_j,u_j,v_j$ for $j=1,\dots,k$ such that
$$
A-\sum _{j=1}^k \sigma_j u_j\, v_j^\mathsf{T}=0.
$$
The number $\sigma _j$ are called singular values, and the vectors
$u_j,v_j$, singular vectors. Using the vectors $u_j$ and $v_j$ as the
columns of two matrices $U\in \mathbb{R}^{m\times k}$ and $V\in
\mathbb{R}^{n\times k}$ and the values $\sigma_j$ as the diagonal entries
of an otherwise zero square matrix $\Sigma\in \mathbb{R}^{k\times k}$,
this can be written as
$$
A=U\Sigma V^T.
$$
Prove that the matrices $U$ and $V$ are orthogonal.
Taylor's Theorem in $\mathbb{R}^n$
Multi-indeces are useful to deal with higher order partial derivatives. For $\alpha
=(\alpha_1,...,\alpha_n)\in \mathbb{N}^n\cup\{0\}$ set $|\alpha|=\sum_{j=1}^n
\alpha_j$ and define
$$
\frac{\partial^{|\alpha|}f}{\partial x^\alpha}=\frac{\partial^{|\alpha|}f }{\partial x_1^{\alpha_1} \cdots \partial
x_n^{\alpha_n}}
$$
Theorem (Taylor's theorem)
Let $U\subset \mathbb{R}^n$ be open and convex and $\bar x\in U$. Let $f\in
\operatorname{C}^{m+1} (U)$, that is, $(m+1)$ times continuously differentiable. Then
$$
f(x) = \sum_{|\alpha|=0}^m {1\over \alpha!} {\partial^{|\alpha|} f(\bar x) \over \partial x^\alpha
} (x-\bar x)^\alpha +\sum_{|\alpha| = m+1} {1\over \alpha!} {\partial^{|\alpha|} f(\xi)\over
\partial x^\alpha } (x-\bar x)^{\alpha},
$$
where $\alpha! =\prod _{j=1}^n\alpha_j!$
Since $U$ is convex and $\bar x, x \in U$, one has that
$$
[\bar x,x] =\big\{\bar x+t(x-\bar x): t\in [0,1]\big\}\subset U\, .
$$
Defining $g(t) = f\bigl(tx+(1-t)\bar x\bigr)$ for $t\in[0,1]$ and using the one-variable
Taylor's Theorem it follows that
$$
f(x) = g(1) = \sum_{j=0}^m \frac{g^{(j)} (0)}{j!} (1-0)^j + \frac{g^{(m+1)} (\theta)}{(m+1)!}(1-0)^{n+1}
$$
for some $\theta\in (0,1)$. Notice that
$$
g(0) = f(x_0)\text{ and }g'(0) = \frac{d}{d t}f\bigl(\bar x+t(x-\bar x)\bigr) \big |_{t=0} =
\sum_{j=1}^n \frac{\partial f}{\partial x_j} (\bar x)(x_j-\bar x_j)
$$
and
\begin{align*}
g''(0)&=\frac{d}{dt}\Big[\sum_{j=1}^n \frac{\partial f}{\partial x_j}(tx+(1-t)\bar x)(x_j-\bar x_j) \Big]\Big|_{t=0}\\
&=\sum_{i,j=1}^n \frac{\partial^2f}{\partial x_i \partial x_j}(\bar x)(x_i-\bar x_i)(x_j -\bar x_j)
=\sum_{i,j=1}^n \frac{\partial^2 f}{\partial x_i \partial x_j}(\bar x)(x_i-\bar x_i)(x_j -\bar x_j)\, .
\end{align*}
Notice that
$$
\frac{1}{2!}g''(0) = \sum_{|\alpha| = 2 } \frac{1}{\alpha!}\frac{\partial^2 f (\bar x)}{\partial
x^{\alpha} }(x-\bar x)^\alpha\, .
$$
Similarly, one can prove
$$
\frac{1}{j!}g^{(j)}(t) =\sum_{|\alpha| = j }\frac{1}{\alpha!}\frac{\partial^j f\bigl(t
\bar x+(1-t)x\bigr)}{\partial x^\alpha} (x-\bar x)^\alpha\, ,
$$
for $j=1,2,\dots,m+1$ and the result follows.
Extremal problems
Definition (Extrema)
Let $U$ be a domain in $\mathbb{R}^n$, $f:U\to \mathbb{R}$ be continuous, and $x_0\in U$. Then we say
that
(i) $f$ attains a local maximum [minimum] at $x_0$ if there is a $\delta>0$ such that
$$
f(x_0)\geq f(x)\quad [ f(x_0)\leq\, f(x) ]\, ,\:x\in \mathbb{B}(x_0, \delta)\, .
$$
(ii) $f$ attains a global maximum [minimum] on $U$ at $x_0$ if
$$
f(x_0)\geq f(x)\quad[ f(x_0)\leq \, f(x) ]\, ,\:x\in U\, .
$$
In these cases it is also said that $x_0$ is a (local) maximizer/minimizer.
A pervasive issue in mathematics is how to find maxima (or maximizers) and minima
of $f$ in $U$ if they exist?
Proposition
Let $f\in \operatorname{C}^1(U)$. If $x_0\in U$ is a local maximizer or a local
minimizer of $f$ in $U$, then $\nabla f(x_0) = 0$.
Assume that $x_0$ is local minimizer for $f$.
Let $e_j=(0,\cdots, 0, 1, 0, \cdots, 0)$ (the $j$th position is $1$, others are zero)
and let
$$
g_j(t)=f(x_0+ t e_j)
$$
Then $t=0$ is a local minimizer for $g_j$. Thus, $g_j'(0)=0$. This implies
$$
{\partial f (x_0)\over \partial x_j}=g_j'(0)=0,\quad 1\le j\le n.
$$
Therefore, $\nabla f(x_0)=0$.
Definition (Critical points)
A point $x_0\in U$ is called a critical point of $f$ in $U$ if either $\nabla f(x_0)
=0$ or $\nabla f(x_0)$ does not exist.
Now, if $x_0$ is a critical point for $f$ in $U$, how can it be determined whether
$x_0$ is a local maximizer, minimizer, or a saddle point?
Definition (Positive definite matrix)
Let $A$ be an $n\times n$ symmetric matrix. Then
(i) $A$ is positive [negative] definite if and only if there is a constant
$\lambda>0$ such that
$$
\langle A x, x\rangle \ge \lambda |x|^2 \quad [ \langle A x, x\rangle \le -\lambda |x|^2 ],\quad x
\in \mathbb{R}^n .
$$
(ii) $A$ is positive semi-definite if and only if
$$
\langle Ax, x\rangle \ge 0,\quad x\in\mathbb{R}^n
$$
Theorem (Second derivatives test)
Let $f\in \operatorname{C}^2(U)$ and $x_0\in U$ be a critical point of $f$. Defining
the Hessian $H_f(x_0)$ of $f$ at $x_0$ by
$$
H_f(x_0) =\begin{bmatrix} \frac{\partial ^2f}{\partial x_1^2} & \cdots & \frac{\partial
^2f}{\partial x_1\partial x_n} \\ \vdots & & \vdots \\
\frac{\partial^2f}{\partial x_n \partial x_1} & \cdots & \frac{\partial ^2f}{\partial
x_n^2}\end{bmatrix}(x_0)\, ,
$$
the following statements hold
(i) If $H_f(x_0)$ is positive definite, then $x_0$ is a local minimizer.
(ii) If $H_f(x_0)$ is negative definite, then $x_0$ is a local maximizer.
(iii) If $H_f(x_0)$ is indefinite, then $x_0$ is a saddle point.
Example
Let $f(x_1,x_2) = x_1^2 + x_2^2$ for $x\in \mathbb{R} ^2$. Then
$(0,0)$ is a critical point and
$$
H_f(0,0) = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}
$$
is positive definite. Therefore $(0,0)$ is a local minimizer.
Example
Let $f(x_1,x_2) = -(x_1^2 + x_2^2)$ for $x\in \mathbb{R} ^2$. Then
$(0,0)$ is a critical point and
$$
H_f(0,0) = \begin{bmatrix}-2 & 0 \\ 0 & -2 \end{bmatrix}
$$
is negative definite. Therefore $(0,0)$ is a local maximizer.
Example
Let $f(x_1,x_2) = x_1^2 - x_2^2$ for $x\in \mathbb{R} ^2$. Then
$(0,0)$ is a critical point and
$$
H_f(0,0) =\begin{bmatrix} 2 & 0 \\0 & -2 \end{bmatrix}
$$
is indefinite. Therefore $(0,0)$ is a saddle point.
Example
Let $f(x,y) = x^4 + y^4$ for $x\in \mathbb{R} ^2$. Then
$(0,0)$ is a critical point and
$$
H_f(x,y) = \begin{bmatrix}12x^2 & 0 \\ 0 & 12y^2 \end{bmatrix},\quad H_f(0,0)
= \begin{bmatrix}0 & 0 \\ 0 & 0 \end{bmatrix}\,
$$
The second derivative test is therefore inconclusive. However, $(0,0)$ is the global
minimizer for $f$ in $\mathbb{R}^2$.
Theorem
A function $f\in \operatorname{C}^2(U)$ is convex if and only $H_f(x)$ is positive
semidefinite, i.e. if and only if
$$
\sum_{i,j=1}^n {\partial^2 f(x) \over \partial x_i \partial x_j} \xi_i \xi_j \ge 0,\: \xi\in\mathbb{R}^n, \quad x\in U.
$$
It is easy to see that $f$ is convex in $U$ if and only if
$$
g: [0,1]\to \mathbb{R}\, ,\: t\mapsto g(t)= f\bigl(tx + (1-t)y\bigr)
$$
is convex for all $x,y \in U$, which, in its turn, is the case iff $g''\geq 0$ on
$[0, 1]$. Now observe that
$$
g''(t)=\sum_{i,j=1}^n \frac{\partial^2 f}{\partial x_i \partial x_j}\bigl(tx+(1-t) y\bigr)(x_i-y_i)(x_j-y_j)
$$
so that $f$ is convex iff
$$
\sum_{i,j=1}^n {\partial^2 f\over \partial x_i \partial x_j} (x) (x_i-y_i)(x_j-y_j)\ge 0,\: x,y\in U.
$$
Theorem
If $f\in \operatorname{C}^2(U)$ is convex, every critical point of $f$ is a global
minimizer.
Let $\bar x\in U$ be a critical point of $f$. Then $\nabla f(\bar x) = 0$ and, by Taylor expansion,
$$
f(x)=f(\bar x) + \nabla f(\bar x) \cdot (x-\bar x)+\frac{1}{2!}
\sum_{i,j=1}^n\frac{\partial^2f}{\partial x_i \partial x_j} \bigl(\theta \bar x + (1-\theta)x\bigr)(x_i -
\bar x_i)(x_j - \bar x_j)\geq f(\bar x).
$$
for any $x\in U$ since $f$ is convex. $f(\bar x)$ is thus a global minimum of $f$
Example
Find all global minimizers of $f:\mathbb{R}^2\to \mathbb{R}$ where
$$
f(x, y)=x^4+y^4-32x-2y^2,\: (x,y)\in \mathbb{R} ^2\, .
$$
Since
$$
\frac{\partial f}{\partial x}=4x^3-32\quad \text{and}\quad \frac{\partial f}{\partial y}=4y^3-4y\, ,
$$
this function is continuously differentiable and $\nabla f=0$ has the three solutions
$(2, 0),\ (2, 1)$ and $(2,-1)$, which are the three critical points of $f$. Computing
$$
H_f=\begin{bmatrix} {\partial^2 f\over \partial x^2} & {\partial^2
f\over \partial x\partial y}\\ {\partial^2 f\over \partial
y\partial x} & {\partial^2 f\over \partial y^2}\end{bmatrix}
=\begin{bmatrix} 12 x^2 & 0 \\ 0 & 12 y^2-4\end{bmatrix}\, ,
$$
it is seen that $H_f(2,0)$ is indefinite and so $(2,0)$ is a saddle point of $f$. On
the other hand $H_f(2, \pm 1)$ are positive definite and $(2,\pm 1)$ are local
minimizers. Since $f(2,\pm 1)=-48-1=-49$ and $f(x, y)\to +\infty$ as $x^2+y^2\to
+\infty$, both $ (2, \pm 1)$ must be global minimizers of $f$.
Lagrange Multipliers
Often extrema of functionals need to be found which are subject to additional
constraints, such as in the case where extrema are sought in the zero set of some
function only. For given $f,g:\mathbb{R}^n\to \mathbb{R}$, a typical example is
$$
\hbox{Min} \Big\{f(x): x\in[g=c] \Big\} ,
$$
which is interpreted as the problem of finding extremal points of $f$ on the
level set of $g$
$$
[g=c]:=\big\{ x\in \mathbb{R}^n\, :\, g(x)=c\big\}
$$
for a given constant $c$ (which can w.l.o.g. be taken to vanish). If you think
geometrically, it is easy to see that, if $x_0$ is an extremal point of $f$ on
$[g=c]$, then $\nabla f(x_0)$ has to point in a direction orthogonal to $[g=c]$ at the
point $x_0$ (otherwise the function value could be increased/decerased by moving in the
appropriate direction along the gradient and still remaining in the set $[g=c]$).
Since $\nabla g(x_0)$ is the direction perpendicular to $[g=c]$ at $x_0$ it follows
that
$$\begin{cases}
\nabla f(x_0) =\lambda \nabla g(x_0)&\\
g(x_0) =c&\end{cases}
$$
for some $\lambda\in \mathbb{R}$. The parameter $\lambda$ is called Lagrange
multiplier for the probelm. This is, of course,
provided the functions involved are smooth and $x_0$ is not critical for $g$.
Example
Find maximum and minimum of $f(x,y,z) = x+y$ on the unit sphere $S^2=[x^2 + y^2 + z^2=1]$.
Set up the system
$$\begin{cases}
\nabla f(x,y,z)=\lambda \nabla g(x,y,z)&\\
g(x,y,z)=c&
\end{cases}
$$
which reads
$$\begin{cases}
1=2 \lambda x&\\1=2\lambda y&\\0=2\lambda z&\\1=x^2+y^2+z^2&
\end{cases}$$
This implies that $z=0$ and $x=y=\frac{1}{2\lambda}$ so that
$$
1=\frac{1}{4\lambda ^2}+\frac{1}{4\lambda^2}\, .
$$
It follows that $\lambda=\pm \frac{1}{\sqrt{ 2}}$ corresponding to the extremal points
$$
\bigl(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}, 0\bigr), \bigl(-\frac{1}{\sqrt{2}},
-\frac{1}{\sqrt{2}}, 0\bigr)
$$
Since $f\bigl(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}\bigr)=\sqrt{2}$ and
$f\bigl(-\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}\bigr)=-\sqrt{2}$, the first is a
minimizer, while the second a maximizer.