Long Chen
We introduce the conjugate gradient (CG) and and its preconditioned version PCG methods for solving
where is a symmetric and positive definite (SPD) operator defined on a Hilbert space with . When , is an SPD matrix.
Equation can be derived from the following optimization problem
As is strongly convex, the global minimizer exists and unique and satisfies Euler equation which is exactly equation .
We shall derive CG from the -orthogonal projection. We use for the standard inner product and for the inner product introduced by the SPD operator . That is
When talking about orthogonality, we refer to the default inner product and emphasize the orthogonality in by -orthogonal or conjugate.
Given a vector , the -orthogonal projection of to a subspace is a vector, denoted by , by the relation
Start from an initial guess . Let . For , let be a subspace spanned by -orthogonal basis, i.e. for .
CG consists of three steps:
We now briefly explain each step and present recrusive formulae.
Remark 1. We treat as points and their difference as a vector. Spaces are vectors spaces and means . If we mix the points with vectors, the space we are looking for is indeed the affine space .
Given a subspace spanned by -orthogonal basis, we compute by
The projection can be computed without knowing as is known. Indeed, by definition, for , which leads to the formulae
With , we can compute a new vector . If , which means is the solution, then we stop. Otherwise expand the subspace to a larger one .
Then use Gram-Schmit process to make new added vector -orthogonal to others. The new conjugate direction is
The formulae is justified by the following orthogonality which will be proved later on.
Lemma 1. The residual is -orthogonal to and -orthogonal to .
Last, we present recursive formula. Starting from an initial guess and , for , we use three recursion formulae to compute
The method is called conjugate gradient method since the search direction is obtained by a correction of the negative gradient (the residual) direction and conjugate to all previous directions.
More efficient formulae for and can be derived using the orthogonality presented in Lemma 1.
We present the algorithm of the conjugate gradient method as follows.
​xfunction u = CG(A,b,u,tol)
​
tol = tol*norm(b);
k = 1;
r = b - A*u;
p = r;
r2 = r'*r;
while sqrt(r2) >= tol && k<length(b)
Ap = A*p;
alpha = r2/(p'*Ap);
u = u + alpha*p;
r = r - alpha*Ap;
r2old = r2;
r2 = r'*r;
beta = r2/r2old;
p = r + beta*p;
k = k + 1;
end
Several remarks on the realization are listed below:
A*p
. There is no need to form the matrix explicitly. A subroutine to compute the matrix-vector multiplication is enough. This is an attractive feature of Krylov subspace methods.
We explore the orthogonality and optimality due to the -projection to the subspace. By definition of -projection, cf. we have the orthogonality
Consequently
That is the shortest distance measured in the -norm is given by the -orthogonal projection.
We will use the -orthogonality to present a proof of Lemma 1.
Lemma 1. The residual is -orthogonal to and -orthogonal to .
Proof. To simplify notation, we denote by . We write . Therefore , which is equivalent to . The first orthogonality is thus proved.
Since at every step, the residual vector is added to expand the subspace, it can be easily proved by induction that
By the recursive formula for the residual , we get for . Since we have proved , we get for , i.e. is -orthogonal to .
In view of , we have the optimality
As , we can obtain another optimality
We can also show
Exercise. Let be computed by . Prove that
That is can be found by the steepest descent method starting from and moving along the searching direction . From we get another formulae .
Let be an SPD matrix. We apply the Gram-Schmidt process to the subspace
to get another -orthogonal basis. The three steps are still the same except in the second step, we add instead of .
Starting from an initial guess and . For , the three-term recursion formulae are
Proof of the following lemma is almost identical to that of Lemma 1.
Lemma 2. is -orthogonal to and is -orthogonal to .
We present the algorithm of preconditioned conjugate gradient method.
xxxxxxxxxx
function u = pcg(A,b,u,B,tol)
​
tol = tol*norm(b);
k = 1;
r = b - A*u;
rho = 1;
while sqrt(rho) >= tol && k<length(b)
Br = B*r;
rho = r'*Br;
if k = 1
p = Br;
else
beta = rho/rho_old;
p = Br + beta*p;
end
Ap = A*p;
alpha = rho/(p'*Ap);
u = u + alpha*p;
r = r - alpha*Ap;
rho_old = rho;
k = k + 1;
end
Similarly we collect several remarks for the implementation
B*r
is crucial. The matrix do not have to be formed explicitly. All we need is the matrix-vector multiplication which can be replaced by a subroutine.
Recall that we have two bases for the subspace
Another basis is that
Lemma 3.
Proof. We prove it by induction. For , it is trivial. Suppose the statement holds for . We prove it holds for by noting that
The space is called Krylov subspace. The CG method belongs to a large class of Krylov subspace methods. In the sequel, we use to denote the space of polynomials with degree .
Theorem 1. Let be the th iteration in the CG method with an initial guess . Then
Proof. For , it can be expanded as
Let . Then
The identity then follows from .
Since is symmetric in the -inner product, we have
which leads to the estimate .
The polynomial with constraint will be called the residual polynomial. Various convergence results of CG method can be obtained by choosing specific residual polynomials.
Corollary 1. Let be the eigenvector basis of . Let , . Then CG method with will find the solution in at most iterations. In particular, for a given , the CG algorithm will find the solution within iterations. Namely CG can be viewed as a direct method.
Proof. Denote by . Let . Then with . Namely if , so is . Note that here the eigenvectors are introduced for analysis and unknown in the algorithm.
The subspace is invariant under the action of , i.e. . Therefore starting from , the Krylov space for all . If , the dimension increases by 1. So at most iterations, . The projection of to is itself.
Choose the polynomial . Then is a residual polynomial of order and . and
The assertion is then from the identity .
Remark 2. The CG method can be also applied to symmetric and positive semi-definite matrix . Let be the eigenvectors associated to . Then from Corollary \ref{ex:keigenvalue}, if , then the CG method with will find a solution within iterations.
The CG is invented as a direct method but it is more effective to use as an iterative method. The rate of convergence depends crucially on the distribution of eigenvalues of and could converge to the solution within certain tolerance in steps .
Theorem 2. Let be the -th iteration of the CG method with . Then
Proof. Let and . Choose
where is the Chebyshev polynomial of degree given by
It is easy to see that , and if , then
Hence
We set
Solving this equation for , we have
with . We then obtain
which complete the proof.
The estimate shows that CG is in general better than the gradient method. Furthermore if the condition number of is close to one, the CG iteration will converge very fast. Even if is large, the iteration will perform well if the eigenvalues are clustered in a few small intervals; see the following estimate.
Corollary 2. Assume that and is the number of elements in . Then
where
To perform the convergence analysis of PCG, we let . Then one can easily verify that Since the optimality still holds, and is symmetric in the -inner product, we obtain the following convergence rate of PCG.
Theorem 3. Let be SPD and let be the th iteration in the PCG method with the preconditioner and an initial guess . Then
The SPD matrix is called preconditioner. A good preconditioner should have the properties that the action of is easy to compute and that is significantly smaller than . The design of a good preconditioner requires the knowledge of the problem and finding a good preconditioner is a central topic of scientific computing.
An interesting fact is that an iterative method may not be convergent whereas can be a preconditioner. For example, the Jacobi method is not convergent for all SPD systems, but can always be used as a preconditioner, which is often known as the diagonal preconditioner. Another popular preconditioner is obtained by an incomplete Cholesky factorization of .