5 Regression Models
We’ve studied several types of function and seen how to spot whether a given data set might suit
a particular model. To get further with this analysis, we need a method for comparing how bad a
particular model is for given data.
5.1 Best-fitting Lines and Linear Regression
We start with an example of some data which appears reasonably linear.
Example 5.1. At t p.m., a trail-runner’s GPS locator says that they’ve travelled y miles along a trail;
t
i
1 2 3 5
y
i
4 8 10 21
We’d like a simple model for how far the runner has travelled as
a function of t. We might use this to predict where they would be
at a given time; say at 6 p.m., or at 2 p.m. if they were to attempt
the trail on another day.
0
10
20
y
0 1 2 3 4 5
t
By plotting the points, the relationship looks to be approximately
12
linear: y mt + c. What is the
best choice of line, and how should we find the coefficients m, c?
What might be good criteria for choosing our line? What should we mean by best? Plainly, we
want the points to be close to the line, but measured how? What use do we want to make of the
approximating line?
Here are three candidate lines plotted with the data set: of the choices, which seems best and why?
0
10
20
y
0 1 2 3 4 5
y = 4t
t
0
10
20
y
0 1 2 3 4 5
y = 2t + 4
t
0
10
20
y
0 1 2 3 4 5
y = 5t 4
t
Since we want our model to predict the hiker’s location
y
ˆ
y = mt + c at a given time t, we’d like our model to
minimize vertical errors
ˆ
y
i
y
i
. We’ve computed these in
the table; since a positive error is as bad as a negative, we
make all the errors positive. It therefore seems reasonable
to claim that the first line is the best choice of the three.
But can we do better?
t
i
1 2 3 5
y
i
4 8 10 21
y = 4t
|
ˆ
y
i
y
i
|
0 0 2 1
y = 2t + 4
|
ˆ
y
i
y
i
|
2 0 0 7
y = 5t 4
|
ˆ
y
i
y
i
|
3 2 1 0
12
Why should we not expect the distance traveled by the hiker to be perfectly linear?
53
We need a sensible definition of best-fitting line for a given data set. One possibility is to minimize the
sum of the vertical errors:
n
i=1
|
ˆ
y
i
y
i
|
For reasons of computational simplicity, uniqueness, statistical interpretation, and to discourage
large individual errors, we don’t do this! The standard approach is instead to minimize the sum of
the squared errors.
Definition 5.2. Let (t
i
, y
i
) be data points with at least two distinct t-values. Let
ˆ
y = mt + c be a linear
predictor (model) for y given t.
The i
th
error in the model is the difference e
i
:=
ˆ
y
i
y
i
= mt
i
+ c y
i
.
The regression line or best-fitting least-squares line is the function
ˆ
y = mt + c which minimizes the
sum S :=
e
2
i
=
(
ˆ
y
i
y
i
)
2
of the squares of the errors.
Having at least two distinct t-values (some t
i
= t
j
) is necessary for the regression line to be unique.
Example (5.1, cont). Suppose the predictor was
ˆ
y = mt + c. We expand the table
t
i
1 2 3 5
y
i
4 8 10 21
ˆ
y
i
m + c 2m + c 3m + c 5m + c
e
i
m + c 4 2m + c 8 3m + c 10 5m + c 21
Our goal is to minimize the function
S(m, c) =
e
2
i
= (m + c 4)
2
+ (2m + c 8)
2
+ (3m + c 10)
2
+ (5m + c 21)
2
This is easy to deal with if we invoke some calculus. If (m, c) minimizes S(m, c), then the first deriva-
tive tests says that the (partial) derivatives of S must be zero.
Keep c constant and differentiate with respect to m:
S
m
= 2(m + c 4) + 4(2m + c 8) + 6(3m + c 10) + 10(5m + c 21)
= 2
h
39m + 11c 155
i
Keep m constant and differentiate with respect to c:
S
c
= (m + c 4) + (2m + c 8) + (3m + c 10) + (5m + c 21)
= 11m + 4c 43
The regression line is found by solving a pair of simultaneous equations
(
39m + 11c = 155
11m + 4c = 43
= m =
21
5
, c =
4
5
=
ˆ
y =
1
5
(21t 4)
By 6 p.m., we predict that the runner would have covered 24.4 miles. The sum of the squared errors
for our regression line is
e
2
i
=
|
ˆ
y
i
y
i
|
2
= 4.4, compared to 5, 53 and 14 for our earlier options.
54
To obtain the general result for n data points, we return to our computations of the partial derivatives:
S
m
=
m
(mt
i
+ c y
i
)
2
= 2
t
i
(mt
i
+ c y
i
) = 2
h
t
2
i
m +
t
i
c
t
i
y
i
i
S
c
=
c
(mt
i
+ c y
i
)
2
= 2
(mt
i
+ c y
i
) = 2
h
t
i
m + nc
y
i
i
These sums are often written using a short-hand notation for average:
t =
1
n
n
i=1
t
i
, t
2
=
1
n
n
i=1
t
2
i
, y =
1
n
n
i=1
y
2
i
, ty =
1
n
n
i=1
t
i
y
i
Theorem 5.3 (Linear Regression). Given n data points (t
i
, y
i
) with at least two distinct t-values, the
best-fitting least-squares line has equation
ˆ
y = mt + c, where m, c satisfy
(
t
2
i
m +
(
t
i
)
c =
t
i
y
i
(
t
i
)
m + nc =
y
i
(
t
2
m + tc = ty
tm + c = y
This is a pair of simultaneous equations for the coefficients m, c, with solution
m =
ty ty
t
2
t
2
, c = y mt
As the next section shows, having two distinct t-values guarantees a non-zero denominator t
2
t
2
.
The expression for c shows that the regression line passes through the data’s center of mass (t, y).
Example 5.4. Five students’ scores on two quizzes are given.
If a student scores 9/10 on the first quiz, what might we expect them
to score on the second?
Quiz 1 8 10 6 7 4
Quiz 2 10 7 5 8 6
To put the question in standard form, suppose Quiz 1 is the t-data and Quiz 2 the y-data. It is helpful
to rewrite the data and add lines to the table so that we may more easily compute everything.
Data
Average
t
i
8 10 6 7 4 35 7
y
i
10 7 5 8 6 36 7.2
t
2
i
64 100 36 49 16 265 53
t
i
y
i
80 70 30 56 24 260 52
0
5
10
Q2 = y
0 2 4 6 8 10
Q1 = t
9
8
m =
52 7 ×7.2
53 7
2
=
1.6
4
= 0.4, c = 7.2 0.4 ×7 = 4.4
=
ˆ
y(t) =
2
5
(t + 11)
This line which minimizes the sum of the squares of the vertical deviations. The prediction is that
the hypothetical student scores
ˆ
y(9) =
2
5
·20 = 8 on Quiz 2. Note that the predictor isn’t symmetric:
if we reverse the roles of t, y we don’t get the same line!
55
Exercises 5.1. 1. Compute the sum of the absolute errors
|
ˆ
y
i
y
i
|
for the regression line and
compare it to the sum of the absolute errors for
ˆ
y = 4t: what do you notice?
2. Let
ˆ
y = mt + c be a linear predictor for the given data.
t
i
0 1 2 3
y
i
1 2 2 3
(a) Compute the sum of squared-errors S(m, c) =
e
2
i
=
|
ˆ
y
i
y
i
|
2
as
a function of m and c.
(b) Compute the partial derivatives
S
m
and
S
c
.
(c) Find m and c by setting both partial derivatives to zero; hence find the equation of the
regression line for these data.
(d) Compare the sum of square errors S for the regression line with the errors if we use the
simple predictor y(t) = 1 +
2
3
t which passes through the first an last data points.
3. Consider Example 5.4.
(a) Compute the sum of square-errors S =
e
2
i
=
|
ˆ
y
i
y
i
|
2
for the regression line.
(b) Suppose a student was expected to score exactly the same on both quizzes; the predictor
would be
ˆ
y = t. What would the sum of squared-errors be in this case?
(c) If a student scores 8/10 on Quiz 2, use linear regression to predict their score on Quiz 1.
(Warning: the answer is NOT
5
2
·8 11 = 9. . . )
4. Ten children had their heights (inches) measured on their first and second birthdays. The data
was as follows.
1
st
birthday 28 28 29 29 29 30 30 32 32 33
2
nd
birthday 30 32 31 34 35 33 36 37 35 37
Given this data, find a regression model and use it to predict the height at 2 years of a child
who measures 32 inches at age 1.
(It is acceptable—and encouraged!—to use a spreadsheet to find the necessary ingredients. You can do
this by hand if you like, but the numbers are large; it is easier with some formulæ from the next section.)
5. (a) Let a, b be given. Find the value of y which minimizes the sum of squares
(y a)
2
+ (y b)
2
(b) For the data set
(t, y)
=
(1, 1), (2, 1) , (2, 3)
, find the unique least-squares linear model
for predicting y given t.
(Hint: think about part (a) if you don’t want to compute)
(c) Show that there are infinitely many lines
ˆ
y = mt + c which minimize the sum of the absolute
errors
3
i=1
|
ˆ
y
i
y
i
|
.
56
5.2 The Coefficient of Determination
In the sense that it minimizes the sum of the squared errors S =
e
2
i
, the linear regression model is
as good as it can be—but how good? We could use S as a quantitative measure of the model’s accuracy,
but it doesn’t do a good job at comparing the accuracy of models for different data sets. The standard
approach to this problem relies the concept of variance.
Definition 5.5. The variance of data sequence (y
1
, . . . , y
n
) is the average of the squared deviations
from their mean y =
1
n
n
i=1
y
i
,
Var y :=
1
n
n
i=1
y
i
y
2
The standard deviation is σ
y
:=
p
Var y.
Variance and standard-deviation are measures of how data deviates from being constant.
Example 5.6. Suppose (y
i
) = (1, 2, 5, 4). Then
y =
1
4
(1 + 2 + 5 + 4) = 3 Var y =
1
4
(2)
2
+ (1)
2
+ 2
2
+ 1
2
=
5
2
σ
y
=
10
2
The square-root means that σ
y
has the same units as y. Loosely speaking, a typical data value is
expected to lie approximately σ
y
=
1
2
10 1.58 from the mean y = 3.
To obtain a measure for how well a regression line fits given data (t
i
, y
i
), we ask what fraction of the
variance in y is explained by the model.
Definition 5.7. The coefficient of determination of a model
ˆ
y = mt + c is the ratio
R
2
:=
Var
ˆ
y
Var y
Examples 5.8. We start by considering two extreme examples.
1. If the data were perfectly linear, then y
i
= mt
i
+ c for all i. The regression line is therefore
ˆ
y = mt + c and the coefficient of determination is precisely R
2
=
Var y
Var y
= 1. All the variance in
the output y is explained by the model’s transfer of the variance in the input t.
2. By contrast, consider the data in the table where we work
out all necessary details to find the regression line:
m =
ty ty
t
2
t
2
= 0, c = y mt = 2
The regression line is the constant
ˆ
y 2, whence
ˆ
y has no
variance and the coefficient of determination is R
2
= 0.
data average
t
i
0 0 2 2 t = 1
y
i
1 3 1 3 y = 2
t
2
i
0 0 4 4 t
2
= 2
t
i
y
i
0 0 2 6 ty = 2
In this example, the regression model doesn’t help explain the y-data in any way: the t-values
have no obvious impact on the y-values.
57
In fact, the coefficient of determination always lies somewhere between these extremes 0 R
2
1:
Exercise 6 demonstrates this and that the extreme situations are essentially those just encountered; in
practice, therefore, 0 < R
2
< 1. Before we revisit our examples from the previous section, observe
that the average of the model’s outputs
ˆ
y
i
is the same as that of the original data:
1
n
n
i=1
ˆ
y
i
=
1
n
n
i=1
(mt
i
+ c) = mt + c = y
This makes computing the variance of
ˆ
y a breeze!
Example 5.1. Recall that
ˆ
y =
1
5
(21t 4). Everything nec-
essary is in the table
Var y =
6.75
2
+ 2.75
2
+ 0.75
2
+ 10.25
2
4
= 39.6875
Var
ˆ
y =
7.35
2
+ 3.15
2
+ 1.05
2
+ 9.45
2
4
= 38.5875
data average
t
i
1 2 3 5 t = 2.75
y
i
4 8 10 21 y = 10.75
ˆ
y
i
3.4 7.6 11.8 20.2
ˆ
y = 10.75
from which R
2
=
Var
ˆ
y
Var y
=
3087
3175
97.23%. The interpretation here is that the data is very close to
being linear; the output y
i
is very closely approximated by the regression model with approxi-
mately 97% of its variance explained by the model.
Example 5.4. This time
ˆ
y =
2
5
(t + 11).
Var y =
2.8
2
+ 0.2
2
+ 2.2
2
+ 0.8
2
+ 1.2
2
5
= 2.96
Var
ˆ
y =
0.4
2
+ 1.2
2
+ 0.4
2
+ 0
2
+ 1.2
2
5
= 0.64
data average
t
i
8 10 6 7 4 t = 7
y
i
10 7 5 8 6 y = 7.2
ˆ
y
i
7.6 8.4 6.8 7.2 6
ˆ
y = 7.2
from which R
2
=
Var
ˆ
y
Var y
=
8
37
21.62%. In this case the coefficient of determination is small,
which indicates that the model does not explain much of the variation in the output.
The four examples are plotted below for easy visual comparison between the R
2
-values.
Perfect model R
2
= 1 Useless model R
2
= 0 Good model R
2
= 0.97 Poor model R
2
= 0.22
Efficient computation of R
2
If you want to compute by hand, our current process is lengthy and
awkward. To obtain a more efficient alternative we first consider an alternative expression for the
variance of any collection of data:
Var x =
1
n
(x
i
x)
2
=
1
n
x
2
i
2x
n
x
i
+
x
n
x
i
= x
2
x
2
Plainly Var x 0 with equality if and only if all data values x
i
are equal. The alternative expression
x
2
x
2
justifies the uniqueness of the regression line in Definition 5.2 and Theorem 5.3.
58
Now expand the variance of the predicted outputs:
Var
ˆ
y =
1
n
(
ˆ
y
i
y)
2
=
1
n
mt
i
+ c (mt + c)
2
=
m
2
n
(t
i
t)
2
= m
2
Var t
Putting these together, we obtain several equivalent expressions for the coefficient of determination:
R
2
=
Var
ˆ
y
Var y
= m
2
Var t
Var y
= m
2
t
2
t
2
y
2
y
2
=
( ty ty)
2
( t
2
t
2
)( y
2
y
2
)
()
Example 5.9. We do one more easy example with simple data (t
i
, y
i
) : (1, 4), (2, 1), (3, 2), (4, 0).
data average
t
i
1 2 3 4 t =
10
4
y
i
4 1 2 0 y =
7
4
t
2
i
1 4 9 16 t
2
=
15
2
y
2
i
16 1 4 0 y
2
=
21
4
t
i
y
i
4 2 6 0 ty = 3
m =
ty ty
t
2
t
2
=
3
70
4
2
15
2
100
4
2
=
11
10
= 1.1
c = y mt =
7
4
+
11 ·10
10 ·4
=
9
2
= 4.5
0
1
2
3
4
y
0 1 2 3 4
t
ˆ
y = 1.1t + 4.5
R
2
=
121
175
= 0.69
e
2
i
= 2.7
The regression line is
ˆ
y =
11
10
t +
9
2
= 1.1t + 4.5, and the coefficient of determination is
R
2
= m
2
t
2
t
2
y
2
y
2
=
121
100
·
15
2
100
4
2
21
4
49
4
2
=
121
100
·
20
35
=
121
175
= 69.1%
The minimized square error is also easily computed:
e
2
i
=
(
ˆ
y
i
y
i
)
2
= (3.4 4)
2
+ (2.3 1)
2
+ (1.2 2)
2
+ (0.1 0)
2
= 2.7
Reversion to the Mean & Correlation By (), the regression model may be re-written in terms of
the standard-deviation and R
2
:
ˆ
y(t) = mt + c = y + m(t t) = y +
R
2
σ
y
σ
t
(t t) =
ˆ
y(t + λσ
t
) = y + λ
R
2
σ
y
Definition 5.10. The correlation coefficient is the value r := ±
R
2
(sign equal to that of m).
An input λ standard-deviations above the mean (t = t + λσ
t
) results in a prediction λr standard-
deviations above the mean (
ˆ
y = y + λrσ
y
). Unless the data is perfectly linear, we have R
2
< 1;
relative to the ‘neutral’ measure given by the standard-deviation a prediction
ˆ
y(t) is closer to the
mean than the input t
|
ˆ
y(t) y
|
σ
y
= r
ˆ
t t
σ
t
<
ˆ
t t
σ
t
59
Example (5.9, cont). We compute the details. The correlation coefficient is r =
R
2
0.832;
we say that the data is negatively correlated, since the output y seems to decrease as t increases. The
standard deviations may be read off from the table:
σ
t
=
Var t =
q
t
2
t
2
=
5
2
1.118, σ
y
=
p
Var y =
q
y
2
y
2
=
35
4
1.479
The predictor may therefore be written (approximately)
ˆ
y(t + λσ
t
) =
ˆ
y(2.5 + 1.12λ) = y + λrσ
y
= 1.75 1.23λ
As a sanity check,
ˆ
y(2.5 + 1.12) =
ˆ
y(3.62) = 1.1 ×3.98 + 4.5 = 0.52 = 1.75 1.23
Weaknesses of Linear Regression There are two obvious issues:
Outliers massively influence the regression line. Dealing with this problem is complicated and
there are a variety of approaches that can be used. It is important to remember that any ap-
proach to modelling, including our regression model, requires some subjective choice.
If the data is not very linear then the regression model will produce a weak predictor. There are
several ways around this as we’ll see in the remaining sections: higher-degree polynomial re-
gression can be performed, and data sometimes becomes more linear after some manipulation,
say by an exponential or logarithmic function.
Exercises 5.2. 1. Suppose (z
i
) = (2, 4, 10, 8) is double the data set in Example 5.6. Find z, Var z and
σ
z
. Why are you not surprised?
2. Use a spreadsheet to find R
2
for the predictor in Exercise 5.1.4. How confident do you feel in
your prediction?
3. Find the standard deviations and correlation coefficients for the data in Examples 5.1 and 5.4.
4. The adult heights of men and women in a given population satisfy the following:
Men: average 69.5 in, σ = 3.2 in. Women: average 63.7 in, σ = 2.5 in.
The height of a father and his adult daughter have correlation coefficient 0.35. If a father’s
height is 72 in (mother’s height unknown), how tall do you expect their daughter to be?
5. Suppose R
2
is the coefficient of determination for a linear regression model
ˆ
y = mt + c. Use
one of the alternative expressions for R
2
(page 59) to find the coefficient of determination for
the reversed predictor
ˆ
t(y)? Are you surprised?
6. Suppose that a data set {(t
i
, y
i
)}
1in
has at least two distinct t- and y-values (some t
i
= t
j
,
etc.), that it has regression line
ˆ
y = mt + c and coefficient of determination R
2
.
(a) Show that R
2
= 0 m = 0.
(b) (Hard) Prove that the sum of squared errors equals S =
n
i=1
e
2
i
= n(Var y Var
ˆ
y).
(c) Obtain the alternative expression R
2
= 1
S
n Var y
. Hence conclude that R
2
1, with
equality if and only if the original data set is perfectly linear.
60
5.3 Matrix Multiplication & Polynomial Regression
In this section we consider how to find a best-fitting least-squares polynomial for given data. To see
how to do this, it helps to rephrase the linear approach using matrices.
13
We start by observing that the system of equations in Theorem 5.3 can be written in as a 2 ×2 matrix
problem. For a data set with n pairs, the coefficients m, c satisfy
t
2
i
t
i
t
i
n
m
c
=
t
i
y
i
y
i
This is nice because we can decompose the square matrix on the left as the product of a simple 2 × n
matrix and its transpose (switch the rows and columns);
t
2
i
t
i
t
i
n
=
t
1
t
2
··· t
n
1 1 ··· 1
t
1
1
t
2
1
.
.
.
.
.
.
t
n
1
=: P
T
P
We can also view the right side as the product of P
T
and the column vector of output values y
i
:
t
i
y
i
y
i
=
t
1
t
2
··· t
n
1 1 ··· 1
y
1
.
.
.
y
n
=: P
T
y
A little theory tells us that if at least two of the t
i
are distinct, then the 2 ×2 matrix P
T
P is invertible;
14
there is a unique regression line whose coefficients may be found by taking the matrix inverse
m
c
= (P
T
P)
1
P
T
y =
ˆ
y = mt + c = (t 1)
m
c
= (t 1)(P
T
P)
1
P
T
y
We can also easily compute the vector of predicted values
ˆ
y
i
=
ˆ
y(t
i
):
ˆy =
t
1
t
2
··· t
n
1 1 ··· 1
m
c
= P(P
T
P)
1
P
T
y
and the squared error
e
2
i
=
|
ˆ
y
i
y
i
|
2
=
||
ˆy y
||
2
, which leads to an alternative expression for
the coefficient of determination
R
2
=
||
ˆy
||
2
ny
2
||
y
||
2
ny
2
where
||
y
||
is the length of a vector.
13
Matrix computations are non-examinable. The purpose of this section is to be see how the regression may easily be
automated and generalized by computer and to understand a little of how a spreadsheet calculates best-fitting curves of
different types.
14
For those who’ve studied linear algebra, P and P
T
P have the same null space and thus rank, since
Px = 0 = P
T
Px = 0 and P
T
Px = 0 = x
T
P
T
Px = 0 =
||
Px
||
= 0 = Px = 0
For linear regression, having at least two distinct t
i
values means rank P = 2, whence P
T
P is invertible.
61
Examples 5.11. 1. We revisit the Example 5.9 in this language.
P =
t
1
1
t
2
1
.
.
.
.
.
.
t
n
1
=
1 1
2 1
3 1
4 1
= P
T
P =
1 2 3 4
1 1 1 1
1 1
2 1
3 1
4 1
=
30 10
10 4
from which
m
c
= (P
T
P)
1
P
T
y =
30 10
10 4
1
1 2 3 4
1 1 1 1
4
1
2
0
=
1
30 ·4 10
2
4 10
10 30
12
7
=
1
20
48 70
120 + 210
=
1
10
11
45
The prediction vector given inputs t
i
is therefore
ˆy = P
m
c
=
1
10
1 2 3 4
1 1 1 1
11
45
=
1
10
34
23
12
1
from which the coefficient of determination is, as before
R
2
=
||
ˆy
||
2
4y
2
||
y
||
2
4y
2
=
1
100
(34
2
+ 23
2
+ 12
2
+ 1
2
) 4 ·
7
2
4
2
(4
2
+ 1
1
+ 2
2
+ 0
2
) 4 ·
7
2
4
2
=
121
175
2. Given the data set {(3, 1), (3, 5), (3, 6)}, we have P =
3 1
3 1
3 1
and P
T
P =
27 9
9 3
which isn’t invertible: 27 ·3 9 ·9 = 0. The
linear regression method doesn’t work!
It is easy to understand this from the picture. Since the three
data points are vertically aligned, any line minimizing the
sum of the squared errors must pass through the average
(3, 4), though it could have any slope!
This illustrates our fundamental assumption: linear regres-
sion requires at least two distinct t-values.
0
2
4
6
y
0 1 2 3
t
It is unnecessary ever to use the matrix approach for linear regression, though the method has sig-
nificant advantages.
Computers store and manipulate data in matrix format, so this method is computer-ready.
Suppose you repeat an experiment several times, taking measurements y
i
at times t
i
. Since
P depends only on the t-data, you need only compute the matrix (P
T
P)
1
P
T
once, making
computation of the regression line for repeat experiments very efficient.
The method generalizes (easily for computers!) to polynomial regression. . .
62
Polynomial Regression
The pattern is almost identical when we use matrices; you just need to make the matrix P a little
larger. . . We work through the approach for a quadratic approximation.
Suppose we have a data set {( t
i
, y
i
) : 1 i n} and that we desire a quadratic polynomial predictor
ˆ
y = at
2
+ bt + c which minimizes the sum of the squared vertical errors
S(a, b, c) =
n
i=1
e
2
i
=
n
i=1
(at
2
i
+ bt
i
+ c y
i
)
2
This might look terrifying, but can be attacked exactly as before using differentiation: to minimize S,
we need the derivatives of S with respect to the coefficients a, b, c to be zero.
S
a
= 2
at
4
i
+ bt
3
i
+ ct
2
i
t
2
i
y
i
= 0
S
b
= 2
at
3
i
+ bt
2
i
+ ct
i
t
i
y
i
= 0
S
c
= 2
at
2
i
+ bt
i
+ c y
i
= 0
a
t
4
i
+ b
t
3
i
+ c
t
2
i
=
t
2
i
y
i
a
t
3
i
+ b
t
2
i
+ c
t
i
=
t
i
y
i
a
t
2
i
+ b
t
i
+ cn =
y
i
As a system of equations for a, b, c this looks fairly nasty, but by rephrasing in terms of matrices, we
see that it is exactly the same problem as before!
t
4
i
t
3
i
t
2
i
t
3
i
t
2
i
t
i
t
2
i
t
i
cn
a
b
c
=
t
2
i
y
i
t
i
y
i
y
i
corresponds to
P
T
P
a
b
c
= P
T
y where P =
t
2
1
t
1
1
.
.
.
.
.
.
.
.
.
t
2
n
t
n
1
and y =
y
1
.
.
.
y
n
The only change is that P is now an n × 3 matrix so that P
T
P is 3 × 3. Analogous to the linear
situation, provided at least three of the t
i
are distinct, the matrix P
T
P is invertible and there is a
unique least-squares quadratic minimizer
ˆ
y = at
2
+ bt + c =
t
2
t 1
a
b
c
=
t
2
t 1
(P
T
P)
1
P
T
y
The predictions
ˆ
y
i
=
ˆ
y(t
i
) therefore form a vector ˆy = P
a
b
c
= P(P
T
P)
1
P
T
y, and the coefficient of
determination may be computed as before.
R
2
=
||
ˆy
||
2
n
y
2
||
y
||
2
ny
2
The method generalizes in the obvious way: if you want a cubic minimizer, give P an extra column
of cubed t
i
-terms! This would be hard work by hand, but is standard fodder for computers: this isn’t
a linear algebra class, so don’t try to invert a 3 ×3 matrix!
63
Example 5.12. We are given data {(t
i
, y
i
)} = {(1, 2), (2, 5), (3, 7), (4, 4)}.
1. For the best-fitting linear model, we use the same P (and thus P
T
P) from the previous example:
m
c
= (P
T
P)
1
P
T
y =
30 10
10 4
1
1 2 3 4
1 1 1 1
2
5
7
4
=
1
10
2 5
5 15
49
18
=
0.8
2.5
which yields
ˆ
y(t) = 0.8t + 2.5. The predicted values and coefficient of determination are then
ˆy =
1 2 3 4
1 1 1 1
0.8
2.5
=
3.3
4.1
4.9
5.7
R
2
=
84.2 81
94 81
0.2462
The linear model predicts only 24.6% of the variance in the output; not very accurate.
2. For a quadratic model; all that changes is the matrix P
P =
1 1 1
4 2 1
9 3 1
16 4 1
= P
T
P =
1 4 9 16
1 2 3 4
1 1 1 1
1 1 1
4 2 1
9 3 1
16 4 1
=
354 100 30
100 30 10
30 10 4
=
a
b
c
= (P
T
P)
1
P
T
2
5
7
4
=
354 100 30
100 30 10
30 10 4
1
149
49
18
=
1.5
8.3
5
from which
ˆ
y = 1.5t
2
+ 8.3t 5. To quantify its accuracy, compute the vector of predicted
values
ˆ
y
i
=
ˆ
y(t
i
) and the coefficient of determination:
ˆy = P
1.5
8.3
5
=
1.8
5.6
6.4
4.2
R
2
=
||
ˆy
||
2
4y
2
||
y
||
2
4y
2
=
93.2 81
94 81
0.9385
The quadratic model is far superior to the linear, ex-
plaining 94% of the observed variance.
3. We can even find a cubic model (P is a 4 ×4 matrix!)
ˆ
y =
1
6
(4t
3
+ 21t
2
17t + 12)
The cubic passes through all four data points, there is
no error and R
2
= 1.
0
2
4
6
0 1 2 3 4
t
y
For real-world data this is possibly less useful than the quadratic model—it certainly takes
longer to find! More importantly, likely experimental error in the y-data has a strong effect
on the ‘perfect’ model—we are, in effect, modelling noise. Do you expect y(5) to be closer to 1
or 8?
64
Exercises 5.3. 1. Recall Example 4.2, with the following almost linear data set.
x 0 2 4 6 8 10
y 3 23 41 59 77 93
Find the best-fitting straight line for the data, then use a spreadsheet to find the best-fitting
quadratic. Is the extra effort worth it?
2. You are given the following data consisting of measurements from an experiment recorded at
times t
i
seconds.
t
i
1 2 3 4 5 6 7 8 9 10
y
i
7 5 3 2 3 5 6 9 8 12
(a) Given the values
t
i
= 55,
t
2
i
= 385,
y
i
= 60,
t
i
y
i
= 385
find the best-fitting least-squares linear model for this data, and use it to predict
ˆ
y(13).
(b) Find the best-fitting quadratic model for the data: feel free to use a spreadsheet!
(c) The graphs below show the best-fitting least-squares linear, quadratic, cubic, quartic, and
ninth-degree models and their coefficients of determination.
0
4
8
12
y
0 2 4 6 8 10
tlinear
0
4
8
12
y
0 2 4 6 8 10
t
quadratic
0
4
8
12
y
0 2 4 6 8 10
tcubic
0
4
8
12
y
0 2 4 6 8 10
t
quartic
0
4
8
12
y
0 2 4 6 8 10
t
degree nine
Degree R
2
1 0.4264
2 0.8830
3 0.9319
4 0.9336
.
.
.
.
.
.
9 1
Which of these models would you choose for this data and why? What considerations
would you take into account?
65
5.4 Exponential & Power Regression Models
If you suspect that your data would be better modelled by a non-polynomial function, there are
several things you can try.
Minimizing the sum of squared-errors might be very difficult for non-polynomial functions because
there is likely no simple tie-in with linear equations/algebra. Attempting this is likely to result in
a horrible non-linear system for your coefficients which is difficult to analyze either theoretically or
using a computer.
15
Log Plots The most common approach when trying to fit an exponential model
ˆ
y = e
mt+c
to data is
to use a log plot: taking logarithms of both sides results in
ln
ˆ
y = mt + c
If we take
ˆ
Y := ln
ˆ
y as a new variable, the model is now a straight-line! The idea is then to use linear
regression to find the coefficients m, ln a.
Example (4.4, cont). Recall our earlier rabbit-population P(t), repeated in the table below. We
previously considered modelling this with an exponential function for two reasons:
1. We were told it was population data!
2. The t-differences are constant (2), while the P-ratios
are approximately so ( 1.41).
t
i
0 2 4 6 8 10
P
i
5 7 10 14 19 28
ln P
i
1.61 1.95 2.30 2.64 2.94 3.33
After constructing a log-plot, the relationship is much clearer:
0
10
20
30
P
0 2 4 6 8 10
t
0
1
2
3
ln P
0 2 4 6 8 10
t
Since the relationship between t and ln P appears linear, we perform a linear regression calculation
to find the best-fitting least-squares line for the (t
i
, ln P
i
) data.
15
As an example of how horrific this is, suppose you want to minimize the sum of square-errors for data (t
i
, y
i
) using an
exponential model
ˆ
y(t) = ae
kt
. The coefficients of our model, a, k should minimize
S(a, k) =
n
i=1
ae
kt
i
y
i
2
Differentiating this with respect to a, k and setting equal to zero results in
(
S
a
= 2
e
kt
i
ae
kt
i
y
i
= 0
S
k
= 2a
t
i
e
kt
i
ae
kt
i
y
i
= 0
=
y
i
e
kt
i

t
i
e
2kt
i
=
e
2kt
i

t
i
y
i
e
kt
i
where we substituted for a to obtain the last equation. Remember that this is an equation for k; if you think you can solve
this easily, think again!
66
Everything necessary comes from extending the table.
Data average
t
i
0 2 4 6 8 10 5
P
i
5 7 10 14 19 28 13.83
ln P
i
1.61 1.95 2.30 2.64 2.94 3.33 2.46
t
2
i
0 4 16 36 64 100 36.67
t
i
ln P
i
0 3.89 9.21 15.83 23.56 33.32 14.30
m =
t ln P t ·ln P
t
2
t
2
=
14.30 5 ·2.46
36.67 5
2
= 0.171
c = ln P mt = 3.46 0.171 ·5 = 1.609
which yields the exponential model
ˆ
P(t) = e
0.171t+1.609
= 4.998(1.186)
t
0
10
20
30
P
0 2 4 6 8 10
t
0
1
2
3
ln P
0 2 4 6 8 10
t
This is very close to the model (5(1.188)
t
) we obtained previously by pure guesswork. The approxi-
mate doubling time T for the population satisfies
e
mT
= 2 = T =
ln 2
m
= 4.06 months
When using the log plot method, interpreting errors and the goodness of fit of a model is a little more
difficult. Typically one computes the coefficient of determination R
2
of the underlying linear model: in
our example,
16
R
2
= m
2
Var t
Var ln P
= 99.3%
It is important to appreciate that the log plot method does
not treat all errors equally: taking logarithms tends to re-
duce error by a greater amount when the output y is large.
This should be clear from the picture, and more formally
by the mean value theorem: if y
1
< y
2
, then there is some
ξ (y
1
, y
2
) for which
ln y
2
ln y
1
=
1
ξ
(y
2
y
1
) <
1
y
1
(y
2
y
1
)
ln y
y
Same y
Different ln y
The log plot approach therefore places a higher emphasis on accurately matching data when the
output y is small. This isn’t such a bad thing since our intuitive view of error depends on the size of
the data. For instance, misplacing a $100 bill is annoying, but a $100 mistake in escrow when buying
a house is unlikely to concern you very much! Exponential data can more easily vary over large
orders of magnitude than linear or quadratic data.
16
This needs more decimal places of accuracy for the log-values than what’s in our table!
67
Log-Log Plots If you suspect a power function model
ˆ
y = at
m
, then taking logarithms
ln
ˆ
y = m ln t + ln a
results in a linear relationship between ln y and ln t. As before, we can apply a linear regression
approach to find a mode; the goodness of fit is again described by the coefficient of determination of
the underlying model.
Exercises 5.4. 1. You suspect a logarithmic model for a data set. Describe how you would approach
finding a model in the context of this section.
2. The table shows the average weight and length of a fish species measured at different ages.
Age (years) Length(cm) Weight (g)
1 5.2 2
2 8.5 8
3 11.5 21
4 14.3 38
5 16.8 69
6 19.2 117
7 21.3 148
8 23.3 190
9 25.0 264
10 26.7 293
11 28.2 318
12 29.6 371
13 30.8 455
14 32.0 504
15 33.0 518
16 34.0 537
17 34.9 651
18 36.4 719
18 37.1 726
20 37.7 810
0
200
400
600
800
w
0 10 20 30 40
l
(a) Do you think an exponential model is a good
fit for this data? Take logarithms of the
weight values and use a spreadsheet to ob-
tain a model
ˆ
w() = ae
m
where w, are the
weight and length respectively.
(b) What happens if you try a log-log plot? Given
what we’re measuring, why do you expect a
power model to be mode accurate?
3. Population data for Long Beach CA is given.
Using a spreadsheet or otherwise, find linear,
quadratic, exponential and logarithmic regres-
sion models for this data.
Which of these models seems to fit the data best,
and which would you trust to best predict the
population in 2020?
Look up the population of Long Beach in 2020;
does it confirm your suspicions? What do you
think is going on?
Year
Years since 1900 Population
1900 0 2,252
1910 10 17,809
1920 20 55,593
1930 30 142,032
1940 40 164,271
1950 50 250,767
1960 60 334,168
1970 70 358,879
1980 80 361,498
1990 90 429,433
2000 100 461,522
2010 110 462,257
68
4. In the early 1600s, Johannes Kepler used observational data to derive his laws of planetary motion,
the third of which relates the orbital period T of a planet (how long it takes to go round the sun)
to its (approximate) distance r from the sun.
Planet T (years) r (millions km)
Mercury 0.24 58
Venus 0.61 110
Earth 1 150
Mars 1.88 230
Jupiter 11.9 780
Saturn 29.5 1400
Uranus 84 2900
Neptune 165 4500
The table shows the data for all the planets. Use a spreadsheet to analyze this data and find a
model relating T to r.
Kepler did not known about Uranus and Neptune and only had relative distances for the plan-
ets. Research the correct statement of Kepler’s third law and compare it with your findings.
69