5 Regression Models

We’ve studied several types of function and seen how to spot whether a given data set might suit

a particular model. To get further with this analysis, we need a method for comparing how bad a

particular model is for given data.

5.1 Best-ﬁtting Lines and Linear Regression

We start with an example of some data which appears reasonably linear.

Example 5.1. At t p.m., a trail-runner’s GPS locator says that they’ve travelled y miles along a trail;

1 2 3 5

4 8 10 21

We’d like a simple model for how far the runner has travelled as

a function of t. We might use this to predict where they would be

at a given time; say at 6 p.m., or at 2 p.m. if they were to attempt

the trail on another day.

0 1 2 3 4 5

By plotting the points, the relationship looks to be approximately

linear: y ≈ mt + c. What is the

best choice of line, and how should we ﬁnd the coefﬁcients m, c?

What might be good criteria for choosing our line? What should we mean by best? Plainly, we

want the points to be close to the line, but measured how? What use do we want to make of the

approximating line?

Here are three candidate lines plotted with the data set: of the choices, which seems best and why?

0 1 2 3 4 5

y = 4t

0 1 2 3 4 5

y = 2t + 4

0 1 2 3 4 5

y = 5t −4

Since we want our model to predict the hiker’s location

y ≈

y = mt + c at a given time t, we’d like our model to

minimize vertical errors

− y

. We’ve computed these in

the table; since a positive error is as bad as a negative, we

make all the errors positive. It therefore seems reasonable

to claim that the ﬁrst line is the best choice of the three.

But can we do better?

1 2 3 5

4 8 10 21

y = 4t

−y

0 0 2 1

y = 2t + 4

−y

2 0 0 7

y = 5t −4

−y

3 2 1 0

Why should we not expect the distance traveled by the hiker to be perfectly linear?

We need a sensible deﬁnition of best-ﬁtting line for a given data set. One possibility is to minimize the

sum of the vertical errors:

∑

i=1

−y

For reasons of computational simplicity, uniqueness, statistical interpretation, and to discourage

large individual errors, we don’t do this! The standard approach is instead to minimize the sum of

the squared errors.

Deﬁnition 5.2. Let (t

, y

) be data points with at least two distinct t-values. Let

y = mt + c be a linear

predictor (model) for y given t.

• The i

error in the model is the difference e

−y

= mt

+ c − y

• The regression line or best-ﬁtting least-squares line is the function

y = mt + c which minimizes the

sum S :=

∑

(

−y

)

of the squares of the errors.

Having at least two distinct t-values (some t

= t

) is necessary for the regression line to be unique.

Example (5.1, cont). Suppose the predictor was

y = mt + c. We expand the table

1 2 3 5

4 8 10 21

m + c 2m + c 3m + c 5m + c

m + c −4 2m + c −8 3m + c −10 5m + c −21

Our goal is to minimize the function

S(m, c) =

∑

= (m + c −4)

+ (2m + c −8)

+ (3m + c −10)

+ (5m + c −21)

This is easy to deal with if we invoke some calculus. If (m, c) minimizes S(m, c), then the ﬁrst deriva-

tive tests says that the (partial) derivatives of S must be zero.

• Keep c constant and differentiate with respect to m:

∂S

∂m

= 2(m + c −4) + 4(2m + c −8) + 6(3m + c −10) + 10(5m + c −21)

= 2

39m + 11c −155

• Keep m constant and differentiate with respect to c:

∂S

∂c

= (m + c −4) + (2m + c −8) + (3m + c −10) + (5m + c −21)

= 11m + 4c −43

The regression line is found by solving a pair of simultaneous equations

(

39m + 11c = 155

11m + 4c = 43

=⇒ m =

, c = −

=⇒

y =

(21t −4)

By 6 p.m., we predict that the runner would have covered 24.4 miles. The sum of the squared errors

for our regression line is

∑

−y

= 4.4, compared to 5, 53 and 14 for our earlier options.

To obtain the general result for n data points, we return to our computations of the partial derivatives:

∂S

∂m

∑

∂

∂m

(mt

+ c − y

)

= 2

∑

(mt

+ c − y

) = 2



∑



m +



∑



c −

∑

∂S

∂c

∑

∂

∂c

(mt

+ c − y

)

= 2

∑

(mt

+ c − y

) = 2



∑



m + nc −

∑

These sums are often written using a short-hand notation for average:

t =

∑

i=1

, t

∑

i=1

, y =

∑

i=1

, ty =

∑

i=1

Theorem 5.3 (Linear Regression). Given n data points (t

, y

) with at least two distinct t-values, the

best-ﬁtting least-squares line has equation

y = mt + c, where m, c satisfy

(



∑



m +

(

∑

)

c =

∑

(

∑

)

m + nc =

∑

↭

(

m + tc = ty

tm + c = y

This is a pair of simultaneous equations for the coefﬁcients m, c, with solution

m =

ty −ty

−t

, c = y −mt

As the next section shows, having two distinct t-values guarantees a non-zero denominator t

− t

The expression for c shows that the regression line passes through the data’s center of mass (t, y).

Example 5.4. Five students’ scores on two quizzes are given.

If a student scores 9/10 on the ﬁrst quiz, what might we expect them

to score on the second?

Quiz 1 8 10 6 7 4

Quiz 2 10 7 5 8 6

To put the question in standard form, suppose Quiz 1 is the t-data and Quiz 2 the y-data. It is helpful

to rewrite the data and add lines to the table so that we may more easily compute everything.

Data

∑

Average

8 10 6 7 4 35 7

10 7 5 8 6 36 7.2

64 100 36 49 16 265 53

80 70 30 56 24 260 52

Q2 = y

0 2 4 6 8 10

Q1 = t

m =

52 −7 ×7.2

53 −7

1.6

= 0.4, c = 7.2 −0.4 ×7 = 4.4

=⇒

y(t) =

(t + 11)

This line which minimizes the sum of the squares of the vertical deviations. The prediction is that

the hypothetical student scores

y(9) =

·20 = 8 on Quiz 2. Note that the predictor isn’t symmetric:

if we reverse the roles of t, y we don’t get the same line!

Exercises 5.1. 1. Compute the sum of the absolute errors

∑

−y

for the regression line and

compare it to the sum of the absolute errors for

y = 4t: what do you notice?

2. Let

y = mt + c be a linear predictor for the given data.

0 1 2 3

1 2 2 3

(a) Compute the sum of squared-errors S(m, c) =

∑

−y

a function of m and c.

(b) Compute the partial derivatives

∂S

∂m

and

∂S

∂c

regression line for these data.

(d) Compare the sum of square errors S for the regression line with the errors if we use the

simple predictor y(t) = 1 +

t which passes through the ﬁrst an last data points.

3. Consider Example 5.4.

(a) Compute the sum of square-errors S =

∑

−y

for the regression line.

(b) Suppose a student was expected to score exactly the same on both quizzes; the predictor

would be

y = t. What would the sum of squared-errors be in this case?

(Warning: the answer is NOT

·8 −11 = 9. . . )

4. Ten children had their heights (inches) measured on their ﬁrst and second birthdays. The data

was as follows.

birthday 28 28 29 29 29 30 30 32 32 33

birthday 30 32 31 34 35 33 36 37 35 37

Given this data, ﬁnd a regression model and use it to predict the height at 2 years of a child

who measures 32 inches at age 1.

(It is acceptable—and encouraged!—to use a spreadsheet to ﬁnd the necessary ingredients. You can do

this by hand if you like, but the numbers are large; it is easier with some formulæ from the next section.)

5. (a) Let a, b be given. Find the value of y which minimizes the sum of squares

(y − a)

+ (y −b)

(b) For the data set



(t, y)





(1, 1), (2, 1) , (2, 3)



, ﬁnd the unique least-squares linear model

for predicting y given t.

(Hint: think about part (a) if you don’t want to compute)

y = mt + c which minimize the sum of the absolute

errors

∑

i=1

−y

5.2 The Coefﬁcient of Determination

In the sense that it minimizes the sum of the squared errors S =

∑

, the linear regression model is

as good as it can be—but how good? We could use S as a quantitative measure of the model’s accuracy,

but it doesn’t do a good job at comparing the accuracy of models for different data sets. The standard

approach to this problem relies the concept of variance.

Deﬁnition 5.5. The variance of data sequence (y

, . . . , y

) is the average of the squared deviations

from their mean y =

∑

i=1

Var y :=

∑

i=1



−y



The standard deviation is σ

Var y.

Variance and standard-deviation are measures of how data deviates from being constant.

Example 5.6. Suppose (y

) = (1, 2, 5, 4). Then

y =

(1 + 2 + 5 + 4) = 3 Var y =



(−2)

+ (−1)

+ 2

+ 1



√

The square-root means that σ

has the same units as y. Loosely speaking, a typical data value is

expected to lie approximately σ

√

10 ≈ 1.58 from the mean y = 3.

To obtain a measure for how well a regression line ﬁts given data (t

, y

), we ask what fraction of the

variance in y is explained by the model.

Deﬁnition 5.7. The coefﬁcient of determination of a model

y = mt + c is the ratio

Var

Var y

Examples 5.8. We start by considering two extreme examples.

1. If the data were perfectly linear, then y

= mt

+ c for all i. The regression line is therefore

y = mt + c and the coefﬁcient of determination is precisely R

Var y

= 1. All the variance in

the output y is explained by the model’s transfer of the variance in the input t.

2. By contrast, consider the data in the table where we work

out all necessary details to ﬁnd the regression line:

m =

ty −ty

−t

= 0, c = y −mt = 2

The regression line is the constant

y ≡ 2, whence

y has no

variance and the coefﬁcient of determination is R

= 0.

data average

0 0 2 2 t = 1

1 3 1 3 y = 2

0 0 4 4 t

= 2

0 0 2 6 ty = 2

In this example, the regression model doesn’t help explain the y-data in any way: the t-values

have no obvious impact on the y-values.

In fact, the coefﬁcient of determination always lies somewhere between these extremes 0 ≤ R

≤ 1:

Exercise 6 demonstrates this and that the extreme situations are essentially those just encountered; in

practice, therefore, 0 < R

< 1. Before we revisit our examples from the previous section, observe

that the average of the model’s outputs

is the same as that of the original data:

∑

i=1

∑

i=1

(mt

+ c) = mt + c = y

This makes computing the variance of

y a breeze!

Example 5.1. Recall that

y =

(21t −4). Everything nec-

essary is in the table

Var y =

6.75

+ 2.75

+ 0.75

+ 10.25

= 39.6875

Var

y =

7.35

+ 3.15

+ 1.05

+ 9.45

= 38.5875

data average

1 2 3 5 t = 2.75

4 8 10 21 y = 10.75

3.4 7.6 11.8 20.2

y = 10.75

from which R

Var

Var y

3087

3175

≈ 97.23%. The interpretation here is that the data is very close to

being linear; the output y

is very closely approximated by the regression model with approxi-

mately 97% of its variance explained by the model.

Example 5.4. This time

y =

(t + 11).

Var y =

2.8

+ 0.2

+ 2.2

+ 0.8

+ 1.2

= 2.96

Var

y =

0.4

+ 1.2

+ 0.4

+ 0

+ 1.2

= 0.64

data average

8 10 6 7 4 t = 7

10 7 5 8 6 y = 7.2

7.6 8.4 6.8 7.2 6

y = 7.2

from which R

Var

Var y

≈ 21.62%. In this case the coefﬁcient of determination is small,

which indicates that the model does not explain much of the variation in the output.

The four examples are plotted below for easy visual comparison between the R

-values.

Perfect model R

= 1 Useless model R

= 0 Good model R

= 0.97 Poor model R

= 0.22

Efﬁcient computation of R

If you want to compute by hand, our current process is lengthy and

awkward. To obtain a more efﬁcient alternative we ﬁrst consider an alternative expression for the

variance of any collection of data:

Var x =

∑

− x)

∑

−

∑

= x

− x

Plainly Var x ≥ 0 with equality if and only if all data values x

are equal. The alternative expression

− x

justiﬁes the uniqueness of the regression line in Deﬁnition 5.2 and Theorem 5.3.

Now expand the variance of the predicted outputs:

Var

y =

∑

(

−y)

∑



+ c − (mt + c)



∑

−t)

= m

Var t

Putting these together, we obtain several equivalent expressions for the coefﬁcient of determination:

Var

Var y

= m

Var t

Var y

= m

−t

−y

( ty − ty)

( t

−t

)( y

−y

)

(∗)

Example 5.9. We do one more easy example with simple data (t

, y

) : (1, 4), (2, 1), (3, 2), (4, 0).

data average

1 2 3 4 t =

4 1 2 0 y =

1 4 9 16 t

16 1 4 0 y

4 2 6 0 ty = 3

m =

ty −ty

−t

3 −

−

100

= −

= −1.1

c = y −mt =

11 ·10

10 ·4

= 4.5

0 1 2 3 4

y = −1.1t + 4.5

121

175

= 0.69

∑

= 2.7

The regression line is

y = −

t +

= −1.1t + 4.5, and the coefﬁcient of determination is

= m

−t

−y

121

100

−

100

−

121

100

121

175

= 69.1%

The minimized square error is also easily computed:

∑

(

−y

)

= (3.4 −4)

+ (2.3 − 1)

+ (1.2 − 2)

+ (0.1 − 0)

= 2.7

Reversion to the Mean & Correlation By (∗), the regression model may be re-written in terms of

the standard-deviation and R

y(t) = mt + c = y + m(t −t) = y +

√

(t −t) =⇒

y(t + λσ

) = y + λ

√

Deﬁnition 5.10. The correlation coefﬁcient is the value r := ±

√

(sign equal to that of m).

An input λ standard-deviations above the mean (t = t + λσ

) results in a prediction λr standard-

deviations above the mean (

y = y + λrσ

). Unless the data is perfectly linear, we have R

< 1;

relative to the ‘neutral’ measure given by the standard-deviation a prediction

y(t) is closer to the

mean than the input t

y(t) −y

= r



t −t



t −t



Example (5.9, cont). We compute the details. The correlation coefﬁcient is r = −

√

≈ −0.832;

we say that the data is negatively correlated, since the output y seems to decrease as t increases. The

standard deviations may be read off from the table:

√

Var t =

−t

√

≈ 1.118, σ

Var y =

−y

√

≈ 1.479

The predictor may therefore be written (approximately)

y(t + λσ

) =

y(2.5 + 1.12λ) = y + λrσ

= 1.75 −1.23λ

As a sanity check,

y(2.5 + 1.12) =

y(3.62) = −1.1 ×−3.98 + 4.5 = 0.52 = 1.75 −1.23

Weaknesses of Linear Regression There are two obvious issues:

• Outliers massively inﬂuence the regression line. Dealing with this problem is complicated and

there are a variety of approaches that can be used. It is important to remember that any ap-

proach to modelling, including our regression model, requires some subjective choice.

• If the data is not very linear then the regression model will produce a weak predictor. There are

several ways around this as we’ll see in the remaining sections: higher-degree polynomial re-

gression can be performed, and data sometimes becomes more linear after some manipulation,

say by an exponential or logarithmic function.

Exercises 5.2. 1. Suppose (z

) = (2, 4, 10, 8) is double the data set in Example 5.6. Find z, Var z and

. Why are you not surprised?

2. Use a spreadsheet to ﬁnd R

for the predictor in Exercise 5.1.4. How conﬁdent do you feel in

your prediction?

3. Find the standard deviations and correlation coefﬁcients for the data in Examples 5.1 and 5.4.

4. The adult heights of men and women in a given population satisfy the following:

Men: average 69.5 in, σ = 3.2 in. Women: average 63.7 in, σ = 2.5 in.

The height of a father and his adult daughter have correlation coefﬁcient 0.35. If a father’s

height is 72 in (mother’s height unknown), how tall do you expect their daughter to be?

5. Suppose R

is the coefﬁcient of determination for a linear regression model

y = mt + c. Use

one of the alternative expressions for R

(page 59) to ﬁnd the coefﬁcient of determination for

the reversed predictor

t(y)? Are you surprised?

6. Suppose that a data set {(t

, y

)}

1≤i≤n

has at least two distinct t- and y-values (some t

= t

etc.), that it has regression line

y = mt + c and coefﬁcient of determination R

(a) Show that R

= 0 ⇐⇒ m = 0.

(b) (Hard) Prove that the sum of squared errors equals S =

∑

i=1

= n(Var y −Var

y).

= 1 −

n Var y

. Hence conclude that R

≤ 1, with

equality if and only if the original data set is perfectly linear.

5.3 Matrix Multiplication & Polynomial Regression

In this section we consider how to ﬁnd a best-ﬁtting least-squares polynomial for given data. To see

how to do this, it helps to rephrase the linear approach using matrices.

We start by observing that the system of equations in Theorem 5.3 can be written in as a 2 ×2 matrix

problem. For a data set with n pairs, the coefﬁcients m, c satisfy



∑







∑



This is nice because we can decompose the square matrix on the left as the product of a simple 2 × n

matrix and its transpose (switch the rows and columns);



∑





··· t

1 1 ··· 1















=: P

We can also view the right side as the product of P

and the column vector of output values y



∑





··· t

1 1 ··· 1















=: P

A little theory tells us that if at least two of the t

are distinct, then the 2 ×2 matrix P

P is invertible;

there is a unique regression line whose coefﬁcients may be found by taking the matrix inverse





= (P

−1

y =⇒

y = mt + c = (t 1)





= (t 1)(P

−1

We can also easily compute the vector of predicted values

y(t

ˆy =



··· t

1 1 ··· 1





= P(P

−1

and the squared error

∑

−y

ˆy −y

, which leads to an alternative expression for

the coefﬁcient of determination

ˆy

−ny

where

is the length of a vector.

Matrix computations are non-examinable. The purpose of this section is to be see how the regression may easily be

automated and generalized by computer and to understand a little of how a spreadsheet calculates best-ﬁtting curves of

different types.

For those who’ve studied linear algebra, P and P

P have the same null space and thus rank, since

Px = 0 =⇒ P

Px = 0 and P

Px = 0 =⇒ x

Px = 0 =⇒

= 0 =⇒ Px = 0

For linear regression, having at least two distinct t

values means rank P = 2, whence P

P is invertible.

Examples 5.11. 1. We revisit the Example 5.9 in this language.

P =



















1 1

2 1

3 1

4 1







=⇒ P

P =



1 2 3 4

1 1 1 1









1 1

2 1

3 1

4 1









30 10

10 4



from which





= (P

−1

y =



30 10

10 4



−1



1 2 3 4

1 1 1 1















30 ·4 −10



4 −10

−10 30







48 −70

−120 + 210





−11



The prediction vector given inputs t

is therefore

ˆy = P







1 2 3 4

1 1 1 1



−11















from which the coefﬁcient of determination is, as before

ˆy

−4y

100

(34

+ 23

+ 12

+ 1

) −4 ·

+ 1

+ 2

+ 0

) −4 ·

121

175

2. Given the data set {(3, 1), (3, 5), (3, 6)}, we have P =



3 1



and P

P =



27 9

9 3



which isn’t invertible: 27 ·3 −9 ·9 = 0. The

linear regression method doesn’t work!

It is easy to understand this from the picture. Since the three

data points are vertically aligned, any line minimizing the

sum of the squared errors must pass through the average

(3, 4), though it could have any slope!

This illustrates our fundamental assumption: linear regres-

sion requires at least two distinct t-values.

0 1 2 3

It is unnecessary ever to use the matrix approach for linear regression, though the method has sig-

niﬁcant advantages.

• Computers store and manipulate data in matrix format, so this method is computer-ready.

• Suppose you repeat an experiment several times, taking measurements y

at times t

. Since

P depends only on the t-data, you need only compute the matrix (P

−1

once, making

computation of the regression line for repeat experiments very efﬁcient.

• The method generalizes (easily for computers!) to polynomial regression. . .

Polynomial Regression

The pattern is almost identical when we use matrices; you just need to make the matrix P a little

larger. . . We work through the approach for a quadratic approximation.

Suppose we have a data set {( t

, y

) : 1 ≤ i ≤ n} and that we desire a quadratic polynomial predictor

y = at

+ bt + c which minimizes the sum of the squared vertical errors

S(a, b, c) =

∑

i=1

∑

i=1

(at

+ bt

+ c − y

)

This might look terrifying, but can be attacked exactly as before using differentiation: to minimize S,

we need the derivatives of S with respect to the coefﬁcients a, b, c to be zero.











∂S

∂a

= 2

∑

+ bt

+ ct

−t

= 0

∂S

∂b

= 2

∑

+ bt

+ ct

−t

= 0

∂S

∂c

= 2

∑

+ bt

+ c − y

= 0

⇐⇒











∑

+ b

∑

+ c

∑

+ b

∑

+ c

∑

+ b

∑

+ cn =

∑

As a system of equations for a, b, c this looks fairly nasty, but by rephrasing in terms of matrices, we

see that it is exactly the same problem as before!





∑

















∑





corresponds to









= P

y where P =













and y =













The only change is that P is now an n × 3 matrix so that P

P is 3 × 3. Analogous to the linear

situation, provided at least three of the t

are distinct, the matrix P

P is invertible and there is a

unique least-squares quadratic minimizer

y = at

+ bt + c =



t 1













t 1



−1

The predictions

y(t

) therefore form a vector ˆy = P





= P(P

−1

y, and the coefﬁcient of

determination may be computed as before.

ˆy

−n

−ny

The method generalizes in the obvious way: if you want a cubic minimizer, give P an extra column

of cubed t

-terms! This would be hard work by hand, but is standard fodder for computers: this isn’t

a linear algebra class, so don’t try to invert a 3 ×3 matrix!

Example 5.12. We are given data {(t

, y

)} = {(1, 2), (2, 5), (3, 7), (4, 4)}.

1. For the best-ﬁtting linear model, we use the same P (and thus P

P) from the previous example:





= (P

−1

y =



30 10

10 4



−1



1 2 3 4

1 1 1 1

















2 −5

−5 15







0.8

2.5



which yields

y(t) = 0.8t + 2.5. The predicted values and coefﬁcient of determination are then

ˆy =



1 2 3 4

1 1 1 1



0.8

2.5









3.3

4.1

4.9

5.7







84.2 −81

94 −81

≈ 0.2462

The linear model predicts only 24.6% of the variance in the output; not very accurate.

2. For a quadratic model; all that changes is the matrix P

P =







1 1 1

4 2 1

9 3 1

16 4 1







=⇒ P

P =





1 4 9 16

1 2 3 4

1 1 1 1











1 1 1

4 2 1

9 3 1

16 4 1











354 100 30

100 30 10

30 10 4





=⇒









= (P

−1

















354 100 30

100 30 10

30 10 4





−1





149









−1.5

8.3

−5





from which

y = −1.5t

+ 8.3t − 5. To quantify its accuracy, compute the vector of predicted

values

y(t

) and the coefﬁcient of determination:

ˆy = P





−1.5

8.3

−5











1.8

5.6

6.4

4.2







ˆy

−4y

93.2 −81

94 −81

≈ 0.9385

The quadratic model is far superior to the linear, ex-

plaining 94% of the observed variance.

3. We can even ﬁnd a cubic model (P is a 4 ×4 matrix!)

y =

(−4t

+ 21t

−17t + 12)

The cubic passes through all four data points, there is

no error and R

= 1.

0 1 2 3 4

For real-world data this is possibly less useful than the quadratic model—it certainly takes

longer to ﬁnd! More importantly, likely experimental error in the y-data has a strong effect

on the ‘perfect’ model—we are, in effect, modelling noise. Do you expect y(5) to be closer to −1

or −8?

Exercises 5.3. 1. Recall Example 4.2, with the following almost linear data set.

x 0 2 4 6 8 10

y 3 23 41 59 77 93

Find the best-ﬁtting straight line for the data, then use a spreadsheet to ﬁnd the best-ﬁtting

quadratic. Is the extra effort worth it?

2. You are given the following data consisting of measurements from an experiment recorded at

times t

seconds.

1 2 3 4 5 6 7 8 9 10

7 5 3 2 3 5 6 9 8 12

(a) Given the values

∑

= 55,

∑

= 385,

∑

= 60,

∑

= 385

ﬁnd the best-ﬁtting least-squares linear model for this data, and use it to predict

y(13).

(b) Find the best-ﬁtting quadratic model for the data: feel free to use a spreadsheet!

ninth-degree models and their coefﬁcients of determination.

0 2 4 6 8 10

tlinear

0 2 4 6 8 10

quadratic

0 2 4 6 8 10

tcubic

0 2 4 6 8 10

quartic

0 2 4 6 8 10

degree nine

Degree R

1 0.4264

2 0.8830

3 0.9319

4 0.9336

9 1

Which of these models would you choose for this data and why? What considerations

would you take into account?

5.4 Exponential & Power Regression Models

If you suspect that your data would be better modelled by a non-polynomial function, there are

several things you can try.

Minimizing the sum of squared-errors might be very difﬁcult for non-polynomial functions because

there is likely no simple tie-in with linear equations/algebra. Attempting this is likely to result in

a horrible non-linear system for your coefﬁcients which is difﬁcult to analyze either theoretically or

using a computer.

Log Plots The most common approach when trying to ﬁt an exponential model

y = e

mt+c

to data is

to use a log plot: taking logarithms of both sides results in

y = mt + c

If we take

Y := ln

y as a new variable, the model is now a straight-line! The idea is then to use linear

regression to ﬁnd the coefﬁcients m, ln a.

Example (4.4, cont). Recall our earlier rabbit-population P(t), repeated in the table below. We

previously considered modelling this with an exponential function for two reasons:

1. We were told it was population data!

2. The t-differences are constant (2), while the P-ratios

are approximately so (≈ 1.41).

0 2 4 6 8 10

5 7 10 14 19 28

ln P

1.61 1.95 2.30 2.64 2.94 3.33

After constructing a log-plot, the relationship is much clearer:

0 2 4 6 8 10

ln P

0 2 4 6 8 10

Since the relationship between t and ln P appears linear, we perform a linear regression calculation

to ﬁnd the best-ﬁtting least-squares line for the (t

, ln P

) data.

As an example of how horriﬁc this is, suppose you want to minimize the sum of square-errors for data (t

, y

) using an

exponential model

y(t) = ae

. The coefﬁcients of our model, a, k should minimize

S(a, k) =

∑

i=1



−y



Differentiating this with respect to a, k and setting equal to zero results in

(

∂S

∂a

= 2

∑



−y



= 0

∂S

∂k

= 2a

∑



−y



= 0

=⇒



∑



∑

2kt





∑

2kt



∑



where we substituted for a to obtain the last equation. Remember that this is an equation for k; if you think you can solve

this easily, think again!

Everything necessary comes from extending the table.

Data average

0 2 4 6 8 10 5

5 7 10 14 19 28 13.83

ln P

1.61 1.95 2.30 2.64 2.94 3.33 2.46

0 4 16 36 64 100 36.67

ln P

0 3.89 9.21 15.83 23.56 33.32 14.30

m =

t ln P −t ·ln P

−t

14.30 −5 ·2.46

36.67 −5

= 0.171

c = ln P −mt = 3.46 − 0.171 ·5 = 1.609

which yields the exponential model

P(t) = e

0.171t+1.609

= 4.998(1.186)

0 2 4 6 8 10

ln P

0 2 4 6 8 10

This is very close to the model (5(1.188)

) we obtained previously by pure guesswork. The approxi-

mate doubling time T for the population satisﬁes

= 2 =⇒ T =

ln 2

= 4.06 months

When using the log plot method, interpreting errors and the goodness of ﬁt of a model is a little more

difﬁcult. Typically one computes the coefﬁcient of determination R

of the underlying linear model: in

our example,

= m

Var t

Var ln P

= 99.3%

It is important to appreciate that the log plot method does

not treat all errors equally: taking logarithms tends to re-

duce error by a greater amount when the output y is large.

This should be clear from the picture, and more formally

by the mean value theorem: if y

< y

, then there is some

ξ ∈ (y

, y

) for which

ln y

−ln y

−y

) <

−y

)

ln y

Same ∆y

Different ∆ ln y

The log plot approach therefore places a higher emphasis on accurately matching data when the

output y is small. This isn’t such a bad thing since our intuitive view of error depends on the size of

the data. For instance, misplacing a $100 bill is annoying, but a $100 mistake in escrow when buying

a house is unlikely to concern you very much! Exponential data can more easily vary over large

orders of magnitude than linear or quadratic data.

This needs more decimal places of accuracy for the log-values than what’s in our table!

Log-Log Plots If you suspect a power function model

y = at

, then taking logarithms

y = m ln t + ln a

results in a linear relationship between ln y and ln t. As before, we can apply a linear regression

approach to ﬁnd a mode; the goodness of ﬁt is again described by the coefﬁcient of determination of

the underlying model.

Exercises 5.4. 1. You suspect a logarithmic model for a data set. Describe how you would approach

ﬁnding a model in the context of this section.

2. The table shows the average weight and length of a ﬁsh species measured at different ages.

Age (years) Length(cm) Weight (g)

1 5.2 2

2 8.5 8

3 11.5 21

4 14.3 38

5 16.8 69

6 19.2 117

7 21.3 148

8 23.3 190

9 25.0 264

10 26.7 293

11 28.2 318

12 29.6 371

13 30.8 455

14 32.0 504

15 33.0 518

16 34.0 537

17 34.9 651

18 36.4 719

18 37.1 726

20 37.7 810

200

400

600

800

0 10 20 30 40

(a) Do you think an exponential model is a good

ﬁt for this data? Take logarithms of the

weight values and use a spreadsheet to ob-

tain a model

w(ℓ) = ae

mℓ

where w, ℓ are the

weight and length respectively.

(b) What happens if you try a log-log plot? Given

what we’re measuring, why do you expect a

power model to be mode accurate?

3. Population data for Long Beach CA is given.

Using a spreadsheet or otherwise, ﬁnd linear,

quadratic, exponential and logarithmic regres-

sion models for this data.

Which of these models seems to ﬁt the data best,

and which would you trust to best predict the

population in 2020?

Look up the population of Long Beach in 2020;

does it conﬁrm your suspicions? What do you

think is going on?

Year

Years since 1900 Population

1900 0 2,252

1910 10 17,809

1920 20 55,593

1930 30 142,032

1940 40 164,271

1950 50 250,767

1960 60 334,168

1970 70 358,879

1980 80 361,498

1990 90 429,433

2000 100 461,522

2010 110 462,257

4. In the early 1600s, Johannes Kepler used observational data to derive his laws of planetary motion,

the third of which relates the orbital period T of a planet (how long it takes to go round the sun)

to its (approximate) distance r from the sun.

Planet T (years) r (millions km)

Mercury 0.24 58

Venus 0.61 110

Earth 1 150

Mars 1.88 230

Jupiter 11.9 780

Saturn 29.5 1400

Uranus 84 2900

Neptune 165 4500

120

160

0 2000 4000

The table shows the data for all the planets. Use a spreadsheet to analyze this data and ﬁnd a

model relating T to r.

Kepler did not known about Uranus and Neptune and only had relative distances for the plan-

ets. Research the correct statement of Kepler’s third law and compare it with your ﬁndings.