
Example (5.9, cont). We compute the details. The correlation coefficient is r = −
√
R
2
≈ −0.832;
we say that the data is negatively correlated, since the output y seems to decrease as t increases. The
standard deviations may be read off from the table:
σ
t
=
√
Var t =
q
t
2
−t
2
=
√
5
2
≈ 1.118, σ
y
=
p
Var y =
q
y
2
−y
2
=
√
35
4
≈ 1.479
The predictor may therefore be written (approximately)
ˆ
y(t + λσ
t
) =
ˆ
y(2.5 + 1.12λ) = y + λrσ
y
= 1.75 −1.23λ
As a sanity check,
ˆ
y(2.5 + 1.12) =
ˆ
y(3.62) = −1.1 ×−3.98 + 4.5 = 0.52 = 1.75 −1.23
Weaknesses of Linear Regression There are two obvious issues:
• Outliers massively influence the regression line. Dealing with this problem is complicated and
there are a variety of approaches that can be used. It is important to remember that any ap-
proach to modelling, including our regression model, requires some subjective choice.
• If the data is not very linear then the regression model will produce a weak predictor. There are
several ways around this as we’ll see in the remaining sections: higher-degree polynomial re-
gression can be performed, and data sometimes becomes more linear after some manipulation,
say by an exponential or logarithmic function.
Exercises 5.2. 1. Suppose (z
i
) = (2, 4, 10, 8) is double the data set in Example 5.6. Find z, Var z and
σ
z
. Why are you not surprised?
2. Use a spreadsheet to find R
2
for the predictor in Exercise 5.1.4. How confident do you feel in
your prediction?
3. Find the standard deviations and correlation coefficients for the data in Examples 5.1 and 5.4.
4. The adult heights of men and women in a given population satisfy the following:
Men: average 69.5 in, σ = 3.2 in. Women: average 63.7 in, σ = 2.5 in.
The height of a father and his adult daughter have correlation coefficient 0.35. If a father’s
height is 72 in (mother’s height unknown), how tall do you expect their daughter to be?
5. Suppose R
2
is the coefficient of determination for a linear regression model
ˆ
y = mt + c. Use
one of the alternative expressions for R
2
(page 59) to find the coefficient of determination for
the reversed predictor
ˆ
t(y)? Are you surprised?
6. Suppose that a data set {(t
i
, y
i
)}
1≤i≤n
has at least two distinct t- and y-values (some t
i
= t
j
,
etc.), that it has regression line
ˆ
y = mt + c and coefficient of determination R
2
.
(a) Show that R
2
= 0 ⇐⇒ m = 0.
(b) (Hard) Prove that the sum of squared errors equals S =
∑
n
i=1
e
2
i
= n(Var y −Var
ˆ
y).
(c) Obtain the alternative expression R
2
= 1 −
S
n Var y
. Hence conclude that R
2
≤ 1, with
equality if and only if the original data set is perfectly linear.
60