Day 2:
In class Investigation of Methods for Finding Line of Least Squares
Katherine Huffman & Brooke Norman
Objectives:
To reiterate the importance of a
scatter plot.
To understand what is meant by
correlation.
To understand what is meant by
regression.
To learn the effects of outliers.
To learn the method for finding the
line of Least-Squares by
hand.
To learn the method for finding the
line of Least-Squares
using a calculator.
Technology:
Calculator: TI-83+
The goal of this lesson is to give the students a basic understanding
of the method used to compute the line of least squares by hand.
The students will work with a small, random data set to find the line
of least squares and then check their work using a calculator.
The students will ultimately be responsible for understanding the
concepts of regression, correlation, and outlier as well as how to
compute the line of least squares by hand and verify their work on the
calculator.
Lesson:
Begin by having the class discuss the importance of using
scatter plots. Scatter plots are a useful way to look at data and
the relationships between different attributes. They can be very
helpful in determining if there are any
trends in the data set and what these trends may be.
After discussing scatter plots, begin introducing the concept of
regression and correlation.
Regression is a way to model a relationship between two
variables. There are several different types of regression, one
of the most commonly used is linear regression. A regression line
is a straight line that describes how a
response variable y changes as an explanatory variable x changes.
A simple formula for a least squares regression line is ŷ =a+bx, where
the slope b=r(sy/sx) and the intercept a= ŷ–
bx. We are using the
line of least squares, which is a common type of linear regression
model.
How do you know if a linear regression model is appropriate for your
data set? Have the class discuss attributes they would look for
in their data set to determine if they would use linear regression to
model the trend of their data. This should segue into a
discussion of correlation.
Correlation refers to the extent to which the data points are
clustered together. It measures the direction and strength of the
linear relationship between two quantitative
variables. The closer the data points are to being in
a line, the greater the correlation. A correlation of 1 would
mean that there is a perfect positive linear relationship or a direct
relationship. A correlation of -1 would mean there is a perfect
negative linear relationship or an inverse relationship. A
correlation of 0 would indicate a virtually no linear
relationship. It is important to know that correlation is
represented by 'r' and will never be greater than 1 or less than
-1.
It is important to know that a correlation coefficient does not speak
to the issue of cause and effect. There can be unseen
factors influencing the data which could imply a relationship that is
not truly present in the data set. Also, outliers can cause the
size of
a correlation coefficient to understate or exaggerate the strength of
the relationship between two variables.
An outlier is a point of the data set that does not
follow the trend of the other points in the data set. It appears
to not follow the same pattern as the rest of the data. An
outlier is an observation that lies outside the overall pattern of the
other observations. Outliers have affects on the different
statistical analyses. An outlier can be small or large, when
comparing it to the other data. Outliers should be examined
closely. They are sometimes the result of a mistake in data and
should be discarded if this is the case. If this is not the case,
and the outlier is a genuine result, it is important to include that
piece of data in your study.
For the above graph, the outlier is at
approximately (16, 19). It is important to understand that if an
outlier is extreme, if it is far away from the data trend, then it will
influence the mean as well as the line of least squares. The mean
will be skewed, or shifted, in the direction of the outlier, as will
the line of least squares. This shift occurs because of the
method used for finding the mean and the line of least squares.
The line of least squares is the line that makes the sum of squared
residuals as small as possible. It can be found without the help
of software, even though it is quite tedious.
Now, using a small data set, we will compute the line of least squares
by hand. Our given data set is (-1, -2), (1, 3), (3, 5), (5, 11)
First we must find the mean of the x-values, call this X, and the mean
of the y-values, call this Y. These means are important because
the line of least squares, by definition, always passes through these
values.
X = 2
Y = 4.25
Now, we have that
= m + b is the equation for the line of least squares,
where m is the the slope determined by m = ∑[(X - x)(Y - y)]/∑(X - x)2,
b is the y-intercept, and and
represent the values that lie on the line of least squares.
To help us better organize the needed values, we will fill in the table
given below. These values will help to determine the slope of the
line of least squares.
|
x
|
y
|
X - x
|
Y - y
|
(X - x)2
|
(Y - y)2
|
(X - x)(Y - y)
|
|
-1
|
-2
|
3
|
6.25
|
9
|
39.0625
|
18.75
|
|
1
|
3
|
1
|
1.25
|
1
|
1.5625
|
1.25
|
|
3
|
5
|
-1
|
-0.75
|
1
|
0.5625
|
0.75
|
|
5
|
11
|
-3
|
-6.75
|
9
|
45.5625
|
20.25
|
Sum
|
|
|
0
|
0
|
20
|
86.75
|
41
|
= m + b
m = ∑[(X - x)(Y - y)]/∑(X - x)2
So for our data set we have that
m = 41/20 = 2.05.
Now we know that the line must contain both means, X and Y, therefore
we can use this known data point to find b. So we have,
4.25 = (2.05)(2) + b
4.25 = 4.1 + b
.15 = b
So we have that = 2.05
+ .15
Now let's check this equation on our calculator to ensure we did our
calculations correctly.
Begin by having the students create a scatter plot in their TI-83
calculators.
Now that the students have a scatter plot of the
data set, they will use the calculator to verify the line of
Least Squares they just found by hand.
STAT -> CALC
4: LinReg(ax+b)
Enter
The students will have the following screen displayed on their
calculators:
LinReg
y=ax+b
a=2.05
b=.15
The value a is the slope and the value b is the y-intercept. Both
of these values are the same as the ones we found when we computed the
line of least squares by hand. Therefore we can conclude that our
calculations were correct.
It is important to point out that the reason computers and calculators
are so commonly used to determine the line of least squares is because
most analysts work with large sets of data, and the process for
computing the line of least squares by hand is extremely tedious and
time consuming.
Previous Days Lesson
Next Days Lesson
Return to Unit Homepage