Day 2:
In class Investigation of Methods for Finding Line of Least Squares

Katherine Huffman & Brooke Norman

Objectives:
To reiterate the importance of a scatter plot.
To understand what is meant by correlation.
To understand what is meant by regression.
To learn the effects of outliers.
To learn the method for finding the line of Least-Squares by hand.
To learn the method for finding the line of Least-Squares using a calculator.

Technology:
Calculator:  TI-83+

The goal of this lesson is to give the students a basic understanding of the method used to compute the line of least squares by hand.  The students will work with a small, random data set to find the line of least squares and then check their work using a calculator.  The students will ultimately be responsible for understanding the concepts of regression, correlation, and outlier as well as how to compute the line of least squares by hand and verify their work on the calculator.

Lesson:

Begin by having the class discuss the importance of using scatter plots.  Scatter plots are a useful way to look at data and the relationships between different attributes.  They can be very helpful in determining if there are any trends in the data set and what these trends may be.

After discussing scatter plots, begin introducing the concept of regression and correlation.

Regression is a way to model a relationship between two variables.  There are several different types of regression, one of the most commonly used is linear regression.  A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes.  A simple formula for a least squares regression line is ŷ =a+bx, where the slope b=r(sy/sx) and the intercept a= ŷ– bx.  We are using the line of least squares, which is a common type of linear regression model.

How do you know if a linear regression model is appropriate for your data set?  Have the class discuss attributes they would look for in their data set to determine if they would use linear regression to model the trend of their data.  This should segue into a discussion of correlation.

Correlation refers to the extent to which the data points are clustered together.  It measures the direction and strength of the linear relationship between two quantitative variables.    The closer the data points are to being in a line, the greater the correlation.  A correlation of 1 would mean that there is a perfect positive linear relationship or a direct relationship.  A correlation of -1 would mean there is a perfect negative linear relationship or an inverse relationship.  A correlation of 0 would indicate a virtually no linear relationship.  It is important to know that correlation is represented by 'r' and will never be greater than 1 or less than -1.

It is important to know that a correlation coefficient does not speak to the issue of cause and effect.  There can be unseen factors influencing the data which could imply a relationship that is not truly present in the data set.  Also, outliers can cause the size of a correlation coefficient to understate or exaggerate the strength of the relationship between two variables.

An outlier is a point of the data set that does not follow the trend of the other points in the data set.  It appears to not follow the same pattern as the rest of the data.  An outlier is an observation that lies outside the overall pattern of the other observations.  Outliers have affects on the different statistical analyses.  An outlier can be small or large, when comparing it to the other data.   Outliers should be examined closely.  They are sometimes the result of a mistake in data and should be discarded if this is the case.  If this is not the case, and the outlier is a genuine result, it is important to include that piece of data in your study.

For the above graph, the outlier is at approximately (16, 19).  It is important to understand that if an outlier is extreme, if it is far away from the data trend, then it will influence the mean as well as the line of least squares.  The mean will be skewed, or shifted, in the direction of the outlier, as will the line of least squares.  This shift occurs because of the method used for finding the mean and the line of least squares.

The line of least squares is the line that makes the sum of squared residuals as small as possible.  It can be found without the help of software, even though it is quite tedious.

Now, using a small data set, we will compute the line of least squares by hand.  Our given data set is (-1, -2), (1, 3), (3, 5), (5, 11)

First we must find the mean of the x-values, call this X, and the mean of the y-values, call this Y.  These means are important because the line of least squares, by definition, always passes through these values.

X = 2
Y = 4.25

Now, we have that   = m + b is the equation for the line of least squares, where m is the the slope determined by m = ∑[(X - x)(Y - y)]/∑(X - x)2, b is the y-intercept, and and represent the values that lie on the line of least squares.

To help us better organize the needed values, we will fill in the table given below.  These values will help to determine the slope of the line of least squares.

 x y X - x Y - y (X - x)2 (Y - y)2 (X - x)(Y - y) -1 -2 3 6.25 9 39.0625 18.75 1 3 1 1.25 1 1.5625 1.25 3 5 -1 -0.75 1 0.5625 0.75 5 11 -3 -6.75 9 45.5625 20.25 Sum 0 0 20 86.75 41

= m + b

m = ∑[(X - x)(Y - y)]/∑(X - x)2

So for our data set we have that
m = 41/20 = 2.05.

Now we know that the line must contain both means, X and Y, therefore we can use this known data point to find b.  So we have,
4.25 = (2.05)(2) + b
4.25 = 4.1 + b
.15 = b

So we have that  =  2.05 + .15

Now let's check this equation on our calculator to ensure we did our calculations correctly.

Begin by having the students create a scatter plot in their TI-83 calculators.

Now that the students have a scatter plot of the data set, they will use the calculator to verify the line of Least Squares they just found by hand.
STAT -> CALC
4: LinReg(ax+b)
Enter

The students will have the following screen displayed on their calculators:
LinReg
y=ax+b
a=2.05
b=.15

The value a is the slope and the value b is the y-intercept.  Both of these values are the same as the ones we found when we computed the line of least squares by hand.  Therefore we can conclude that our calculations were correct.

It is important to point out that the reason computers and calculators are so commonly used to determine the line of least squares is because most analysts work with large sets of data, and the process for computing the line of least squares by hand is extremely tedious and time consuming.

Previous Days Lesson

Next Days Lesson