Averages
and Variations
Purpose:
The purpose of
this 5 day unit is to introduce three measures of centers, the mode, mean and
median. This unit will also introduce a measure of variation, standard
deviation. The students will also become familiar with the statistical
functions in Excel and learn to use Excel to sort and analysis data. The
students will initially work the problems out with pencil and paper when the
concepts are introduced. Then on the following days the students will work on
some of the same problems and some new ones in the computer lab using Excel.
DAY 1:
Averages- mode and median
When we want to
describe a population using only one number, we use a central value or average.
The average helps predict what the entire population is likely to be. There are
many different ways to find averages we are going to focus on three, the mean,
the mode and the median. We call these
the measures of central tendency.
We will find the
three measures of central tendency for three types of data. They are:
MODE
The mode: It is the value or property that occurs most frequently in
the data. It is the most common value. It does not have to be unique. There can
be more than one mode or none if no values are repeated.
For raw data: The
mode is the value that occurs most frequently.
For
ungrouped data: The mode is the value that occurs most
frequently, the one with the highest frequency.
For
grouped data: The modal class is the class with
highest frequency. The mode is determined by:
LCB + (fo - f-1)/ (2 fo - f-1 - f1) * h
where fo is the frequency of
the modal class, f1 is the frequency of the proceeding class, f-1
is the frequency of the preceding class, LCB is the lower class boundary and h
is the class width.
Example Problem:
Given the
following raw data for test scores, order the data using the stem and leaf
method and find the mode.
77 |
84 |
76 |
88 |
99 |
92 |
68 |
78 |
98 |
69 |
75 |
97 |
95 |
71 |
88 |
73 |
88 |
94 |
73 |
76 |
Solution:
Use
the stem and leaf method to sort the data in ascending order. The mode is more
evident using this method. The mode is 88 because it occurs three times.
Example Problem:
Income for Men and
Women- a random sample of population using grouped data.
Earnings ($) |
Frequency |
|
0- |
20,000 |
123 |
20,000- |
25,000 |
356 |
25,000- |
30,000 |
560 |
30,000- |
35,000 |
311 |
35,000- |
40,000 |
107 |
40,000- |
45,000 |
35 |
45,000- |
50,000 |
21 |
50,000- |
over |
8 |
Find an average income
for the sample population by finding the mode.
Solution: The
modal class is the class interval with the highest frequency, 25,000-30,000.
Then the mode is:
LCB + (fo
- f-1)/ (2 fo - f-1
- f1) * h
where the
frequency of the modal class is fo = 560,
the frequency of the proceeding class is f1 = 311, the frequency of
the preceding class is f-1 = 356,
the lower class boundary is LCB = 25,000 and the class width is h =
5,000. Therefore the mode is $27,252.
MEDIAN (the 50th %)
The median: The median is a central value where there
are an equal number of values above and below it. We say the median is
resistant to extreme values because it is not affected by them. The median
relies only on the total number of data values to determine the position of the
median.
For Raw Data: To
obtain the median we must order the data from smallest to largest. Then we can
construct the middle value. Since the median is the 50th% then 50 %
or half of the values fall below it and half fall above it. Then we multiply
the total number of data values, n, by one half to obtain the position of the
median, i.
·
If we
obtain a whole number, i, for the position of the
median, then we average the value of the ith and
ith+1 position value to obtain the median.
·
If we
do not obtain a whole number, i, for the position of
the median, then we round i up and use the value in
that position as the median.
Example Problem:
Use the previous raw data to find the median test score.
77 |
84 |
76 |
88 |
99 |
92 |
68 |
78 |
98 |
69 |
75 |
97 |
95 |
71 |
88 |
73 |
88 |
94 |
73 |
76 |
Solution: Since
there are 10 data values then n = 10. This give us (20) * (.5) = i = 10. Therefore we need to average the 10th
and 11th term to obtain the median. This average gives us (78+84)/2
= 81. Therefore 81 is the median for these raw test scores.
For ungrouped data (given the frequency of each value): To
obtain the total number of data values, n, the frequencies of all the values
are added. Then we can divide the total number of data values, n, by 1/2 to
obtain the position of the median. Also since the median relies on position
relative to all the data values, we need to take into account the cumulative
frequency of each value. This is the total number of data values that have
occurred up to the chosen data value. The position of the median is determined
by where i is in the cumulative frequency. The
cumulative frequency of data value must be higher than the position of the
median, i. Then the first data value with a higher
cumulative frequency is the median.
For grouped data: We proceed the same way as for the
ungrouped data by calculating the cumulative frequencies of the class intervals
and obtaining the position of the median by (1/2)n = i.
The first class interval that has a cumulative frequency higher than i is the median class. Then we use the lower class
boundary, LCB, the cumulative frequency of the previous class, a, the class
width, h, and the frequency of the median class, f, to find the median. We use
the following equation:
LCB + (i – a)/f *
h = Median
Example Problem:
Income for Men and
Women- a random sample from the population.
Earnings ($) |
Frequency |
|
0- |
20,000 |
123 |
20,000- |
25,000 |
356 |
25,000- |
30,000 |
560 |
30,000- |
35,000 |
311 |
35,000- |
40,000 |
107 |
40,000- |
45,000 |
35 |
45,000- |
50,000 |
21 |
50,000- |
over |
8 |
Find an average
income for the sample population by finding the median.
Solution:
Earnings ($) |
Frequency |
Cumm. Freq. 123 |
|
0- |
20,000 |
123 |
|
20,000- |
25,000 |
356 |
479 |
25,000- |
30,000 |
560 |
1039 |
30,000- |
35,000 |
311 |
1350 |
35,000- |
40,000 |
107 |
1457 |
40,000- |
45,000 |
35 |
1492 |
45,000- |
50,000 |
21 |
1513 |
50,000- |
over |
8 |
1521 |
|
n = |
1521 |
|
We obtain the
total number of data values by summing the frequency of each class interval.
Then n = 1521. Now we have that i = 1521(1/2) =
760.5. The first class interval with a cumulative frequency higher than 760.5
is the 25,000-30,000 class. Therefore this is the median class. Then the median
is:
LCB + (i – a)/f *
h
where LCB =
25,000, I = 760.5, a = 479, f = 560 and h = 5,000. The median is $27,513.
Homework:
Problem:
If shoe
manufacturer has a random sample of raw data on the size of shoes bought for
1000 customers, which type of average would be the most useful to them and why?
Solution:
For the size of shoe
data the mode would best help manufactures know what size to produce the most
of.
Problem:
If the sample was
for the price of the shoes 1000 customer bought, then what would be the most
useful average and why?
Solution:
For the price of
the shoes the median would be more helpful to know the middle price customers
are paying for shoes.
DAY 2:
Averages-the mean
MEAN
The mean: The mean
is an average that takes into account every entry in the data. It is referred
to as the arithmetic mean. To find the mean we add the values of all the
entries and then divide by the number of entries. Since the mean takes into
account every data value then it is affected by extreme values and is not
considered to be a “resistant” measure.
For raw data: we
simply sum all the data values and divide by the total data values summed, n.
For ungrouped
data: we need to take into account every instance a data value occurs.
Therefore we need to sum each data value as many times as its frequency. If the
frequency is 5, then we need to sum that data value 5 times. We can obtain this
by multiplying each data value by its frequency. We can then sum these values
and divide the sum by the total number of data values, n, that is the sum of
the frequencies.
For grouped data:
we again need to take into account every instance of the class interval. We
need to sum the data values as many times as their frequency. Since our data is
already grouped we need to find one value to represent the entire class. We
need to find the mid-values of the class intervals. This mid-value is the mean
of the class interval, the sum of the class boundaries divided by two. We then
take into account every occurrence of the mid-values by multiplying the
mid-values by the frequencies. Then we can proceed as for ungrouped data and
sum these values and divide the sum by the total number data values, n, that is
the sum of the frequencies.
Example Problem:
Given the
following data, find an average number of hours slept by adults and teenagers
by finding the mean, the median and the mode.
Amount of hours
slept at night for adults and teenagers.
Hours of sleep |
Adults |
Teenagers |
3 hours |
2 |
6 |
4 hours |
5 |
5 |
5 hours |
11 |
15 |
6 hours |
35 |
22 |
7 hours |
42 |
55 |
8 hours |
31 |
23 |
9 hours |
8 |
15 |
10 hours |
4 |
5 |
Total= |
138 |
146 |
Solution: For the
mean, we need to “count” each occurrence of each value. Therefore multiplying
the value times the frequency gives us x*f which symbolized the total sum of
each data value. Then the summation of x*f gives the sum of all the data
values. We can divide this sum by n to obtain the mean. For the adults n = 138.
For the teenagers n = 146. Therefore the mean for adults is 6.85 and the mean
for teenagers is 6.84
The median
position, i, is determine by n * (1/2). The median
position for the adults is 69 and for the teenagers it is 73. Therefore the
median for both groups is 7.
The mode is the
value with the highest frequency.
Mode for adults =
7 hours
Mode for teenagers
= 7 hours
Trimmed mean: We
said the arithmetic mean can be affected by extreme data values. Extreme data
values can skew the mean so that it is either too high or too low instead of a
good measure of the central tendency. We can “trim” the mean to have a measure
of center that is more “resistant”, meaning it is not influenced by extremely
high or low data. We can compute any percent trimmed mean by finding the that
percent of the total data values. We then round up and eliminate that number of
data values from both sides of the ordered data set. Since we are removing
specific extreme data value this works only with raw data.
Example: If we
want a 10% trimmed mean of 15 data points since 10% of 15 is 1.5, then we round
to 2. Then the 2 lowest and highest data values (a total of four data values)
are removed.
Problem:
Given the income
of the following group, find an average by finding the mean. Then find the 5%
trimmed mean and compare. Which one better represents the entire data set? Why?
Income |
Frequency |
28,000 |
2 |
30,000 |
6 |
32,000 |
5 |
36,000 |
10 |
42,000 |
8 |
299,000 |
1 |
Solution:
Income |
Frequency |
XF |
New XF |
28,000 |
2 |
56,000 |
|
30,000 |
6 |
180,000 |
180,000 |
32,000 |
5 |
160,000 |
160,000 |
36,000 |
10 |
360,000 |
360,000 |
42,000 |
8 |
336,000 |
294,000 |
299,000 |
1 |
299,000 |
|
n = |
32 |
1,391,000 |
994,000 |
The mean using all
the data values is the sum of all the data values, f*x, divided by the total
number of data values. The mean is 1,391,000 / 32 = 43,469.
The 5% trimmed
mean is obtained by finding 5% of 32 which is 1.6. We round up to two and
remove the two smallest data values and the two largest data values from the
set. Our new sum of the data values, f*x, is 994,000 and our new n is 28.
Therefore our new 5% trimmed mean is 994,000/28 = 35,500.
The trimmed mean
is more representative of the entire population because the majority of the
data values are around this value. The untrimmed mean is not as representative
since it is higher than every value except one.
The extreme value of the largest data values skewed the untrimmed mean.
Homework:
If you were trying
to choose a major and you were considering the average income for each major,
which average would you prefer to have and why?
The mean could be
thrown off my extremely high or low income. The median is a middle or balanced
point and would be a better indicator.
DAY 3:
Averages-
in the Computer Lab
Problem:
Given the
following test scores, use Excel to order the data and find the mode and the
mean. Click
here to obtain this data and work in Excel.
77 |
84 |
76 |
88 |
99 |
92 |
68 |
78 |
98 |
69 |
75 |
97 |
95 |
71 |
88 |
73 |
88 |
94 |
73 |
76 |
Solution:
Click
here to see the solution in Excel. Under the data menu in Excel the Sort
choice sorts the data in either descending or ascending order. Also the data
can be sorted using the A to Z buttons in the toolbox. The cells containing the
raw data need to be selected before the Sort function is used.
The
mode can be found by using the function MODE and selecting the data cells as
the target array. This can be done with or without sorting the data. The
formula in the cell is “MODE (First Cell: Last Cell).” The mode is 88. To see
the formula in the cells go to the Tools menu and select formula auditing and
select formula auditing mode. To see the results of the formulas then
inactivate this mode by selecting the formula auditing mode again from the
Tools menu.
The
mean is found in Excel either by using Excel to perform the arithmetic or by
using the statistical function AVERAGE. Using Excel for arithmetic we can sum
all the data values using the summation function. Then we can find n by using
the Count function. Finally we obtain the mean by dividing the sum by n. We use
the AVERAGE function by selecting the array. The formula in the cell is
“AVERAGE (First Cell: Last Cell).” The
mean is 82.95.
Problem:
How hot does it
get in
If we find out that
the thermometer used to measure the temperatures was always reading two degrees
higher now find the new adjusted median and mean. Compare them with the
original median and mean. What do you notice?
99 |
84 |
78 |
86 |
94 |
83 |
84 |
95 |
79 |
90 |
80 |
99 |
95 |
82 |
91 |
82 |
93 |
90 |
80 |
83 |
Solution: Click
here to see this solution in Excel. We can sort all the data values using the
A to Z function. We can then use the COUNT function to obtain n = 20 by
selecting all the data values to be counted. Then use the spreadsheet
capabilities to find i = (½)* n = 10 and select the
values in the appropriate position. Since i is an
integer then the mean of the 10th and 11th terms is the median.
Therefore the median is 85.
We
can check our answer using the median function. The formula in the cell would
be “MEDIAN (First Cell: Last Cell).” The median is 85.
NOTE: We can also use
the PERCENTILE function by selecting the array and defining the median to be
the 50%. The formula in the cell using this function would be “PERCENTILE
(First Cell: Last Cell, 1/2).”
We can use the summation function to
sum all the data values. We can then use previous results of the COUNT function
to obtain n. We then use the spreadsheet capabilities to find the quotient of
the sum by n.
The
mean can also be found in Excel using the AVERAGE function. We can use the
AVERAGE function by selecting the array. The formula in the cell is “AVERAGE
(First Cell: Last Cell).” The mean is
87.35 degrees.
The adjusted mean
and median can be obtained by adding 2 degrees to the original mean and median.
The adjusted mean is 89.35 degrees and the adjusted median is 87. The students
may add 2 degrees to each data value and use the same Excel functions again but
comparing the two values at the end will help emphasize that there is no need
to do that.
Problem:
Earning for Men
and Women:
Earnings ($) |
Frequency |
|
0- |
20,000 |
123 |
20,000- |
25,000 |
356 |
25,000- |
30,000 |
560 |
30,000- |
35,000 |
311 |
35,000- |
40,000 |
107 |
40,000- |
45,000 |
35 |
45,000- |
50,000 |
21 |
50,000- |
Over |
8 |
Using Excel, find an
average income by finding the median. Click
here to obtain this data set and work in Excel.
Solution:
Earnings ($) |
Frequency |
Cumm. Freq. 123 |
|
0- |
20,000 |
123 |
|
20,000- |
25,000 |
356 |
479 |
25,000- |
30,000 |
560 |
1039 |
30,000- |
35,000 |
311 |
1350 |
35,000- |
40,000 |
107 |
1457 |
40,000- |
45,000 |
35 |
1492 |
45,000- |
50,000 |
21 |
1513 |
50,000- |
65,000 |
8 |
1521 |
|
n = |
1521 |
|
Click here
to see this solution in Excel. For this particular problem Excel would only
be useful in manipulating the data and calculating the cumulative frequencies.
The same equation as used in day one would still need to be used but Excel can
be used to refer to the appropriate cells.
Therefore i = 1521(1/2) = 760.5. The median class is the
25,000-30,000 class. Then the median is:
LCB + (i – a)/f *
h
where LCB =
25,000, I = 760.5, a = 479, f = 560 and h = 5,000. The median is $27,513.
Problem:
Given the
following large number of data points find two averages to represent each
class, the freshman, sophomore, juniors and seniors. Then find another two
averages to represent the data as a whole. Which averages would be the most
appropriate and why? The data represents the height (in inches) of freshman
through seniors. Click
here to obtain this data from Excel. Use Excel to obtain your answers.
Freshman |
Sophomores |
Juniors |
Seniors |
|
50 |
50 |
53 |
55 |
|
47 |
49 |
47 |
48 |
|
54 |
53 |
55 |
58 |
|
57 |
60 |
58 |
60 |
|
65 |
65 |
68 |
70 |
|
68 |
70 |
67 |
69 |
|
61 |
61 |
64 |
65 |
|
45 |
47 |
45 |
48 |
|
50 |
50 |
53 |
55 |
|
65 |
67 |
64 |
66 |
|
68 |
67 |
70 |
72 |
|
47 |
50 |
49 |
50 |
|
56 |
55 |
58 |
61 |
|
74 |
76 |
73 |
75 |
|
62 |
61 |
64 |
66 |
|
56 |
59 |
57 |
59 |
|
59 |
59 |
63 |
64 |
|
70 |
71 |
68 |
71 |
|
59 |
58 |
61 |
63 |
|
55 |
58 |
56 |
58 |
|
49 |
49 |
52 |
54 |
|
61 |
63 |
61 |
62 |
|
45 |
48 |
55 |
49 |
|
50 |
53 |
51 |
53 |
|
59 |
59 |
62 |
64 |
|
51 |
53 |
50 |
52 |
|
72 |
72 |
75 |
76 |
|
54 |
56 |
54 |
57 |
|
60 |
60 |
63 |
65 |
|
48 |
50 |
47 |
49 |
|
54 |
53 |
56 |
58 |
|
50 |
53 |
52 |
53 |
|
62 |
61 |
64 |
67 |
|
71 |
73 |
70 |
72 |
|
67 |
66 |
69 |
71 |
|
53 |
56 |
54 |
56 |
|
53 |
53 |
57 |
58 |
|
68 |
69 |
66 |
69 |
|
45 |
44 |
47 |
49 |
|
47 |
50 |
48 |
50 |
|
57 |
57 |
60 |
62 |
|
69 |
71 |
69 |
70 |
|
58 |
57 |
59 |
62 |
|
64 |
67 |
65 |
67 |
|
53 |
53 |
56 |
58 |
|
72 |
74 |
71 |
73 |
|
52 |
52 |
55 |
56 |
|
72 |
74 |
72 |
75 |
|
Sums: |
2784 |
2819 |
2853 |
2938 |
Using the
functions in Excel we can obtain all three averages in different ways. Click here to see
these results obtained in Excel.
|
|
Results |
|||
Average |
Formula |
Fresh. |
Soph. |
Junior |
Senior |
Mean |
AVERAGE (First Cell: Last Cell) |
58 |
59 |
59.44 |
61.25 |
Median |
MEDIAN (First Cell: Last Cell) |
57 |
58 |
58.5 |
61.5 |
Mode |
MODE (First Cell: Last Cell) |
50 |
53 |
64 |
58 |
DAY 4:
Variation-
We have discussed
the measures of central tendencies of data, now we will consider the spread of
the data or the variation. The range gives the variation between the largest
and smallest data values. But the range does not tell us how the data values
differ from each other.
Another measure of
variation is the standard deviation. It measures how far away each data value
is from the mean. The equation to find the standard deviation sums the square
difference between each data value and the mean, Σ(xi-mean)2.
If the difference is not squared and data values are spread evenly below and
above the mean then the sum of the difference could be zero. The equation squares
this difference to eliminate this occurrence by eliminating negative
differences. Then we have the square sum of the difference between each data
point and the mean. If we divide this by n-1, we obtain the average square
difference for the data set. Since we are interested in the average difference
or spread of the data set, we need to take the square root. Therefore the
equation is:
S= √(Σ(xi-mean)2/(n-1))
For raw data: We
find the mean and take the difference between each data value and the mean.
Then we square the difference, divide by n-1 and take the square root to find
the standard deviation.
For ungroup data:
We know that the sum of the squared difference can be split into two sums. We also
know that the mean is the sum of all the data values divided by the total
number of data values, n. Using algebra we know:
Σ(xi-mean)2 = Σxi2
– (Σxi)2/n
Now we know for
ungrouped data that the sum of the data values is obtained by taking into account
each instance the data value occurs. Therefore the data value is multiplied by
its frequency. Now substituting this back into the standard deviation equation,
we obtain:
S= √(( Σ(xi2f)
– (Σxif)2/n)/(n-1))
We use the
definition of the mean again to obtain another version of this equation:
S= √((Σ(xi2f)
– (mean2*n)/(n-1))
Then we can either
obtain the difference between each data point and the mean or we can find the
square of the sum of all the data values (Σ(xi2f) to
solve for the standard deviation.
For grouped data:
We can use the same method as for the ungrouped data with one modification. We
need to find the mid-value of each class interval. The mid-value of the class
interval is the average of the lower class boundary and upper class boundary. Now
this average is our xi value that is multiplied by the class
frequency. We can now continue the same as for ungrouped data and solve for the
standard deviation.
Problem:
Find the variation
of the following data by finding the range and the standard deviation. The data
is the weight of babies. Click
here to work with this data in Excel.
Weight (lbs) |
frequency |
3.0 |
2 |
4.5 |
7 |
6.0 |
8 |
6.8 |
14 |
7.2 |
10 |
7.9 |
3 |
8.5 |
1 |
Solution: Click here to
see this solution in Excel. The range is 8.5-3.0 = 5.5 lbs. The standard
deviation is 1.255.
Homework: If you were given an average repair
cost for your type of car and the standard deviation. Which type of average
would you prefer to have any why?
Solution: The mean
would give the most amount of information since the standard deviation tells us
how far the prices are from the mean.
DAY 5:
Variation-
in the Computer Lab
Problem:
Find the mean and
standard deviation of the following two data sets. How do they compare? Click
here to obtain the data sets and work in Excel.
Click
here to obtain the salaries of baseball players in the National League.
Click
here to obtain the salaries of baseball players in the American League.
Solution: Click
here to see the solution in Excel.
Problem: Generate
a data table, either by generating random numbers or by choosing a
characteristic that can be obtained in the classroom. For example, generating a
die being rolled or collecting the amount of loose change each student in class
has or collecting the amount of money students spent on lunch today. Click
here to go to Excel and create your own data table. Find the mean, median,
mode and standard deviation of your data set.
To see an example
of randomly generated data go to Unit One in this Statistical Instructional
Unit.