Averages and Variations

Averages and Variations

Purpose:

The purpose of this 5 day unit is to introduce three measures of centers, the mode, mean and median. This unit will also introduce a measure of variation, standard deviation. The students will also become familiar with the statistical functions in Excel and learn to use Excel to sort and analysis data. The students will initially work the problems out with pencil and paper when the concepts are introduced. Then on the following days the students will work on some of the same problems and some new ones in the computer lab using Excel.

DAY 1: Averages- mode and median

When we want to describe a population using only one number, we use a central value or average. The average helps predict what the entire population is likely to be. There are many different ways to find averages we are going to focus on three, the mean, the mode and the median. We call these the measures of central tendency.

We will find the three measures of central tendency for three types of data. They are:

Raw data- data values are given in random sequence.
Ungrouped data- the frequency of single data values are given.
Grouped data- frequency data of class intervals is given.

MODE

The mode: It is the value or property that occurs most frequently in the data. It is the most common value. It does not have to be unique. There can be more than one mode or none if no values are repeated.

For raw data: The mode is the value that occurs most frequently.

For ungrouped data: The mode is the value that occurs most frequently, the one with the highest frequency.

For grouped data: The modal class is the class with highest frequency. The mode is determined by:

LCB + (f_o - f_-1)/ (2 f_o - f_-1 - f₁) * h

where f_o is the frequency of the modal class, f₁ is the frequency of the proceeding class, f_-1 is the frequency of the preceding class, LCB is the lower class boundary and h is the class width.

Example Problem:

Given the following raw data for test scores, order the data using the stem and leaf method and find the mode.

77	84	76	88	99
92	68	78	98	69
75	97	95	71	88
73	88	94	73	76

Solution:

Use the stem and leaf method to sort the data in ascending order. The mode is more evident using this method. The mode is 88 because it occurs three times.

Example Problem:

Income for Men and Women- a random sample of population using grouped data.

Earnings ($)		Frequency
0-	20,000	123
20,000-	25,000	356
25,000-	30,000	560
30,000-	35,000	311
35,000-	40,000	107
40,000-	45,000	35
45,000-	50,000	21
50,000-	over	8

Find an average income for the sample population by finding the mode.

Solution: The modal class is the class interval with the highest frequency, 25,000-30,000. Then the mode is:

LCB + (f_o - f_-1)/ (2 f_o - f_-1 - f₁) * h

where the frequency of the modal class is f_o = 560, the frequency of the proceeding class is f₁ = 311, the frequency of the preceding class is f_-1 = 356, the lower class boundary is LCB = 25,000 and the class width is h = 5,000. Therefore the mode is $27,252.

MEDIAN (the 50^th %)

The median: The median is a central value where there are an equal number of values above and below it. We say the median is resistant to extreme values because it is not affected by them. The median relies only on the total number of data values to determine the position of the median.

For Raw Data: To obtain the median we must order the data from smallest to largest. Then we can construct the middle value. Since the median is the 50^th% then 50 % or half of the values fall below it and half fall above it. Then we multiply the total number of data values, n, by one half to obtain the position of the median, i.

· If we obtain a whole number, i, for the position of the median, then we average the value of the ith and ith+1 position value to obtain the median.

· If we do not obtain a whole number, i, for the position of the median, then we round i up and use the value in that position as the median.

Example Problem: Use the previous raw data to find the median test score.

77	84	76	88	99
92	68	78	98	69
75	97	95	71	88
73	88	94	73	76

Solution: Since there are 10 data values then n = 10. This give us (20) * (.5) = i = 10. Therefore we need to average the 10^th and 11^th term to obtain the median. This average gives us (78+84)/2 = 81. Therefore 81 is the median for these raw test scores.

For ungrouped data (given the frequency of each value): To obtain the total number of data values, n, the frequencies of all the values are added. Then we can divide the total number of data values, n, by 1/2 to obtain the position of the median. Also since the median relies on position relative to all the data values, we need to take into account the cumulative frequency of each value. This is the total number of data values that have occurred up to the chosen data value. The position of the median is determined by where i is in the cumulative frequency. The cumulative frequency of data value must be higher than the position of the median, i. Then the first data value with a higher cumulative frequency is the median.

For grouped data: We proceed the same way as for the ungrouped data by calculating the cumulative frequencies of the class intervals and obtaining the position of the median by (1/2)n = i. The first class interval that has a cumulative frequency higher than i is the median class. Then we use the lower class boundary, LCB, the cumulative frequency of the previous class, a, the class width, h, and the frequency of the median class, f, to find the median. We use the following equation:

LCB + (i – a)/f * h = Median

Example Problem:

Income for Men and Women- a random sample from the population.

Earnings ($)		Frequency
0-	20,000	123
20,000-	25,000	356
25,000-	30,000	560
30,000-	35,000	311
35,000-	40,000	107
40,000-	45,000	35
45,000-	50,000	21
50,000-	over	8

Find an average income for the sample population by finding the median.

Solution:

Earnings ($)		Frequency	Cumm. Freq. 123
0-	20,000	123	Cumm. Freq. 123
20,000-	25,000	356	479
25,000-	30,000	560	1039
30,000-	35,000	311	1350
35,000-	40,000	107	1457
40,000-	45,000	35	1492
45,000-	50,000	21	1513
50,000-	over	8	1521
	n =	1521

We obtain the total number of data values by summing the frequency of each class interval. Then n = 1521. Now we have that i = 1521(1/2) = 760.5. The first class interval with a cumulative frequency higher than 760.5 is the 25,000-30,000 class. Therefore this is the median class. Then the median is:

LCB + (i – a)/f * h

where LCB = 25,000, I = 760.5, a = 479, f = 560 and h = 5,000. The median is $27,513.

Homework:

Problem:

If shoe manufacturer has a random sample of raw data on the size of shoes bought for 1000 customers, which type of average would be the most useful to them and why?

Solution:

For the size of shoe data the mode would best help manufactures know what size to produce the most of.

Problem:

If the sample was for the price of the shoes 1000 customer bought, then what would be the most useful average and why?

Solution:

For the price of the shoes the median would be more helpful to know the middle price customers are paying for shoes.

DAY 2: Averages-the mean

MEAN

The mean: The mean is an average that takes into account every entry in the data. It is referred to as the arithmetic mean. To find the mean we add the values of all the entries and then divide by the number of entries. Since the mean takes into account every data value then it is affected by extreme values and is not considered to be a “resistant” measure.

For raw data: we simply sum all the data values and divide by the total data values summed, n.

For ungrouped data: we need to take into account every instance a data value occurs. Therefore we need to sum each data value as many times as its frequency. If the frequency is 5, then we need to sum that data value 5 times. We can obtain this by multiplying each data value by its frequency. We can then sum these values and divide the sum by the total number of data values, n, that is the sum of the frequencies.

For grouped data: we again need to take into account every instance of the class interval. We need to sum the data values as many times as their frequency. Since our data is already grouped we need to find one value to represent the entire class. We need to find the mid-values of the class intervals. This mid-value is the mean of the class interval, the sum of the class boundaries divided by two. We then take into account every occurrence of the mid-values by multiplying the mid-values by the frequencies. Then we can proceed as for ungrouped data and sum these values and divide the sum by the total number data values, n, that is the sum of the frequencies.

Example Problem:

Given the following data, find an average number of hours slept by adults and teenagers by finding the mean, the median and the mode.

Amount of hours slept at night for adults and teenagers.

Hours of sleep	Adults	Teenagers
3 hours	2	6
4 hours	5	5
5 hours	11	15
6 hours	35	22
7 hours	42	55
8 hours	31	23
9 hours	8	15
10 hours	4	5
Total=	138	146

Solution: For the mean, we need to “count” each occurrence of each value. Therefore multiplying the value times the frequency gives us x*f which symbolized the total sum of each data value. Then the summation of x*f gives the sum of all the data values. We can divide this sum by n to obtain the mean. For the adults n = 138. For the teenagers n = 146. Therefore the mean for adults is 6.85 and the mean for teenagers is 6.84

The median position, i, is determine by n * (1/2). The median position for the adults is 69 and for the teenagers it is 73. Therefore the median for both groups is 7.

The mode is the value with the highest frequency.

Mode for adults = 7 hours

Mode for teenagers = 7 hours

Trimmed mean: We said the arithmetic mean can be affected by extreme data values. Extreme data values can skew the mean so that it is either too high or too low instead of a good measure of the central tendency. We can “trim” the mean to have a measure of center that is more “resistant”, meaning it is not influenced by extremely high or low data. We can compute any percent trimmed mean by finding the that percent of the total data values. We then round up and eliminate that number of data values from both sides of the ordered data set. Since we are removing specific extreme data value this works only with raw data.

Example: If we want a 10% trimmed mean of 15 data points since 10% of 15 is 1.5, then we round to 2. Then the 2 lowest and highest data values (a total of four data values) are removed.

Problem:

Given the income of the following group, find an average by finding the mean. Then find the 5% trimmed mean and compare. Which one better represents the entire data set? Why?

Income	Frequency
28,000	2
30,000	6
32,000	5
36,000	10
42,000	8
299,000	1

Solution:

Income	Frequency	XF	New XF
28,000	2	56,000
30,000	6	180,000	180,000
32,000	5	160,000	160,000
36,000	10	360,000	360,000
42,000	8	336,000	294,000
299,000	1	299,000
n =	32	1,391,000	994,000

The mean using all the data values is the sum of all the data values, f*x, divided by the total number of data values. The mean is 1,391,000 / 32 = 43,469.

The 5% trimmed mean is obtained by finding 5% of 32 which is 1.6. We round up to two and remove the two smallest data values and the two largest data values from the set. Our new sum of the data values, f*x, is 994,000 and our new n is 28. Therefore our new 5% trimmed mean is 994,000/28 = 35,500.

The trimmed mean is more representative of the entire population because the majority of the data values are around this value. The untrimmed mean is not as representative since it is higher than every value except one. The extreme value of the largest data values skewed the untrimmed mean.

Homework:

If you were trying to choose a major and you were considering the average income for each major, which average would you prefer to have and why?

The mean could be thrown off my extremely high or low income. The median is a middle or balanced point and would be a better indicator.

DAY 3: Averages-

in the Computer Lab

Problem:

Given the following test scores, use Excel to order the data and find the mode and the mean. Click here to obtain this data and work in Excel.

77	84	76	88	99
92	68	78	98	69
75	97	95	71	88
73	88	94	73	76

Solution:

Click here to see the solution in Excel. Under the data menu in Excel the Sort choice sorts the data in either descending or ascending order. Also the data can be sorted using the A to Z buttons in the toolbox. The cells containing the raw data need to be selected before the Sort function is used.

The mode can be found by using the function MODE and selecting the data cells as the target array. This can be done with or without sorting the data. The formula in the cell is “MODE (First Cell: Last Cell).” The mode is 88. To see the formula in the cells go to the Tools menu and select formula auditing and select formula auditing mode. To see the results of the formulas then inactivate this mode by selecting the formula auditing mode again from the Tools menu.

The mean is found in Excel either by using Excel to perform the arithmetic or by using the statistical function AVERAGE. Using Excel for arithmetic we can sum all the data values using the summation function. Then we can find n by using the Count function. Finally we obtain the mean by dividing the sum by n. We use the AVERAGE function by selecting the array. The formula in the cell is “AVERAGE (First Cell: Last Cell).” The mean is 82.95.

Problem:

How hot does it get in Atlanta? The following are temperatures for 20 days in the summer. Using Excel, find the median and mean. Click here to obtain this data set and work in Excel.

If we find out that the thermometer used to measure the temperatures was always reading two degrees higher now find the new adjusted median and mean. Compare them with the original median and mean. What do you notice?

99	84	78	86	94
83	84	95	79	90
80	99	95	82	91
82	93	90	80	83

Solution: Click here to see this solution in Excel. We can sort all the data values using the A to Z function. We can then use the COUNT function to obtain n = 20 by selecting all the data values to be counted. Then use the spreadsheet capabilities to find i = (½)* n = 10 and select the values in the appropriate position. Since i is an integer then the mean of the 10th and 11^th terms is the median. Therefore the median is 85.

We can check our answer using the median function. The formula in the cell would be “MEDIAN (First Cell: Last Cell).” The median is 85.

NOTE: We can also use the PERCENTILE function by selecting the array and defining the median to be the 50%. The formula in the cell using this function would be “PERCENTILE (First Cell: Last Cell, 1/2).”

We can use the summation function to sum all the data values. We can then use previous results of the COUNT function to obtain n. We then use the spreadsheet capabilities to find the quotient of the sum by n.

The mean can also be found in Excel using the AVERAGE function. We can use the AVERAGE function by selecting the array. The formula in the cell is “AVERAGE (First Cell: Last Cell).” The mean is 87.35 degrees.

The adjusted mean and median can be obtained by adding 2 degrees to the original mean and median. The adjusted mean is 89.35 degrees and the adjusted median is 87. The students may add 2 degrees to each data value and use the same Excel functions again but comparing the two values at the end will help emphasize that there is no need to do that.

Problem:

Earning for Men and Women:

Earnings ($)		Frequency
0-	20,000	123
20,000-	25,000	356
25,000-	30,000	560
30,000-	35,000	311
35,000-	40,000	107
40,000-	45,000	35
45,000-	50,000	21
50,000-	Over	8

Using Excel, find an average income by finding the median. Click here to obtain this data set and work in Excel.

Solution:

Earnings ($)		Frequency	Cumm. Freq. 123
0-	20,000	123	Cumm. Freq. 123
20,000-	25,000	356	479
25,000-	30,000	560	1039
30,000-	35,000	311	1350
35,000-	40,000	107	1457
40,000-	45,000	35	1492
45,000-	50,000	21	1513
50,000-	65,000	8	1521
	n =	1521

Click here to see this solution in Excel. For this particular problem Excel would only be useful in manipulating the data and calculating the cumulative frequencies. The same equation as used in day one would still need to be used but Excel can be used to refer to the appropriate cells.

Therefore i = 1521(1/2) = 760.5. The median class is the 25,000-30,000 class. Then the median is:

LCB + (i – a)/f * h

where LCB = 25,000, I = 760.5, a = 479, f = 560 and h = 5,000. The median is $27,513.

Problem:

Given the following large number of data points find two averages to represent each class, the freshman, sophomore, juniors and seniors. Then find another two averages to represent the data as a whole. Which averages would be the most appropriate and why? The data represents the height (in inches) of freshman through seniors. Click here to obtain this data from Excel. Use Excel to obtain your answers.

Freshman		Sophomores	Juniors	Seniors
50		50	53	55
47		49	47	48
54		53	55	58
57		60	58	60
65		65	68	70
68		70	67	69
61		61	64	65
45		47	45	48
50		50	53	55
65		67	64	66
68		67	70	72
47		50	49	50
56		55	58	61
74		76	73	75
62		61	64	66
56		59	57	59
59		59	63	64
70		71	68	71
59		58	61	63
55		58	56	58
49		49	52	54
61		63	61	62
45		48	55	49
50		53	51	53
59		59	62	64
51		53	50	52
72		72	75	76
54		56	54	57
60		60	63	65
48		50	47	49
54		53	56	58
50		53	52	53
62		61	64	67
71		73	70	72
67		66	69	71
53		56	54	56
53		53	57	58
68		69	66	69
45		44	47	49
47		50	48	50
57		57	60	62
69		71	69	70
58		57	59	62
64		67	65	67
53		53	56	58
72		74	71	73
52		52	55	56
72		74	72	75
Sums:	2784	2819	2853	2938

Using the functions in Excel we can obtain all three averages in different ways. Click here to see these results obtained in Excel.

		Results
Average	Formula	Fresh.	Soph.	Junior	Senior
Mean	AVERAGE (First Cell: Last Cell)	58	59	59.44	61.25
Median	MEDIAN (First Cell: Last Cell)	57	58	58.5	61.5
Mode	MODE (First Cell: Last Cell)	50	53	64	58

DAY 4: Variation-

We have discussed the measures of central tendencies of data, now we will consider the spread of the data or the variation. The range gives the variation between the largest and smallest data values. But the range does not tell us how the data values differ from each other.

Another measure of variation is the standard deviation. It measures how far away each data value is from the mean. The equation to find the standard deviation sums the square difference between each data value and the mean, Σ(x_i-mean)². If the difference is not squared and data values are spread evenly below and above the mean then the sum of the difference could be zero. The equation squares this difference to eliminate this occurrence by eliminating negative differences. Then we have the square sum of the difference between each data point and the mean. If we divide this by n-1, we obtain the average square difference for the data set. Since we are interested in the average difference or spread of the data set, we need to take the square root. Therefore the equation is:

S= √(Σ(x_i-mean)²/(n-1))

For raw data: We find the mean and take the difference between each data value and the mean. Then we square the difference, divide by n-1 and take the square root to find the standard deviation.

For ungroup data: We know that the sum of the squared difference can be split into two sums. We also know that the mean is the sum of all the data values divided by the total number of data values, n. Using algebra we know:

Σ(x_i-mean)² = Σx_i² – (Σx_i)²/n

Now we know for ungrouped data that the sum of the data values is obtained by taking into account each instance the data value occurs. Therefore the data value is multiplied by its frequency. Now substituting this back into the standard deviation equation, we obtain:

S= √(( Σ(x_i²f) – (Σx_if)²/n)/(n-1))

We use the definition of the mean again to obtain another version of this equation:

S= √((Σ(x_i²f) – (mean²*n)/(n-1))

Then we can either obtain the difference between each data point and the mean or we can find the square of the sum of all the data values (Σ(x_i²f) to solve for the standard deviation.

For grouped data: We can use the same method as for the ungrouped data with one modification. We need to find the mid-value of each class interval. The mid-value of the class interval is the average of the lower class boundary and upper class boundary. Now this average is our x_i value that is multiplied by the class frequency. We can now continue the same as for ungrouped data and solve for the standard deviation.

Problem:

Find the variation of the following data by finding the range and the standard deviation. The data is the weight of babies. Click here to work with this data in Excel.

Weight (lbs)	frequency
3.0	2
4.5	7
6.0	8
6.8	14
7.2	10
7.9	3
8.5	1

Solution: Click here to see this solution in Excel. The range is 8.5-3.0 = 5.5 lbs. The standard deviation is 1.255.

Homework: If you were given an average repair cost for your type of car and the standard deviation. Which type of average would you prefer to have any why?

Solution: The mean would give the most amount of information since the standard deviation tells us how far the prices are from the mean.

DAY 5: Variation-

in the Computer Lab

Problem:

Find the mean and standard deviation of the following two data sets. How do they compare? Click here to obtain the data sets and work in Excel.

Click here to obtain the salaries of baseball players in the National League.

Click here to obtain the salaries of baseball players in the American League.

Solution: Click here to see the solution in Excel.

Problem: Generate a data table, either by generating random numbers or by choosing a characteristic that can be obtained in the classroom. For example, generating a die being rolled or collecting the amount of loose change each student in class has or collecting the amount of money students spent on lunch today. Click here to go to Excel and create your own data table. Find the mean, median, mode and standard deviation of your data set.

To see an example of randomly generated data go to Unit One in this Statistical Instructional Unit.