Averages and Variations

Purpose:

The purpose of this 5 day unit is to introduce three measures of centers, the mode, mean and median. This unit will also introduce a measure of variation, standard deviation. The students will also become familiar with the statistical functions in Excel and learn to use Excel to sort and analysis data. The students will initially work the problems out with pencil and paper when the concepts are introduced. Then on the following days the students will work on some of the same problems and some new ones in the computer lab using Excel.

DAY 1: Averages- mode and median

When we want to describe a population using only one number, we use a central value or average. The average helps predict what the entire population is likely to be. There are many different ways to find averages we are going to focus on three, the mean, the mode and the median.  We call these the measures of central tendency.

We will find the three measures of central tendency for three types of data. They are:

• Raw data- data values are given in random sequence.
• Ungrouped data- the frequency of single data values are given.
• Grouped data- frequency data of class intervals is given.

MODE

The mode: It is the value or property that occurs most frequently in the data. It is the most common value. It does not have to be unique. There can be more than one mode or none if no values are repeated.

For raw data:                The mode is the value that occurs most frequently.

For ungrouped data:       The mode is the value that occurs most frequently, the one with the highest frequency.

For grouped data:          The modal class is the class with highest frequency. The mode is determined by:

LCB + (fo - f-1)/ (2 fo - f-1 - f1) * h

where fo is the frequency of the modal class, f1 is the frequency of the proceeding class, f-1 is the frequency of the preceding class, LCB is the lower class boundary and h is the class width.

Example Problem:

Given the following raw data for test scores, order the data using the stem and leaf method and find the mode.

 77 84 76 88 99 92 68 78 98 69 75 97 95 71 88 73 88 94 73 76

Solution:

Use the stem and leaf method to sort the data in ascending order. The mode is more evident using this method. The mode is 88 because it occurs three times.

Example Problem:

Income for Men and Women- a random sample of population using grouped data.

 Earnings (\$) Frequency 0- 20,000 123 20,000- 25,000 356 25,000- 30,000 560 30,000- 35,000 311 35,000- 40,000 107 40,000- 45,000 35 45,000- 50,000 21 50,000- over 8

Find an average income for the sample population by finding the mode.

Solution: The modal class is the class interval with the highest frequency, 25,000-30,000. Then the mode is:

LCB + (fo - f-1)/ (2 fo - f-1 - f1) * h

where the frequency of the modal class is fo = 560, the frequency of the proceeding class is f1 = 311, the frequency of the preceding class is f-1 = 356,  the lower class boundary is LCB = 25,000 and the class width is h = 5,000. Therefore the mode is \$27,252.

MEDIAN (the 50th %)

The median: The median is a central value where there are an equal number of values above and below it. We say the median is resistant to extreme values because it is not affected by them. The median relies only on the total number of data values to determine the position of the median.

For Raw Data:  To obtain the median we must order the data from smallest to largest. Then we can construct the middle value. Since the median is the 50th% then 50 % or half of the values fall below it and half fall above it. Then we multiply the total number of data values, n, by one half to obtain the position of the median, i.

·        If we obtain a whole number, i, for the position of the median, then we average the value of the ith and ith+1 position value to obtain the median.

·        If we do not obtain a whole number, i, for the position of the median, then we round i up and use the value in that position as the median.

Example Problem: Use the previous raw data to find the median test score.

 77 84 76 88 99 92 68 78 98 69 75 97 95 71 88 73 88 94 73 76

Solution: Since there are 10 data values then n = 10. This give us (20) * (.5) = i = 10. Therefore we need to average the 10th and 11th term to obtain the median. This average gives us (78+84)/2 = 81. Therefore 81 is the median for these raw test scores.

For ungrouped data (given the frequency of each value): To obtain the total number of data values, n, the frequencies of all the values are added. Then we can divide the total number of data values, n, by 1/2 to obtain the position of the median. Also since the median relies on position relative to all the data values, we need to take into account the cumulative frequency of each value. This is the total number of data values that have occurred up to the chosen data value. The position of the median is determined by where i is in the cumulative frequency. The cumulative frequency of data value must be higher than the position of the median, i. Then the first data value with a higher cumulative frequency is the median.

For grouped data: We proceed the same way as for the ungrouped data by calculating the cumulative frequencies of the class intervals and obtaining the position of the median by (1/2)n = i. The first class interval that has a cumulative frequency higher than i is the median class. Then we use the lower class boundary, LCB, the cumulative frequency of the previous class, a, the class width, h, and the frequency of the median class, f, to find the median. We use the following equation:

LCB + (i – a)/f * h = Median

Example Problem:

Income for Men and Women- a random sample from the population.

 Earnings (\$) Frequency 0- 20,000 123 20,000- 25,000 356 25,000- 30,000 560 30,000- 35,000 311 35,000- 40,000 107 40,000- 45,000 35 45,000- 50,000 21 50,000- over 8

Find an average income for the sample population by finding the median.

Solution:

 Earnings (\$) Frequency Cumm. Freq. 123 0- 20,000 123 20,000- 25,000 356 479 25,000- 30,000 560 1039 30,000- 35,000 311 1350 35,000- 40,000 107 1457 40,000- 45,000 35 1492 45,000- 50,000 21 1513 50,000- over 8 1521 n = 1521

We obtain the total number of data values by summing the frequency of each class interval. Then n = 1521. Now we have that i = 1521(1/2) = 760.5. The first class interval with a cumulative frequency higher than 760.5 is the 25,000-30,000 class. Therefore this is the median class. Then the median is:

LCB + (i – a)/f * h

where LCB = 25,000, I = 760.5, a = 479, f = 560 and h = 5,000. The median is \$27,513.

Homework:

Problem:

If shoe manufacturer has a random sample of raw data on the size of shoes bought for 1000 customers, which type of average would be the most useful to them and why?

Solution:

For the size of shoe data the mode would best help manufactures know what size to produce the most of.

Problem:

If the sample was for the price of the shoes 1000 customer bought, then what would be the most useful average and why?

Solution:

For the price of the shoes the median would be more helpful to know the middle price customers are paying for shoes.

DAY 2: Averages-the mean

MEAN

The mean: The mean is an average that takes into account every entry in the data. It is referred to as the arithmetic mean. To find the mean we add the values of all the entries and then divide by the number of entries. Since the mean takes into account every data value then it is affected by extreme values and is not considered to be a “resistant” measure.

For raw data: we simply sum all the data values and divide by the total data values summed, n.

For ungrouped data: we need to take into account every instance a data value occurs. Therefore we need to sum each data value as many times as its frequency. If the frequency is 5, then we need to sum that data value 5 times. We can obtain this by multiplying each data value by its frequency. We can then sum these values and divide the sum by the total number of data values, n, that is the sum of the frequencies.

For grouped data: we again need to take into account every instance of the class interval. We need to sum the data values as many times as their frequency. Since our data is already grouped we need to find one value to represent the entire class. We need to find the mid-values of the class intervals. This mid-value is the mean of the class interval, the sum of the class boundaries divided by two. We then take into account every occurrence of the mid-values by multiplying the mid-values by the frequencies. Then we can proceed as for ungrouped data and sum these values and divide the sum by the total number data values, n, that is the sum of the frequencies.

Example Problem:

Given the following data, find an average number of hours slept by adults and teenagers by finding the mean, the median and the mode.

Amount of hours slept at night for adults and teenagers.

 Hours of sleep Adults Teenagers 3 hours 2 6 4 hours 5 5 5 hours 11 15 6 hours 35 22 7 hours 42 55 8 hours 31 23 9 hours 8 15 10 hours 4 5 Total= 138 146

Solution: For the mean, we need to “count” each occurrence of each value. Therefore multiplying the value times the frequency gives us x*f which symbolized the total sum of each data value. Then the summation of x*f gives the sum of all the data values. We can divide this sum by n to obtain the mean. For the adults n = 138. For the teenagers n = 146. Therefore the mean for adults is 6.85 and the mean for teenagers is 6.84

The median position, i, is determine by n * (1/2). The median position for the adults is 69 and for the teenagers it is 73. Therefore the median for both groups is 7.

The mode is the value with the highest frequency.

Mode for adults = 7 hours

Mode for teenagers = 7 hours

Trimmed mean: We said the arithmetic mean can be affected by extreme data values. Extreme data values can skew the mean so that it is either too high or too low instead of a good measure of the central tendency. We can “trim” the mean to have a measure of center that is more “resistant”, meaning it is not influenced by extremely high or low data. We can compute any percent trimmed mean by finding the that percent of the total data values. We then round up and eliminate that number of data values from both sides of the ordered data set. Since we are removing specific extreme data value this works only with raw data.

Example: If we want a 10% trimmed mean of 15 data points since 10% of 15 is 1.5, then we round to 2. Then the 2 lowest and highest data values (a total of four data values) are removed.

Problem:

Given the income of the following group, find an average by finding the mean. Then find the 5% trimmed mean and compare. Which one better represents the entire data set? Why?

 Income Frequency 28,000 2 30,000 6 32,000 5 36,000 10 42,000 8 299,000 1

Solution:

 Income Frequency XF New XF 28,000 2 56,000 30,000 6 180,000 180,000 32,000 5 160,000 160,000 36,000 10 360,000 360,000 42,000 8 336,000 294,000 299,000 1 299,000 n = 32 1,391,000 994,000

The mean using all the data values is the sum of all the data values, f*x, divided by the total number of data values. The mean is 1,391,000 / 32 = 43,469.

The 5% trimmed mean is obtained by finding 5% of 32 which is 1.6. We round up to two and remove the two smallest data values and the two largest data values from the set. Our new sum of the data values, f*x, is 994,000 and our new n is 28. Therefore our new 5% trimmed mean is 994,000/28 = 35,500.

The trimmed mean is more representative of the entire population because the majority of the data values are around this value. The untrimmed mean is not as representative since it is higher than every value except one.  The extreme value of the largest data values skewed the untrimmed mean.

Homework:

If you were trying to choose a major and you were considering the average income for each major, which average would you prefer to have and why?

The mean could be thrown off my extremely high or low income. The median is a middle or balanced point and would be a better indicator.

DAY 3: Averages-

in the Computer Lab

Problem:

Given the following test scores, use Excel to order the data and find the mode and the mean. Click here to obtain this data and work in Excel.

 77 84 76 88 99 92 68 78 98 69 75 97 95 71 88 73 88 94 73 76

Solution:

Click here to see the solution in Excel. Under the data menu in Excel the Sort choice sorts the data in either descending or ascending order. Also the data can be sorted using the A to Z buttons in the toolbox. The cells containing the raw data need to be selected before the Sort function is used.

The mode can be found by using the function MODE and selecting the data cells as the target array. This can be done with or without sorting the data. The formula in the cell is “MODE (First Cell: Last Cell).” The mode is 88. To see the formula in the cells go to the Tools menu and select formula auditing and select formula auditing mode. To see the results of the formulas then inactivate this mode by selecting the formula auditing mode again from the Tools menu.

The mean is found in Excel either by using Excel to perform the arithmetic or by using the statistical function AVERAGE. Using Excel for arithmetic we can sum all the data values using the summation function. Then we can find n by using the Count function. Finally we obtain the mean by dividing the sum by n. We use the AVERAGE function by selecting the array. The formula in the cell is “AVERAGE (First Cell: Last Cell).”  The mean is 82.95.

Problem:

How hot does it get in Atlanta? The following are temperatures for 20 days in the summer. Using Excel, find the median and mean. Click here to obtain this data set and work in Excel.

If we find out that the thermometer used to measure the temperatures was always reading two degrees higher now find the new adjusted median and mean. Compare them with the original median and mean. What do you notice?

 99 84 78 86 94 83 84 95 79 90 80 99 95 82 91 82 93 90 80 83

Solution:  Click here to see this solution in Excel. We can sort all the data values using the A to Z function. We can then use the COUNT function to obtain n = 20 by selecting all the data values to be counted. Then use the spreadsheet capabilities to find i = (½)* n = 10 and select the values in the appropriate position. Since i is an integer then the mean of the 10th and 11th terms is the median. Therefore the median is 85.

We can check our answer using the median function. The formula in the cell would be “MEDIAN (First Cell: Last Cell).” The median is 85.

NOTE: We can also use the PERCENTILE function by selecting the array and defining the median to be the 50%. The formula in the cell using this function would be “PERCENTILE (First Cell: Last Cell, 1/2).”

We can use the summation function to sum all the data values. We can then use previous results of the COUNT function to obtain n. We then use the spreadsheet capabilities to find the quotient of the sum by n.

The mean can also be found in Excel using the AVERAGE function. We can use the AVERAGE function by selecting the array. The formula in the cell is “AVERAGE (First Cell: Last Cell).”  The mean is 87.35 degrees.

The adjusted mean and median can be obtained by adding 2 degrees to the original mean and median. The adjusted mean is 89.35 degrees and the adjusted median is 87. The students may add 2 degrees to each data value and use the same Excel functions again but comparing the two values at the end will help emphasize that there is no need to do that.

Problem:

Earning for Men and Women:

 Earnings (\$) Frequency 0- 20,000 123 20,000- 25,000 356 25,000- 30,000 560 30,000- 35,000 311 35,000- 40,000 107 40,000- 45,000 35 45,000- 50,000 21 50,000- Over 8

Using Excel, find an average income by finding the median. Click here to obtain this data set and work in Excel.

Solution:

 Earnings (\$) Frequency Cumm. Freq. 123 0- 20,000 123 20,000- 25,000 356 479 25,000- 30,000 560 1039 30,000- 35,000 311 1350 35,000- 40,000 107 1457 40,000- 45,000 35 1492 45,000- 50,000 21 1513 50,000- 65,000 8 1521 n = 1521

Click here to see this solution in Excel. For this particular problem Excel would only be useful in manipulating the data and calculating the cumulative frequencies. The same equation as used in day one would still need to be used but Excel can be used to refer to the appropriate cells.

Therefore i = 1521(1/2) = 760.5. The median class is the 25,000-30,000 class. Then the median is:

LCB + (i – a)/f * h

where LCB = 25,000, I = 760.5, a = 479, f = 560 and h = 5,000. The median is \$27,513.

Problem:

Given the following large number of data points find two averages to represent each class, the freshman, sophomore, juniors and seniors. Then find another two averages to represent the data as a whole. Which averages would be the most appropriate and why? The data represents the height (in inches) of freshman through seniors. Click here to obtain this data from Excel. Use Excel to obtain your answers.

 Freshman Sophomores Juniors Seniors 50 50 53 55 47 49 47 48 54 53 55 58 57 60 58 60 65 65 68 70 68 70 67 69 61 61 64 65 45 47 45 48 50 50 53 55 65 67 64 66 68 67 70 72 47 50 49 50 56 55 58 61 74 76 73 75 62 61 64 66 56 59 57 59 59 59 63 64 70 71 68 71 59 58 61 63 55 58 56 58 49 49 52 54 61 63 61 62 45 48 55 49 50 53 51 53 59 59 62 64 51 53 50 52 72 72 75 76 54 56 54 57 60 60 63 65 48 50 47 49 54 53 56 58 50 53 52 53 62 61 64 67 71 73 70 72 67 66 69 71 53 56 54 56 53 53 57 58 68 69 66 69 45 44 47 49 47 50 48 50 57 57 60 62 69 71 69 70 58 57 59 62 64 67 65 67 53 53 56 58 72 74 71 73 52 52 55 56 72 74 72 75 Sums: 2784 2819 2853 2938

Using the functions in Excel we can obtain all three averages in different ways. Click here to see these results obtained in Excel.

 Results Average Formula Fresh. Soph. Junior Senior Mean AVERAGE (First Cell: Last Cell) 58 59 59.44 61.25 Median MEDIAN (First Cell: Last Cell) 57 58 58.5 61.5 Mode MODE (First Cell: Last Cell) 50 53 64 58

DAY 4: Variation-

We have discussed the measures of central tendencies of data, now we will consider the spread of the data or the variation. The range gives the variation between the largest and smallest data values. But the range does not tell us how the data values differ from each other.

Another measure of variation is the standard deviation. It measures how far away each data value is from the mean. The equation to find the standard deviation sums the square difference between each data value and the mean, Σ(xi-mean)2. If the difference is not squared and data values are spread evenly below and above the mean then the sum of the difference could be zero. The equation squares this difference to eliminate this occurrence by eliminating negative differences. Then we have the square sum of the difference between each data point and the mean. If we divide this by n-1, we obtain the average square difference for the data set. Since we are interested in the average difference or spread of the data set, we need to take the square root. Therefore the equation is:

S= √(Σ(xi-mean)2/(n-1))

For raw data: We find the mean and take the difference between each data value and the mean. Then we square the difference, divide by n-1 and take the square root to find the standard deviation.

For ungroup data: We know that the sum of the squared difference can be split into two sums. We also know that the mean is the sum of all the data values divided by the total number of data values, n. Using algebra we know:

Σ(xi-mean)2 =  Σxi2 – (Σxi)2/n

Now we know for ungrouped data that the sum of the data values is obtained by taking into account each instance the data value occurs. Therefore the data value is multiplied by its frequency. Now substituting this back into the standard deviation equation, we obtain:

S= √(( Σ(xi2f) – (Σxif)2/n)/(n-1))

We use the definition of the mean again to obtain another version of this equation:

S= √((Σ(xi2f) – (mean2*n)/(n-1))

Then we can either obtain the difference between each data point and the mean or we can find the square of the sum of all the data values (Σ(xi2f) to solve for the standard deviation.

For grouped data: We can use the same method as for the ungrouped data with one modification. We need to find the mid-value of each class interval. The mid-value of the class interval is the average of the lower class boundary and upper class boundary. Now this average is our xi value that is multiplied by the class frequency. We can now continue the same as for ungrouped data and solve for the standard deviation.

Problem:

Find the variation of the following data by finding the range and the standard deviation. The data is the weight of babies. Click here to work with this data in Excel.

 Weight (lbs) frequency 3.0 2 4.5 7 6.0 8 6.8 14 7.2 10 7.9 3 8.5 1

Solution: Click here to see this solution in Excel. The range is 8.5-3.0 = 5.5 lbs. The standard deviation is 1.255.

Homework: If you were given an average repair cost for your type of car and the standard deviation. Which type of average would you prefer to have any why?

Solution: The mean would give the most amount of information since the standard deviation tells us how far the prices are from the mean.

DAY 5: Variation-

in the Computer Lab

Problem:

Find the mean and standard deviation of the following two data sets. How do they compare? Click here to obtain the data sets and work in Excel.

Problem: Generate a data table, either by generating random numbers or by choosing a characteristic that can be obtained in the classroom. For example, generating a die being rolled or collecting the amount of loose change each student in class has or collecting the amount of money students spent on lunch today. Click here to go to Excel and create your own data table. Find the mean, median, mode and standard deviation of your data set.

To see an example of randomly generated data go to Unit One in this Statistical Instructional Unit.