Assignment # 12

by Shannon Umberger

### Introduction

In this write-up, I investigated a set of data using a Microsoft Excel spreadsheet. First, I entered the raw data into the spreadsheet and graphed the data set. Next, I constructed a function that I thought best fit the data, let the spreadsheet calculate the theoretical data from this function, and graphed this new data set. Then, I used my function to make some predictions of missing data points. As I was performing this investigation, I had used residuals to measure the error between the raw data and the theoretical data, so I presented this analysis next. Finally, in my conclusion, I used a graphics calculator to help me produce a new function that had a smaller margin of error than the one I had constructed.

### Raw Data

The following data set comes from the lumber industry. It compares the age of a tree with the number of board feet of lumber that can be produced from that tree.

 Age of Tree 100s of Board Feet Produced (in Years) (Raw Data) 20 1 40 6 60 ??? 80 33 100 56 120 88 140 ??? 160 182 180 ??? 200 320

The following figure is a scatter plot of the above raw data. The data points were connected in order to enhance the shape of the curve.

### Theoretical Data

My next task was to find a function that best fit the given set of raw data. At first, judging by the shape of the graph, I thought that the function should be exponential. I quickly learned that exponential functions created more of a dramatic upward curve than I needed, so I moved on to a power function. After much guessing and testing, and while checking the residuals (explained below), I finally chose to use the function y = 0.00059x2.49.

The following chart shows the given raw data set and the theoretical data set, which was found using the above function. Notice that the theoretical data is very close to the raw data.

 Age of Tree 100s of Board Feet Produced 100s of Board Feet Produced (in Years) (Raw Data) (Theoretical Data) 20 1 1.02427530199162 40 6 5.75415279690848 80 33 32.3255616393263 100 56 56.3445625752648 120 88 88.7183210272092 160 182 181.597877598818 200 320 316.531328738118

The following figure is a graph of both the raw data set (in blue) and the theoretical data set (in pink). Note that the two curves seem to align so well that it is almost impossible to see the graph of the raw data set behind the graph of the theoretical data set.

### Predictions

Now that I have decided on a function that I think best models the data, I can use it to predict missing data points. The spreadsheet performs this operation for me (see chart below).

 Age of Tree 100s of Board Feet Produced 100s of Board Feet Produced (in Years) (Raw Data) (Theoretical Data) 20 1 1.02427530199162 40 6 5.75415279690848 60 ??? 15.7924178014801 80 33 32.3255616393263 100 56 56.3445625752648 120 88 88.7183210272092 140 ??? 130.229934006361 160 182 181.597877598818 180 ??? 243.489674632951 200 320 316.531328738118

I can predict that a 60-year-old tree produces approximately 1,600 board feet, a 140-year-old tree approximately 13,000 board feet, and a 180-year-old tree approximately 24,300 board feet.

### Measure of Error

Because the theoretical data set seems to match the raw data set so well, one can be led to believe that the function I chose could be the best function to use to model this data. Before that conclusion can be drawn, though, it is good to measure the error between the two sets of data.

The method I chose to use was to measure the residuals. First, I found the difference (called the residuals) between the theoretical data and the raw data. Since some of the residuals were positive and others were negative, it was not a good idea to average the residuals at this point. So, I squared the residuals in order to get a set of positive numbers, and then I averaged the squares.

Thes following chart shows all of the performed operations stated above.

 Age of Tree 100s of Board Feet Produced 100s of Board Feet Produced Theoretical Data Minus Raw Data Squares of Differences (in Years) (Raw Data) (Theoretical Data) (Residuals) (Squares of Residuals) 20 1 1.02427530199162 0.0242753019916193 0.000589290286784317 40 6 5.75415279690848 -0.245847203091518 0.0604408472679222 80 33 32.3255616393263 -0.674438360673733 0.454867102348272 100 56 56.3445625752648 0.344562575264845 0.118723368273142 120 88 88.7183210272092 0.7183210272092 0.51598509813088 160 182 181.597877598818 -0.402122401181515 0.161702425531987 200 320 316.531328738118 -3.46867126188198 12.0316803230059 Sum of Squares = 13.3439884548449 Average of Squares = 1.90628406497784

Note that the average of the squares of the residuals is approximately 1.9. The closer this value is to 0, the "better" the function that was chosen to model the data. It can be concluded that the function I used as a model is really good, but it is by no means to "best" function that exists.

Another way to analyze the residuals is to graph them, as shown below. Notice that most of the points seem to "bounce" between 0.7 and -0.7, which is fairly good. The last point, though, at a difference of -3.5, is an outlier of this trend. Again, it can be concluded that a better function can be found to model the data.

### Conclusion

Since I concluded that a better function can be found to model the data than the one I had constructed, and because I had trouble finding a better one myself (using the guess-and-test method with the spreadsheet), I decided to compare my function with one generated by a graphics calculator.

For the same raw data set as previously used, a graphics calculator gave me the regression function y = 0.00058637x2.49260861. Obviously, this function is very close to the one I constructed, but it uses constants that have been taken out to more decimal places than the constants in my model. Because this new function is more accurate, it makes sense that it should be "better" than the one I chose to use.

Plugging this new function into the spreadsheet produces the following data set. Notice that the new theoretical data set is very similar to the theoretical data set that my function had produced.

 Age of Tree 100s of Board Feet Produced 100s of Board Feet Produced (in Years) (Raw Data) (Theoretical Data) 20 1 1.02595972336347 40 6 5.77404641747198 80 33 32.4960242316547 100 56 56.6746651309192 120 88 89.2805420572102 160 182 182.885885307211 200 320 318.96198233571

A graph of this new theoretical data set (shown below) is almost exactly like the graph of the original theoretical data set that was produced by my function.

Even the predictions made using the new function (shown below) are not much different than those I made using my function!

 Age of Tree 100s of Board Feet Produced 100s of Board Feet Produced (in Years) (Raw Data) (Theoretical Data) 20 1 1.02595972336347 40 6 5.77404641747198 60 ??? 15.8637866078776 80 33 32.4960242316547 100 56 56.6746651309192 120 88 89.2805420572102 140 ??? 131.107930616389 160 182 182.885885307211 180 ??? 245.29201274612 200 320 318.96198233571

One might conclude, then, that the accuracy of the constants in the model function is not so important. But the true story is revealed after analyzing the residuals that were produced by the new function. This information is shown in the chart below.

 Age of Tree 100s of Board Feet Produced 100s of Board Feet Produced Theoretical Data Minus Raw Data Squares of Differences (in Years) (Raw Data) (Theoretical Data) (Residuals) (Squares of Residuals) 20 1 1.02595972336347 0.0259597233634659 0.000673907237107675 40 6 5.77404641747198 -0.225953582528025 0.0510550214572488 80 33 32.4960242316547 -0.50397576834532 0.253991575079256 100 56 56.6746651309192 0.674665130919223 0.455173038878252 120 88 89.2805420572102 1.28054205721016 1.63978796028402 160 182 182.885885307211 0.885885307210629 0.78479277753167 200 320 318.96198233571 -1.03801766428995 1.07748067137797 Sum of Squares = 4.26295495184552 Average of Squares = 0.60899356454936

Note that the average of the squares of the residuals is now approximately 0.6, as opposed to 1.9 found before. 0.6 is closer to 0 than 1.9 is, so while the function produced by the graphics calculator is not perfect, it is definitely better than my function.

A graph of the residuals (below) supports the fact that the new function is better than the function I had constructed. Though the points still seem to "bounce around" a little, none of them are extreme outliers like the one produced using my function.