Assignment # 12

by Shannon Umberger


Introduction

In this write-up, I investigated a set of data using a Microsoft Excel spreadsheet. First, I entered the raw data into the spreadsheet and graphed the data set. Next, I constructed a function that I thought best fit the data, let the spreadsheet calculate the theoretical data from this function, and graphed this new data set. Then, I used my function to make some predictions of missing data points. As I was performing this investigation, I had used residuals to measure the error between the raw data and the theoretical data, so I presented this analysis next. Finally, in my conclusion, I used a graphics calculator to help me produce a new function that had a smaller margin of error than the one I had constructed.


Raw Data

The following data set comes from the lumber industry. It compares the age of a tree with the number of board feet of lumber that can be produced from that tree.

Age of Tree

100s of Board Feet Produced

(in Years)

(Raw Data)
20 1
40 6
60 ???
80 33
100 56
120 88
140 ???
160 182
180 ???
200 320

The following figure is a scatter plot of the above raw data. The data points were connected in order to enhance the shape of the curve.


Theoretical Data

My next task was to find a function that best fit the given set of raw data. At first, judging by the shape of the graph, I thought that the function should be exponential. I quickly learned that exponential functions created more of a dramatic upward curve than I needed, so I moved on to a power function. After much guessing and testing, and while checking the residuals (explained below), I finally chose to use the function y = 0.00059x2.49.

The following chart shows the given raw data set and the theoretical data set, which was found using the above function. Notice that the theoretical data is very close to the raw data.

Age of Tree

100s of Board Feet Produced

100s of Board Feet Produced

(in Years)

(Raw Data)

(Theoretical Data)
20 1 1.02427530199162
40 6 5.75415279690848
80 33 32.3255616393263
100 56 56.3445625752648
120 88 88.7183210272092
160 182 181.597877598818
200 320 316.531328738118

The following figure is a graph of both the raw data set (in blue) and the theoretical data set (in pink). Note that the two curves seem to align so well that it is almost impossible to see the graph of the raw data set behind the graph of the theoretical data set.


Predictions

Now that I have decided on a function that I think best models the data, I can use it to predict missing data points. The spreadsheet performs this operation for me (see chart below).

Age of Tree

100s of Board Feet Produced

100s of Board Feet Produced

(in Years)

(Raw Data)

(Theoretical Data)
20 1 1.02427530199162
40 6 5.75415279690848
60 ??? 15.7924178014801
80 33 32.3255616393263
100 56 56.3445625752648
120 88 88.7183210272092
140 ??? 130.229934006361
160 182 181.597877598818
180 ??? 243.489674632951
200 320 316.531328738118

I can predict that a 60-year-old tree produces approximately 1,600 board feet, a 140-year-old tree approximately 13,000 board feet, and a 180-year-old tree approximately 24,300 board feet.


Measure of Error

Because the theoretical data set seems to match the raw data set so well, one can be led to believe that the function I chose could be the best function to use to model this data. Before that conclusion can be drawn, though, it is good to measure the error between the two sets of data.

The method I chose to use was to measure the residuals. First, I found the difference (called the residuals) between the theoretical data and the raw data. Since some of the residuals were positive and others were negative, it was not a good idea to average the residuals at this point. So, I squared the residuals in order to get a set of positive numbers, and then I averaged the squares.

Thes following chart shows all of the performed operations stated above.

Age of Tree

100s of Board Feet Produced

100s of Board Feet Produced

Theoretical Data Minus Raw Data

Squares of Differences

(in Years)

(Raw Data)

(Theoretical Data)

(Residuals)

(Squares of Residuals)
20 1 1.02427530199162 0.0242753019916193 0.000589290286784317
40 6 5.75415279690848 -0.245847203091518 0.0604408472679222
80 33 32.3255616393263 -0.674438360673733 0.454867102348272
100 56 56.3445625752648 0.344562575264845 0.118723368273142
120 88 88.7183210272092 0.7183210272092 0.51598509813088
160 182 181.597877598818 -0.402122401181515 0.161702425531987
200 320 316.531328738118 -3.46867126188198 12.0316803230059

Sum of Squares =
13.3439884548449

Average of Squares =
1.90628406497784

Note that the average of the squares of the residuals is approximately 1.9. The closer this value is to 0, the "better" the function that was chosen to model the data. It can be concluded that the function I used as a model is really good, but it is by no means to "best" function that exists.

Another way to analyze the residuals is to graph them, as shown below. Notice that most of the points seem to "bounce" between 0.7 and -0.7, which is fairly good. The last point, though, at a difference of -3.5, is an outlier of this trend. Again, it can be concluded that a better function can be found to model the data.


Conclusion

Since I concluded that a better function can be found to model the data than the one I had constructed, and because I had trouble finding a better one myself (using the guess-and-test method with the spreadsheet), I decided to compare my function with one generated by a graphics calculator.

For the same raw data set as previously used, a graphics calculator gave me the regression function y = 0.00058637x2.49260861. Obviously, this function is very close to the one I constructed, but it uses constants that have been taken out to more decimal places than the constants in my model. Because this new function is more accurate, it makes sense that it should be "better" than the one I chose to use.

Plugging this new function into the spreadsheet produces the following data set. Notice that the new theoretical data set is very similar to the theoretical data set that my function had produced.

Age of Tree

100s of Board Feet Produced

100s of Board Feet Produced

(in Years)

(Raw Data)

(Theoretical Data)
20 1 1.02595972336347
40 6 5.77404641747198
80 33 32.4960242316547
100 56 56.6746651309192
120 88 89.2805420572102
160 182 182.885885307211
200 320 318.96198233571

A graph of this new theoretical data set (shown below) is almost exactly like the graph of the original theoretical data set that was produced by my function.

Even the predictions made using the new function (shown below) are not much different than those I made using my function!

Age of Tree

100s of Board Feet Produced

100s of Board Feet Produced

(in Years)

(Raw Data)

(Theoretical Data)
20 1 1.02595972336347
40 6 5.77404641747198
60 ??? 15.8637866078776
80 33 32.4960242316547
100 56 56.6746651309192
120 88 89.2805420572102
140 ??? 131.107930616389
160 182 182.885885307211
180 ??? 245.29201274612
200 320 318.96198233571

One might conclude, then, that the accuracy of the constants in the model function is not so important. But the true story is revealed after analyzing the residuals that were produced by the new function. This information is shown in the chart below.

Age of Tree

100s of Board Feet Produced

100s of Board Feet Produced

Theoretical Data Minus Raw Data

Squares of Differences

(in Years)

(Raw Data)

(Theoretical Data)

(Residuals)

(Squares of Residuals)
20 1 1.02595972336347 0.0259597233634659 0.000673907237107675
40 6 5.77404641747198 -0.225953582528025 0.0510550214572488
80 33 32.4960242316547 -0.50397576834532 0.253991575079256
100 56 56.6746651309192 0.674665130919223 0.455173038878252
120 88 89.2805420572102 1.28054205721016 1.63978796028402
160 182 182.885885307211 0.885885307210629 0.78479277753167
200 320 318.96198233571 -1.03801766428995 1.07748067137797

Sum of Squares =
4.26295495184552

Average of Squares =
0.60899356454936

Note that the average of the squares of the residuals is now approximately 0.6, as opposed to 1.9 found before. 0.6 is closer to 0 than 1.9 is, so while the function produced by the graphics calculator is not perfect, it is definitely better than my function.

A graph of the residuals (below) supports the fact that the new function is better than the function I had constructed. Though the points still seem to "bounce around" a little, none of them are extreme outliers like the one produced using my function.


Return to Shannon's Home Page