Assignment 12: Modeling Data with Excel

By Nikhat Parveen

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In this assignment we will model the tree data from the lumber industry given in problem 11 using excel.

This data is from the lumber industry, giving the approximate number of board feet of lumber per tree in a forest of a given age. What function will fit the data? Predict the harvest for ages other than those given.

How can we model the harvest of a tree based on its age that will be helpful to the lumber industry?

As the data we have is quantitative, we can plot a scatter plot  to determine if there is an  association between the age of tree ( consider as explanatory variable (x)) and the board feet of lumber (consider as response variable (y)).

Two quantitative variables x and y are said to have positive association when high values of x tend to occur with high values of y, and when low values of x tend to occur with low values of y. They are said to have a negative association when high values of one variable tend to pair with low values of the other variable, and low values of one pair with high values of the other.

As we want to explore the Excel features, we will plot the scatter plot using Excel chart tool.

Based on the data we have we can plot the data. We will be using MS Excel to analyze the tree data as it has the features to graph the data and  fit the data with  a best fit function.

After entering the data in excel I created the scatter plot as shown below. Click here for the excel file.

Let's analyze the data from the scatter plot...

• The data spread clearly shows that the growth of harvest is not linear, and in fact we can learn this from the data table too.  Certainly the trend looks like an exponential or power.

• You can also notice that the scatter plot shows a positive association between the two variables, as the high values of x (age of tree) tends to occur with the higher values of y (amount of lumber).

Now  to determine which line, exponential or power will best fit the data, we will plot both and decide which one to use based on the correlation.

Taking the advantage of the excel feature "add trend line", we will plot the trend line as the exponential as well as the power function. Note that the excel provides us with the option tool where we can select the option to give the equation of the line as well as the R squared value. (we will discuss R2 very soon)

1.Exponential fit: y = aebx, where y is the dependent variable, ie. amount of lumber and x is the independent variable which is age of tree in yrs. Excel calculates the value of a and b that best fits the data.

2. Power fit: y = axb, here x and y are the independent and dependent variables. Excel calculates the a and b values that best fit the data.

Analysis:

• Looking at both graphs above we can see that the data points  fit perfectly on the power trend line as compared to the exponential line.

• Both the graphs (exponential and power) gives the prediction equation  and the correlation R2 .

• The correlation for exponential is 0.8968 and for power it is 0.9999

Insight: The prediction equation is so called since it predicts the value of the response (dependent) variable y at any value of x. The reason for using the best fit line is that if x and y have an association, then we can predict most y values more accurately by substituting x-values into the equation y = aebx or  y = axb .

R-squared or r2: Another way to describe the strength of association refers to how well you can predict y using the prediction equation. The square of the correlation summarizes how much less prediction error there is when we use the best fit line to predict y, compared to using prediction equation to predict y. The prediction error is summarized by the sum of squared distances between the actual y- values and the predicted values. The notation r2 is used for this measure because, in fact, it equals the square of the correlation 'r'.

Specifically, r2 is interpreted as the proportional reduction in error. For age of tree and amount of lumber r2 = 0.9999 (power)  This means that the prediction error using the regression line to predict y is 99% smaller than the prediction error using y(hat) to predict y.

Since the correlation of 0.9999 indicates a strong relationship between the age of tree and the amount of lumber and is higher for the power function than the correlation for the exponential, we would therefore consider the power function as the best fit  for our tree data.

Now using the prediction equation y = 0.0006 x2.4926 provided by excel, we can predict the amount of lumber per tree at a given age. We can also find this equation solving the equation  y = axb    mathematically by taking log on both side of equation. Click here for details.

The chart below uses this equation in predicting the board feet of lumber given the age of tree.

Conclusion: MS Excel is a wonderful tool for modeling any kind of statistical data. It's special smart features of fitting the data to an equation is simply superb. By a click of a button we can graph the data, add the trend line and modify to any line we want.

Based on the chart we found, we can easily see that the board feet of lumber predicted at 10 years age tree is '0' feet and that of 50 years age tree is 1000 feet and at 200 yrs old tree is predicted to give 32,600 feet of lumber. But we should be cautious of not extrapolating the data by considering values that are far out of the range.