Brenda King

Tree Problem

The problem:

Using data from the lumber industry which gives the approximate number of board feet of lumber per tree in a forest of a given age, find a function that will fit the data.  Predict the harvest for ages other than those given.

 Age of Tree 100s of Board Feet 20 1 40 6 60 80 33 100 56 120 88 140 160 182 180 200 320

Introduction

In many real-world problems there exist patterns or relationships between sets of numerical data.  The relationship among the data can be influenced by many variables.  In the case of trees, the variables may include age, soil condition, weather, insects, and human intervention.   In this investigation, the relationship between Age of Trees and number of 100s of board feet will be modeled.

How can a model be fit to data?  How can the “best” model be selected?  Can the model be used to predict values other than those given? Are there limits on the use of the model?

Examining Relationships

An effective way to see a relationship in data is to display the information in a scatter plot.   A scatter plot shows how two variables relate to each other by looking for patterns in the data. Strong relationships show data following a specific pattern or trend and weak relationships show data widely dispersed or with no pattern at all. The scatter plot for the tree data is shown in diagram 1.

Diagram 1

Scatterplots show the form and strength of the relationship between sets of data.  For the tree problem, the data moves in a positive direction from the lower left corner to upper right corner of the graph.  The form of the relationship is slightly curved.  The strength of a relationship is determined by how closely the points follow the curve.    The tree scatterplot seems to indicate a strong relationship.  A model can be constructed to represent this data.

When the points in a scatter plot are represented by a line of best fit, the line can be used to predict values other than the ones given. Linear relationships are quite common and simple to use.   Since the scatterplot displayed for the tree data appears to be curved, the function model will probably not be linear.  Many statistical measurements, such as correlation r, are based on the strength of a straight-line relationship. In order to use these measures to determine the “best” model, a transformation will be necessary to achieve linearity.

Curve fitting

The process of fitting a set of data with a model can be done in many ways.

For example, given three noncollinear points, a quadratic model can be fit to the data. To find a, b, and c in f(x)=ax2+bx+c, write and solve a system of three linear equations using the three unknowns.

Three data points were selected from the tree problem and setup in equations as shown below.

 Point Substitution Equation (20,1) a(20)2+b(20)+c = 1 400a+20b+c=1 (80,33) a(80)2+b(80)+c = 33 6400a+80b+c=33 (160,182) a(160)2+b(160)+c = 182 25600a+160b+c=182

Using a matrix equation, X=a-1 b, to solve for a, b, and c, gives the following output:

These results produce the quadratic model of f(x) = .009494x2. -.416071x + 5.52381.

The graph of the tree data and quadratic model are shown in diagram 2.

Diagram 2

Although this model is not an exact fit, the last point is clearly not on the line, it does look like a nice fit.  The equation produced from these three randomly selected points will not produce the only quadratic model for the tree problem.  The model is dependant on the points selected for the calculation.  If a different set of points are used, the equation would change.

The degree of the polynomial can increase if more points are used.  Given four noncollinear points, a quartic model can be fit to the data.

Another way to produce function models is to use graphing calculators or excel spreadsheets. Diagram 3 - 8 are sample models produced in this way.

Diagram 3 Linear

The linear model does not seem to be a good fit, only two points are close to the line.

Diagram 4 Exponential

The exponential model does not seem to be a good fit either, however, points are closer to this curve than the linear model.

For the models with degree greater than 1 (as shown in diagram 5-8), the domain must be restricted to positive values (tree ages). These diagrams included negative values to better display the shape of the model.

The quadratic model created from the calculator produces a tighter fit, to all the points, than the 3 point model.  The calculator has more data to use in the equation.

3 point model: f(x) = .009494x2. -.416071x + 5.52381

7 point model: f(x) = .016203x2. -.475431x + 2.43061

Diagram 6 Cubic

With the increase of each power, the curve appears to be passing through and closer to more and more data points.

Diagram 7 Quartic

The changes shown in the correlation of determination, R2, confirms the improvement of each graph.

Diagram 8 Power

The power model seems to have slipped in the measure of fit as determined by R2, but still a very strong relationship is modeled.

Selecting the best model

A correlation coefficient, r. indicates how closely the data points cluster around a linear model. The coefficient of determination, r2, is a numerical quantity that tells how well the least-squares line does at predicting values.

When data is nonlinear, a transformation can be done to use correlation measures.  One of the transformations that will be described here (to flatten out nonlinear models) is for power equations y=axp.

Taking the logarithm of both sides of the power equation gives log y = log a + p log x.  The  results is a linear relationship between log x and log y.  The power p in the power equation becomes the slope of the straight line that links log y to log x.   If by taking the log of both variables produces a linear scatterplot, then the results is a reasonable model for the original data.

With the linear model just created, the least-square regression analysis and the measures of good fit, such as correlation, can be used.  Diagram 9 shows the linearized Tree data and the graph of the results.

 Log (AGE) Log (Board) 2.995732274 0 3.688879454 1.791759469 4.094344562 2.785608595 4.382026635 3.496507561 4.605170186 4.025351691 4.787491743 4.477336814 4.941642423 4.911544711 5.075173815 5.204006687 5.192956851 5.511128522 5.298317367 5.768320996

With the correlation of determination, r2, set at .999864, we see we have a good fit to the original data.  The graphing calculator does provide a correlation of determination with the models shown in diagrams 3-8, but it is also nice to see how a nonlinear curve is “straighten” out for the calculation just done and shown in diagram 9.

Predicting values

Using the model produced in the transformation, f(x)=2.4952x-7.44665, and plugging in log(60), log(140), and log(180) values not given can be predicted.  To get the results needed the data is converted back to the original form by using them as an exponent.  The table below has been filled in with the missing values.

 Age of Tree 100s of Board Feet 20 1 40 6 60 15.95 80 33 100 56 120 88 140 132.12 160 182 180 247.35 200 320

There are some limits to fitting data to models.  Predictions should stay within the bounds of the original data if possible.  The model summarizes the relationship between two variables only when one of the variables helps explain or predict the other. Another problem is data outliers.  A deviation that falls outside the overall pattern of the relationship, such as an outlier, will distort the model.

Summary

Models can be fit to data using systems of equations, matrices, built in calculator programs, excel spreadsheets and computer programs.  When necessary, nonlinear models can be transformed to linear models to take advantage of least-square analysis. Goodness of fit measures, like correlation, can be used to select the “best” model for the data .  By using models, unknown values can be predicted.  Caution should be taken when trying to make predictions outside the range of the original data.