4.8 Building models

In many situations, a scientist may have measured a large number of predictor variables that could be potentially important, and interest lies in determining which ones are best at explaining variation in the response data cloud and also whether particular combinations of variables, working together, do a better job than other combinations in this respect. More specifically, one may wish to build a model for the response data cloud, using the best possible combination of predictor variables available. There are two primary issues one is faced with when trying to build models in this way: first, what criterion should be used to identify a “good” model and second, what procedure should one use to select the variables on the basis of said criterion?

In addition to providing tests of hypotheses for specific regression-style problems, DISTLM also provides the user with a flexible model-building tool. A suite of selection procedures and selection criteria are available, as seen in the DISTLM dialog box and described below.
 

Selection procedures

Selection criteria

Depending on the resemblance measure used (and, to perhaps a lesser extent, the scales of the original response variables), it is possible for AIC (or AIC$_c$ or BIC) to be negative for a given model. This is caused, not by the model having a negative residual sum of squares, but rather by $SS _ {\text{Residual}}$ being less than 1.0 in value. When the log is taken of a value less than 1.0, the result is a negative number. However, in these cases (as in all others), smaller values of AIC (or AIC$_c$ or BIC) still correspond to a better model.

Although there are other model selection criteria, we included in DISTLM the ones which presently seem to have the greatest general following in the literature (e.g., Burnham & Anderson (2002) ). For example, Godinez-Dominguez & Freire (2003) used a multivariate analogue to AIC in order to choose among competing models in a multivariate canonical correspondence analysis (CCA). However, the properties and behaviour of these proposed criteria are still largely unknown in the context of dbRDA, especially with multiple response variables ($\rho$ > 1) and for non-Euclidean resemblance measures. More research in this area is certainly required. In the context of univariate model selection, AIC is known to be a rather generous criterion, and will tend to err on the side of including rather too many predictor variables; that is, to “overfit” (e.g., Nishii (1984) , Zhang (1992) , Seber & Lee (2003) ). On the other hand, trials using BIC suggest it may be a bit too severe, requiring the removal of rather too many potentially useful variables. Thus, we suggest that if the use of AIC and BIC yield similar results for a given dataset, then you are probably on the right track! One possibility is to plot a scatter-plot of the AIC and BIC values for the top 20 or so models obtained for a given dataset and see which models fall in the lower left-hand corner (that is, those which have relatively low values using either of these criteria). These are the ones that should be considered as the best current contenders for a parsimonious model. An example of this type of approach is given in an analysis of the Ekofisk macrofauna below.


Revision #27
Created 12 August 2022 09:27:54 by Arden
Updated 10 December 2024 21:47:54 by Abby Miller