'Which model ...?' is the wrong question.

N.T. Longford


This paper extends on the editorial "Model selection and efficiency: is `Which model ...?' the right question?" (N.T. Longford, 2005, JRSS A 168, 469-472).

The weaknesses of the standard way of addressing model uncertainty, by selecting one of the candidate models and then applying it for all the intended inferences, are discussed and an alternative, composition of the estimators, is proposed. The weaknesses are generic to all model selection methods, because the action taken after the selection, evaluating an estimator, ignores the consequences of the fact that the selection may have been erroneous. In a typical setting of comparing the fit of a model with the fit of its submodel, choosing the submodel may bring about bias (of a model-based estimator), but the variance is (usually) reduced. Choosing the (more general) model keeps the variance greater, but may reduce the bias. We should therefore consider the relative merits of bias and variance reduction adhering to efficiency as our original criterion. This entails an admission that a model may yield efficient estimators for some quantities, but not for others. The connection of model validity and efficiency is tenuous, unless we are in asymptotia, where establishing model validity is trivial. In practice, we are never there, no model is correct, unless we control the data generating process, and any claim that a model selection process is 'good' is out of place because we do not have a metric for the distance between the selected and the (ideal) valid model. Moreover, the aim to find the ideal model is misguided, because it is not compatible with efficient estimation.

All attempts to minimise the probabilities of erroneous selection are misguided when the ultimate goal of the analysis is efficient estimation, because efficiency, defined as small mean squared error, is only distantly related to the probability (of model correctness). Further, an act of selection is a two-edged inferential sword: we may find a 'better' model, but we incur a penalty for searching. This penalty is often ignored and, as a problem, it is understood selectively. Its concise diagnosis is that the distribution of the estimator based on a selected model depends on the process of selection and is a mixture of the distributions of the estimators based on all the candidate models. The properties of these mixtures are difficult to explore because the mixing and mixed distributions are correlated.

Supporting any theory of model selection by asymptotic results is not helpful, because model selection is, in essence, a small-sample problem. In small samples, some outright invalid models may yield far more efficient estimators (of certain parameters or other quantities) than the valid model; restricting ourselves to unbiased estimators is a big handicap. In any case, the act of selection can destroy the property of no bias of an estimator, even conditionally on having selected correctly.

I conclude that the search for the valid (or suitable) model leads to a blind alley in finite time (and with finite samples), because quantities of interest can be estimated more efficiently by selecting models specifically for the quantity. Simple examples from practice (ANOVA, small-area estimation and clinical trials) will be discussed. Further gains, sometimes substantial can be made by linearly combining (composing) alternative estimators. Empirical Bayes estimators can be interpreted as a successful application of this idea. Note that this is different from Bayes factors, in which estimators are also combined, but with weights that depend solely on the fits of the candidate models, and not on the target.

Related articles:
N.T. Longford (2003). An alternative to model selection in ordinary regression. Statistics and Computing 13, 67-80.

N.T. Longford (2008). An alternative analysis of variance. SORT, Journal of the Catalan Institute of Statistics 32, 77-91.

January 2011.