Abstract submitted to the ESRA Conference, Prague, June 2007

Studying human populations. A new curriculum for social statistics

Presenter: Nicholas T. Longford, SNTL, Reading, England

Email: ntl@sntl.co.uk

The presentation, based on a monograph recently submitted for publication, describes a new curriculum for social statistics centred around the basic concepts of sampling and measurement. Statistics is defined as making decisions in the presence of uncertainty which arises as a consequence of limited resources at our disposal. By resources, we mean not only finance, but also time, expertise, instruments, goodwill and availability of the respondents, and the like. If these were not limited, the entire population that we study could be enumerated -- the values of the key variable(s) established for every one of its members. With limited resources, we have to resort to short-cuts: sampling (observing only a sample from the population) and imperfect measurement (recording the values of a manifest instead of the latent variable). This motivates two central topics: sampling methods and the study of measurement processes.

Statistical models can be introduced as descriptions of infinite populations, which are usually considered as a matter of convenience. To promote integrity, an admission is made upfront that linearity of the models is adopted as a matter of analytical convenience. Sampling methods and model-based estimation represent two distinct perspectives: in sampling, the values of the variables on the population are regarded as fixed, and the act of inclusion into the sample is the sole source of uncertainty; in model-based methods, the values of the outcome variable(s) are regarded as random. The Bayesian paradigm is introduced to complement these two perspectives; in Bayesian analysis, all observed quantities are regarded as fixed and all quantities with unknown values as random.

Non-response or, more generally, failure to collect the data as per plan, is a ubiquitous problem in all large-scale studies of human populations, and methods for dealing with it are indispensable for a statistical analyst. The curriculum follows the innovative research on the EM algorithm and multiple imputation by Don Rubin and his ex-students, and introduces some less standard applications of their missing-data principle. One of them is a general method for dealing with imperfect measurement: the values of the latent variable are treated as the missing information.

With limited resources in mind, the necessity for study design, in all its aspects (sampling, questionnaire, measurement, coding, collation of auxiliary information, and the like) is emphasised. Experimental design is introduced as the setting in which the assignment of the values of some or all of the variables is under the designer's control, and can be randomised. Without it, making causal inferences is very difficult and requires some unverifiable assumptions. In observational studies, to which we resort when experimental design cannot be exercised, the focus rests on the assignment process. The definition of potential outcomes is introduced, connected to sampling methods, and treated as another example of incompleteness.

Advanced or specialised topics include clinical trials, multilevel analysis, generalized linear models and meta-analysis. In general, hypothesis testing is de-emphasised because it is not conducive to making good decisions.

May 2007