Cover Page

Contents

Preface

Abbreviations

CHAPTER 1: Survey Error Evaluation

1.1 SURVEY ERROR

1.2 EVALUATING THE MEAN-SQUARED ERROR

1.3 ABOUT THIS BOOK

CHAPTER 2: A General Model for Measurement Error

2.1 THE RESPONSE DISTRIBUTION

2.2 VARIANCE ESTIMATION IN THE PRESENCE OF MEASUREMENT ERROR

2.3 REPEATED MEASUREMENTS

2.4 RELIABILITY OF MULTIITEM SCALES

2.5 TRUE VALUES, BIAS, AND VALIDITY

CHAPTER 3: RESPONSE PROBABILITY MODELS FOR TWO MEASUREMENTS

3.1 RESPONSE PROBABILITY MODEL

3.2 ESTIMATING π, θ AND ϕ

3.3 HUI-WALTER MODEL FOR TWO DICHOTOMOUS MEASUREMENTS

3.4 FURTHER ASPECTS OF THE HUI-WALTER MODEL

3.5 THREE OR MORE POLYTOMOUS MEASUREMENTS

CHAPTER 4: LATENT CLASS MODELS FOR EVALUATING CLASSIFI CATION ERRORS

4.1 THE STANDARD LATENT CLASS MODEL

4.2 LATENT CLASS MODELING BASICS

4.3 INCORPORATING GROUPING VARIABLES

4.4 MODEL ESTIMATION AND EVALUATION

CHAPTER 5: FURTHER ASPECTS OF LATENT CLASS MODELING

5.1 PARAMETER ESTIMATION

5.2 LOCAL DEPENDENCE MODELS

5.3 MODELING COMPLEX SURVEY DATA

CHAPTER 6: Latent Class Models for Special Applications

6.1 MODELS FOR ORDINAL DATA

6.2 A LATENT CLASS MODEL FOR RELIABILITY

6.3 CAPTURE-RECAPTURE MODELS

CHAPTER 7: Latent Class Models for Panel Data

7.1 MARKOV LATENT CLASS MODELS

7.2 SOME NONSTANDARD MARKOV MODELS

7.3 FURTHER ASPECTS OF MARKOV LATENT CLASS ANALYSIS

CHAPTER 8: Survey Error Evaluation: Past, Present, and Future

8.1 HISTORY OF SURVEY ERROR EVALUATION METHODOLOGY

8.2 CURRENT STATE OF THE ART

8.3 SOME IDEAS FOR FUTURE DIRECTIONS

8.4 CONCLUSIONS

Appendix A: Two-Stage Sampling Formulas

Appendix B: Loglinear Modeling Essentials

B.1 LOGLINEAR VERSUS ANOVA MODELS: SIMILARITIES AND DIFFERENCES

B.2 MODELING CELL AND OTHER CONDITIONAL PROBABILITIES

B.3 GENERALIZATION TO THREE VARIABLES

B.4 ESTIMATION OF LOGLINEAR AND LOGIT MODELS

References

Index

WILEY SERIES IN SURVEY METHODOLOGY

Established in Part by WALTER A. SHEWHART AND SAMUEL S. WILKS

Editors: Mick P. Couper, Graham Kalton, J. N. K. Rao, Norbert Schwarz, Christopher Skinner
Editor Emeritus: Robert M. Groves

A complete list of the titles in this series appears at the end of this volume.

Cover Page

Biemer, Paul P.
 Latent class analysis of survey error / Paul P. Biemer
  p. cm.
 Includes index.
 ISBN 978-0-470-28907-5 (cloth)

To Judy, Erika, and Lianne

Preface

Survey estimates are based on a sample and are therefore subject to sampling error. However, there are many other sources of error in surveys. These errors are referred to collectively as nonsampling errors. Nonsampling errors arise from interviewers, respondents, data processors, other survey personnel, and operations. Evaluating sampling error is easily accomplished by estimating the standard error of an estimate that any commercial analysis software package can do. Evaluating nonsampling errors is quite difficult, often requiring data not normally collected in the typical survey. This book discusses methods for evaluating the nonsampling error in survey data focusing primarily on data that are categorical and errors that result in misclassifications. The book concentrates on a general set of models and techniques referred to as collectively as latent class analysis.

Latent class analysis (LCA) encompasses a wide range of techniques and models that can be used for numerous applications, and this book covers many of those. The general methods literature often views LCA as the categorical data analog of factor analysis. That is not the approach taken in this book. Rather, this book treats LCA as a generalization of classical test theory and the traditional survey error modeling approaches. Readers who wish to apply LCA in factor analytic applications can benefit from the techniques considered here, but may wish to supplement their study of LCA with other literature cited in Chapter 4.

This book was written because there is currently no comprehensive resource for survey researchers and practitioners to learn the methods needed in the modeling and estimation of classification errors, particularly LCA techniques. Survey error evaluation methodology emerged in the 1960s primarily at the US Bureau of the Census. Over the years, a number of journal articles have been published that describe various models for estimating classification error in survey results; however, the work has been relatively sparse and scattered. This book collects the many years of experience of the author and other authors in the field of survey misclassification to provide a guide for the practitioner as well as a text for the student of survey error evaluation. It combines the theoretical, methodological, and practical aspects of estimating classification error and interpreting the results for the purposes of survey data quality improvement.

The audience for the book is individuals in government, universities, businesses, and other organizations who are involved in the development, implementation, or evaluation of surveys. Students of statistics or survey methodology who wish to learn how to evaluate the measurement error in survey data could also use this book as a resource for their studies. The book’s content should be accessible to anyone having an intermediate background in statistics and sampling methods. An ideal reader is anyone possessing a working under-standing of elementary probability theory, expectation, bias, variance, multinomial distribution, hypothesis testing, and model goodness of fit. Knowledge of categorical data analysis is helpful but is not required to grasp the essential concepts of the book. A primer on categorical data analysis containing the essential concepts used in the book is provided in Appendix B.

The book provides a very general statistical framework for modeling and estimating classification error in surveys. Following a general discussion of surveys and nonsampling errors in Chapter 1 , Chapter 2 examines some of the early models for survey measurement error including the Census Bureau model and the classical test theory model. In this chapter, the similarities and differences, strengths, and weaknesses of the approaches are described. This background serves to introduce the basic concepts of measurement error modeling for categorical data in Chapter 3 , beginning with the very elementary model proposed by Bross. A latent class model for two measurements (Hui–Walter model) is introduced that also serves to establish the essential principles of LCA. This chapter introduces the reader to the concept of the true value of a variable as an unobserved (latent) variable and the survey response as a single indicator of this latent variable. The early models are shown to be special cases of this general latent class model. Chapter 3 also describes the expectation–maximization (EM) algorithm and its pivotal role in latent class model parameter estimation.

The general latent class model for three or more indicators is introduced in Chapter 4. This chapter provides all the details regarding how to estimate the model parameters, how to build good models and test their fit, and other issues related to the interpretation of the model parameter estimates. Chapter 4 also introduces the LCA software package, EM, written by Dr. Jeroen Vermunt, which can be downloaded free from Vermunt’s Website. EM input statements are provided for most of the examples thoughout the book, and readers are encouraged to replicate the analysis with this software. Chapter 5 contains a number of advance topics in LCA, including how one deals with sparse data, boundary values, unidentifiability, and local maxima. Chapter 5 also provides an extensive discussion of local dependence in LCA and how to model its effects. The chapter ends with a discussion of latent class modeling with complex survey data. A number of advanced applications of LCA are included in Chapter 6.

Chapter 7 discusses models and analysis techniques that are appropriate for evaluating the measurement error in panel survey data referred to as Markov latent class analysis(MLCA). MLCA is an important area for survey evaluation because it provides a means for estimating classification error directly from the data collected from the survey without the need for special reinterview or response replication studies. Essential in the application of these models is some evaluation or assessment of the extent to which the model assumption holds. The chapter concludes with a discussion of these issues and the primary approaches for model validation.

Finally, Chapter 8 provides an overview of LCA and MLCA, tracing the history of the methodology and summarizing the current state of the art. The chapter considers some of the criticisms and pitfalls of LCA, when such criticism may be justified, and how to avoid the pitfalls and criticisms by an appropriate analysis of the data. Glimpsing into the future, the chapter concludes with a discussion of areas where further research is needed to advance the field.

Acknowledgments

A number of individuals and organizations deserve recognition and my appreciation for their support and encouragement throughout the preparation of this book. First, much appreciation is due to RTI International, who, through the RTI Fellows Program, supported much of my time to write this book. I am also indebted to Wayne Holden, my supervisor, for his encouragement throughout this project. This book has benefited substantially from my associations with Marcus Berzofsky, who wrote his Ph.D. thesis in this area, and Bill Kalsbeek, who codirected Berzofsky’s dissertation with me. Berzofsky also read a draft of the manuscript and offered valuable suggestions for improvement. In that regard, many thanks also to Lars Lyberg, Frauke Kreuter, Juliana Werneburg, and Adam Carle, who also read drafts of the manuscript and offered important ideas for improving the material.

I am very grateful for the many contributions of Clyde Tucker and Brian Meekins at BLS to this work. Our collaboration on applying MLCA to the Consumer Expenditure Survey provided considerable insights into the many complexities associated with real–world applications. John Bushery, Chris Wiesen, Gordon Brown, and Ken Bollen coauthored papers with me in this area, and I am sincerely grateful for the substantial body of knowledge I gained from these associations. Thanks also to the many students to whom I have taught this material. Their questions, comments, and critiques helped me clarify the concepts and helped to reveal areas of the methodology that are more difficult for the novice. Jeroen Vermunt deserves my sincere gratitude for his advice over the years. He has taught me much about LCA through our discussions, email exchanges, and other writings. The field of LCA also owes him a great debt of gratitude for writing the EM software and making it available to anyone for free.

Finally, this book would not have been possible without the support, sacrifice, and encouragement of my loving wife, Judy. To her I express love and sincere appreciation.

PAUL P. BIEMER

Raleigh, NC
November 2010

Abbreviations

AIC

Akaike information criterion

ALVS

Agriculture Land Values Study

ANES

American National Election Studies

ANOVA

analysis of covariance (ANACOVA—analysis of covariance)

ARL

administrative records list

BIC

Bayesian information criterion

BLS

Bureau of Labor Statistics (US)

BVR

bivariate residual

CATI

computer—assisted telephone interview(ing)

CASI

computer—assisted self—interview(ing) (ACASI—audioCASI)

CBCL

child behavior checklist

CDAS

Categorical Data Analysis System (proprietary software)

CEIS

Consumer Expenditure Interview Survey

CPS

Current Population Survey (US)

CR

Cressie—Read

DA

data acquisition

deff

design effect

df

degree(s) of freedom

DSF

delivery service file

ECM*

expectation—conditional maximization

EE

erroneously enumerated

EFU

Evaluation Followup (US Census Bureau)

EI

evaluation interview

EM

expectation—maximizatin (IEM–LCA software package for EM;

proprietary to J. Vermunt)

EMP

employed

EPSEM

equal—probability selection method

ERS

Economic Research Service

FR

field representative (SFR — supervisory FR)

GLLAMM

generalized linear latent and mixed models

HT

Horvitz—Thompson

ICC

intracluster correlation

ICE

independent classification error

ICM

integrated coverage measurement

i.i.d.

independent and identically distributed

IPF

iterative proportional fitting

IRT

item response theory

KD20

Kuder–Richardson Formula 20

LC

latent class

LCA

latent class analysis (MLCA—Markov LCA)

LCMS

latent class mover—stayer (model)

LD

local dependence

LF

labor force

LISREL*

linear structural relations

LL

loglinear–logit (model; loglinear and logit combined)

LOR

log—odds ratio (LORC—LOR check)

LTA

latent transition analysis

MAR

missing at random (MCAR—missing completely at random; NMAR — not missing at random)

ML

maximum likelihood (MLE—ML estimation)

MLC

Markov latent class

MM

manifest Markov (model)

MMS

manifest mover–stayer (model)

MSA

metropolitan statistical area

MSE

mean—squared error

NASS

National Agricultural Statistics Service (US)

NDR

net difference rate

NHSDA

National Household Survey on Drug Abuse (US)

NHIS

National Health Interview Survey (US)

NLF

not (in) labor force

npar

number of (free unconstrained π or u) parameters

NR

nonresponse

NSDUH

National Survey on Drug Use and Health (US)

PES

postenumeration survey

PML

pseudo—maximum—likelihood

PPS

probabilities proportional to size

Pr

probability (in equations; Prob in tables)

PS

poststratification

PSU

primary—stage sampling unit (SSU—secondary—stage SU)

RB

relative bias

RDD

random—digit dialing

SAS*

statistical analysis system

s.e.

standard error (in equations; Std Err in tables)

SEM

structural equation model

SIPP

Survey of Income Program Participation (US)

SPSS*

statistical package for social sciences; statistical product and

service solutions

SRS

simple random sampling

SRV

simple response variance

SSM

scale score measure(ment)

SV

sampling variance

UNE

unemployed

UWE

unequal weighting effect

* ECM, LISREL, SAS, and SPSS are proprietory names; these acronyms are commonly used in industry and elsewhere (original unabbreviated meanings given here are rarely used)

CHAPTER 1

Survey Error Evaluation

1.1 SURVEY ERROR

1.1.1 An Overview of Surveys

This book focuses primarily on the errors in data collected in sample surveys and how to evaluate them. A natural place to start is to define the term “survey.” The American Statistical Association’s Section on Survey Research Methods has produced a series of 10 short pamphlets under the rubric What Is a Survey? (Scheuren 1999). That series defines a survey as a method of gathering information from a sample of objects (or units) that constitute a population. Typically, a survey involves a questionnaire of some type that is completed by either an informant (referred to as the respondent), an interviewer, an observer, or other agent acting on behalf of the survey organization or sponsor. The population of units can be individuals such as householders, teachers, physicians, or laborers or other entities such as businesses, schools, farms, or institutions. In some cases, the units can even be events such as hospitalizations, accidents, investigations, or incarcerations. Essentially any object of interest can form the population.

In a broad sense, surveys also include censuses because the primary distinction is just the fraction of the sample to be surveyed. A survey is confined to only a sample or subset of the population. Usually, only a small fraction of the population members is selected for the sample. In a census, every unit in the population is selected. Therefore, much of what we say about surveys also applies to censuses.

If the survey sample is selected randomly (i.e., by a probability mechanism giving known, nonzero probabilities of selection to each population member), valid statistical statements regarding the parameters of the population can be made. For example, suppose that a government agency wants to estimate the proportion of 2-year-old children in the country that has been vaccinated against infectious diseases (polio, diphtheria, etc.). A randomly selected sample of 1000 children in this age group is drawn and their caregivers are interviewed. From these data, it is possible to determine the proportion of children who are vaccinated within some specified margin of error for the estimate. Sampling theory [see, e.g., Cochran (1977) or more recently, Levy and Lemeshow (2008)] provides specific methods for estimating margins of error and testing hypotheses about the population parameters.

Surveys may be cross-sectional or longitudinal. Cross-sectional surveys provide a “snapshot” of the population at one point in time. The products of cross-sectional surveys are typically descriptive statistics that capture distributions of the population for characteristics of interest, including health, education criminal justice, economics, and environmental variables. Cross-sectional surveys may occur only once or may be repeated at some regular interval (e.g., annually). As an example, the National Heath Interview Survey (Centers for Disease Control and Prevention 2009) is conducted monthly and collects important data on the health characteristics of the US population.

Longitudinal or panel surveys are repeating surveys where at least some of the same sample units are interviewed at different points in time. By taking similar measurements on the same units at different points in time, investigators can more precisely estimated changes in population parameters as well as individual characteristics. A fixed panel (or cohort) survey interviews the entire sample repeatedly usually over some significant period of time such as 2 or more years. As an example, the Panel Study of Income Dynamics (Hill 1991) has been collecting income data on the same 4800 families (as well as families spawned from these) since 1968.

A rotating panel survey is a type of longitudinal survey where part of the sample is replaced at regular intervals while the remainder of the sample is carried forward for additional interviewing. This design retains many of the advantages of a fixed panel design for estimating change while reducing the burden and possible conditioning effects on sample units caused by repeatedly interviewing them many times. An example is the US Current Population Survey (CPS) (US Census Bureau 2006), which is a monthly household survey for measuring the month-to-month and year-to-year changes in labor force participation rates. The CPS uses a somewhat complex rotating panel design, where each month about one-eighth of the sample is replaced by new households. In this way, households are interviewed a maximum of 8 times before they are rotated out of the sample.

Finally, a split-panel survey is a type of longitudinal survey that combines the features of a repeated cross-sectional survey with a fixed panel survey design. The sample is divided into two subsamples: one that is treated as a repeated cross-sectional survey and the other that follows a rotating panel design. An example of a split-sample design is the American National Election Studies [American National Election Studies (ANES), 2008]. Figure 1.1 compares these four survey designs.

As this book will explain, methods for evaluating the error in surveys may differ depending on the type of survey. Many of the methods discussed can be applied to any survey while others are appropriate only for longitudinal surveys. The next section provides some background on the problem of survey error and its effects on survey quality.

Figure 1.1 Reinterview patterns for four survey types.

image

 

1.1.2 Survey Quality and Accuracy and Total Survey Error

The terms survey quality, survey data quality, accuracy, bias, variance, total survey error, measurement validity, and reliability are encountered quite often in the survey error literature. Unfortunately, their definitions are often unspecified or inconsistent from study to study, which has led to some confusion in the field. In this section, we provide definitions of these terms that are reasonably consistent with conventional use, beginning with perhaps the most ambiguous term: survey quality.

Because of its subjective nature, survey quality is a vague concept. To some data producers, survey quality might mean data quality: large sample size, a high response rate, error-free responses, and very little missing data. Statisticians, in particular, might rate such a survey highly on some quality scale. Data users, on the other hand, might still complain that the data were not timely or accessible, that the documentation of the data files is confusing and incomplete, or that the questionnaire omitted many relevant areas of inquiry that are essential for research in their chosen field. From the user’s perspective, the survey exhibits very poor quality.

These different points of view suggest that survey quality is a very complex, multidimensional concept. Juran and Gryna (1980) proposed a simple definition of quality that can be appropriately applied to surveys, namely, the quality of a product is its “fitness for use.” But, as Juran and Gryna explain, this definition is deceptively simple because there are really two facets of quality: (1) freedom from deficiencies and (2) responsiveness to customers’ needs. For survey work, facet 1 might be translated as error-free data, data accuracy, or high data quality, while facet 2 might be translated as providing product features that result in high user satisfaction. The latter might include data accessibility and clarity, timely data delivery, collection of relevant information, and use of coherent and conventional concepts.

When applied to statistical products, the definition “fitness for use” has another limitation in that it implies a single use or purpose. Surveys are usually designed for multiple objectives among many data users. A variable in a survey may be used in many different ways, depending on the goals of the data analyst. For some uses, timeliness may be paramount. For other uses, timeliness is desirable, but comparability (i.e., ensuring that the results can be compared unambiguously to prior data releases from the same survey) may be more critical.

In the mid-1970s, a few government statistical offices began to develop definitions for survey quality that explicitly took into account the multidimensionality of the concept [see, e.g., Lyberg et al. (1977) or, more recently, Fellegi (1996)]. This set of definitions has been referred to as a survey quality framework. As an example, the quality framework used by Statistics Canada includes these seven quality dimensions: relevance, accuracy, timeliness, accessibility, interpretability, comparability, and coherence. Formal and accepted definitions of these concepts can be found at Statistics Canada (2006). Eurostat has also adopted a similar quality framework [see, e.g., Eurostat (2003)].

Given this multidimensional conceptulization of quality, a natural question is quality to be maximized in a survey? One might conceptualize a one-dimensional indicator that combines these seven dimensions into an overall survey quality indicator. Then the indicator could be evaluated for various designs and the survey design maximizing this quantity could be selected. However, this approach oversimplifies the complexity of the problem since there is no appropriate way for combining the diverse dimensions of survey quality. Rather, quality reports or quality declarations providing information on each dimension have been used to summarize survey quality. A quality report might include a description of the strengths and weaknesses of a survey organized by quality dimension, with emphasis on sampling errors, nonsam-pling errors,1 key release dates for user data files, forms of dissemination, availability, and contents of documentation, as well as special features of the survey approach that may be of importance to most users. A number of surveys have produced extended versions of such reports, called quality profiles. A quality profile is a document that provides a comprehensive picture of the quality of a survey, addressing each potential source of error. Quality profiles have been developed for a number of US surveys, including the Current Population Survey (CPS) (Brooks and Bailar 1978), the Survey of Income and Program Participation (Jabine et al. 1990), US Schools and Staffing Survey (Kalton et al. 2000), American Housing Survey (Chakrabarty and Torres 1996), and the US Residential Energy Consumption Survey (Energy Information Administration 1996). Kasprzyk and Kalton (2001) review the use of quality profiles in US statistical agencies and discuss their strengths and weaknesses for survey improvement and quality declaration purposes.

Figure 1.2   Comparisons of three cost-equivalent survey designs. (Source: Biemer 2010).

image

Note that data quality or accuracy is not synonymous with survey quality. Good survey quality is the result of optimally balancing all quality dimensions to suit the specific needs of the primary data users. As an example, if producing timely data is of paramount importance, accuracy may have to be compromised to some extent. Likewise, if a high level of accuracy is needed, temporal comparability may have to be sacrificed to take advantage of the latest and much improved methodologies and technologies. On the other hand, data quality refers to the amount of error in the data. As such, it focuses on just one quality dimension—accuracy.

To illustrate this balancing process, Figure 1.2 shows three cost-equivalent survey designs, each with a different mix of five quality dimensions: accessibility, timeliness, relevance, comparability, and accuracy. The shading of the bars in the graph represents the proportion of the survey budget that is to be allocated for each quality dimension. For example, design C allocates about two-thirds of the budget to achieve data accuracy while design A allocates less to accuracy so that more resources can be devoted to the other four dimensions. Design B represents somewhat of a compromise between designs A and C. Determining the best design allocation depends on the purpose of the survey and how the data will ultimately be used. If a very high level of accuracy is required (e.g., a larger sample or a reduction of nonsampling errors), design C is preferred. However, if users are willing to sacrifice data quality for the sake of greater accessibility, relevance, and comparability, then design A may be preferred. Since each design has its own strengths and weaknesses, the best one will have a mix of quality attributes that is most appropriate for the most important purposes or the majority of data users.

Total survey error refers to the totality of error that can arise in the design, collection, processing, and analysis of survey data. The concept dates back to the early 1940s, although it has been revised and refined by a many authors over the years. Deming (1944), in one of the earliest works, describes “13 factors that affect the usefulness of surveys.” These factors include sampling errors as well as nonsampling errors: the other factors that will cause an estimate to differ from the population parameter it is intended to estimate. Prior to Deming’s work, not much attention was being paid to nonsampling errors, and, in fact, textbooks on survey sampling seldom mentioned them. Indeed, classical sampling theory (Neyman 1934) assumes that survey data are error-free except for sampling error. The term total survey error originated with an edited volume of the same name (Andersen et al. 1979).

Optimal survey design is the process of minimizing the total survey error subject to cost constraints [see, e.g., Groves (1989) and Fellegi and Sunter (1974)]. Biemer and Lyberg (2003) extended this idea to include other quality dimensions (timeliness, accessibility, comparability, etc.) in addition to accuracy. They advocate an approach that treats the other quality dimensions as additional constraints to be met as total survey error is minimized (or equiva-lently, accuracy is maximized). For example, if the appropriate balance of the quality dimensions is as depicted by design B in Figure 1.2, then the optimal survey design is one that minimizes total survey error within that fraction of the budget allocated to achieving high data accuracy represented by the unshaded area of the bar. As an example, in the case of design B, the budget available for optimizing accuracy is approximately 50% of the total survey budget. The optimal design is one that maximizes accuracy within this budget allocation while satisfying the requirements established for the other quality dimensions shown in the figure.

Mean-Squared Error (MSE)

The prior discussion can be summarized by stating that the central goal of survey design should be to minimize total survey error subject to constraints on costs while accommodating other user-specified quality dimensions. Survey methodology, as a field of study, aims to accomplish this goal. General textbooks on survey methodology include those by Groves (1989), Biemer and Lyberg (2003), Groves et al. (2009), and Dillman et al. (2008), as well as a number of edited volumes. The current book focuses on one important facet of survey methodology—the evaluation of survey error, particularly measurement error. The book focuses on the current best methods for assessing the accuracy of survey estimates subject to classification error. A key concept in the survey methods literature is the mean squared error (MSE), which is a measure of the accuracy of an estimate. The next few paragraphs describe this concept.

Let image denote an estimate of the population parameter μ based on sample survey data. Survey error may be defined as the difference between the estimate and the parameter that it is intended to estimate:

(1.1)   image

There are many reasons why image and μ may disagree and, consequently, the survey error will not be zero. One obvious reason is that the estimator of μ is based upon a sample and, depending on the specific sample selected, image will deviate from μ, sometimes considerably so, especially for small samples. However, even in very large samples, the difference can be considerable due to nonsampling errors, meaning errors in an estimate that arise from all sources other than sampling error. The survey responses themselves may be in error because of ambiguous question wording, respondent errors, interviewer influences, and other sources. In addition, there may be missing data due to non-responding sample members (referred to as unit nonresponse) or when respondents do not answer certain questions (referred to as item nonresponse). Data processing can also introduce errors. All these errors can cause image and μ to differ even when there is no sampling error as in a complete census.

In the survey methods literature, the preferred measure of total survey error of an estimate (i.e., the combination of sampling and nonsampling error sources) is the MSE, defined as

(1.2)   image

which can be rewritten as

(1.3)   image

where image is the bias of the estimator and image is the variance. In these expressions, expected value is broadly defined with respect to the sample design as well as the various random processes that generate nonsampling errors. Optimal survey design attempts to minimize (1.3) given the budget, schedule, and other constraints specified by the survey design. This is a very challenging task. It is facilitated, first and foremost, by some assessment of the magnitude of the MSE for at least a few key survey characteristics.

The preferred approach to evaluating survey error is to first decompose the total error into components associated with the various sources of error in a survey. Then each error source can be evaluated separately in a “divide and conquer” fashion. Some sources may be ignored while others may be targeted in special evaluation studies. Biemer and Lyberg (2003, Chapter 2) suggest a mutually exclusive and exhaustive list of error sources applicable to most surveys. The list includes sampling error and five nonsampling error sources: specification error, measurement error, nonresponse error, frame error, and data processing error. These errors sources are briefly described next.

 

1.1.3 Nonsampling Error

The evaluation of sampling error is considered a best practice in survey research. Nonsampling errors are rarely fully evaluated in surveys, although many examples of evaluations focus on one or perhaps two error sources. In this section, we consider the five sources of nonsampling error in more detail and then discuss some methods that have been used in their evaluation.

Specification Error

A specification error arises when the concept implied by the survey question and the concept that should have been measured in the survey differ.2 When this occurs, the wrong parameter is estimated by the survey and, thus, inferences based on the estimate are likely to be erroneous. Specification error is often caused by poor communication between the researcher (or subject matter expert) and the questionnaire designer. This concept is closely related to the concept of construct validity in psychometric literature [see, e.g., Nunnally and Bernstein (1994)] and relevance in official statistics [see, e.g., Dalenius (1985)].

Specification errors are particularly common in surveys of business establishments and organizations where many terms that have precise meanings to accountants are misspecified or defined incorrectly by the questionnaire designers. Examples are terms such as revenue, asset, liability, gross service fees, and information services, which have different meanings in different contexts. Such specialized terms should be clearly defined in surveys to avoid specification error.

As an example, consider the measurement of unemployment in the US Current Population Survey (CPS). The US Bureau of Labor Statistics (BLS) considers the unemployed population as comprising two types of persons: those who are “looking for work” and those who are “on layoff.” Persons on layoff are defined as those who are separated from a job and await a recall to return to that job. Persons who are “looking for work” are the unemployed who are not on layoff and who are pursuing certain specified activities to find employment. Distinguishing between these two groups is important for labor economists. Prior to 1994, the CPS questionnaire did not consider or collect information as to whether a person classified as “laid off” expected to be recalled to work at some point in the future. Rather, respondents were simply asked “Were you on layoff from a job?” This question was later determined to be problematic because, to many people, a “layoff” could mean permanent termination from the job rather than the temporary loss of work as the BLS economists defined the term.

In 1994, the BLS redesigned the labor force status questions, as part of that redesign, attempted to clarify the concept of layoff in the questionnaire. The revised questions now ask, “Has your employer given you a date to return to work?” and “Could you have returned to work if you had been recalled?” These questions brought the concept of “on layoff” in line with the specification being used by BLS economists. Specification errors can be quite difficult to detect without the help of subject matter experts who are intimately familiar with the survey concepts and how they will ultimately be used in data analyses, because questions may be well-worded while still completely missing essential elements of the variable to be measured.

Biemer and Lyberg (2003, p. 39) provide another example of specification error from the Agriculture Land Values Survey (ALVS) conducted by the US National Agricultural Statistics Service. The ALVS asked farm operators to provide the market value for a specific tract of land that was randomly selected within the boundaries of the farm. Unfortunately, the concepts that were essential for the valid valuation of agricultural land were not accurately stated in the survey—a problem that came to light only after economists at the Economic Research Service (ERS) were consulted regarding the true purpose of the questions. These subject matter experts pointed out that their models required a value that did not include capital improvements such as irrigation equipment, storage facilities, and dwellings. Because the survey question did not exclude capital improvements, the survey specification of agricultural land value was inconsistent with the way the ERS economists were using the data.

Measurement Error

Whereas total survey error is defined for a statistic (or estimator), measurement error is defined for an observation. Let μi denote the true value of some characteristic measured in a survey for a unit i, and let yi denote the corresponding survey measurement of μi. Then

(1.4)   image

that is, the difference between the survey measurement and the true value of the characteristic. Measurement error has been studied extensively and is often reported in the survey methods literature [for an extensive review, see Biemer and Lyberg (2003, Chapters 4–6)]. For many surveys, measurement error can also be the most damaging source of error. It includes errors arising from respondents, interviewers, and survey questions. Respondents may either deliberately or otherwise provide incorrect information in response to questions. Interviewers can cause errors in a number of ways. They may, by their appearance or comments, influence responses; they may record responses incorrectly, or otherwise fail to comply with prescribed survey procedures; and, in some cases, they may deliberately falsify data. The questionnaire can be a major source of error if it is poorly designed. Ambiguous questions, confusing instructions, and easily misunderstood terms are examples of questionnaire problems that can lead to measurement error.

Measurement errors can also arise from the information systems that respondents may draw on to formulate their responses. For example, a farm operator or business owner may consult records that may be in error and, thus, cause an error in the reported data. It is also well known (Biemer and Lyberg 2003, Chapter 6) that the mode of data collection can affect measurement error. As an example, mode comparison studies (Biemer 1988; de Leeuw and van der Zouwen 1988; Groves 1989) have found that data collected by telephone interviewing are, in some cases, less accurate than the same information collected by face-to-face interviewing. Finally, the setting or environment within which the survey is conducted can also contribute to measurement error. When collecting data on sensitive topics such as drug use, sexual behavior, or fertility, the interviewer may find that a private setting is more conducive to obtaining accurate responses than one in which other members of the household are present. In establishment surveys, topics such as land use, financial loss and gain, environmental waste treatment, and resource allocation can also be sensitive. In these cases, assurances of confidentiality may reduce measurement errors that result from intentional misreporting. Biemer et al. (1991) provides a comprehensive review of measurement error in surveys.

Frame Error

Frame error arises in the process for constructing, maintaining, and using the sampling frame(s) for selecting the survey sample. The sampling frame is defined as a list of population members or some other mechanism used for drawing the sample. Ideally, the frame would contain every member of the population with no duplicates. Also, units that are not part of the population would not be on the frame. Likewise, information on the frame that is used in the sample selection process should be accurate and up to date. Unfortunately, sampling frames rarely satisfy these ideals, often resulting in various types of frame errors.

There are essentially three types of sampling frames: area frames, list frames, and implicit frames. Area frames are typically used for agricultural and household surveys. An area frame is constructed by first dividing an area to be sampled (say, a state) into smaller areas (such as counties, census tracts, or blocks). A random sample of these smaller areas is drawn and a counting and listing operation is implemented in the selected areas to enumerate all the ultimate sampling units. For household surveys, the counting and listing operation is intended to identify and list every dwelling unit in the sampled smaller areas. Following the listing process, dwelling units may be sampled according to any appropriate randomization scheme. The process is similar for agricultural surveys, except rather than a dwelling unit, the ultimate sampling unit may be a farm or land parcel.

The omission of eligible population units from the frame (referred to as noncoverage error) can be a problem with area samples, primarily as a result of errors made during the counting–listing phase. Enumerators in the field may miss some dwelling units that are hidden from view or are mistaken as part of other dwelling units (e.g., garages that have been converted to apartments). Boundary units may be erroneously excluded or included because of inaccurate maps or enumerator error. Boundary units can also be a source of duplication error if they are included for areal units on both sides of the boundary.

More recent research has considered the use of list frames for selecting household samples [see, e.g., O’Muircheartaigh et al., (2007), Dohrmann et al. (2007), and Iannacchione et al. (2007)]. One such list is the US Postal Service delivery sequence file (DSF). This frame contains all the delivery point addresses serviced by the US Postal Service. Because sampling proceeds directly from this list, a counting–listing operation is not needed, saving considerable cost. Noncoverage error may be an important issue in the use of the DSF, particularly in rural areas [see, e.g., Iannacchione et al. (2003)]. Methods for reducing the noncoverage errors, such as the half-open interval method [see, e.g., Groves et al. 2009)] have met with varying success (O’Muircheartaigh et al., 2007). List frames are also commonly used for sampling special populations such as teachers, physicians, and other professionals. Establishment surveys make extensive use of list frames drawn from establishment lists purchased from commercial vendors.

A sampling frame may not be a physical list, but rather an implicit list as in the case of random-digit dialing (RDD) sampling. For RDD sampling, the frame is implied by the mechanism generating the random numbers. Frame construction may begin by first identifying all telephone exchanges (e.g., in the United States and Canada, the area code plus the 3-digit prefix) that contain at least one residential number. The implied frame is then all 10-digit telephone numbers that can be formed using these exchanges, although the numbers in the sample are the only telephone numbers actually generated and eventually dialed. Intercept sampling may also use an implicit sampling frame. In intercept sampling, a systematic sample of units is selected as they are encountered during the interviewing process; examples where an explicit list of population units is not available include persons in a shopping mall or visitors to a website.

To ensure that samples represent the entire population, every person, farm operator, household, establishment, or other element in the population should be listed on the frame. Ineligible units should be identified and removed from the sample as they are selected. Further, to weight the responses using the appropriate probabilities of selection, the number of times that each element is listed on the frame should also be known, at least for the sampled units. To the extent that these requirements fail, frame errors occur.

Errors can occur when a frame is constructed. Population elements may be omitted or duplicated an unknown number of times. There may be elements on the frame that should not be included (e.g., in a farm survey, businesses that are not farms). Erroneous omissions often occur when the cost of creating a complete frame is too high. We may be well aware that the sampling frame for the survey is missing some units but the cost of completing the frame is quite high. If the number of missing population members is small, then it may not be worth the cost to provide a complete frame. Duplications on a frame are a common problem when the frame combines a number of lists. For the same reason, erroneous inclusions on the frame usually occur because the available information about each frame member is not adequate to determine which units are members of the population and which are not. Given these frame imperfections, the population represented by the frame does not always coincide with the population of interest in the survey. The former population is referred to as the frame population and the latter as the target population.

Nonresponse Error

Nonresponse error is a fairly general source of error encompassing both unit and item nonresponse. Unit nonresponse occurs when a sampled unit (e.g., a household, farm, school or establishment) does not respond to any part of a questionnaire, such as a household that refuses to participate in a face-to-face survey, a mail survey questionnaire that is never returned, or an eligible sample member who refuses or whose telephone is never answered. Item nonresponse error occurs when the questionnaire is only partially completed because an interview was prematurely terminated or some items that should have been answered were skipped or left blank. For example, income questions are typically subject to a high level of item nonresponse from respondent refusals.

F or open-ended questions, even when a response is provided, nonresponse may occur if the response is unusable or inadequate. As an example, a common open-ended question in socioeconomic surveys is “What is your occupation?” A respondent may provide some information about his or her occupation, but perhaps not enough to allow an occupation and industry coder to assign an occupation code number during the data-processing stage.

Data-Processing Error

The final source of nonsampling error is data processing. Data-processing error includes errors in editing, data entering, coding, weighting, and tabulating of the survey data. As an example of editing error, suppose that a data editor is instructed to call the respondent back to verify the value of some budget line item whenever the value of the item exceeds a specified limit. In some cases, the editor may fail to apply this rule correctly, thus generating errors in the data.