Cover Page

Contents

Tables

Boxes

Figures

Getting Files from the Wiley ftp and Internet Sites

LIST OF DATA SITES PROVIDED ON WEB SITE

Preface to the Fourth Edition

PART 1: Basic Concepts

CHAPTER 1: Uses of Sample Surveys

1.1 WHY SAMPLE SURVEYS ARE USED

1.2 DESIGNING SAMPLE SURVEYS

1.3 PRELIMINARY PLANNING OF A SAMPLE SURVEY

EXERCISES

BIBLIOGRAPHY

CHAPTER 2: The Population and the Sample

2.1 THE POPULATION

2.2 THE SAMPLE

2.3 SAMPLING DISTRIBUTIONS

2.4 CHARACTERISTICS OF ESTIMATES OF POPULATION PARAMETERS

2.5 CRITERIA FOR A GOOD SAMPLE DESIGN

2.6 SUMMARY

EXERCISES

BIBLIOGRAPHY

PART 2: Major Sampling Designs and Estimation Procedures

CHAPTER 3: Simple Random Sampling

3.1 WHAT IS A SIMPLE RANDOM SAMPLE?

3.2 ESTIMATION OF POPULATION CHARACTERISTICS UNDER SIMPLE RANDOM SAMPLING

3.3 SAMPLING DISTRIBUTIONS OF ESTIMATED POPULATION CHARACTERISTICS

3.4 COEFFICIENTS OF VARIATION OF ESTIMATED POPULATION PARAMETERS

3.5 RELIABILITY OF ESTIMATES

3.6 ESTIMATION OF PARAMETERS FOR SUBDOMAINS

3.7 HOW LARGE A SAMPLE DO WE NEED?

3.8 WHY SIMPLE RANDOM SAMPLING IS RARELY USED

3.9 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 4: Systematic Sampling

4.1 HOW TO TAKE A SYSTEMATIC SAMPLE

4.2 ESTIMATION OF POPULATION CHARACTERISTICS

4.3 SAMPLING DISTRIBUTION OF ESTIMATES

4.4 VARIANCE OF ESTIMATES

4.5 A MODIFICATION THAT ALWAYS YIELDS UNBIASED ESTIMATES

4.6 ESTIMATION OF VARIANCES

4.7 REPEATED SYSTEMATIC SAMPLING

4.8 HOW LARGE A SAMPLE DO WE NEED?

4.9 USING FRAMES THAT ARE NOT LISTS

4.10 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 5: Stratification and Stratified Random Sampling

5.1 WHAT IS A STRATIFIED RANDOM SAMPLE?

5.2 HOW TO TAKE A STRATIFIED RANDOM SAMPLE

5.3 WHY STRATIFIED SAMPLING?

5.4 POPULATION PARAMETERS FOR STRATA

5.5 SAMPLE STATISTICS FOR STRATA

5.6 ESTIMATION OF POPULATION PARAMETERS FROM STRATIFIED RANDOM SAMPLING

5.7 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 6: Stratified Random Sampling: Further Issues

6.1 ESTIMATION OF POPULATION PARAMETERS

6.2 SAMPLING DISTRIBUTIONS OF ESTIMATES

6.3 ESTIMATION OF STANDARD ERRORS

6.4 ESTIMATION OF CHARACTERISTICS OF SUBGROUPS

6.5 ALLOCATION OF SAMPLE TO STRATA

6.6 STRATIFICATION AFTER SAMPLING

6.7 HOW LARGE A SAMPLE IS NEEDED?

6.8 CONSTRUCTION OF STRATUM BOUNDARIES AND DESIRED NUMBER OF STRATA

6.9 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 7: Ratio Estimation

7.1 RATIO ESTIMATION UNDER SIMPLE RANDOM SAMPLING

7.2 ESTIMATION OF RATIOS FOR SUBDOMAINS UNDER SIMPLE RANDOM SAMPLING

7.3 POSTSTRATIFIED RATIO ESTIMATES UNDER SIMPLE RANDOM SAMPLING

7.4 RATIO ESTIMATION OF TOTALS UNDER SIMPLE RANDOM SAMPLING

7.5 COMPARISON OF RATIO ESTIMATE WITH SIMPLE INFLATION ESTIMATE

7.6 APPROXIMATION TO THE STANDARD ERROR OF THE RATIO ESTIMATED TOTAL

7.7 DETERMINATION OF SAMPLE SIZE

7.8 REGRESSION ESTIMATION OF TOTALS

7.9 RATIO ESTIMATION IN STRATIFIED RANDOM SAMPLING

7.10 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 8: Cluster Sampling: Introduction and Overview

8.1 WHAT IS CLUSTER SAMPLING?

8.2 WHY IS CLUSTER SAMPLING WIDELY USED?

8.3 A DISADVANTAGE OF CLUSTER SAMPLING: HIGH STANDARD ERRORS

8.4 HOW CLUSTER SAMPLING IS TREATED IN THIS BOOK

8.5 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 9: Simple One-Stage Cluster Sampling

9.1 HOW TO TAKE A SIMPLE ONE-STAGE CLUSTER SAMPLE

9.2 ESTIMATION OF POPULATION CHARACTERISTICS

9.3 SAMPLING DISTRIBUTIONS OF ESTIMATES

9.4 HOW LARGE A SAMPLE IS NEEDED?

9.5 RELIABILITY OF ESTIMATES AND COSTS INVOLVED

9.6 CHOOSING A SAMPLING DESIGN BASED ON COST AND RELIABILITY

9.7 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 10: Two-Stage Cluster Sampling: Clusters Sampled with Equal Probability

10.1 SITUATION IN WHICH ALL CLUSTERS HAVE THE SAME NUMBER Ni OF ENUMERATION UNITS

10.2 SITUATION IN WHICH NOT ALL CLUSTERS HAVE THE SAME NUMBER Ni OF ENUMERATION UNITS

10.3 SYSTEMATIC SAMPLING AS CLUSTER SAMPLING

10.4 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 11: Cluster Sampling in Which Clusters Are Sampled with Unequal Probability: Probability Proportional to Size Sampling

11.1 MOTIVATION FOR NOT SAMPLING CLUSTERS WITH EQUAL PROBABILITY

11.2 TWO GENERAL CLASSES OF ESTIMATORS VALID FOR SAMPLE DESIGNS IN WHICH UNITS ARE SELECTED WITH UNEQUAL PROBABILITY

11.3 PROBABILITY PROPORTIONAL TO SIZE SAMPLING

11.4 FURTHER COMMENT ON PPS SAMPLING

11.5 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 12: Variance Estimation in Complex Sample Surveys

12.1 LINEARIZATION

12.2 REPLICATION METHODS

12.3 SUMMARY

EXERCISES

TECHNICAL APPENDIX

BIBLIOGRAPHY

PART 3: Selected Topics in Sample Survey Methodology

CHAPTER 13: Nonresponse and Missing Data in Sample Surveys

13.1 EFFECT OF NONRESPONSE ON ACCURACY OF ESTIMATES

13.2 METHODS OF INCREASING THE RESPONSE RATE IN SAMPLE SURVEYS

13.3 MAIL SURVEYS COMBINED WITH INTERVIEWS OF NONRESPONDENTS

13.4 OTHER USES OF DOUBLE (or TWO-PHASE) SAMPLING METHODOLOGY

13.5 ITEM NONRESPONSE: METHODS OF IMPUTATION

13.6 MULTIPLE IMPUTATION

13.7 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 14: Selected Topics in Sample Design and Estimation Methodology

14.1 WORLD HEALTH ORGANIZATION EPI SURVEYS: A MODIFICATION OF PPS SAMPLING FOR USE IN DEVELOPING COUNTRIES

14.2 QUALITY ASSURANCE SAMPLING

14.3 SAMPLE SIZES FOR LONGITUDINAL STUDIES

14.4 ESTIMATION OF PREVALENCE OF DISEASES FROM SCREENING STUDIES

14.5 ESTIMATION OF RARE EVENTS: NETWORK SAMPLING

14.6 ESTIMATION OF RARE EVENTS: DUAL SAMPLES

14.7 ESTIMATION OF CHARACTERISTICS FOR LOCAL AREAS: SYNTHETIC ESTIMATION

14.8 EXTRACTION OF SENSITIVE INFORMATION: RANDOMIZED RESPONSE TECHNIQUES

14.9 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 15: Telephone Survey Sampling

15.1 INTRODUCTION

15.2. HISTORY OF TELEPHONE SAMPLING IN THE UNITED STATES

15.3 WITHIN-HOUSEHOLD SELECTION TECHNIQUES

15.4 STEPS IN THE TELEPHONE SURVEY PROCESS

15.5 DRAWING AND MANAGING A TELEPHONE SURVEY SAMPLE

15.6 POST-SURVEY DATA ENHANCEMENT PROCEDURES

15.7 IMPUTATION OF MISSING DATA

15.8 DECLINING COVERAGE AND RESPONSE RATES

15.9 ADDRESSING THE PROBLEMS WITH CELL PHONES

15.10 ADDRESS-BASED SAMPLING

EXERCISES

BIBLIOGRAPHY

CHAPTER 16: Constructing the Survey Weights

16.1 INTRODUCTION

16.2 OBJECTIVES OF WEIGHTING

16.3 CONSTRUCTING THE SAMPLE WEIGHTS

16.4 ESTIMATION AND ANALYSIS ISSUES

16.5 Summary

BIBLIOGRAPHY

CHAPTER 17: Strategies for Design-Based Analysis of Sample Survey Data

17.1 STEPS REQUIRED FOR PERFORMING A DESIGN-BASED ANALYSIS

17.2 ANALYSIS ISSUES FOR “TYPICAL” SAMPLE SURVEYS

17.3 SUMMARY

TECHNICAL APPENDIX

BIBLIOGRAPHY

Appendix

Answers to Selected Exercises

Index

WILEY SERIES IN SURVEY METHODOLOGY

Established in Part by WALTER A. SHEWHART AND SAMUEL S. WILKS

Editors: Robert M. Groves, Graham Kalton, J. N. K. Rao, Norbert Schwarz, Christopher Skinner

A complete list of the titles in this series appears at the end of this volume.

image

To our wives, Virginia and Elaine,
and our sons, daughters, and grandchildren

Tables

Tables

2.1Number of Household Visits Made During a Specified Year
2.2Sample Data for Number of Household Visits
2.3Data for Number of Students Not Immunized for Measles Among Six Schools in a Community
2.4Possible Samples and Values of x′
2.5Sampling Distribution for Data of Table 2.4
2.6Sampling Procedure for Population of Six Schools
2.7Possible Samples and Values of x′
2.8Sampling Distribution of x′
2.9Data for the Burn Area Estimates
3.1Possible Samples of Three Schools and Values of x′
3.2Sampling Distribution of the x′ in Table 3.1
3.3Values of the fpc (N = 10,000)
3.4Sample Data for Number of Household Visits
3.5Race and Out-of-Pocket Medical Expenses for Six Families (1987)
3.6Data for a Subdomain Based on the Families in Table 3.5
3.7Sampling Distribution of z/y
3.8Cumulative Exposure to Pulmonary Stressors and Forced Vital Capacity Among Workers in a Sample Taken at a Plant Employing 1200 Workers
4.1Systematic Sample of One in Six Physicians (from Table 2.1)
4.2Five Possible Samples of One in Five Physicians (from Table 2.1)
4.3Comparison of Means and Standard Errors for Simple Random Sampling and for Systematic Sampling
4.4Six Possible Samples of One in Six Physicians (Table 2.1)
4.5Possible Samples of 1 in k Elements (N/k is an Integer)
4.6Cluster Samples Based on Data of Table 3.8
4.7Data for Nurse Practitioner’s Visits (Unordered List)
4.8Four Possible Samples for Data of Table 4.7
4.9Data for Nurse Practitioner’s Visits (Monotonically Ordered List)
4.10Four Possible Samples for Data of Table 4.9
4.11Data for Nurse Practitioner’s Visits (Periodicity in List)
4.12Four Possible Samples for Data of Table 4.11
4.13Summary of Results for Four Types of Sampling
4.14Distribution of Remainders and Samples for 25 Possible Random Numbers
4.15Sampling Distribution of Estimated Means from Data of Tables 4.14 and 2.1
4.16Systematic One-in-Five Sample Taken from Table 2.1
4.17Days Lost from Work Because of Acute Illness in One Year Among 162 Employees in a Plant
4.18Data for Six Systematic Samples Taken from Table 4.17
4.19Six Samples Taken from Table 4.17
5.1Truck Miles and Number of Accidents Involving Trucks by Type of Road Segment
5.2Sampling Distribution of x′ for 56 Possible Samples of Three Segments
5.3Two Strata for Data of Table 5.1
5.4Sampling Distribution of image for 30 Possible Samples of Three Segments
5.5Comparison of Results for Simple Random Sampling and Stratification
5.6Strata for a Population of 14 Families
6.1Retail Prices of 10 Capsules of a Tranquilizer in Pharmacies in Two Communities (Strata)
6.2Possible Samples for the Stratified Random Sample
6.3Sampling Distribution for image
6.4General Hospitals in Illinois by Geographical Stratum, 2005
6.5Strata for Number of Hospital Beds by County Among Counties in Illinois (Excluding Cook County) Having General Hospitals
6.6Number of Pairs Available in Each of 18 Sex-Race-Sequence Difference Quantiles
6.7Total Pairs Available and Total Pairs Sampled
6.8Sample Data of Veterinarian’s Survey
6.9Distribution of Hospital Episodes per Person per Year
6.10Frequency Distribution of Total Amount Charged During 1996 to Medicaid for 2387 Patients Treated by a Large Medical Group
6.11Optimal Allocation Based on Use of the rootfreq Method for Construction of Strata
6.12Results of Three Methods of Strata Construction Combined with Optimal Allocation from Data on 2387 Patients Shown in Table 6.10
7.1Pharmaceutical Expenses and Total Medical Expenses Among All Residents of Eight Community Areas
7.2Possible Samples of Two Elements from the Population of Eight Elements (Table 7.1)
7.3Samples of Seven Elements from the Population of Table 7.1
7.4Population (2000 Census) and Current School Enrollment by Census Tract
7.5Possible Samples of Two Schools Taken from Table 7.4
7.6Values of x′ and x″ for Samples in Table 7.5
7.7Sample Data from Table 5.1
8.1Some Practical Examples of Clusters
8.2Comparison of Costs for Two Sampling Designs
9.1Number of Persons over 65 Years of Age and Number over 65 Years Needing Services of Visiting Nurse for Five Housing Developments
9.2Summary Data for the Two Clusters Selected in the Sample
9.3Sampling Distributions of Three Estimates
9.4Means and Standard Errors of Estimates
9.5Number of Eligible Persons and Number Receiving Substance Abuse Treatment Among Individuals Sentenced to Probation in 26 District Courts
10.1Number of Patients Seen by Nurse Practitioners and Number Referred to a Physician for Five Community Health Centers
10.2Summary Data for the Three Clusters Selected in the Sample
10.3Worksheet for Calculations Involving Cluster Totals
10.4Worksheet for Calculations Involving Listing Units
10.5Designs Satisfying Specifications on Total
10.6Designs Satisfying Specifications on Ratio
10.7Amount of Money Billed to Medicare by Day and Week
10.8Designs Satisfying Specifications on Total
10.9Total Admissions with Life-Threatening Conditions and Total Admissions Discharged Dead from Ten Hospitals, 2007
10.10Summary Data for a Sample of Three Hospitals Selected from the Ten Hospitals in Table 10.9
10.11Sampling Distribution of image, and rclu
10.12Frequency Distribution of Estimated Total image Over All Possible Samples of Two Hospitals
11.1Number of Outpatient Surgical Procedures Performed in 1997 in Three Community Hospitals
11.2Number of Women Over 90 Years of Age Admitted to Nursing Homes in a Community During 1997
11.3Distribution of the Hansen-Hurwitz Estimator image Over All Possible Samples of Two Nursing Homes Drawn with Replacement
11.4Number of Women Over 90 Years of Age Admitted to Nursing Homes in a Community During 1997
11.5Distribution of the Horvitz-Thompson Estimator image Over All Possible Samples of Two Nursing Homes Drawn with Replacement
11.6Results of the PPS Sample
11.7Data File for Use in Illustrative Example
11.8Procedure for PPS Sampling with Replacement
12.1Medicaid Payments and Overpayments for 10 Sample Claims Submitted on Patient
13.1Data for the Survey of Physicians
13.2Data that Would be Obtained from the Mail Survey
13.3Actual Values for Missing Data in Table 13.2
14.1Distribution of image
16.1NSCAW Weights for Two Age Domains
16.2NSCAW Response Propensities for Age by Gender Weighting Classes
16.3NSCAW Nonresponse WCA Factors for Age by Gender Weighting Classes
16.4NSCAW Poststratification Adjustment Factors for State Group by Substantiation Poststrata
16.5Final Weight Adjustment Factors for NSCAW Sample
17.1Number of Nursing Homes in Three Regions in a State
17.2Resulting Data from Sample of Nursing Homes
17.3Selection of Sample Subjects from the Departments of Gironde and Dordogne
17.4Association of Wine Consumption vs. Incident Dementia (Model-Based Analysis)
17.5Logistic Regression Analysis of Wine Consumption and Incident Dementia Assuming Simple Random Sampling (Model-Based Analysis)
17.6Logistic Regression Analysis of Wine Consumption and Incident Dementia Incorporating Sample Survey Parameters (Design-Based Analysis)
A.1Random Number Table
A.2Selected Percentiles of Standard Normal Distribution

Boxes

Box

2.1 Population Parameters
2.2 Sample Statistics
2.3 Mean and Variance of Sampling Distribution When Each Sample Has the Same Probability (1/T) of Selection
2.4 Mean and Variance of Sampling Distribution When Each Sample Does Not Have the Same Probability of Selection
3.1 Estimated Totals, Means, Proportions, and Variances Under Simple Random Sampling, and Estimated Variances and Standard Errors of These Estimates
3.2 Population Estimates, and Means and Standard Errors of Population Estimates Under Simple Random Sampling
3.3 Coefficients of Variation of Population Estimates Under Simple Random Sampling
3.4 Estimated Variances and 100(1−α)% Confidence Intervals Under Simple Random Sampling
3.5 Exact and Approximate Sample Sizes Required Under Simple Random Sampling
4.1 Estimated Totals, Means, and Variances Under Systematic Sampling, and Estimated Variances, and Standard Errors of These Estimates
4.2 Variances of Population Estimates Under Systematic Sampling
4.3 Estimation Procedures for Population Means Under Repeated Systematic Sampling
5.1 Population and Strata Parameters for Stratified Sampling
5.2 Estimates of Population Parameters and Standard Errors of These Estimates for Stratified Sampling
6.1 Means and Standard Errors of Population Estimates Under Stratified Random Sampling
6.2 Estimated Standard Errors Under Stratified Random Sampling
6.3 Estimates of Population Parameters Under Stratified Random Sampling with Proportional Allocation
7.1 Formulas for Ratio Estimation Under Simple Random Sampling
9.1 Notation Used in Simple One-Stage Cluster Sampling
9.2 Estimated Population Characteristics and Estimated Standard Errors for Simple One-Stage Cluster Sampling
9.3 Theoretical Standard Errors for Estimates Under Simple One-Stage Cluster Sampling
9.4 Exact and Approximate Sample Sizes Required Under Simple One-Stage Cluster Sampling
10.1 Notation Used in Simple Two-Stage Cluster Sampling
10.2 Estimated Population Characteristics and Estimated Standard Errors for Simple Two-Stage Cluster Sampling
10.3 Standard Errors for Population Estimates Under Simple Two-Stage Cluster Sampling
10.4 Estimates of Population Characteristics Under Simple Two-Stage Cluster Sampling, Unequal Numbers of Listing Units
10.5 Theoretical Standard Errors for Population Estimates for Simple Two-Stage Cluster Sampling, Unequal Numbers of Listing units

Figures

Figure

2.1 Relative frequency distribution of the sampling distribution of x
2.2 Relationship among bias, variability, and MSE for data of Table 2.9
3.1 Areas under normal curve within ±1, ±1.96 and ±3 standard errors of the mean
9.1 Form for collecting the data from sample households in housing developments
9.2 Cost: simple random sample and single-stage cluster sampling
14.1 Network sample for health-care providers and skin cancer patients
16.1 The correspondence among the target, frame, and respondent populations and the sample
16.2 Distribution of NSCAW final weights

Getting Files from the Wiley ftp and Internet Sites

To download the files listed in this book and other material associated with it, use an ftp program or a Web browser.

FTP ACCESS

If you are using an ftp program, type the following at your ftp prompt or URL prompt:

ftp://ftp.wiley.com

Some programs may provide the first Up for you, in which case, you can type:

ftp.wiley.com

If log-in parameters are required, log in as anonymous (e.g., User ID: anonymous). Leave the password blank. After you have connected to the Wiley ftp site, navigate through the directory path of:

/public/sci_tech_med/populations

Also, a direct link to the related FTP site is available on the book’s Wiley.com webpage.

FILE ORGANIZATION

Under the populations directory are subdirectories that include MATLAB files for PC, Macintosh, and UNIX systems and Microsoft® Excel files. Important information is included in the README files.

LIST OF DATA SETS PROVIDED ON WEB SITE

Data Set

momsag.dta

workendta

wloss2.ssd

jacktwin.ssd

jacktwin2.dta

dogscats.ssd

tab7ptl.ssd

tab7ptl.dta

bhratio.dat

tab9_la.dta

tab9_lc.ssd

il10ptl.ssd

il10pt2.dta

il10pt2.ssd

hospslct.ssd

exmp12_2.ssd

exmp12_2.dta

amblnce2.ssd

Preface to the Fourth Edition

This fourth edition of Sampling of Populations: Methods and Applications comes nearly ten years after the 1999 publication of the third edition. Unlike the third edition, it did not involve a major change in organization or emphasis. From our own experiences teaching this topic in graduate one-semester courses and in more intensive short courses held over a two-to-three-day period, as well as from positive feedback we received from students or professionals taking courses using this book or else using it for self-learning or reference, we feel that the organization used in the third edition works well; therefore we have kept it basically intact.

We did feel, however, that the book would benefit greatly from a moderate updating and refreshing along the following lines:

1. In the third edition, we introduced the use in statistical analysis of survey data using versions of SUDAAN and Stata that were available in the mid-1990s. Although the syntax used in SUDAAN has not changed much, the syntax used for the Stata survey analysis commands has changed considerably since then, and the capabilities of both of these software packages for conducting analysis of data from complex sample surveys has increased nearly exponentially. In this fourth edition, we have updated the syntax as needed for all of the illustrative examples that we have included.
2. We have updated the illustrative examples to reflect the fact that we are now in the twenty-first century. For example, in the well-known “dogs and cats” illustrative example, we estimated that the average dog owner pays an average of approximately $25.00 a year in veterinary medical costs per dog and that a cat owner pays approximately $10.75 per cat. As pet owners (Stan has dogs; Paul has cats), it costs us well over $1,000.00 a year to maintain our animals in good health, which is why the insurance companies now offer medical insurance for pets. So as to keep our book from appearing “quaint,” we took a good look at the examples and updated the dates and the prices when necessary.
3. In the third edition, we made available on the Wiley FTP site several of the data files used in illustrative examples and exercises. In this edition, we are now making available most of the data files discussed in the text.
4. During the course of our teaching from the third edition and from our more recent experiences in the design of complex sample surveys and the analysis of data from such surveys, we began to recognize that we paid too little attention to the process of preparing data for design-based statistical analysis. This process involves the several stages of weighting that take into consideration construction of base weights that reflect inclusion probabilities, adjustments for nonresponse and for deficiencies in the sampling frame, weight trimming, and adjustments for deficiencies in the sampling frame. The importance of these procedures cannot be emphasized too much, since they can result in large improvements in the validity and reliability of the resulting estimates.
5. To strengthen this area we invited Drs. Paul Biemer and Sharon Christ to write a chapter entitled “Constructing the Sample Weights,” which covers the above material very thoroughly (Chapter 16). Dr. Biemer is a Distingushed Fellow in Statistics at RTI International and Professor of Sociology at the University of North Carolina at Chapel Hill. He has authored numerous publications in survey methodology, including the seminal book, Introduction to Survey Quality (co-authored with Lars Lyberg). Dr. Christ is a recent PhD graduate in biostatistics who has worked with Dr. Biemer on several survey-related projects.

During the years between the third and fourth editions, telephone surveys have undergone very rapid changes owing to such phenomena as declining response rates, the institution of the “Do Not Call List,” the increased penetration of cell phones in households, and the greatly increased number of “cell phone only” households. In addition, list-assisted random digit dialing replaced the Mitofsky–Waksberg method for sampling telephone households. To capture these changes, we invited Drs Michael W. Link and Mansour Fahimi to write a chapter on telephone surveys (Chapter 15). Both Drs. Link and Fahimi are sample survey methodologists who have worked and published intensively in the area of telephone sample surveys. Both are former colleagues of Drs. Levy and Biemer at RTI International. Dr. Link is currently Vice President for Methodological Research and Chief Methodologist at Nielsen, and Dr. Fahimi is presently Vice President for Statistical Research Services at the Marketing Systems Group, which is a major provider of expertise and services for sample surveys through their GENESYS Sampling division.

In summary, we feel that the enhancements mentioned above will provide the reader with a more enjoyable and beneficial experience. We also include material from the Preface to the third edition that we feel might be helpful to readers of this edition.

This fourth edition is likely to be the final edition of Sampling of Populations: Methods and Applications (at least written by us). Therefore, we would like to express our utmost appreciation to our colleagues, students, managers, and staff at all of the institutions with which we were fortunate enough to be associated in the 30+ years since we conceived this book (University of Massachussetts for both Paul and Stan; University of Illinois at Chicago, University of North Carolina at Chapel Hill, and RTI International for Paul; and The Ohio State University for Stan). Stan is particularly appreciative of Ohio State Provost Barbara Snyder’s agreeing to allow him to spend three months away from Columbus on a special research assignment (SRA). This provided him with the time he needed to work on this book. Thanks also to Annick Alpérovitch, Carole Dufouil, and Christophe Tzourio at INSERM Unit 708 in Paris, France for providing an office for Stan and an environment conducive for working on this book during the SRA. We would like to thank our current and former editors: Alex Kugushev at Wadsworth; Beatrice Shube, Helen Ramsey, and Steve Quigley at Wiley. We would also like to thank Nidhi Kochar, Charisse Darrell-Fields, Michael Sabbatino and Tracy McHone for their assistance with various aspects of this book’s preparation. We’re especially appreciative of the expertise Melanie Cole brought to formatting, organizing, and preparing the camera-ready manuscript. Parts of the material in the fourth edition were recently used in Paul’s course at the University of North Carolina and we are grateful to the students in that course for their helpful suggestions and detection of errors. Vince Iannacchione, Paul’s colleague and co-instructor in that course, pointed out several inconsistencies and made numerous helpful comments. Amanda Lewis-Evans designed the cover art. Most of all, we are grateful to our wives, Virginia and Elaine, for putting up with us. It’s been a great ride.

Paul S. Levy
Stanley Lemeshow

Research Triangle Park, North Carolina
Columbus, Ohio
April 2008

Material from the Preface To The Third Edition

The original edition of Sampling of Populations: Methods and Applications was published in 1980 by Lifetime Learning Publications (a Division of Wadsworth, Inc.) under the title Sampling for Health Professionals. Like other Lifetime Learning Books, its primary intended audience was the working professional; in this instance, the practicing statistician. With this as the target audience, the authors felt that such a book on sampling should have the following features:

Sampling for Health Professionals was well received both by reviewers and readers, and had a steady following throughout its existence (1980-91). In addition to having a strong following among practicing statisticians (its intended primary audience), it had been adopted as a primary text or recommended as additional reading by an unexpectedly large number of instructors of sampling courses in various academic units.

Based on the success of this first version of the book, we developed a greatly revised and expanded version that was published by John Wiley & Sons in 1991 under the title Sampling of Populations: Methods and Applications. Our purpose in that revision was to improve the suitability of the book for use as a text in applied sampling courses without compromising its readability or its suitability for the continuation and self-learning markets. The resulting first Wiley edition was a considerably updated and expanded version of the original work, but in much the same style and tone.

Although less than a decade has elapsed since the appearance of the first Wiley edition, there have been major developments both in the design and administration of sample surveys as well as in the analysis of the resulting data from sample surveys. In particular, refinements in telephone sampling and interviewing methodologies have now made it generally more feasible and less expensive than face-to-face household interviewing. While the first edition contained some material on random digit dialing (RDD), it did not cover the various refinements of RDD or the use of list-assisted sampling methods and other innovations that are now widely used.

Also, “user friendly” computer software is much more readily available not only for obtaining standard errors of survey estimates, but also for performing statistical procedures such as contingency table analysis and multiple linear or logistic regression that take into consideration complexities in the sample design. In fact, at the writing of this new edition, modules for the analysis of complex survey data have begun to appear in major general statistical software packages (e.g., Stata). Such software is now widely used and has removed the necessity for many of the complicated formulas that appeared in Chapter 11 as well as elsewhere in the first Wiley edition. Likewise, the analysis methods that appear in Chapter 16 of that edition are a reflection of methods used historically when design-based analysis methodology was not in its present state of development and software was not readily available. We felt that the discussion in that chapter did not reflect adequately the present state of the art with respect to analysis strategies for survey data and we have totally revised it to reflect more closely current practice.

Both of us feel strongly that knowledge of telephone-sampling methodology and familiarity with computer-driven methods now widely used in the analysis of complex survey data should be part of any introduction to sampling methods, and, with this in mind, we have greatly enhanced and revised the material on survey data analysis, and attempted to introduce the use of appropriate software throughout the book in our discussion of the major sample designs and estimation procedures. In addition to adding material on the topics just mentioned, we have made other revisions based on our own experience with the book in sampling courses and on suggestions from students and colleagues. Some of these are listed below.

1. Exercises that seem to be ineffective have been modified or replaced. Some new exercises have been added.
2. The material on two-stage cluster sampling has been expanded and reorganized, with designs in which clusters are sampled with equal probability presented in Chapter 10 and designs in which clusters are sampled with unequal probability given in Chapter 11. The important class of probability proportional to size (PPS) sampling designs is now presented in Chapter 11 as a particular case of designs in which clusters are sampled with unequal probability. This appears to us a more natural way to present cluster sampling designs and has worked out very well in our classes.
3. In our presentation of many of the numerical illustrative examples, we include discussion of how the analysis would appear in one of the software packages that specializes in or includes analysis of survey data.
4. Chapters that deal with specific sampling designs now have a similar structure, with the following subsections:
5. We have greatly expanded the number of articles cited in the annotated reference list to include more recent articles of importance. In particular, one of us (PSL) was the Section Editor for Design of Experiments and Sample Surveys of the recently published Encyclopedia of Biostatistics and, as such, was responsible for the material on sample survey methods. In that capacity, he solicited over 50 expository or review articles covering important topics in sampling methodology and written by experts on the particular topics. These are useful references for readers learning sampling for the first time as well as for more advanced readers, and we have cited many of them.
6. In the prior edition, we did not include in electronic form data sets used in the illustrative examples and exercises. In this edition, we make such data available on the Internet.

It is our feeling that one of the strengths of the earlier edition is its focus on the basic principles and methods of sampling. To maintain this focus, we omit or treat very briefly several very interesting topics that have seen considerable development in the last decade. We feel that they are best covered in more specialized texts on sampling. As a result, we do not cover to any extent topics such as distance sampling, adaptive sampling, and superpopulation models that are of considerable importance, but have been treated very well in other volumes. We did, however, include several topics that were not in the previous edition and that we feel are important for a general understanding of sampling methodology. Examples of such topics included in this edition are construction of stratum boundaries and desired number of strata (Chapter 6); estimation of ratios for subdomains (Chapter 7); poststratified ratio estimates (Chapter 7); the Hansen-Hurwitz estimator and the Horvitz-Thompson estimator (Chapter 11).

From our experience with the first edition of Sampling of Populations: Methods and Applications, we feel that this book will be used by practicing statisticians as well as by students taking formal courses in sampling methodology. Both of us teach in schools of public health, and have used this book as the basic text for a one-semester course in sample-survey methodology. Our classes have included a mix of students concentrating in biostatistics, epidemiology, and other areas in the biomedical and social and behavioral sciences. In our experience, this book has been very suitable to this mix of students, and we feel that at least 80% of this material could be covered without difficulty in a single semester course.

Several instructors have indicated that, in their courses on sampling theory. this book works well as a primary text in conjunction with a more theoretical text (e.g., W.G. Cochran, Sampling Techniques, 3rd ed., New York: Wiley, 1977), with the latter text used for purposes of providing additional theoretical background. Conversely, selected readings from our book have been used to provide sampling background to students in broader courses on survey research methodology (often taught in sociology departments).

The number of our students and colleagues who gave us helpful comments and suggestions on our earlier text and on the present volume are too numerous to mention and we are grateful to all of them. We would like to thank, in particular, Janelle Klar and Elizabeth Donohoe-Cook for carefully reading this manuscript and making valuable editorial and substantive suggestions. In addition, we would like to thank the two anonymous individuals who reviewed an earlier draft of this manuscript for the publisher. Although we did not agree with all of their suggestions, we did take into consideration in our subsequent revision many of their thoughtful and insightful comments. Most of all, however, we wish to recognize the pioneers of sampling methodology who have written the early textbooks in this field. In particular, the books by William Cochran, Morris Hansen, William Hurwitz and William Madow, Leslie Kish, and P.V. Sukhatme are statistical classics that are still widely studied by students, academics, and practitioners. Those of us who cut our teeth on these books and have made our careers in survey sampling owe them a great debt.

Paul S. Levy
Stanley Lemeshow

Chicago, Illinois
Amherst, Massachusetts
December 1998

PART 1

Basic Concepts

CHAPTER 1

Uses of Sample Surveys

1.1 WHY SAMPLE SURVEYS ARE USED

Information on characteristics of populations is constantly needed by politicians, marketing departments of companies, public officials responsible for planning health and social services, and others. For reasons relating to timeliness and cost, this information is often obtained by use of sample surveys. Such surveys are the subject of this book.

The following discussion provides an example of a sample survey conducted to obtain information about a health characteristic in a particular population. A health department in a large state is interested in determining the proportion of the state’s children of elementary school age who have been immunized against childhood infectious diseases (e.g., polio, diphtheria, tetanus, and pertussis). For administrative reasons, this task must be completed in only one month.

At first glance this task would seem to be most formidable, involving the careful coordination of a large staff attempting to collect information, either from parents or from school immunization records on each and every child of elementary school age residing in that state. Clearly, the budget necessary for such an undertaking would be enormous because of the time, travel expenses, and number of children involved. Even with a sizable staff, it would be difficult to complete such an undertaking in the specified time frame.

To handle problems such as the one outlined above, this text will present a variety of methods for selecting a subset (a sample) from the original set of all measurements (the population) of interest to the researchers. It is the members of the sample who will be interviewed, studied, or measured. For example, in the problem stated above, the net effect of such methods will be that valid and reliable estimates of the proportion of children who have been immunized for these diseases could be obtained in the time frame specified and at a fraction of the cost that would have resulted if attempts were made to obtain the information concerning every child of elementary school age in the state.

More formally, a sample survey may be defined as a study involving a subset (or sample) of individuals selected from a larger population. Variables or characteristics of interest are observed or measured on each of the sampled individuals. These measurements are then aggregated over all individuals in the sample to obtain summary statistics (e.g., means, proportions, and totals) for the sample. It is from these summary statistics that extrapolations can be made concerning the entire population. The validity and reliability of these extrapolations depend on how well the sample was chosen and on how well the measurements were made. These issues constitute the subject matter of this text.

When all individuals in the population are selected for measurement, the study is called a census. The summary statistics obtained from a census are not extrapolations, since every member of the population is measured. The validity of the resulting statistics, however, depends on how well the measurements are made. The main advantages of sample surveys over censuses lie in the reduced costs and greater speed made possible by taking measurements on a subset rather than on an entire population. In addition, studies involving complex issues requiring elaborate measurement procedures are often feasible only if a sample of the population is selected for measurement since limited resources can be allocated to getting detailed measurements if the number of individuals to be measured is not too great.

In the United States, as in many other countries, governmental agencies are mandated to develop and maintain programs whereby sample surveys are used to collect data on the economic, social, and health status of the population, and these data are used for research purposes as well as for policy decisions. For example, the National Center for Health Statistics (NCHS), a center within the United States Department of Health and Human Services, is mandated by law to conduct a program of periodic and ongoing sample surveys designed to obtain information about illness, disability, and the utilization of health care services in the United States [15]. Similar agencies, centers, or bureaus exist within other departments (e.g., the Bureau of Labor Statistics within the Department of Labor, and the National Center for Educational Statistics within the Department of Education) that collect data relevant to the mission of their departments through a program of sample surveys. Field work for these surveys is sometimes done by the U.S. Bureau of the Census, which also has its own program of surveys, or by commercial firms.

The surveys developed by such government agencies often have extremely complex designs and require very large and highly skilled staff (and, hence, large budgets) for their execution. Although the nature of the missions of these government agencies—provision of valid and reliable statistics on a wide variety of indicators for the United States as a whole and various subgroups of it—would justify these large budgets, such costs are rarely justified or at all feasible for most institutions that make use of sample surveys. The information needs of most potential users of sample surveys are far more limited in scope and much more focused around a relatively small set of particular questions. Thus, the types of surveys conducted outside of the federal government are generally simpler in design and “one-shot” rather than ongoing. These are the types of surveys on which we will focus in this text. We will, however, devote some discussion to more complex sample surveys, especially in Chapter 12, which discusses variance estimation methods that have been developed primarily to meet the needs of very complex government surveys.

Sample surveys belong to a larger class of nonexperimental studies generally given the name “observational studies” in the health or social sciences literature. Most sample surveys can be put in the class of observational studies known as “cross-sectional studies.” Other types of observational studies include cohort studies and case-control studies.

Cross-sectional studies are “snapshots” of a population at a single point in time, having as objectives either the estimation of the prevalence or the mean level of some characteristics of the population or the measurement of the relationship between two or more variables measured at the same point in time.

Cohort and case-control studies are used for analytic rather than for descriptive purposes. For example, they are used in epidemiology to test hypotheses about the association between exposure to suspected risk factors and the incidence of specific diseases.

These study designs are widely used to gain insight into relationships. In the business world, for example, a sample of delinquent accounts might be taken (i.e., the “cases”) along with a sample of accounts that are not delinquent (i.e., the “controls”), and the characteristics of each group might be compared for purposes of determining those factors that are associated with delinquency. Numerous examples of these study designs could be given in other fields;

As mentioned above, cohort and case-control studies are designed with the objective in mind of testing some statement (or hypothesis) concerning a set of independent variables (e.g., suspected risk factors) and a dependent variable (e.g., disease incidence). Although such studies are very important, they do not make up the subject matter of this text. The type of study of concern here is often known as a descriptive survey. Its main objective is that of estimating the level of a set of variables in a defined population. For example, in the hypothetical example presented at the beginning of this chapter, the major objective is to estimate, through use of a sample, the proportion of all children of elementary school age who have been vaccinated against childhood diseases. In descriptive surveys, much attention is given to the selection of the sample since extrapolation is made from the sample to the population. Although hypotheses can be tested based on data collected from such descriptive surveys, this is generally a secondary objective in such surveys. Estimation is almost always the primary objective.

1.2 DESIGNING SAMPLE SURVEYS

In this section, we will discuss the four major components involved in designing sample surveys. These components are sample design, survey measurements, survey operations, and statistical analysis and report generation.

1.2.1 Sample Design

In a sample survey, the major statistical components are referred to as the sample design and include both the sampling plan and the estimation procedures. The sampling plan is the methodology used for selecting the sample from the population. The estimation procedures are the algorithms or formulas used for obtaining estimates of population values from the sample data and for estimating the reliability of these population estimates.

The choice of a particular sample design should be a collaborative effort involving input from the statistician who will design the survey, the persons involved in executing the survey, and those who will use the data from the survey. The data users should specify what variables should be measured, what estimates are required, what levels of reliability and validity are needed for the estimates, and what restrictions are placed on the survey with respect to timeliness and costs. Those individuals involved in executing the survey should furnish input about costs for personnel, time, and materials as well as input about the feasibility of alternative sampling and measurement procedures. Having received such input, the statistician can then propose a sample design that will meet the required specifications of the users at the lowest possible cost.

1.2.2 Survey Measurements

Just as sampling and estimation are the statistician’s responsibility in the design of a sample survey, the choice of measurements to be taken and the procedures for taking these measurements are the responsibility of those individuals who are experts in the subject matter of the survey and of those individuals having expertise in the measurement sciences. The former (often called “subject matter persons”) give the primary input into specifying the measurements that are needed in order to meet the objectives of the survey. Once these measurements are specified, the measurement experts—often behavioral scientists or professional survey methodologists with special training and skills in interviewing or other aspects of survey research—begin designing questionnaires or forms to be used in eliciting the data from the sample individuals. The design of a questionnaire or other survey instrument that is suitable for collecting valid and reliable data is often a very complex task. It always requires considerable care and often requires a preliminary study, especially if some of the variables to be measured have never been measured before.

Once the survey instruments have been drafted, the statistician provides input with respect to the procedures to be used to evaluate and assure the quality of the data. In addition, the statistician ensures that the data can be easily coded and processed for statistical analysis and provides input into the strategies and statistical methods that will be used in the analysis.

1.2.3 Survey Operations

pilot survey