Cover Page

Contents

Tables

Boxes

Figures

Getting Files from the Wiley ftp and Internet Sites

LIST OF DATA SITES PROVIDED ON WEB SITE

Preface to the Fourth Edition

PART 1: Basic Concepts

CHAPTER 1: Uses of Sample Surveys

1.1 WHY SAMPLE SURVEYS ARE USED

1.2 DESIGNING SAMPLE SURVEYS

1.3 PRELIMINARY PLANNING OF A SAMPLE SURVEY

EXERCISES

BIBLIOGRAPHY

CHAPTER 2: The Population and the Sample

2.1 THE POPULATION

2.2 THE SAMPLE

2.3 SAMPLING DISTRIBUTIONS

2.4 CHARACTERISTICS OF ESTIMATES OF POPULATION PARAMETERS

2.5 CRITERIA FOR A GOOD SAMPLE DESIGN

2.6 SUMMARY

EXERCISES

BIBLIOGRAPHY

PART 2: Major Sampling Designs and Estimation Procedures

CHAPTER 3: Simple Random Sampling

3.1 WHAT IS A SIMPLE RANDOM SAMPLE?

3.2 ESTIMATION OF POPULATION CHARACTERISTICS UNDER SIMPLE RANDOM SAMPLING

3.3 SAMPLING DISTRIBUTIONS OF ESTIMATED POPULATION CHARACTERISTICS

3.4 COEFFICIENTS OF VARIATION OF ESTIMATED POPULATION PARAMETERS

3.5 RELIABILITY OF ESTIMATES

3.6 ESTIMATION OF PARAMETERS FOR SUBDOMAINS

3.7 HOW LARGE A SAMPLE DO WE NEED?

3.8 WHY SIMPLE RANDOM SAMPLING IS RARELY USED

3.9 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 4: Systematic Sampling

4.1 HOW TO TAKE A SYSTEMATIC SAMPLE

4.2 ESTIMATION OF POPULATION CHARACTERISTICS

4.3 SAMPLING DISTRIBUTION OF ESTIMATES

4.4 VARIANCE OF ESTIMATES

4.5 A MODIFICATION THAT ALWAYS YIELDS UNBIASED ESTIMATES

4.6 ESTIMATION OF VARIANCES

4.7 REPEATED SYSTEMATIC SAMPLING

4.8 HOW LARGE A SAMPLE DO WE NEED?

4.9 USING FRAMES THAT ARE NOT LISTS

4.10 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 5: Stratification and Stratified Random Sampling

5.1 WHAT IS A STRATIFIED RANDOM SAMPLE?

5.2 HOW TO TAKE A STRATIFIED RANDOM SAMPLE

5.3 WHY STRATIFIED SAMPLING?

5.4 POPULATION PARAMETERS FOR STRATA

5.5 SAMPLE STATISTICS FOR STRATA

5.6 ESTIMATION OF POPULATION PARAMETERS FROM STRATIFIED RANDOM SAMPLING

5.7 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 6: Stratified Random Sampling: Further Issues

6.1 ESTIMATION OF POPULATION PARAMETERS

6.2 SAMPLING DISTRIBUTIONS OF ESTIMATES

6.3 ESTIMATION OF STANDARD ERRORS

6.4 ESTIMATION OF CHARACTERISTICS OF SUBGROUPS

6.5 ALLOCATION OF SAMPLE TO STRATA

6.6 STRATIFICATION AFTER SAMPLING

6.7 HOW LARGE A SAMPLE IS NEEDED?

6.8 CONSTRUCTION OF STRATUM BOUNDARIES AND DESIRED NUMBER OF STRATA

6.9 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 7: Ratio Estimation

7.1 RATIO ESTIMATION UNDER SIMPLE RANDOM SAMPLING

7.2 ESTIMATION OF RATIOS FOR SUBDOMAINS UNDER SIMPLE RANDOM SAMPLING

7.3 POSTSTRATIFIED RATIO ESTIMATES UNDER SIMPLE RANDOM SAMPLING

7.4 RATIO ESTIMATION OF TOTALS UNDER SIMPLE RANDOM SAMPLING

7.5 COMPARISON OF RATIO ESTIMATE WITH SIMPLE INFLATION ESTIMATE

7.6 APPROXIMATION TO THE STANDARD ERROR OF THE RATIO ESTIMATED TOTAL

7.7 DETERMINATION OF SAMPLE SIZE

7.8 REGRESSION ESTIMATION OF TOTALS

7.9 RATIO ESTIMATION IN STRATIFIED RANDOM SAMPLING

7.10 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 8: Cluster Sampling: Introduction and Overview

8.1 WHAT IS CLUSTER SAMPLING?

8.2 WHY IS CLUSTER SAMPLING WIDELY USED?

8.3 A DISADVANTAGE OF CLUSTER SAMPLING: HIGH STANDARD ERRORS

8.4 HOW CLUSTER SAMPLING IS TREATED IN THIS BOOK

8.5 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 9: Simple One-Stage Cluster Sampling

9.1 HOW TO TAKE A SIMPLE ONE-STAGE CLUSTER SAMPLE

9.2 ESTIMATION OF POPULATION CHARACTERISTICS

9.3 SAMPLING DISTRIBUTIONS OF ESTIMATES

9.4 HOW LARGE A SAMPLE IS NEEDED?

9.5 RELIABILITY OF ESTIMATES AND COSTS INVOLVED

9.6 CHOOSING A SAMPLING DESIGN BASED ON COST AND RELIABILITY

9.7 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 10: Two-Stage Cluster Sampling: Clusters Sampled with Equal Probability

10.1 SITUATION IN WHICH ALL CLUSTERS HAVE THE SAME NUMBER N_i OF ENUMERATION UNITS

10.2 SITUATION IN WHICH NOT ALL CLUSTERS HAVE THE SAME NUMBER N_i OF ENUMERATION UNITS

10.3 SYSTEMATIC SAMPLING AS CLUSTER SAMPLING

10.4 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 11: Cluster Sampling in Which Clusters Are Sampled with Unequal Probability: Probability Proportional to Size Sampling

11.1 MOTIVATION FOR NOT SAMPLING CLUSTERS WITH EQUAL PROBABILITY

11.2 TWO GENERAL CLASSES OF ESTIMATORS VALID FOR SAMPLE DESIGNS IN WHICH UNITS ARE SELECTED WITH UNEQUAL PROBABILITY

11.3 PROBABILITY PROPORTIONAL TO SIZE SAMPLING

11.4 FURTHER COMMENT ON PPS SAMPLING

11.5 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 12: Variance Estimation in Complex Sample Surveys

12.1 LINEARIZATION

12.2 REPLICATION METHODS

12.3 SUMMARY

EXERCISES

TECHNICAL APPENDIX

BIBLIOGRAPHY

PART 3: Selected Topics in Sample Survey Methodology

CHAPTER 13: Nonresponse and Missing Data in Sample Surveys

13.1 EFFECT OF NONRESPONSE ON ACCURACY OF ESTIMATES

13.2 METHODS OF INCREASING THE RESPONSE RATE IN SAMPLE SURVEYS

13.3 MAIL SURVEYS COMBINED WITH INTERVIEWS OF NONRESPONDENTS

13.4 OTHER USES OF DOUBLE (or TWO-PHASE) SAMPLING METHODOLOGY

13.5 ITEM NONRESPONSE: METHODS OF IMPUTATION

13.6 MULTIPLE IMPUTATION

13.7 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 14: Selected Topics in Sample Design and Estimation Methodology

14.1 WORLD HEALTH ORGANIZATION EPI SURVEYS: A MODIFICATION OF PPS SAMPLING FOR USE IN DEVELOPING COUNTRIES

14.2 QUALITY ASSURANCE SAMPLING

14.3 SAMPLE SIZES FOR LONGITUDINAL STUDIES

14.4 ESTIMATION OF PREVALENCE OF DISEASES FROM SCREENING STUDIES

14.5 ESTIMATION OF RARE EVENTS: NETWORK SAMPLING

14.6 ESTIMATION OF RARE EVENTS: DUAL SAMPLES

14.7 ESTIMATION OF CHARACTERISTICS FOR LOCAL AREAS: SYNTHETIC ESTIMATION

14.8 EXTRACTION OF SENSITIVE INFORMATION: RANDOMIZED RESPONSE TECHNIQUES

14.9 SUMMARY

EXERCISES

BIBLIOGRAPHY

CHAPTER 15: Telephone Survey Sampling

15.1 INTRODUCTION

15.2. HISTORY OF TELEPHONE SAMPLING IN THE UNITED STATES

15.3 WITHIN-HOUSEHOLD SELECTION TECHNIQUES

15.4 STEPS IN THE TELEPHONE SURVEY PROCESS

15.5 DRAWING AND MANAGING A TELEPHONE SURVEY SAMPLE

15.6 POST-SURVEY DATA ENHANCEMENT PROCEDURES

15.7 IMPUTATION OF MISSING DATA

15.8 DECLINING COVERAGE AND RESPONSE RATES

15.9 ADDRESSING THE PROBLEMS WITH CELL PHONES

15.10 ADDRESS-BASED SAMPLING

EXERCISES

BIBLIOGRAPHY

CHAPTER 16: Constructing the Survey Weights

16.1 INTRODUCTION

16.2 OBJECTIVES OF WEIGHTING

16.3 CONSTRUCTING THE SAMPLE WEIGHTS

16.4 ESTIMATION AND ANALYSIS ISSUES

16.5 Summary

BIBLIOGRAPHY

CHAPTER 17: Strategies for Design-Based Analysis of Sample Survey Data

17.1 STEPS REQUIRED FOR PERFORMING A DESIGN-BASED ANALYSIS

17.2 ANALYSIS ISSUES FOR “TYPICAL” SAMPLE SURVEYS

17.3 SUMMARY

TECHNICAL APPENDIX

BIBLIOGRAPHY

Appendix

Answers to Selected Exercises

Index

WILEY SERIES IN SURVEY METHODOLOGY

Established in Part by WALTER A. SHEWHART AND SAMUEL S. WILKS

Editors: Robert M. Groves, Graham Kalton, J. N. K. Rao, Norbert Schwarz, Christopher Skinner

A complete list of the titles in this series appears at the end of this volume.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic format. For information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Levy, Paul S.

Sampling of populations : methods and applications / Paul S. Levy, Stanley Lemeshow. — 4th ed. p. cm.

Includes bibliographical references and index.

ISBN 978-0-470-04007-2

1. Population—Statistical methods. 2. Sampling (Statistics) I. Lemeshow, Stanley. II. Title.

HB849.49.L48 2008

304.601′51952—dc21

2008004934

To our wives, Virginia and Elaine,
and our sons, daughters, and grandchildren

Tables

2.1	Number of Household Visits Made During a Specified Year
2.2	Sample Data for Number of Household Visits
2.3	Data for Number of Students Not Immunized for Measles Among Six Schools in a Community
2.4	Possible Samples and Values of x′
2.5	Sampling Distribution for Data of Table 2.4
2.6	Sampling Procedure for Population of Six Schools
2.7	Possible Samples and Values of x′
2.8	Sampling Distribution of x′
2.9	Data for the Burn Area Estimates
3.1	Possible Samples of Three Schools and Values of x′
3.2	Sampling Distribution of the x′ in Table 3.1
3.3	Values of the fpc (N = 10,000)
3.4	Sample Data for Number of Household Visits
3.5	Race and Out-of-Pocket Medical Expenses for Six Families (1987)
3.6	Data for a Subdomain Based on the Families in Table 3.5
3.7	Sampling Distribution of z/y
3.8	Cumulative Exposure to Pulmonary Stressors and Forced Vital Capacity Among Workers in a Sample Taken at a Plant Employing 1200 Workers
4.1	Systematic Sample of One in Six Physicians (from Table 2.1)
4.2	Five Possible Samples of One in Five Physicians (from Table 2.1)
4.3	Comparison of Means and Standard Errors for Simple Random Sampling and for Systematic Sampling
4.4	Six Possible Samples of One in Six Physicians (Table 2.1)
4.5	Possible Samples of 1 in k Elements (N/k is an Integer)
4.6	Cluster Samples Based on Data of Table 3.8
4.7	Data for Nurse Practitioner’s Visits (Unordered List)
4.8	Four Possible Samples for Data of Table 4.7
4.9	Data for Nurse Practitioner’s Visits (Monotonically Ordered List)
4.10	Four Possible Samples for Data of Table 4.9
4.11	Data for Nurse Practitioner’s Visits (Periodicity in List)
4.12	Four Possible Samples for Data of Table 4.11
4.13	Summary of Results for Four Types of Sampling
4.14	Distribution of Remainders and Samples for 25 Possible Random Numbers
4.15	Sampling Distribution of Estimated Means from Data of Tables 4.14 and 2.1
4.16	Systematic One-in-Five Sample Taken from Table 2.1
4.17	Days Lost from Work Because of Acute Illness in One Year Among 162 Employees in a Plant
4.18	Data for Six Systematic Samples Taken from Table 4.17
4.19	Six Samples Taken from Table 4.17
5.1	Truck Miles and Number of Accidents Involving Trucks by Type of Road Segment
5.2	Sampling Distribution of x′ for 56 Possible Samples of Three Segments
5.3	Two Strata for Data of Table 5.1
5.4	Sampling Distribution of for 30 Possible Samples of Three Segments
5.5	Comparison of Results for Simple Random Sampling and Stratification
5.6	Strata for a Population of 14 Families
6.1	Retail Prices of 10 Capsules of a Tranquilizer in Pharmacies in Two Communities (Strata)
6.2	Possible Samples for the Stratified Random Sample
6.3	Sampling Distribution for
6.4	General Hospitals in Illinois by Geographical Stratum, 2005
6.5	Strata for Number of Hospital Beds by County Among Counties in Illinois (Excluding Cook County) Having General Hospitals
6.6	Number of Pairs Available in Each of 18 Sex-Race-Sequence Difference Quantiles
6.7	Total Pairs Available and Total Pairs Sampled
6.8	Sample Data of Veterinarian’s Survey
6.9	Distribution of Hospital Episodes per Person per Year
6.10	Frequency Distribution of Total Amount Charged During 1996 to Medicaid for 2387 Patients Treated by a Large Medical Group
6.11	Optimal Allocation Based on Use of the rootfreq Method for Construction of Strata
6.12	Results of Three Methods of Strata Construction Combined with Optimal Allocation from Data on 2387 Patients Shown in Table 6.10
7.1	Pharmaceutical Expenses and Total Medical Expenses Among All Residents of Eight Community Areas
7.2	Possible Samples of Two Elements from the Population of Eight Elements (Table 7.1)
7.3	Samples of Seven Elements from the Population of Table 7.1
7.4	Population (2000 Census) and Current School Enrollment by Census Tract
7.5	Possible Samples of Two Schools Taken from Table 7.4
7.6	Values of x′ and x″ for Samples in Table 7.5
7.7	Sample Data from Table 5.1
8.1	Some Practical Examples of Clusters
8.2	Comparison of Costs for Two Sampling Designs
9.1	Number of Persons over 65 Years of Age and Number over 65 Years Needing Services of Visiting Nurse for Five Housing Developments
9.2	Summary Data for the Two Clusters Selected in the Sample
9.3	Sampling Distributions of Three Estimates
9.4	Means and Standard Errors of Estimates
9.5	Number of Eligible Persons and Number Receiving Substance Abuse Treatment Among Individuals Sentenced to Probation in 26 District Courts
10.1	Number of Patients Seen by Nurse Practitioners and Number Referred to a Physician for Five Community Health Centers
10.2	Summary Data for the Three Clusters Selected in the Sample
10.3	Worksheet for Calculations Involving Cluster Totals
10.4	Worksheet for Calculations Involving Listing Units
10.5	Designs Satisfying Specifications on Total
10.6	Designs Satisfying Specifications on Ratio
10.7	Amount of Money Billed to Medicare by Day and Week
10.8	Designs Satisfying Specifications on Total
10.9	Total Admissions with Life-Threatening Conditions and Total Admissions Discharged Dead from Ten Hospitals, 2007
10.10	Summary Data for a Sample of Three Hospitals Selected from the Ten Hospitals in Table 10.9
10.11	Sampling Distribution of , and r_clu
10.12	Frequency Distribution of Estimated Total Over All Possible Samples of Two Hospitals
11.1	Number of Outpatient Surgical Procedures Performed in 1997 in Three Community Hospitals
11.2	Number of Women Over 90 Years of Age Admitted to Nursing Homes in a Community During 1997
11.3	Distribution of the Hansen-Hurwitz Estimator Over All Possible Samples of Two Nursing Homes Drawn with Replacement
11.4	Number of Women Over 90 Years of Age Admitted to Nursing Homes in a Community During 1997
11.5	Distribution of the Horvitz-Thompson Estimator Over All Possible Samples of Two Nursing Homes Drawn with Replacement
11.6	Results of the PPS Sample
11.7	Data File for Use in Illustrative Example
11.8	Procedure for PPS Sampling with Replacement
12.1	Medicaid Payments and Overpayments for 10 Sample Claims Submitted on Patient
13.1	Data for the Survey of Physicians
13.2	Data that Would be Obtained from the Mail Survey
13.3	Actual Values for Missing Data in Table 13.2
14.1	Distribution of
16.1	NSCAW Weights for Two Age Domains
16.2	NSCAW Response Propensities for Age by Gender Weighting Classes
16.3	NSCAW Nonresponse WCA Factors for Age by Gender Weighting Classes
16.4	NSCAW Poststratification Adjustment Factors for State Group by Substantiation Poststrata
16.5	Final Weight Adjustment Factors for NSCAW Sample
17.1	Number of Nursing Homes in Three Regions in a State
17.2	Resulting Data from Sample of Nursing Homes
17.3	Selection of Sample Subjects from the Departments of Gironde and Dordogne
17.4	Association of Wine Consumption vs. Incident Dementia (Model-Based Analysis)
17.5	Logistic Regression Analysis of Wine Consumption and Incident Dementia Assuming Simple Random Sampling (Model-Based Analysis)
17.6	Logistic Regression Analysis of Wine Consumption and Incident Dementia Incorporating Sample Survey Parameters (Design-Based Analysis)
A.1	Random Number Table
A.2	Selected Percentiles of Standard Normal Distribution

Boxes

Box

2.1	Population Parameters
2.2	Sample Statistics
2.3	Mean and Variance of Sampling Distribution When Each Sample Has the Same Probability (1/T) of Selection
2.4	Mean and Variance of Sampling Distribution When Each Sample Does Not Have the Same Probability of Selection
3.1	Estimated Totals, Means, Proportions, and Variances Under Simple Random Sampling, and Estimated Variances and Standard Errors of These Estimates
3.2	Population Estimates, and Means and Standard Errors of Population Estimates Under Simple Random Sampling
3.3	Coefficients of Variation of Population Estimates Under Simple Random Sampling
3.4	Estimated Variances and 100(1−α)% Confidence Intervals Under Simple Random Sampling
3.5	Exact and Approximate Sample Sizes Required Under Simple Random Sampling
4.1	Estimated Totals, Means, and Variances Under Systematic Sampling, and Estimated Variances, and Standard Errors of These Estimates
4.2	Variances of Population Estimates Under Systematic Sampling
4.3	Estimation Procedures for Population Means Under Repeated Systematic Sampling
5.1	Population and Strata Parameters for Stratified Sampling
5.2	Estimates of Population Parameters and Standard Errors of These Estimates for Stratified Sampling
6.1	Means and Standard Errors of Population Estimates Under Stratified Random Sampling
6.2	Estimated Standard Errors Under Stratified Random Sampling
6.3	Estimates of Population Parameters Under Stratified Random Sampling with Proportional Allocation
7.1	Formulas for Ratio Estimation Under Simple Random Sampling
9.1	Notation Used in Simple One-Stage Cluster Sampling
9.2	Estimated Population Characteristics and Estimated Standard Errors for Simple One-Stage Cluster Sampling
9.3	Theoretical Standard Errors for Estimates Under Simple One-Stage Cluster Sampling
9.4	Exact and Approximate Sample Sizes Required Under Simple One-Stage Cluster Sampling
10.1	Notation Used in Simple Two-Stage Cluster Sampling
10.2	Estimated Population Characteristics and Estimated Standard Errors for Simple Two-Stage Cluster Sampling
10.3	Standard Errors for Population Estimates Under Simple Two-Stage Cluster Sampling
10.4	Estimates of Population Characteristics Under Simple Two-Stage Cluster Sampling, Unequal Numbers of Listing Units
10.5	Theoretical Standard Errors for Population Estimates for Simple Two-Stage Cluster Sampling, Unequal Numbers of Listing units

Figures

Figure

2.1	Relative frequency distribution of the sampling distribution of x′
2.2	Relationship among bias, variability, and MSE for data of Table 2.9
3.1	Areas under normal curve within ±1, ±1.96 and ±3 standard errors of the mean
9.1	Form for collecting the data from sample households in housing developments
9.2	Cost: simple random sample and single-stage cluster sampling
14.1	Network sample for health-care providers and skin cancer patients
16.1	The correspondence among the target, frame, and respondent populations and the sample
16.2	Distribution of NSCAW final weights

Getting Files from the Wiley ftp and Internet Sites

To download the files listed in this book and other material associated with it, use an ftp program or a Web browser.

FTP ACCESS

If you are using an ftp program, type the following at your ftp prompt or URL prompt:

ftp://ftp.wiley.com

Some programs may provide the first Up for you, in which case, you can type:

ftp.wiley.com

If log-in parameters are required, log in as anonymous (e.g., User ID: anonymous). Leave the password blank. After you have connected to the Wiley ftp site, navigate through the directory path of:

/public/sci_tech_med/populations

Also, a direct link to the related FTP site is available on the book’s Wiley.com webpage.

FILE ORGANIZATION

Under the populations directory are subdirectories that include MATLAB files for PC, Macintosh, and UNIX systems and Microsoft® Excel files. Important information is included in the README files.

LIST OF DATA SETS PROVIDED ON WEB SITE

Data Set

momsag.dta

workendta

wloss2.ssd

jacktwin.ssd

jacktwin2.dta

dogscats.ssd

tab7ptl.ssd

tab7ptl.dta

bhratio.dat

tab9_la.dta

tab9_lc.ssd

il10ptl.ssd

il10pt2.dta

il10pt2.ssd

hospslct.ssd

exmp12_2.ssd

exmp12_2.dta

amblnce2.ssd

Preface to the Fourth Edition

This fourth edition of Sampling of Populations: Methods and Applications comes nearly ten years after the 1999 publication of the third edition. Unlike the third edition, it did not involve a major change in organization or emphasis. From our own experiences teaching this topic in graduate one-semester courses and in more intensive short courses held over a two-to-three-day period, as well as from positive feedback we received from students or professionals taking courses using this book or else using it for self-learning or reference, we feel that the organization used in the third edition works well; therefore we have kept it basically intact.

We did feel, however, that the book would benefit greatly from a moderate updating and refreshing along the following lines:

1. In the third edition, we introduced the use in statistical analysis of survey data using versions of SUDAAN and Stata that were available in the mid-1990s. Although the syntax used in SUDAAN has not changed much, the syntax used for the Stata survey analysis commands has changed considerably since then, and the capabilities of both of these software packages for conducting analysis of data from complex sample surveys has increased nearly exponentially. In this fourth edition, we have updated the syntax as needed for all of the illustrative examples that we have included.

2. We have updated the illustrative examples to reflect the fact that we are now in the twenty-first century. For example, in the well-known “dogs and cats” illustrative example, we estimated that the average dog owner pays an average of approximately $25.00 a year in veterinary medical costs per dog and that a cat owner pays approximately $10.75 per cat. As pet owners (Stan has dogs; Paul has cats), it costs us well over $1,000.00 a year to maintain our animals in good health, which is why the insurance companies now offer medical insurance for pets. So as to keep our book from appearing “quaint,” we took a good look at the examples and updated the dates and the prices when necessary.

3. In the third edition, we made available on the Wiley FTP site several of the data files used in illustrative examples and exercises. In this edition, we are now making available most of the data files discussed in the text.

4. During the course of our teaching from the third edition and from our more recent experiences in the design of complex sample surveys and the analysis of data from such surveys, we began to recognize that we paid too little attention to the process of preparing data for design-based statistical analysis. This process involves the several stages of weighting that take into consideration construction of base weights that reflect inclusion probabilities, adjustments for nonresponse and for deficiencies in the sampling frame, weight trimming, and adjustments for deficiencies in the sampling frame. The importance of these procedures cannot be emphasized too much, since they can result in large improvements in the validity and reliability of the resulting estimates.

5. To strengthen this area we invited Drs. Paul Biemer and Sharon Christ to write a chapter entitled “Constructing the Sample Weights,” which covers the above material very thoroughly (Chapter 16). Dr. Biemer is a Distingushed Fellow in Statistics at RTI International and Professor of Sociology at the University of North Carolina at Chapel Hill. He has authored numerous publications in survey methodology, including the seminal book, Introduction to Survey Quality (co-authored with Lars Lyberg). Dr. Christ is a recent PhD graduate in biostatistics who has worked with Dr. Biemer on several survey-related projects.

During the years between the third and fourth editions, telephone surveys have undergone very rapid changes owing to such phenomena as declining response rates, the institution of the “Do Not Call List,” the increased penetration of cell phones in households, and the greatly increased number of “cell phone only” households. In addition, list-assisted random digit dialing replaced the Mitofsky–Waksberg method for sampling telephone households. To capture these changes, we invited Drs Michael W. Link and Mansour Fahimi to write a chapter on telephone surveys (Chapter 15). Both Drs. Link and Fahimi are sample survey methodologists who have worked and published intensively in the area of telephone sample surveys. Both are former colleagues of Drs. Levy and Biemer at RTI International. Dr. Link is currently Vice President for Methodological Research and Chief Methodologist at Nielsen, and Dr. Fahimi is presently Vice President for Statistical Research Services at the Marketing Systems Group, which is a major provider of expertise and services for sample surveys through their GENESYS Sampling division.

In summary, we feel that the enhancements mentioned above will provide the reader with a more enjoyable and beneficial experience. We also include material from the Preface to the third edition that we feel might be helpful to readers of this edition.

This fourth edition is likely to be the final edition of Sampling of Populations: Methods and Applications (at least written by us). Therefore, we would like to express our utmost appreciation to our colleagues, students, managers, and staff at all of the institutions with which we were fortunate enough to be associated in the 30+ years since we conceived this book (University of Massachussetts for both Paul and Stan; University of Illinois at Chicago, University of North Carolina at Chapel Hill, and RTI International for Paul; and The Ohio State University for Stan). Stan is particularly appreciative of Ohio State Provost Barbara Snyder’s agreeing to allow him to spend three months away from Columbus on a special research assignment (SRA). This provided him with the time he needed to work on this book. Thanks also to Annick Alpérovitch, Carole Dufouil, and Christophe Tzourio at INSERM Unit 708 in Paris, France for providing an office for Stan and an environment conducive for working on this book during the SRA. We would like to thank our current and former editors: Alex Kugushev at Wadsworth; Beatrice Shube, Helen Ramsey, and Steve Quigley at Wiley. We would also like to thank Nidhi Kochar, Charisse Darrell-Fields, Michael Sabbatino and Tracy McHone for their assistance with various aspects of this book’s preparation. We’re especially appreciative of the expertise Melanie Cole brought to formatting, organizing, and preparing the camera-ready manuscript. Parts of the material in the fourth edition were recently used in Paul’s course at the University of North Carolina and we are grateful to the students in that course for their helpful suggestions and detection of errors. Vince Iannacchione, Paul’s colleague and co-instructor in that course, pointed out several inconsistencies and made numerous helpful comments. Amanda Lewis-Evans designed the cover art. Most of all, we are grateful to our wives, Virginia and Elaine, for putting up with us. It’s been a great ride.

Paul S. Levy
Stanley Lemeshow

Research Triangle Park, North Carolina
Columbus, Ohio
April 2008

Material from the Preface To The Third Edition

The original edition of Sampling of Populations: Methods and Applications was published in 1980 by Lifetime Learning Publications (a Division of Wadsworth, Inc.) under the title Sampling for Health Professionals. Like other Lifetime Learning Books, its primary intended audience was the working professional; in this instance, the practicing statistician. With this as the target audience, the authors felt that such a book on sampling should have the following features:

Presentation of the basic concepts and procedures of sample design and estimation methods in a user-friendly way with a minimum of mathematical formality and jargon.
Compilation of important formulas in boxes that can be easily located by the user.
Presentation of the various procedures for drawing a sample (e.g., systematic sampling, probability proportional to size sampling) in a step-by-step manner that the reader could easily follow (almost like a manual).
Use of heuristic demonstrations and numerous illustrative examples rather than rigorous mathematical proofs for the purpose of giving the reader a clearer understanding of the rationale for certain procedures used in sampling.

Sampling for Health Professionals was well received both by reviewers and readers, and had a steady following throughout its existence (1980-91). In addition to having a strong following among practicing statisticians (its intended primary audience), it had been adopted as a primary text or recommended as additional reading by an unexpectedly large number of instructors of sampling courses in various academic units.

Based on the success of this first version of the book, we developed a greatly revised and expanded version that was published by John Wiley & Sons in 1991 under the title Sampling of Populations: Methods and Applications. Our purpose in that revision was to improve the suitability of the book for use as a text in applied sampling courses without compromising its readability or its suitability for the continuation and self-learning markets. The resulting first Wiley edition was a considerably updated and expanded version of the original work, but in much the same style and tone.

Although less than a decade has elapsed since the appearance of the first Wiley edition, there have been major developments both in the design and administration of sample surveys as well as in the analysis of the resulting data from sample surveys. In particular, refinements in telephone sampling and interviewing methodologies have now made it generally more feasible and less expensive than face-to-face household interviewing. While the first edition contained some material on random digit dialing (RDD), it did not cover the various refinements of RDD or the use of list-assisted sampling methods and other innovations that are now widely used.

Also, “user friendly” computer software is much more readily available not only for obtaining standard errors of survey estimates, but also for performing statistical procedures such as contingency table analysis and multiple linear or logistic regression that take into consideration complexities in the sample design. In fact, at the writing of this new edition, modules for the analysis of complex survey data have begun to appear in major general statistical software packages (e.g., Stata). Such software is now widely used and has removed the necessity for many of the complicated formulas that appeared in Chapter 11 as well as elsewhere in the first Wiley edition. Likewise, the analysis methods that appear in Chapter 16 of that edition are a reflection of methods used historically when design-based analysis methodology was not in its present state of development and software was not readily available. We felt that the discussion in that chapter did not reflect adequately the present state of the art with respect to analysis strategies for survey data and we have totally revised it to reflect more closely current practice.

Both of us feel strongly that knowledge of telephone-sampling methodology and familiarity with computer-driven methods now widely used in the analysis of complex survey data should be part of any introduction to sampling methods, and, with this in mind, we have greatly enhanced and revised the material on survey data analysis, and attempted to introduce the use of appropriate software throughout the book in our discussion of the major sample designs and estimation procedures. In addition to adding material on the topics just mentioned, we have made other revisions based on our own experience with the book in sampling courses and on suggestions from students and colleagues. Some of these are listed below.

1. Exercises that seem to be ineffective have been modified or replaced. Some new exercises have been added.

2. The material on two-stage cluster sampling has been expanded and reorganized, with designs in which clusters are sampled with equal probability presented in Chapter 10 and designs in which clusters are sampled with unequal probability given in Chapter 11. The important class of probability proportional to size (PPS) sampling designs is now presented in Chapter 11 as a particular case of designs in which clusters are sampled with unequal probability. This appears to us a more natural way to present cluster sampling designs and has worked out very well in our classes.

3. In our presentation of many of the numerical illustrative examples, we include discussion of how the analysis would appear in one of the software packages that specializes in or includes analysis of survey data.

4. Chapters that deal with specific sampling designs now have a similar structure, with the following subsections:

How to take the sample
Estimation of population parameters
More theoretical discussion of sampling distributions of these estimates (if worthwhile)
Estimation of standard errors
Sample size determination
Optimization issues (if appropriate)
Summary, exercises, bibliography

5. We have greatly expanded the number of articles cited in the annotated reference list to include more recent articles of importance. In particular, one of us (PSL) was the Section Editor for Design of Experiments and Sample Surveys of the recently published Encyclopedia of Biostatistics and, as such, was responsible for the material on sample survey methods. In that capacity, he solicited over 50 expository or review articles covering important topics in sampling methodology and written by experts on the particular topics. These are useful references for readers learning sampling for the first time as well as for more advanced readers, and we have cited many of them.

6. In the prior edition, we did not include in electronic form data sets used in the illustrative examples and exercises. In this edition, we make such data available on the Internet.

It is our feeling that one of the strengths of the earlier edition is its focus on the basic principles and methods of sampling. To maintain this focus, we omit or treat very briefly several very interesting topics that have seen considerable development in the last decade. We feel that they are best covered in more specialized texts on sampling. As a result, we do not cover to any extent topics such as distance sampling, adaptive sampling, and superpopulation models that are of considerable importance, but have been treated very well in other volumes. We did, however, include several topics that were not in the previous edition and that we feel are important for a general understanding of sampling methodology. Examples of such topics included in this edition are construction of stratum boundaries and desired number of strata (Chapter 6); estimation of ratios for subdomains (Chapter 7); poststratified ratio estimates (Chapter 7); the Hansen-Hurwitz estimator and the Horvitz-Thompson estimator (Chapter 11).

From our experience with the first edition of Sampling of Populations: Methods and Applications, we feel that this book will be used by practicing statisticians as well as by students taking formal courses in sampling methodology. Both of us teach in schools of public health, and have used this book as the basic text for a one-semester course in sample-survey methodology. Our classes have included a mix of students concentrating in biostatistics, epidemiology, and other areas in the biomedical and social and behavioral sciences. In our experience, this book has been very suitable to this mix of students, and we feel that at least 80% of this material could be covered without difficulty in a single semester course.

Several instructors have indicated that, in their courses on sampling theory. this book works well as a primary text in conjunction with a more theoretical text (e.g., W.G. Cochran, Sampling Techniques, 3rd ed., New York: Wiley, 1977), with the latter text used for purposes of providing additional theoretical background. Conversely, selected readings from our book have been used to provide sampling background to students in broader courses on survey research methodology (often taught in sociology departments).

The number of our students and colleagues who gave us helpful comments and suggestions on our earlier text and on the present volume are too numerous to mention and we are grateful to all of them. We would like to thank, in particular, Janelle Klar and Elizabeth Donohoe-Cook for carefully reading this manuscript and making valuable editorial and substantive suggestions. In addition, we would like to thank the two anonymous individuals who reviewed an earlier draft of this manuscript for the publisher. Although we did not agree with all of their suggestions, we did take into consideration in our subsequent revision many of their thoughtful and insightful comments. Most of all, however, we wish to recognize the pioneers of sampling methodology who have written the early textbooks in this field. In particular, the books by William Cochran, Morris Hansen, William Hurwitz and William Madow, Leslie Kish, and P.V. Sukhatme are statistical classics that are still widely studied by students, academics, and practitioners. Those of us who cut our teeth on these books and have made our careers in survey sampling owe them a great debt.

Paul S. Levy
Stanley Lemeshow

Chicago, Illinois
Amherst, Massachusetts
December 1998

PART 1

Basic Concepts

CHAPTER 1 Uses of Sample Surveys

1.1 WHY SAMPLE SURVEYS ARE USED

Information on characteristics of populations is constantly needed by politicians, marketing departments of companies, public officials responsible for planning health and social services, and others. For reasons relating to timeliness and cost, this information is often obtained by use of sample surveys. Such surveys are the subject of this book.

The following discussion provides an example of a sample survey conducted to obtain information about a health characteristic in a particular population. A health department in a large state is interested in determining the proportion of the state’s children of elementary school age who have been immunized against childhood infectious diseases (e.g., polio, diphtheria, tetanus, and pertussis). For administrative reasons, this task must be completed in only one month.

At first glance this task would seem to be most formidable, involving the careful coordination of a large staff attempting to collect information, either from parents or from school immunization records on each and every child of elementary school age residing in that state. Clearly, the budget necessary for such an undertaking would be enormous because of the time, travel expenses, and number of children involved. Even with a sizable staff, it would be difficult to complete such an undertaking in the specified time frame.

To handle problems such as the one outlined above, this text will present a variety of methods for selecting a subset (a sample) from the original set of all measurements (the population) of interest to the researchers. It is the members of the sample who will be interviewed, studied, or measured. For example, in the problem stated above, the net effect of such methods will be that valid and reliable estimates of the proportion of children who have been immunized for these diseases could be obtained in the time frame specified and at a fraction of the cost that would have resulted if attempts were made to obtain the information concerning every child of elementary school age in the state.

More formally, a sample survey may be defined as a study involving a subset (or sample) of individuals selected from a larger population. Variables or characteristics of interest are observed or measured on each of the sampled individuals. These measurements are then aggregated over all individuals in the sample to obtain summary statistics (e.g., means, proportions, and totals) for the sample. It is from these summary statistics that extrapolations can be made concerning the entire population. The validity and reliability of these extrapolations depend on how well the sample was chosen and on how well the measurements were made. These issues constitute the subject matter of this text.

When all individuals in the population are selected for measurement, the study is called a census. The summary statistics obtained from a census are not extrapolations, since every member of the population is measured. The validity of the resulting statistics, however, depends on how well the measurements are made. The main advantages of sample surveys over censuses lie in the reduced costs and greater speed made possible by taking measurements on a subset rather than on an entire population. In addition, studies involving complex issues requiring elaborate measurement procedures are often feasible only if a sample of the population is selected for measurement since limited resources can be allocated to getting detailed measurements if the number of individuals to be measured is not too great.

In the United States, as in many other countries, governmental agencies are mandated to develop and maintain programs whereby sample surveys are used to collect data on the economic, social, and health status of the population, and these data are used for research purposes as well as for policy decisions. For example, the National Center for Health Statistics (NCHS), a center within the United States Department of Health and Human Services, is mandated by law to conduct a program of periodic and ongoing sample surveys designed to obtain information about illness, disability, and the utilization of health care services in the United States [15]. Similar agencies, centers, or bureaus exist within other departments (e.g., the Bureau of Labor Statistics within the Department of Labor, and the National Center for Educational Statistics within the Department of Education) that collect data relevant to the mission of their departments through a program of sample surveys. Field work for these surveys is sometimes done by the U.S. Bureau of the Census, which also has its own program of surveys, or by commercial firms.

The surveys developed by such government agencies often have extremely complex designs and require very large and highly skilled staff (and, hence, large budgets) for their execution. Although the nature of the missions of these government agencies—provision of valid and reliable statistics on a wide variety of indicators for the United States as a whole and various subgroups of it—would justify these large budgets, such costs are rarely justified or at all feasible for most institutions that make use of sample surveys. The information needs of most potential users of sample surveys are far more limited in scope and much more focused around a relatively small set of particular questions. Thus, the types of surveys conducted outside of the federal government are generally simpler in design and “one-shot” rather than ongoing. These are the types of surveys on which we will focus in this text. We will, however, devote some discussion to more complex sample surveys, especially in Chapter 12, which discusses variance estimation methods that have been developed primarily to meet the needs of very complex government surveys.

Sample surveys belong to a larger class of nonexperimental studies generally given the name “observational studies” in the health or social sciences literature. Most sample surveys can be put in the class of observational studies known as “cross-sectional studies.” Other types of observational studies include cohort studies and case-control studies.

Cross-sectional studies are “snapshots” of a population at a single point in time, having as objectives either the estimation of the prevalence or the mean level of some characteristics of the population or the measurement of the relationship between two or more variables measured at the same point in time.

Cohort and case-control studies are used for analytic rather than for descriptive purposes. For example, they are used in epidemiology to test hypotheses about the association between exposure to suspected risk factors and the incidence of specific diseases.

These study designs are widely used to gain insight into relationships. In the business world, for example, a sample of delinquent accounts might be taken (i.e., the “cases”) along with a sample of accounts that are not delinquent (i.e., the “controls”), and the characteristics of each group might be compared for purposes of determining those factors that are associated with delinquency. Numerous examples of these study designs could be given in other fields;

As mentioned above, cohort and case-control studies are designed with the objective in mind of testing some statement (or hypothesis) concerning a set of independent variables (e.g., suspected risk factors) and a dependent variable (e.g., disease incidence). Although such studies are very important, they do not make up the subject matter of this text. The type of study of concern here is often known as a descriptive survey. Its main objective is that of estimating the level of a set of variables in a defined population. For example, in the hypothetical example presented at the beginning of this chapter, the major objective is to estimate, through use of a sample, the proportion of all children of elementary school age who have been vaccinated against childhood diseases. In descriptive surveys, much attention is given to the selection of the sample since extrapolation is made from the sample to the population. Although hypotheses can be tested based on data collected from such descriptive surveys, this is generally a secondary objective in such surveys. Estimation is almost always the primary objective.

1.2 DESIGNING SAMPLE SURVEYS

In this section, we will discuss the four major components involved in designing sample surveys. These components are sample design, survey measurements, survey operations, and statistical analysis and report generation.

1.2.1 Sample Design

In a sample survey, the major statistical components are referred to as the sample design and include both the sampling plan and the estimation procedures. The sampling plan is the methodology used for selecting the sample from the population. The estimation procedures are the algorithms or formulas used for obtaining estimates of population values from the sample data and for estimating the reliability of these population estimates.

The choice of a particular sample design should be a collaborative effort involving input from the statistician who will design the survey, the persons involved in executing the survey, and those who will use the data from the survey. The data users should specify what variables should be measured, what estimates are required, what levels of reliability and validity are needed for the estimates, and what restrictions are placed on the survey with respect to timeliness and costs. Those individuals involved in executing the survey should furnish input about costs for personnel, time, and materials as well as input about the feasibility of alternative sampling and measurement procedures. Having received such input, the statistician can then propose a sample design that will meet the required specifications of the users at the lowest possible cost.

1.2.2 Survey Measurements

Just as sampling and estimation are the statistician’s responsibility in the design of a sample survey, the choice of measurements to be taken and the procedures for taking these measurements are the responsibility of those individuals who are experts in the subject matter of the survey and of those individuals having expertise in the measurement sciences. The former (often called “subject matter persons”) give the primary input into specifying the measurements that are needed in order to meet the objectives of the survey. Once these measurements are specified, the measurement experts—often behavioral scientists or professional survey methodologists with special training and skills in interviewing or other aspects of survey research—begin designing questionnaires or forms to be used in eliciting the data from the sample individuals. The design of a questionnaire or other survey instrument that is suitable for collecting valid and reliable data is often a very complex task. It always requires considerable care and often requires a preliminary study, especially if some of the variables to be measured have never been measured before.

Once the survey instruments have been drafted, the statistician provides input with respect to the procedures to be used to evaluate and assure the quality of the data. In addition, the statistician ensures that the data can be easily coded and processed for statistical analysis and provides input into the strategies and statistical methods that will be used in the analysis.

1.2.3 Survey Operations

pilot survey