Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data

Names: Upton, Graham J. G., author.
Title: Categorical data analysis by example / Graham J.G. Upton.
Description: Hoboken, New Jersey : John Wiley & Sons, 2016. | Includes index.
Identifiers: LCCN 2016031847 (print) | LCCN 2016045176 (ebook) | ISBN 9781119307860 (cloth) | ISBN 9781119307914 (pdf) | ISBN 9781119307938 (epub)

Subjects: LCSH: Multivariate analysis. | Log-linear models.
Classification: LCC QA278 .U68 2016 (print) | LCC QA278 (ebook) | DDC 519.5/35–dc23
LC record available at https://lccn.loc.gov/2016031847

CONTENTS

Preface

Acknowledgments

Chapter 1: Introduction

1.1 What are Categorical Data?
1.2 A Typical Data Set
1.3 Visualization and Cross-Tabulation
1.4 Samples, Populations, and Random Variation
1.5 Proportion, Probability, and Conditional Probability
1.6 Probability Distributions
1.7 *The Likelihood

Chapter 2: Estimation and Inference for Categorical Data

2.1 Goodness of Fit
2.2 Hypothesis Tests for a Binomial Proportion (Large Sample)
2.3 Hypothesis Tests for a Binomial Proportion (Small Sample)
2.4 Interval Estimates for a Binomial Proportion
References

Chapter 3: The 2 × 2 Contingency Table

3.1 Introduction
3.2 Fisher’s Exact Test (For Independence)
3.3 Testing Independence with Large Cell Frequencies
3.4 The 2 × 2 Table in a Medical Context
3.5 Measuring Lack of Independence (Comparing Proportions)
References

Chapter 4: The I × J Contingency Table

4.1 Notation
4.2 Independence in the I × J Contingency Table
4.3 Partitioning
4.4 Graphical Displays
4.5 Testing Independence with Ordinal Variables
References

Chapter 5: The Exponential Family

5.1 Introduction
5.2 The Exponential Family
5.3 Components of a General Linear Model
5.4 Estimation
References

Chapter 6: A Model Taxonomy

6.1 Underlying Questions
6.2 Identifying the Type of Model

Chapter 7: The 2 × J Contingency Table

7.1 A Problem with X² (And G²)
7.2 Using the Logit
7.3 Individual Data and Grouped Data
7.4 Precision, Confidence Intervals, and Prediction Intervals
7.5 Logistic Regression with a Categorical Explanatory Variable
References

Chapter 8: Logistic Regression with Several Explanatory Variables

8.1 Degrees of Freedom when there are no Interactions
8.2 Getting a Feel for the Data
8.3 Models with two-Variable Interactions

Chapter 9: Model Selection and Diagnostics

9.1 Introduction
9.2 Notation for Interactions and for Models
9.3 Stepwise Methods for Model Selection Using G²
9.4 AIC and Related Measures
9.5 The Problem Caused by Rare Combinations of Events
9.6 Simplicity Versus Accuracy
9.7 DFBETAS
References

Chapter 10: Multinomial Logistic Regression

10.1 A Single Continuous Explanatory Variable
10.2 Nominal Categorical Explanatory Variables
10.3 Models for an Ordinal Response Variable
References

Chapter 11: Log-Linear Models for I × J Tables

11.1 The Saturated Model
11.2 The Independence Model for an I × J Table

Chapter 12: Log-Linear Models for I × J × K Tables

12.1 Mutual Independence: A/B/C
12.2 The Model AB/C
12.3 Conditional Independence and Independence
12.4 The Model AB/AC
12.5 The Models AB/AC/BC and ABC
12.6 Simpson’s Paradox
12.7 Connection between Log-Linear Models and Logistic Regression
Reference

Chapter 13: Implications and Uses of Birch’s Result

13.1 Birch’s Result
13.2 Iterative Scaling
13.3 The Hierarchy Constraint
13.4 Inclusion of the All-Factor Interaction
13.5 Mostellerizing
References

Chapter 14: Model Selection for Log-Linear Models

14.1 Three Variables
14.2 More than Three Variables
Reference

Chapter 15: Incomplete Tables, Dummy Variables, and Outliers

15.1 Incomplete Tables
15.2 Quasi-Independence
15.3 Dummy Variables
15.4 Detection of Outliers

Chapter 16: Panel Data and Repeated Measures

16.1 The Mover-Stayer Model
16.2 The Loyalty Model
16.3 Symmetry
16.4 Quasi-Symmetry
16.5 The Loyalty-Distance Model
References

Appendix: R Code for Cobweb Function

Index

Author Index

Index of Examples

EULA

List of Tables

Chapter 1

Table 1.1
Table 1.2
Table 1.3
Table 1.4
Table 1.5
Table 1.6

Chapter 2

Table 2.1

Chapter 3

Table 3.1
Table 3.2
Table 3.3
Table 3.4
Table 3.5

Chapter 4

Table 4.1
Table 4.2
Table 4.3
Table 4.4
Table 4.5
Table 4.6

Chapter 6

Table 6.1

Chapter 7

Table 7.1
Table 7.2
Table 7.3
Table 7.4
Table 7.5
Table 7.6
Table 7.7
Table 7.8
Table 7.9

Chapter 8

Table 8.1
Table 8.2
Table 8.3
Table 8.4

Chapter 9

Table 9.1
Table 9.2
Table 9.3
Table 9.4
Table 9.5
Table 9.6
Table 9.7
Table 9.8
Table 9.9
Table 9.10
Table 9.11
Table 9.12
Table 9.13
Table 9.14

Chapter 10

Table 10.1
Table 10.2
Table 10.3
Table 10.4
Table 10.5
Table 10.6
Table 10.7
Table 10.8
Table 10.9
Table 10.10
Table 10.11

Chapter 11

Table 11.1
Table 11.2
Table 11.3
Table 11.4
Table 11.5
Table 11.6

Chapter 12

Table 12.1
Table 12.2

Chapter 13

Table 13.1
Table 13.2

Chapter 14

Table 14.1
Table 14.2
Table 14.3
Table 14.4
Table 14.5
Table 14.6

Chapter 15

Table 15.1
Table 15.2
Table 15.3
Table 15.4
Table 15.5
Table 15.6
Table 15.7
Table 15.8

Chapter 16

Table 16.1
Table 16.2
Table 16.3
Table 16.4
Table 16.5
Table 16.6
Table 16.7
Table 16.8

List of Illustrations

Chapter 1

Figure 1.1 Illustration of results of sports preference survey.
Figure 1.2 A normal distribution, with mean μ and variance σ².
Figure 1.3 Chi-squared distributions with 2, 4, and 8 degrees of freedom.

Chapter 2

Figure 2.1 The average width of a “95%” confidence interval for p varies with the actual p-value. The results shown are using the recommended mid-P interval for the case n = 50.

Chapter 4

Figure 4.1 Mosaic diagram, using standardized residuals (Equation 4.7) for the independence model applied to the data of Table 4.1.
Figure 4.2 Cobweb diagram for the independence model applied to the data of Table 4.1. The lines indicate the category combinations that have the largest standardized residuals (Black = positive, Gray = negative).
Figure 4.3 A scatter diagram representation of the data in Table 4.6, with a line suggesting the dependence between the two variables.

Chapter 7

Figure 7.1 The relation between probability, p, and the logit, ln [p/(1 − p)].
Figure 7.3 Graphs of mean daily total calcium intake (in mg) against proportion dying of CVD, and against logit(proportion dying of CVD).
Figure 7.2 Scatter diagram showing the mean daily total calcium intake (in mg) for 100 individuals who died of cardiovascular disease (CVD), and for 100 who did not.
Figure 7.4 The probabilities derived from the fitted simple logistic regression model superimposed on the scatter diagram of mean daily total calcium intake (in mg) against proportion dying of CVD.
Figure 7.5 The approximate 95% prediction interval for the simple logistic regression model, Equation (7.1), relating CVD to calcium intake.

Chapter 8

Figure 8.1 Scatter diagram showing the dependence of the logit(dying of CVD) on mean daily total calcium intake and gender.
Figure 8.2 The estimated probabilities of dying from CVD for males (solid line) and females (dashed line) for the purely additive model, using age and gender, and for the model including the age-gender interaction.

Chapter 9

Figure 9.1 The tree of possible models involving three explanatory variables, X, Y, and Z.
Figure 9.2 Forward selection for the referendum data. Models examined are indicated, with the entire preferred path shown in bold. Key: P, political affiliation; D, time of decision; R, read official leaflet.
Figure 9.3 Backward elimination for the referendum data. Models examined are indicated, with the entire preferred path shown in bold. Key: P, political affiliation; D, time of decision; R, read official leaflet.
Figure 9.4 The values of DFBETAS for the coefficient of T for the orange juice data. The extreme value for DFBETAS corresponds to the pair of observations taken at 55°C.

Chapter 10

Figure 10.1 The fit of the model describing the dependence of political allegiance on age.
Figure 10.2 The tree of models describing the dependence of political allegiance on gender and/or class.
Figure 10.3 The dependence of the cell probabilities on the value of x, using the cumulative logit model given by Equation (10.8) with three categories, β = 1, μ₁ = 0.2, and μ₂ = 2.
Figure 10.4 The tree of models describing the dependence of working hours on sex and marital condition using proportional odds models.

Chapter 11

Figure 11.1 A cobweb diagram for the murder data of Table 11.5.

Chapter 12

Figure 12.1 Model tree showing the connections between the nine models of possible interest in a three-way classification involving variables A, B, and C.
Figure 12.2 Graphical analogies of problems caused by ignoring a third variable. (a) Apparent dependence with conditional independence. (b) Apparent independence with strong dependence in the sub-populations.
Figure 12.3 A graphical analogue of Simpson’s paradox: positive slopes in subpopulations become a negative slope when the data are (incorrectly) treated as a single population.
Figure 12.4 The fit of log-linear models describing the dependence of political allegiance on gender and/or class.

Chapter 14

Figure 14.1 A cobweb diagram showing the relative importance of the various two-variable interactions for the data of Table 14.1.
Figure 14.2 A cobweb diagram showing the relative importance of the various two-variable interactions for the data of Table 14.4.
Figure 14.3 Extract from the tree of models applied to the data of Table 14.4. The variables are A, Date of birth; B, Region; C, Thumb style; D, Hand posture.

Chapter 15

Figure 15.1 Cobweb diagram for the Aberdeen mothers.

Preface

This book is aimed at all those who wish to discover how to analyze categorical data without getting immersed in complicated mathematics and without needing to wade through a large amount of prose. It is aimed at researchers with their own data ready to be analyzed and at students who would like an approachable alternative view of the subject. The few starred sections provide background details for interested readers, but can be omitted by readers who are more concerned with the “How” than the “Why.”

As the title suggests, each new topic is illustrated with an example. Since the examples were as new to the writer as they will be to the reader, in many cases I have suggested preliminary visualizations of the data or informal analyses prior to the formal analysis. Any model provides, at best, a convenient simplification of a mass of data into a few summary figures. For a proper analysis of any set of data, it is essential to understand the background to the data and to have available information on all the relevant variables. Examples in textbooks cannot be expected to provide detailed insights into the data analyzed: those insights should be provided by the users of the book in the context of their own sets of data.

In many cases (particularly in the later chapters), R code is given and excerpts from the resulting output are presented. R was chosen simply because it is free! The thrust of the book is about the methods of analysis, rather than any particular programming language. Users of other languages (SAS, STATA, ...) would obtain equivalent output from their analyses; it would simply be presented in a slightly different format. The author does not claim to be an expert R programmer, so the example code can doubtless be improved. However, it should work adequately as it stands.

In the context of log-linear models for cross-tabulations, two “specialties of the house” have been included: the use of cobweb diagrams to get visual information concerning significant interactions, and a procedure for detecting outlier category combinations. The R code used for these is available and may be freely adapted.

GRAHAM J. G. UPTON

Wivenhoe, Essex
March, 2016

Acknowledgments

A first thanks go to generations of students who have sat through lectures related to this material without complaining too loudly!

I have gleaned data from a variety of sources and particular thanks are due to Mieke van Hemelrijck and Sabine Rohrmann for making the NHANES III data available. The data on the hands of blues guitarists have been taken from the Journal of Statistical Education, which has an excellent online data resource. Most European and British data were abstracted from the UK Data Archive, which is situated at the University of Essex; I am grateful for their assistance and their permission to use the data. Those interested in election data should find the website of the British Election Study helpful. The US crime data were obtained from the website provided by the FBI. On behalf of researchers everywhere, I would like to thank these entities for making their data so easy to re-analyze.

GRAHAM J. G. UPTON