Categorical Data Analysis
C22.0010             B90.2337             B90.3307
 

Jeffrey S. Simonoff
Office: MEC 8-54
Phone: (212) 998-0452
FAX: (212) 995-4003
e-mail: jsimonof@stern.nyu.edu
WWW: http://www.stern.nyu.edu/ ~ jsimonof/classes/3307

Text:  R.L. Christensen, Log-Linear Models and Logistic Regression, 2nd. edition, Springer-Verlag (1997).

Prerequisite: Introductory statistics core course, including regression

``Standard" statistical analysis focuses on continuous data, and more specifically the normal (Gaussian) distribution. Despite this, many of the data problems faced in practice involve categorical data - that is, data where variables can take on only a finite number of values. For such data the normal distribution is no longer relevant, and concepts like association measures, model building, regression, and so on, must be reformulated.

This course addresses data of this sort. The guiding principle will be to highlight and exploit connections with Gaussian-based methodology, especially regression. My intention is to mix theory and methodology with practice throughout the course. We will see formulas, but we will also see data sets.

There are three versions of this course meeting simultaneously: the Stern Ph.D. course B90.3307, the MBA elective course B90.2337, and the undergraduate elective course C22.0010. While class sessions for the three versions of the course are identical, grading will be based on different sets of deliverables, reflecting a more theoretical approach for B90.3307 and a more applied approach for B90.2337 and C22.0010.

There will be no tests in this class. Grades will be based on homeworks, and on two assignments where in each you will read and discuss a journal article related to categorical data (students in B90.3307 will discuss theory/methodology articles from statistic journals, while students in B90.2337/C22.0010 will discuss applications articles from functional journals).
 

Syllabus
 
(1)
Introduction
(a)
The nature of categorical data
(b)
Important random variables: Gaussian, binomial, multinomial, Poisson
(c)
Goodness-of-fit
(2)
Regression
(a)
Least squares regression - background
(b)
Poisson regression
(c)
Generalized linear models
(3)
Two-way tables
(a)
Two-sample tests and comparisons of binomial proportions - background
(b)
2×2 tables, tests of independence, and odds ratios
(c)
Analysis of variance - background
(d)
Log-linear models for two-way tables
(e)
I×J tables
(f)
Structural zeroes and quasi-independence models
(g)
Outliers in contingency tables
(h)
Models for symmetric tables and matched pairs data
(4)
Tables with ordered categories
(a)
The bivariate normal distribution - background
(b)
Association models for two-way tables
(c)
Model selection
(5)
Three-way and higher dimensional tables
(a)
Simpson's paradox
(b)
Independence and odds ratios
(c)
Log-linear models for three-way tables
(d)
Association graphs and collapsibility
(e)
Log-linear models for higher dimensional tables
(f)
Models for tables with ordered categories
(6)
Logistic regression
(a)
Categorical predictors and log-linear models
(b)
Continuous predictors
(7)
Sparse contingency tables
(a)
The properties of sparse tables
(b)
Exact and conditional small-sample inference
(c)
Sparse asymptotics
(d)
Smoothing sparse tables
(e)
Generalized additive models