Panel Data
Econometrics
Panel Data Sets
Professor W. Greene
Department of Economics
Office: MEC 7-90
Ph: +1-212-998-0876
E-mail: wgreene@stern.nyu.edu
Home Page: https://pages.stern.nyu.edu/wgreene
Return to course home page.
Notes: The following list points to a series of data sets. We will use
some of these in our class discussions. A number of others are provided for
students to analyze as part of their study of the topic. The Penn World
Tables are a major cross country data base online that provides a wealth of
interesting data. This can be accessed directly: The Penn World Tables
There are many other sources of data on the
web. One that is particularly rich is the archives of the Journal of
Applied Econometrics: (Click here to
visit)
Data below are provided in two formats: (1)
The 'csv format' is a plain vanilla ascii text file containing the variable
names at the top of the file followed by the variables, arranged neatly in the
file and separated by commas. This is a portable file that is readable by any
econometrics package. It will also import directly into Excel just by double
clicking its name. (2) If you are using LIMDEP or NLOGIT, the .lpj project file
can be imported directly into the program, as is. (The .csv file can as well.)
- Grunfeld
Investment Data, 10 Firms, 20 Years (1935-1954)
Variables in the file are
Firm = Firm ID, 1,...,10
Year = 1935,...,1954
I = Investment
F = Real Value of the Firm
C = Real Value of the Firm's
Capital Stock
Data are from the Ph.D. dissertation of Y. Grunfeld (Univ. of Chicago,
1958), See, e.g., Zellner, A., "An Efficient Method of Estimating
Seemingly Unrelated Regression Equations and Tests for Aggregation
Bias," Journal of the American Statistical Association, 57,
1962, pp. 348-368 for analyses of these data.
- Spanish
Dairy Farm Production, N = 247, T = 6
Variables in the file are
FARM = Farm ID
YEAR = year, 93, 94, ..., 98
Inputs
COWS,
X1 = log of, deviations from means (logs)
LAND,
X2 = same
LABOR, X3 = same
FEED,
X4 = same
Translog terms = squares and cross products: X11, X22, X33, X44, X12, X13, X14,X23,
X24, X34
YEAR93,...,YEAR98 = year dummy
variables
Output
MILK
= farm output
YIT =
log of MILK production
- Swiss Railways, N = 49, T(i) = 3
to 13
Variables in the file are
ID = Company ID from 1 to 50 (49 companies, 604
obs) ID = 21 zero count.
RAILROAD =
Company ID reduced to remove gap at ID=21, values 1 to 49.
YEAR = Year (1985 to 1997).
NI = Number of year observations for
firm, repeated in each observation.
STOPS =
Number of stations on the network.
NETWORK =
Length of railway network (m).
LABOREXP =
Labor expenses in 1000 CHF.
STAFF =
Number of employees.
ELECEXP =
Electricity expenses in 1000 CHF.
KWH = Total
consumed electricity (in 1000 kWh).
TOTCOST =
Total cost (in 1000 CHF).
NARROW_T = Dummy
for the networks with narrow track (1 m wide).
RACK = Dummy
for the networks with RACK RAIL (cremaillere) in at least some part;
(used
to maintain a slow movement of the train on high slopes).
TUNNEL = Dummy
for networks that have tunnels with an average length of more than 300 meters.
VIRAGE =
Dummy for the networks whose minimum radius of curvature is 100 meters or less.
CT = Total
costs adjusted for inflation (1000 CHF).
Q1 = Total
output in train kilometers.
Q2 = Total
passenger output in passenger kilometers.
Q3 = Total
goods output in ton kilometers.
PL = Labor
price adjusted for inflation (in CHF per person per year).
PK = Capital
price using the total number of seats as a proxy for capital stock (CHF per
seat).
PE = Price
of electricity (CHF per kWh).
LABOR =
Quantity of labor.
ELEC =
Quantity of energy.
CAPITAL =
Quantity of Capital
Capital
costs = TOTCOST- (LABOREXP + ELECEXP)
Inflation
adjustment is done with respect to 1997 prices.
Logs
of costs and prices (lnCT, lnPK, lnPL) are normalized by PE.
LNCT, LNPK, LNPL, LNQ1, LNQ2, LNQ3, LNSTOP, LNCAP,
LNNET = logs of variables
- Bank Cost
Data, 500 Banks, 5 Years:
Variables in the file are
Cit = total cost
of transformation of financial and physical resources into loans and
investments = the sum
of the five cost items described below;
Y1it =
installment loans to individuals for personal and household expenses;
Y2it = real
estate loans;
Y3it =
business loans;
Y4it = federal
funds sold and securities purchased under agreements to resell;
Y5it = other
assets;
W1it = price
of labor, average wage per employee;
W2it = price
of capital = expenses on premises and fixed assets divided by the dollar
value of
of premises and fixed assets;
W3it = price
of purchased funds = interest expense on money market deposits plus
expense of
federal funds purchased and securities
sold under agreements to repurchase plus interest
expense on demand notes issued by the U.S.
Treasury divided by the dollar value of
purchased funds;
W4it = price
of interest-bearing deposits in total transaction accounts = interest
expense on
interest-bearing categories of total
transaction accounts;
W5it = price
of interest-bearing deposits in total nontransaction accounts = interest
expense on
total deposits minus interest expense on
money market deposit accounts divided by the
dollar value of interest-bearing deposits
in total nontransaction accounts;
T = trend
variable, t = 1,2,3,4,5 for years 1996, 1997, 1998, 1999, 2000
The data in the file are for a translog cost function, linearly
homogeneous in the input prices. Specifically,
C = log(Cost/W5), W1,W2,W3,W4
= log(Wj/W5), Q1,...,Q5 = log(Ym), and the squared and cross product
terms are W11, W12,..., Q11,Q12,...,
W1Q1,...,W4Q5, T, T2, TW1,...,TW4, TQ1,...,TQ5.
- Dahlberg
and Johansson Municipal Expenditure Data, 265 Swedish Municipalities, 9
years
Variables in the file are
ID = Identification, 1,..., 265
YEAR = year, 1979,...,1987
EXPEND = Expenditures
REVENUE = Receipts, taxes and
Fees
GRANTS = Government grants and
shared tax revenues
See Greene (2003, pp. 551 and elsewhere) for analysis of these data. The
article on which the analysis is based is Dahlberg, M. and E. Johannson,
E., "An Examination of the Dynamic Behavior of Local Governments
using GMM Bootstrapping Methods," Journal of Applied Econometrics,
15, 2000, pp. 401-416. (These data were downloaded from the JAE data
archive.)
- World
Gasoline Demand Data, 18 OECD Countries, 19 years
Variables in the file are
COUNTRY = name of country (Does not
appear in the LIMDEP project file)
YEAR = year, 1960-1978
LGASPCAR = log of consumption
per car
LINCOMEP = log of per capita
income
LRPMG = log of real price of
gasoline
LCARPCAP = log of per capita
number of cars
See Baltagi (2001, p. 24) for analysis of these data. The article on which
the analysis is based is Baltagi, B. and Griffin, J., "Gasolne Demand
in the OECD: An Application of Pooling and Testing Procedures,"
European Economic Review, 22, 1983, pp. 117-137. The data were downloaded
from the website for Baltagi's text.
- Statewide
Capital Productivity Data, lower 48 states, 17 years
Variables in the file are
STATE = state name
ST_ABB = state abbreviation
YR = year, 1970,...,1986
P_CAP = public capital
HWY = highway capital
WATER = water utility capital
UTIL = utility capital
PC = private capital
GSP = gross state product
EMP = employment
UNEMP = unemployment rate
See Baltagi (2001, p. 25) for analysis of these data. The article on which
the analysis is based is Munell, A., "Why has Productivity Declined?
Productivity and Putlic Investment," New England Economic Review, 1990, pp. 3-22. The data
were downloaded from the website for Baltagi's text.
- Cornwell and
Rupert Returns to Schooling Data, 595 Individuals, 7 Years
Variables in the file are
EXP = work experience
WKS = weeks worked
OCC = occupation, 1 if blue
collar,
IND = 1 if manufacturing
industry
SOUTH = 1 if resides in south
SMSA = 1 if resides in a city
(SMSA)
MS = 1 if married
FEM = 1 if female
UNION = 1 if wage set by unioin
contract
ED = years of education
BLK = 1 if individual is black
LWAGE = log of wage
These data were analyzed in Cornwell, C. and Rupert, P., "Efficient
Estimation with Panel Data: An Empirical Comparison of Instrumental
Variable Estimators," Journal of Applied Econometrics, 3, 1988, pp.
149-155. See Baltagi, page 122 for further analysis. The data
were downloaded from the website for Baltagi's text.
- German
Manufacturing Innovation Data, 1,270 Firms, 5 years
Variables in the file are
YEAR = year, 1994-1998
FIRM = 1,...,1,270
IP = Product or innovation
occurrec, 0/1 variable
EMPL = employment
IM = Imports
IMUM = import share in industry
FDIUM = FDI share in industry
PROD = productivity measure
LOGSALES = log of industry sales
RAWMTL = dummy for firm in raw
materials industry
INVGOOD = dummy for firm in
investment goods industry
CONSGOOD = dummy for firm in
consumer goods industry
FOOD = dummy for firm in food
industry
These data were analyzed in Bertschek, I. and M. Lechner, "Convenient
Estimators for the Panel Probit Model," Journal of Econometrics, 87, 2, 1998, pp. 329-372. See,
also, Greene, Econometric Analysis, 5th ed., (2003) for various analyses,
and Greene, W. "Convenient Estimators for the Panel Probit Model:
Further Results, Empirical Economics,
2004. These data are not publicly available. Extracts from the
data set will be provided in class.
- German
Health Care Usage Data, 7,293 Individuals, Varying Numbers of Periods
Data downloaded from Journal of
Applied Econometrics Archive. This is an unbalanced panel with 7,293
individuals. They can be used for regression, count models, binary choice,
ordered choice, and bivariate binary choice. This is a large data
set. There are altogether 27,326 observations. The number of
observations ranges from 1 to 7. (Frequencies are: 1=1525, 2=2158,
3=825, 4=926, 5=1051, 6=1000, 7=987). Note, the variable NUMOBS
below tells how many observations there are for each person. This
variable is repeated in each row of the data for the person.
Variables in the file are
ID = person - identification
number
FEMALE = female = 1; male
= 0
YEAR = calendar year of the
observation
AGE = age in years
AGESQ = age squared.
HSAT = health
satisfaction, coded 0 (low) - 10 (high) Note, this variable has 40
coding errors. Variable NEWHSAT below fixes them.
HANDDUM = handicapped = 1;
otherwise = 0
HANDPER = degree of handicap in
percent (0 - 100)
HHNINC = household nominal
monthly net income in German marks / 10000
HHKIDS = children under age 16
in the household = 1; otherwise = 0
EDUC = years of schooling
MARRIED = married = 1;
otherwise = 0
HAUPTS = highest schooling
degree is Hauptschul degree = 1; otherwise = 0
REALS = highest schooling
degree is Realschul degree = 1; otherwise = 0
FACHHS = highest schooling
degree is Polytechnical degree = 1; otherwise = 0
ABITUR = highest schooling
degree is Abitur = 1; otherwise = 0
UNIV = highest schooling
degree is university degree = 1; otherwise = 0
WORKING = employed = 1;
otherwise = 0
BLUEC = blue collar employee =
1; otherwise = 0
WHITEC = white collar employee =
1; otherwise = 0
SELF = self employed = 1;
otherwise = 0
BEAMT = civil servant = 1;
otherwise = 0
DOCVIS = number of doctor
visits in last three months
HOSPVIS = number of
hospital visits in last calendar year
DOCTOR = dummy variable = 1 if DOCVIS > 0, 0 otherwise.
HOSPITAL = dummy variable = 1 if
HOSPVIS > 0, 0 otherwise.
PUBLIC = insured in public
health insurance = 1; otherwise = 0
ADDON = insured by add-on
insurance = 1; otherswise = 0
NUMOBS = number of
observations for this person. Repeated in each row of data.
NEWHSAT = recoded value of HSAT
with coding errors corrected.
- World Health
Organization Panel Data on Health Care Attainment: 191 Countries, 5
Years (Some countries fewer)
These data have been used by many researchers to study the Health Care
Survey assembled by WHO as part of the Year 2000 World Health Report. On
the course bibliography, see, for example, Greene (2004a). Note,
variables marked * were updated with more recent sources in Greene
(2004a). Missing values for some of the variables in this data set are
filled by using fitted values from a linear regression. To set the
proper sample for panel data analysis, use observations for which SMALL = 0. To obtain the
balanced panel, then use only observations with GROUPTI = 5.
The variables in the file are
COMP = composite measure of
health care attainment; LCOMP =
logCOMP
DALE = Disability adjusted life
expectancy (other measure); LDALE
= logDALE
YEAR = 1993,...,1997; TIME = 1,2,3,4,5; T93, T94, T95, T96, T97 = year
dummy variables
HEXP = per capita health
expenditure; LHEXP = logHEXP; LHEXP2 = log-squaredHEXP
HC3 = educational attainment; LHC = logHC3; LHC2 =
log-squaredHC3; LHEXPHC = logHEXP * logHC3
SMALL = indicator for states,
provinces, etc. SMALL > 0
implies internal political unit, = 0 implies country observation
COUNTRY = number assigned to
country
STRATUM = another country
indicator
GROUPTI = number of observations
when SMALL = 0. Usually 5, some
= 1, one country = 4.
OECD = dummy variable for OECD country (30 countries)
GINI = gini coefficient for
income inequality
GEFF = world bank measure of
government effectiveness*
VOICE = world bank measure of
democratization of the political process*
TROPICS = dummy variable for
tropical location
POPDEN = population density*
PUBTHE = proportion of health
expenditure paid by bublic authorities
GDPC = normalized per capita
GDP; LGDPC = logGDPC; LGDPC2 = log-squaredGDPC
- Stated Discrete Choice 400 individuals, 8 choice
situations, 4 alternatives:
Data are a simulated discrete choice experiment. 400 individuals choose a
brand among 3 alternatives, or none of the above, the 4th
choice. 8 different situations are given to each individual.
Variables in the file are
id
= person identification
BRAND
= 1,2,3,4
CHOICE =
alternative chosen (0=no, 1=yes);
FASH = dummy coded
fashion (0=no, 1=yes);
QUAL = dummy coded
quality (0=low, 1=high);
PRICE = price,
level coded, .04, .08, .12, .16, .20;
PRICESQ = price
squared;
ASC4 = dummy
variable for 4th choice (no brand);;
MALE = dummy
variable for male (0=no, 1=yes);
AGE25 = age less than 25 (0=no,
1=yes)
AGE39 = age 25 to 39
(0=no, 1=yes),
AGE40 = age greater than 40 (0=no,
1=yes)
- Stated Discrete Choice Experiment - California Public
Utilities Survey Data:
These data were provided generously by Kenneth Train for use in teaching
about stated choice modeling. We gratefully acknowledge his assistance.
This is a real stated choice experiment in which 361 customers were
surveyed about their preferences for an electricity supplier. Each
customer responded to from 8 to 12 choice situations in which four
configurations were offered. There are altogether 17,232 rows in the data
set comprising 4,308 choice tasks.
The
variables in the file are
ID =
respondent ID number
CHOICE = binary
indicator of the preferred alternative.
NTASK = number of
tasks ranging from 8 to 12, repeated on each of the 4*NTASK records
PRICE = fixed
price in cents per kilowatt hour
CNTLNGTH = contract
length
LOCAL = dummy
variable for local utility
KNOWN = dummy
variable for well known company
TOD = dummy
variable for time of day rates, 11 cents in day, 5 cents at night
SEAS = dummy
variable for seasonal rates (10 cents summer, 8 cents winter, 6 cents
spring/fall)
NUMREC = 4*NTASK
- Koop and Tobias (2004) Labor Market Experience Data.
(See Koop, G. and J. Tobias, "Learning About Heterogeneity in Returns
to Schooling," Journal of Applied Econometrics, 19, 2004, pp.
827-849.
The data file is in two parts. The
first file contains the panel of 17,919 observations on the Person ID and 4
time-varying variables. The second file contains time invariant variables for
the individual or the 2,178 households. The data were downloaded from the
Journal of Applied Econometrics archive website. The two data sets are merged
in the .csv and .lpj files noted.
Variables in the file are
Time Varying
PERSONID = Person id (ranging
from 1 to 2,178),
EDUC = Education,
LOGWAGE = Log of hourly wage,
POTEXPER = Potential experience,
TIMETRND = Time trend.
Time Invariant
ABILITY = Ability,
MOTHERED = Mother's education,
FATHERED = Father's education,
BRKNHOME = Dummy variable for
residence in a broken home,
SIBLINGS = Number of siblings.
- Data on Industry Spillovers in Production
Used in Bloom, N.,
M. Schankerman and J. Van Reenen, "Identifying Technology
Spillovers and Product Market Rivalry," Econometrica, vol. 81(4), pages 1347-1393, 07, 2013.
Also Burda, M. and M. Harding, "Panel Probit with Flexible Correlated Effects: Quantifying
Technology Spillovers in the Presence of Latent Heterogeneity," Journal of Applied Econometrics,
28, 6, 2013, pp. 956-981. The data were downloaded from the Journal of Applied Econometrics
website. Unbalanced panel, 729 firms, 1 to 21 years of data per firm.
12928 rows of data in total.
The Variables in the file
are:
Firm = 1 to 729
Year = 1981 - 2001
Pat_any = 1 if firm had any patents in that year
LGSPILLT = lagged log of stock of tec weighted R&D (Jaffe
distance)
LGSPILLS = lagged log of stock of sic weighted R&D (Jaffe
distance)
LGMALSPI = lagged log of stock of tec weighted R&D (Mahalanobis
distance)
LGMALSPT = lagged log of stock of sic weighted R&D (Mahalanobis
distance)
LGRD1 = lagged log stock of R&D expenditures (coded -1 for
missing)
LSALES1 = lagged log sales
LGRD1_DU = dummy variable indicates missing value of LGRD1
TI = Number of observations for firm i
T = year - 1981. 1 to 21
LGSTBAR = Firm mean of LGSPILLT
LGSPBAR = Firm mean of LGSPILLS
LGMSIBAR = Firm mean of LGMALSPI
LGMSTBAR = Firm mean of LGMALSPT
LSALEBAR = Firm mean of LSALES1
LGRD1BAR = Firm mean of LGRD1
- Data on Crime in North Carolina
Used in Cornwell, C. and W. Trumbull,
"Estimating the Economic Model of Crime with Panel Data," Review of Economics and Statistics,
76, 1994, pp. 360-366 and Baltagi, B., "Estimating
an Economic Model of Crime Using Panel Data from North Carolina," Journal
of Applied Econometrics, 21, 2006, pp. 543-547
Balanced panel, 90 counties and 7 years, 630 rows of data in
total..
The last eight (constructed) variables are missing for 1981. Missing
variables are denoted by a "." in the text file.
The 59 Variables in the file
are:
1. county = county
identifier
2. year = 81 to 87
3. crmrte = crimes
committed per person
4. prbarr =
'probability' of arrest
5. prbconv =
'probability' of conviction
6. prbpris =
'probability' of prison sentenc
7. avgsen = avg.
sentence, days
8. polpc = police per
capita
9. density = people per
sq. mile
10. taxpc = tax revenue
per capita
11. west = 1 if in
western N.C.
12. central = 1 if in
central N.C.
13. urban = 1 if in SMSA
14. pctmin80 = perc.
minority, 1980
15. wcon = weekly wage,
construction
16. wtuc = wkly wge,
trns, util, commun
17. wtrd = wkly wge,
whlesle, retail trade
18. wfir = wkly wge,
fin, ins, real est
19. wser = wkly wge,
service industry
20. wmfg = wkly wge,
manufacturing
21. wfed = wkly wge, fed
employees
22. wsta = wkly wge,
state employees
23. wloc = wkly wge,
local gov emps
24. mix = offense mix:
face-to-face/other
25. pctymle = percent
young male
26. d82 = 1 if year = 82
27. d83 = 1 if year = 83
28. d84 = 1 if year = 84
29. d85 = 1 if year = 85
30. d86 = 1 if year = 86
31. d87 = 1 if year = 87
32. lcrmrte =
log(crmrte)
33. lprbarr =
log(prbarr)
34. lprbconv =
log(prbconv)
35. lprbpris =
log(prbpris)
36. lavgsen =
log(avgsen)
37. lpolpc = log(polpc)
38. ldensity =
log(density)
39. ltaxpc = log(taxpc)
40. lwcon = log(wcon)
41. lwtuc = log(wtuc)
42. lwtrd = log(wtrd)
43. lwfir = log(wfir)
44. lwser = log(wser)
45. lwmfg = log(wmfg)
46. lwfed = log(wfed)
47. lwsta = log(wsta)
48. lwloc = log(wloc)
49. lmix = log(mix)
50. lpctymle =
log(pctymle)
51. lpctmin =
log(pctmin)
52. clcrmrte = lcrmrte -
lcrmrte[t-1]
53. clprbarr = lprbarr -
lprbarr[t-1]
54. clprbcon = lprbconv
- lprbconv[t-1]
55. clprbpri = lprbpri -
lprbpri[t-1]
56. clavgsen = lavgsen -
lavgsen[t-1]
57. clpolpc = lpolpc -
lpolpc[t-1]
58. cltaxpc = ltaxpc -
ltaxpc[t-1]
59. clmix = lmix -
lmix[t-1]
- Arellano-Bond: Dynamic Panel Data Modeling. UK Industry
Data
Data on 140 firms observed in 7, 8 or 9 years. Unbalanced panel. 1031
observations in total.
The
variables in the file are
IND = industry
code
YEAR = year,
1977 to 1984
EMP = firm
employment
WAGE = wage
CAP = capital
INDOUTPT = industry
output
NI = log EMP
W = log WAGE
K = log CAP
YS = log
IDOUTPT
REC = line number
1-1031
YEARM1 = lag of
YEAR.
ID = firm id
number
NL1 = lag 1 of
N
NL2 = lag 2 of
N
WL1 = lag 1 of
W
WL2 = lag 2 of
W
KL1 = lag 1 of
K
KL2 = lag 2 of
K
YSL1 = lag 1 of
YS
YSL2 = lag 2 of
YS
YR1976
to YR1984
are year dummy variables
Return to course home
page.