**Panel Data
Econometrics**

**Panel Data Sets**

Professor W. Greene

Department of Economics

Office: MEC 7-90

Ph: +1-212-998-0876

E-mail: wgreene@stern.nyu.edu

Home Page: http://people.stern.nyu.edu/wgreene

**Notes:** The following list points to a series of data sets. We will use
some of these in our class discussions. A number of others are provided for
students to analyze as part of their study of the topic. The Penn World
Tables are a major cross country data base online that provides a wealth of
interesting data. This can be accessed directly: The Penn World Tables

There are many other sources of data on the
web. One that is particularly rich is the archives of the Journal of
Applied Econometrics: (Click here to
visit)

Data below are provided in two formats: (1)
The 'csv format' is a plain vanilla ascii text file containing the variable
names at the top of the file followed by the variables, arranged neatly in the
file and separated by commas. This is a portable file that is readable by any
econometrics package. It will also import directly into Excel just by double
clicking its name. (2) If you are using LIMDEP or NLOGIT, the .lpj project file
can be imported directly into the program, as is. (The .csv file can as well.)

**Grunfeld Investment Data, 10 Firms, 20 Years (1935-1954)**

**Variables in the file are**

Firm = Firm ID, 1,...,10

Year = 1935,...,1954

*I*= Investment

*F*= Real Value of the Firm

*C*= Real Value of the Firm's Capital Stock

Data are from the Ph.D. dissertation of Y. Grunfeld (Univ. of Chicago, 1958), See, e.g., Zellner, A., "An Efficient Method of Estimating Seemingly Unrelated Regression Equations and Tests for Aggregation Bias,"*Journal of the American Statistical Association*, 57, 1962, pp. 348-368 for analyses of these data.**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)**Spanish Dairy Farm Production,**N = 247, T = 6

**Variables in the file are**

*FARM*= Farm ID

*YEAR*= year, 93, 94, ..., 98

Inputs

*COWS, X1*= log of, deviations from means (logs)

*LAND, X2*= same

*LABOR, X3*= same

*FEED, X4*= same

Translog terms = squares and cross products:*X11, X22, X33, X44, X12, X13, X14,X23, X24, X34*

YEAR93,...,YEAR98 = year dummy variables

Output

*MILK*= farm output

*YIT*= log of MILK production**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**Swiss Railways,**N = 49, T(i) = 3 to 13

**Variables in the file are**

*ID*= Company ID from 1 to 50 (49 companies, 604 obs) ID = 21 zero count.

*RAILROAD* =
Company ID reduced to remove gap at ID=21, values 1 to 49.

*YEAR* = Year (1985 to 1997).

*NI* = Number of year observations for
firm, repeated in each observation.

*STOPS* =
Number of stations on the network.

*NETWORK* =
Length of railway network (m).

*LABOREXP* =
Labor expenses in 1000 CHF.

*STAFF* =
Number of employees.

*ELECEXP* =
Electricity expenses in 1000 CHF.

*KWH* = Total
consumed electricity (in 1000 kWh).

*TOTCOST* =
Total cost (in 1000 CHF).

*NARROW_T* = Dummy
for the networks with narrow track (1 m wide).

*RACK* = Dummy
for the networks with RACK RAIL (cremaillere) in at least some part;

(used
to maintain a slow movement of the train on high slopes).

*TUNNEL* = Dummy
for networks that have tunnels with an average length of more than 300 meters.

*VIRAGE* =
Dummy for the networks whose minimum radius of curvature is 100 meters or less.

*CT* = Total
costs adjusted for inflation (1000 CHF).

*Q1* = Total
output in train kilometers.

*Q2* = Total
passenger output in passenger kilometers.

*Q3* = Total
goods output in ton kilometers.

*PL* = Labor
price adjusted for inflation (in CHF per person per year).

*PK* = Capital
price using the total number of seats as a proxy for capital stock (CHF per
seat).

*PE* = Price
of electricity (CHF per kWh).

*LABOR* =
Quantity of labor.

*ELEC* =
Quantity of energy.

*CAPITAL* =
Quantity of Capital

Capital
costs = *TOTCOST*- (*LABOREXP* + *ELECEXP*)

Inflation
adjustment is done with respect to 1997 prices.

Logs
of costs and prices (ln*CT*, ln*PK*, ln*PL*) are normalized by *PE*.

*LNCT, LNPK, LNPL, LNQ1, LNQ2, LNQ3*, *LNSTOP, LNCAP,
LNNET* = logs of variables

**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**Bank Cost Data, 500 Banks, 5 Years:**

**Variables in the file are**

*Cit*= total cost of transformation of financial and physical resources into loans and

investments = the sum of the five cost items described below;

*Y1it*= installment loans to individuals for personal and household expenses;

*Y2it*= real estate loans;

*Y3it*= business loans;

*Y4it*= federal funds sold and securities purchased under agreements to resell;

*Y5it*= other assets;

*W1it*= price of labor, average wage per employee;

*W2it*= price of capital = expenses on premises and fixed assets divided by the dollar value of

of premises and fixed assets;

*W3it*= price of purchased funds = interest expense on money market deposits plus expense of

federal funds purchased and securities sold under agreements to repurchase plus interest

expense on demand notes issued by the U.S. Treasury divided by the dollar value of

purchased funds;

*W4it*= price of interest-bearing deposits in total transaction accounts = interest expense on

interest-bearing categories of total transaction accounts;

*W5it*= price of interest-bearing deposits in total nontransaction accounts = interest expense on

total deposits minus interest expense on money market deposit accounts divided by the

dollar value of interest-bearing deposits in total nontransaction accounts;

*T*= trend variable, t = 1,2,3,4,5 for years 1996, 1997, 1998, 1999, 2000

The data in the file are for a translog cost function, linearly homogeneous in the input prices. Specifically,

*C*= log(*Cost/W5*),*W1,W2,W3,W4*= log(*Wj/W5*),*Q1*,...,*Q5*= log(Ym), and the squared and cross product

terms are*W11, W12,..., Q11,Q12,..., W1Q1,...,W4Q5, T, T2, TW1,...,TW4, TQ1,...,TQ5*.**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**Dahlberg and Johansson Municipal Expenditure Data, 265 Swedish Municipalities, 9 years**

**Variables in the file are**

*ID*= Identification, 1,..., 265

*YEAR*= year, 1979,...,1987

*EXPEND*= Expenditures

*REVENUE*= Receipts, taxes and Fees

*GRANTS*= Government grants and shared tax revenues

See Greene (2003, pp. 551 and elsewhere) for analysis of these data. The article on which the analysis is based is Dahlberg, M. and E. Johannson, E., "An Examination of the Dynamic Behavior of Local Governments using GMM Bootstrapping Methods," Journal of Applied Econometrics, 15, 2000, pp. 401-416. (These data were downloaded from the JAE data archive.)**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**World Gasoline Demand Data, 18 OECD Countries, 19 years**

**Variables in the file are**

*COUNTRY*= name of country (Does not appear in the LIMDEP project file)

*YEAR*= year, 1960-1978

*LGASPCAR*= log of consumption per car

*LINCOMEP*= log of per capita income

*LRPMG*= log of real price of gasoline

*LCARPCAP*= log of per capita number of cars

See Baltagi (2001, p. 24) for analysis of these data. The article on which the analysis is based is Baltagi, B. and Griffin, J., "Gasolne Demand in the OECD: An Application of Pooling and Testing Procedures," European Economic Review, 22, 1983, pp. 117-137. The data were downloaded from the website for Baltagi's text.**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**Statewide Capital Productivity Data, lower 48 states, 17 years**

**Variables in the file are**

*STATE*= state name

*ST_ABB*= state abbreviation

*YR*= year, 1970,...,1986

*P_CAP*= public capital

*HWY*= highway capital

*WATER*= water utility capital

*UTIL*= utility capital

*PC*= private capital

*GSP*= gross state product

*EMP*= employment

*UNEMP*= unemployment rate

See Baltagi (2001, p. 25) for analysis of these data. The article on which the analysis is based is Munell, A., "Why has Productivity Declined? Productivity and Putlic Investment,"*New England Economic Review*, 1990, pp. 3-22. The data were downloaded from the website for Baltagi's text.**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**Cornwell and Rupert Returns to Schooling Data, 595 Individuals, 7 Years**

**Variables in the file are**

*EXP*= work experience

*WKS*= weeks worked

*OCC*= occupation, 1 if blue collar,

*IND*= 1 if manufacturing industry

*SOUTH*= 1 if resides in south

*SMSA*= 1 if resides in a city (SMSA)

*MS*= 1 if married

*FEM*= 1 if female

*UNION*= 1 if wage set by unioin contract

*ED*= years of education

*BLK*= 1 if individual is black

*LWAGE*= log of wage

These data were analyzed in Cornwell, C. and Rupert, P., "Efficient Estimation with Panel Data: An Empirical Comparison of Instrumental Variable Estimators," Journal of Applied Econometrics, 3, 1988, pp. 149-155. See Baltagi, page 122 for further analysis. The data were downloaded from the website for Baltagi's text.**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**German Manufacturing Innovation Data, 1,270 Firms, 5 years**

**Variables in the file are**

*YEAR*= year, 1994-1998

*FIRM*= 1,...,1,270

*IP*= Product or innovation occurrec, 0/1 variable

*EMPL*= employment

*IM*= Imports

*IMUM*= import share in industry

*FDIUM*= FDI share in industry

*PROD*= productivity measure

*LOGSALES*= log of industry sales

*RAWMTL*= dummy for firm in raw materials industry

*INVGOOD*= dummy for firm in investment goods industry

*CONSGOOD*= dummy for firm in consumer goods industry

*FOOD*= dummy for firm in food industry

These data were analyzed in Bertschek, I. and M. Lechner, "Convenient Estimators for the Panel Probit Model,"*Journal of Econometrics*, 87, 2, 1998, pp. 329-372. See, also, Greene, Econometric Analysis, 5th ed., (2003) for various analyses, and Greene, W. "Convenient Estimators for the Panel Probit Model: Further Results,*Empirical Economics*, 2004. These data are not publicly available. Extracts from the data set will be provided in class.

**German Health Care Usage Data, 7,293 Individuals, Varying Numbers of Periods**

Data downloaded from*Journal of Applied Econometrics Archive*. This is an unbalanced panel with 7,293 individuals. They can be used for regression, count models, binary choice, ordered choice, and bivariate binary choice. This is a large data set. There are altogether 27,326 observations. The number of observations ranges from 1 to 7. (Frequencies are: 1=1525, 2=2158, 3=825, 4=926, 5=1051, 6=1000, 7=987). Note, the variable NUMOBS below tells how many observations there are for each person. This variable is repeated in each row of the data for the person.

**Variables in the file are**

*ID*= person - identification number

*FEMALE*= female = 1; male = 0

*YEAR*= calendar year of the observation

*AGE*= age in years

*AGESQ*= age squared.

*HSAT*= health satisfaction, coded 0 (low) - 10 (high) Note, this variable has 40 coding errors. Variable NEWHSAT below fixes them.

*HANDDUM*= handicapped = 1; otherwise = 0

*HANDPER*= degree of handicap in percent (0 - 100)

*HHNINC*= household nominal monthly net income in German marks / 10000

*HHKIDS*= children under age 16 in the household = 1; otherwise = 0

*EDUC*= years of schooling

*MARRIED*= married = 1; otherwise = 0

*HAUPTS*= highest schooling degree is Hauptschul degree = 1; otherwise = 0

*REALS*= highest schooling degree is Realschul degree = 1; otherwise = 0

*FACHHS*= highest schooling degree is Polytechnical degree = 1; otherwise = 0

*ABITUR*= highest schooling degree is Abitur = 1; otherwise = 0

*UNIV*= highest schooling degree is university degree = 1; otherwise = 0

*WORKING*= employed = 1; otherwise = 0

*BLUEC*= blue collar employee = 1; otherwise = 0

*WHITEC*= white collar employee = 1; otherwise = 0

*SELF*= self employed = 1; otherwise = 0

*BEAMT*= civil servant = 1; otherwise = 0

*DOCVIS*= number of doctor visits in last three months

*HOSPVIS*= number of hospital visits in last calendar year

*DOCTOR*= dummy variable = 1 if*DOCVIS*> 0, 0 otherwise.

*HOSPITAL*= dummy variable = 1 if*HOSPVIS*> 0, 0 otherwise.

*PUBLIC*= insured in public health insurance = 1; otherwise = 0

*ADDON*= insured by add-on insurance = 1; otherswise = 0

*NUMOBS*= number of observations for this person. Repeated in each row of data.

*NEWHSAT*= recoded value of HSAT with coding errors corrected.**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**World Health Organization Panel Data on Health Care Attainment: 191 Countries, 5 Years (Some countries fewer)**

These data have been used by many researchers to study the Health Care Survey assembled by WHO as part of the Year 2000 World Health Report. On the course bibliography, see, for example, Greene (2004a). Note, variables marked * were updated with more recent sources in Greene (2004a). Missing values for some of the variables in this data set are filled by using fitted values from a linear regression. To set the proper sample for panel data analysis, use observations for which*SMALL*= 0. To obtain the balanced panel, then use only observations with GROUPTI = 5.

**The variables in the file are**

*COMP*= composite measure of health care attainment;*LCOMP*= log*COMP*

*DALE*= Disability adjusted life expectancy (other measure);*LDALE*= log*DALE*

*YEAR*= 1993,...,1997;*TIME*= 1,2,3,4,5;*T93, T94, T95, T96, T97*= year dummy variables

*HEXP*= per capita health expenditure;*LHEXP*= log*HEXP*;*LHEXP2*= log-squared*HEXP*

*HC3*= educational attainment;*LHC*= log*HC3*;*LHC2*= log-squared*HC3*;*LHEXPHC*= log*HEXP** log*HC3*

*SMALL*= indicator for states, provinces, etc.*SMALL*> 0 implies internal political unit, = 0 implies country observation

*COUNTRY*= number assigned to country

*STRATUM*= another country indicator

*GROUPTI*= number of observations when*SMALL*= 0. Usually 5, some = 1, one country = 4.

*OECD*= dummy variable for*OECD*country (30 countries)

*GINI*= gini coefficient for income inequality

*GEFF*= world bank measure of government effectiveness*

*VOICE*= world bank measure of democratization of the political process*

*TROPICS*= dummy variable for tropical location

*POPDEN*= population density*

*PUBTHE*= proportion of health expenditure paid by bublic authorities

*GDPC*= normalized per capita GDP;*LGDPC*= log*GDPC*;*LGDPC2*= log-squared*GDPC***Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**Stated Discrete Choice 400 individuals, 8 choice situations, 4 alternatives:**

Data are a simulated discrete choice experiment. 400 individuals choose a brand among 3 alternatives, or none of the above, the 4^{th}choice. 8 different situations are given to each individual.

**Variables in the file are
**

*BRAND*
= 1,2,3,4

*CHOICE* =
alternative chosen (0=no, 1=yes);

*FASH* = dummy coded
fashion (0=no, 1=yes);

*QUAL* = dummy coded
quality (0=low, 1=high);

*PRICE* = price,
level coded, .04, .08, .12, .16, .20;

*PRICESQ* = price
squared;

*ASC4* = dummy
variable for 4^{th} choice (no brand);;

*MALE* = dummy
variable for male (0=no, 1=yes);

*AGE25* = age less than 25 (0=no,
1=yes)

*AGE39* = age 25 to 39
(0=no, 1=yes),

*AGE40* = age greater than 40 (0=no,
1=yes)

**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**Stated Discrete Choice Experiment - California Public Utilities Survey Data:**

These data were provided generously by Kenneth Train for use in teaching about stated choice modeling. We gratefully acknowledge his assistance. This is a real stated choice experiment in which 361 customers were surveyed about their preferences for an electricity supplier. Each customer responded to from 8 to 12 choice situations in which four configurations were offered. There are altogether 17,232 rows in the data set comprising 4,308 choice tasks.

**The
variables in the file are**

*ID* =
respondent ID number

*CHOICE* = binary
indicator of the preferred alternative.

*NTASK* = number of
tasks ranging from 8 to 12, repeated on each of the 4*NTASK records

*PRICE* = fixed
price in cents per kilowatt hour

*CNTLNGTH* = contract
length

*LOCAL* = dummy
variable for local utility

*KNOWN* = dummy
variable for well known company

*TOD* = dummy
variable for time of day rates, 11 cents in day, 5 cents at night

*SEAS* = dummy
variable for seasonal rates (10 cents summer, 8 cents winter, 6 cents
spring/fall)

*NUMREC* = 4**NTASK*

**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**Koop and Tobias (2004) Labor Market Experience Data.**

(See Koop, G. and J. Tobias, "Learning About Heterogeneity in Returns to Schooling," Journal of Applied Econometrics, 19, 2004, pp. 827-849.

The data file is in two parts. The
first file contains the panel of 17,919 observations on the Person ID and 4
time-varying variables. The second file contains time invariant variables for
the individual or the 2,178 households. The data were downloaded from the
Journal of Applied Econometrics archive website. The two data sets are merged
in the .csv and .lpj files noted.

**Variables in the file are
Time Varying**

*PERSONID* = Person id (ranging
from 1 to 2,178),

*EDUC* = Education,

*LOGWAGE* = Log of hourly wage,

*POTEXPER* = Potential experience,

*TIMETRND* = Time trend.

**Time Invariant**

*ABILITY* = Ability,

*MOTHERED* = Mother's education,

*FATHERED* = Father's education,

*BRKNHOME* = Dummy variable for
residence in a broken home,

*SIBLINGS* = Number of siblings.

**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**Data on Industry Spillovers in Production**

Used in Bloom, N., M. Schankerman and J. Van Reenen, "Identifying Technology Spillovers and Product Market Rivalry,"*Econometrica*, vol. 81(4), pages 1347-1393, 07, 2013. Also Burda, M. and M. Harding, "Panel Probit with Flexible Correlated Effects: Quantifying Technology Spillovers in the Presence of Latent Heterogeneity,"*Journal of Applied Econometrics*, 28, 6, 2013, pp. 956-981. The data were downloaded from the*Journal of Applied Econometrics*website. Unbalanced panel, 729 firms, 1 to 21 years of data per firm. 12928 rows of data in total.

**The Variables in the file
are:**

*Firm* = 1 to 729

*Year* = 1981 - 2001

*Pat_any* = 1 if firm had any patents in that year

*LGSPILLT* = lagged log of stock of tec weighted R&D (Jaffe
distance)

*LGSPILLS* = lagged log of stock of sic weighted R&D (Jaffe
distance)

*LGMALSPI* = lagged log of stock of tec weighted R&D (Mahalanobis
distance)

*LGMALSPT* = lagged log of stock of sic weighted R&D (Mahalanobis
distance)

*LGRD1* = lagged log stock of R&D expenditures (coded -1 for
missing)

*LSALES1* = lagged log sales

*LGRD1*_DU = dummy variable indicates missing value of LGRD1

*TI *= Number of observations for firm i

*T* = year - 1981. 1 to 21

*LGSTBAR* = Firm mean of LGSPILLT

*LGSPBAR* = Firm mean of LGSPILLS

*LGMSIBAR* = Firm mean of LGMALSPI

*LGMSTBAR* = Firm mean of LGMALSPT

*LSALEBAR* = Firm mean of LSALES1

*LGRD1BAR* = Firm mean of LGRD1

**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**Data on Crime in North Carolina**

Used in**Cornwell, C. and W. Trumbull,**"Estimating the Economic Model of Crime with Panel Data,"*Review of Economics and Statistics*, 76, 1994, pp. 360-366 and**Baltagi, B.**, "Estimating an Economic Model of Crime Using Panel Data from North Carolina,"*Journal of Applied Econometrics*, 21, 2006, pp. 543-547

Balanced panel, 90 counties and 7 years, 630 rows of data in
total..

The last eight (constructed) variables are missing for 1981. Missing
variables are denoted by a "." in the text file.

**The 59 Variables in the file
are:**

1. county = county
identifier

2. year = 81 to 87

3. crmrte = crimes
committed per person

4. prbarr =
'probability' of arrest

5. prbconv =
'probability' of conviction

6. prbpris =
'probability' of prison sentenc

7. avgsen = avg.
sentence, days

8. polpc = police per
capita

9. density = people per
sq. mile

10. taxpc = tax revenue
per capita

11. west = 1 if in
western N.C.

12. central = 1 if in
central N.C.

13. urban = 1 if in SMSA

14. pctmin80 = perc.
minority, 1980

15. wcon = weekly wage,
construction

16. wtuc = wkly wge,
trns, util, commun

17. wtrd = wkly wge,
whlesle, retail trade

18. wfir = wkly wge,
fin, ins, real est

19. wser = wkly wge,
service industry

20. wmfg = wkly wge,
manufacturing

21. wfed = wkly wge, fed
employees

22. wsta = wkly wge,
state employees

23. wloc = wkly wge,
local gov emps

24. mix = offense mix:
face-to-face/other

25. pctymle = percent
young male

26. d82 = 1 if year = 82

27. d83 = 1 if year = 83

28. d84 = 1 if year = 84

29. d85 = 1 if year = 85

30. d86 = 1 if year = 86

31. d87 = 1 if year = 87

32. lcrmrte =
log(crmrte)

33. lprbarr =
log(prbarr)

34. lprbconv =
log(prbconv)

35. lprbpris =
log(prbpris)

36. lavgsen =
log(avgsen)

37. lpolpc = log(polpc)

38. ldensity =
log(density)

39. ltaxpc = log(taxpc)

40. lwcon = log(wcon)

41. lwtuc = log(wtuc)

42. lwtrd = log(wtrd)

43. lwfir = log(wfir)

44. lwser = log(wser)

45. lwmfg = log(wmfg)

46. lwfed = log(wfed)

47. lwsta = log(wsta)

48. lwloc = log(wloc)

49. lmix = log(mix)

50. lpctymle =
log(pctymle)

51. lpctmin =
log(pctmin)

52. clcrmrte = lcrmrte -
lcrmrte[t-1]

53. clprbarr = lprbarr -
lprbarr[t-1]

54. clprbcon = lprbconv
- lprbconv[t-1]

55. clprbpri = lprbpri -
lprbpri[t-1]

56. clavgsen = lavgsen -
lavgsen[t-1]

57. clpolpc = lpolpc -
lpolpc[t-1]

58. cltaxpc = ltaxpc -
ltaxpc[t-1]

59. clmix = lmix -
lmix[t-1]

**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)

**Arellano-Bond: Dynamic Panel Data Modeling. UK Industry Data**

Data on 140 firms observed in 7, 8 or 9 years. Unbalanced panel. 1031 observations in total.

**The
variables in the file are**

*IND* = industry
code

*YEAR* = year,
1977 to 1984

*EMP* = firm
employment

*WAGE* = wage

*CAP* = capital

*INDOUTPT* = industry
output

*NI* = log EMP

*W* = log WAGE

*K* = log CAP

*YS* = log
IDOUTPT

*REC* = line number
1-1031

*YEARM1* = lag of
YEAR.

*ID* = firm id
number

*NL1* = lag 1 of
N

*NL2* = lag 2 of
N

*WL1* = lag 1 of
W

*WL2* = lag 2 of
W

*KL1* = lag 1 of
K

*KL2* = lag 2 of
K

*YSL1* = lag 1 of
YS

*YSL2* = lag 2 of
YS

*YR1976
to YR1984*
are year dummy variables

**Excel Spreadsheet**(csv)**LIMDEP Project file**(lpj)