Cover Page

Panel Data Analysis Using EViews

I Gusti Ngurah Agung

Graduate School of Management Faculty of Economics and Business, University of Indonesia

Wiley Logo

Dedication

Dedicated to my wife

Anak Agung Alit Mas,

as well as all my Generation

Preface

The main objectives of this book are to present (1) various general equation of panel data models, with some specific models; (2) various illustrative statistical results based on selected specific models with special notes and comments and (3) comparative studies between sets of special type models, sucha s heterogeneous regression, fixed-effects and random effects models, so that readers can be informed of a model's limitation(s) compared to the others in the set.

This book presents over 250 illustrative examples of panel data analysis using EViews, compared to the books of Baltagi (2009a,b) on Econometric Analysis of Panel Data and A Companion to Econometric Analysis of Panel Data which mainly present the mathematical concepts of the models with some data analysis. Referring to the fixed- and random effects models, Baltagi presented statistical results based on various additive models and none with the numerical time independent variable. However, Baltagi quotes a simple dynamic panel data model with heterogeneous coefficients on the lagged dependent variable and the time trend presented by Wansbeek and Knaap (1999, in Baltagi, 2009a, p. 168), and a random walk model with heterogeneous trend presented by Hardi (2000, in Baltagi, 2009a).

Similarly, this is the case for most of the panel data models presented in Gujarati (2003). Wooldridge (2002), and in more than 300 papers presented in five international journals, such as the Journal of Finance (JOF) from the years 2010 and 2011, International Journal of Accounting (IJA), Journal of Accounting and Economics (JAE), British Accounting Review (BAR), and Advances in Accounting, incorporating Advances in International Accounting (AA) from the years 2008, 2009 and 2010, which are additive models.

However, it is important to note that Wooldridge (2002) presented a random effect model with trend or the numerical time independent variable, Bansal (2005) presented the models with trend and Time-Related Effects (TRE), but based on time series data, and Agung (2009a) presented various models with trend and TRE. So I would say that various models, either additive or interaction models, with the numerical time independent variable or the time and time-period dummy variables, should be acceptable or valid and reliable panel data models.

I found that a very limited number of models with interaction independent variables or heterogeneous regressions models were presented. Only Giroud and Mueller (2011) presented several Year-Industry fixed effects interaction models (or Year-Industry FEMs with interaction independent variables). Referring to the dummy variables models, (Siswantoro and Agung, 2010) presented their findings that only 63 out of 268 papers in the four journals (IJA, JAE, BAR and AA), had dummy variables models, and only five of the models had interaction independent variables. In addition, Dharmapala, et al. (2011) presented interaction models or heterogeneous regressions using the Firm and Year dummies, and Park and Jang (2011) presented an interaction period-fixed-effects model with 34 parameters, besides the year dummies. In fact, the heterogeneous regressions model, which is an interaction model, was introduced by Johnson and Neyman in 1962 (cited in Huitema, 1980).

If a multiple regression panel data model does not have any dummy variable, then the regression model presents a single continuous model for whole individual-time observations. I would consider such a model to be inappropriate. On the other hand, a dummy variables model could also be the worst within its group with the same set of numerical and categorical independent variables, which are illustrated in this book.

Referring to various models indicated here, this book presents various models, either additive or interaction models, with the numerical time independent variable or time and time-period dummies variables. Note that the numerical time variable has been used to present classical growth models, namely the geometric and exponential growth models (Agung, 2009a, 2011b). Furthermore, the time t in fact represents an environmental variable, which is invariant over individuals or research objects.

The models presented in this book in fact are derived from my first two books in the data analysis using EViews (Agung, 2009a, 2011). For this reason, I recommend to readers to use the models in the first two books as the basic and main references to develop various alternative or more advanced models based on panel data, because this book only presents some of the models.

Furthermore, special statistical results using the object VAR and GLS are illustrated, which have not been presented in other books as well as papers in the international journals. A manual stepwise selection method is introduced, aside from the application of the STEPLS estimation method provide by EViews. Even though STEPLS regressions have been commonly applied, my book proposes and introduces how to apply the STEPLS, using a multistage stepwise selection method specifically for interaction models with numerical and categorical independent variables; such as continuous interaction models by groups of the research objects (firms or individuals) and time points or time periods.

Based on my own point of view, models based on panel data should be classified into three groups; namely (1) The group of models based on unstacked data, or the group of time-series models by states (firms or individuals); (2) The group of models based on stacked or pool data, especially for incomplete panel data; and (3) The group of models based on natural experimental or special structural balanced panel data. For this reason, this book contains 14 chapters, which are classified into three parts.

Part I presents the Time Series Data Analyses by States. In this part the panel data considered is unstacked data, where the units of the analysis are time observations. The sets or multi-dimensionals of exogenous, endogenous and environmental variables, respectively, for the state i can be presented using the symbols X _i t = (X1_i, … , Xk_i, …)t, Y_ i t = (Y1_i, … , Yg _i, …)t, and Z t = (Z1, … , Zj, …)t, for i = 1, … , N; and t = 1, … , T.

Note that the scores of the environmental variables are constant for all states or individuals. All the time series models presented in Agung (2009a) are valid models for each state i. This part presents only the analyses specifically for the unstacked data with a small number of N. The models for a large N will be presented in Part II and Part III.

The four chapters contained in this part are as follows:

Chapter 1 presents multivariate data analyses based on a single time series by states, using various models, multivariate lagged-variable autoregressive growth models, namely MLVAR(p,q)_GM, seemingly causal models (SCMs) with trend or time-related effects, fixed-effects and random effects models, VAR and VEC models. In addition, this chapter also presents piece-wise models, various models having environmental independent variables, TGARCH(a,b,c) and instrumental variables models.

Chapter 2 presents multivariate data analyses based on bivariate time series by states, as the extension of all models presented in Chapter 1. In addition, this chapter also presents simultaneous causal models.

Chapter 3 presents multivariate data analyses based on multivariate time series by states, as the extension of all models presented in Chapter 2. In addition, this chapter also presents special VAR models with an environmental multivariate, which have not been found in other books and papers.

Chapter 4 presents the application of various SCMs, either additive or interaction models, based on a single time series Y_i t, bivariate time series (X_i t ,Y_i t ), trivariate time series (X1_i t ,X2_i t ,Y_i t ) or (X_i t ,Y1_i t ,Y2_i t ), and the application of SCMs as the alternative VAR models with the environmental multivariate presented in Chapter 3.

Part II presents Pool Panel Data Analyses. In this part the panel data considered is stacked data where the units of the analysis are the individual-time or firm-time observations. So the sets or multi-dimensionals of exogenous, endogenous and environmental variables, respectively, for the firm i, at the time t can be presented using the symbols X it = (X1, … , Xk, …)it, Y it = (Y1, … , Yg, …)it, and Z t = (Z1, … ,Zj, …)t, for i = 1, … ,N t ; and t = 1, … ,T. Note that the symbol N t is used to indicate that the models presented in this part should be valid for incomplete or unbalanced pool panel data as well as balanced pool panel data. However, special models for the balanced panel data, as a natural experimental data, will be presented in Part III.

The statistical methods and models applied can directly be derived from the models based on cross-section data presented in Agung (2011), by using or inserting the time dummy independent variables. More complex models should be considered or defined based on pool panel data with a large N and large (or very large) T. With large time-point observations, then time T can be used as a numerical independent variable, so a defined model could be a continuous or discontinuous model of T, because at least two time periods have to be considered, as presented in Agung (2009a).

This part contains seven chapters as follows:

Chapter 5 presents the preliminary evaluation analysis, the applications of the object “Descriptive Statistics and Tests ”, for multi-dimensional problems by times, and the object “N-way Tabulation”, multi-factorial cell-proportion models, Kendall's tau, and multiple association between categorical variables.

Chapter 6 presents general choice models, specifically the binary choice models having categorical or numerical independent variables. Special findings and notes are presented based on alternative binary choice models having numerical independent variables.

Chapter 7 presents advanced general choice models as the extension of all models presented in Chapter 6. In addition, this chapter demonstrates data analysis based on multifactorial binary and ordered choice models with categorical or numerical independent variables, using the manual step-by-step process in modeling to compare to the STEPLS estimation method, new unexpected stepwise polynomial regressions, general choice models with trend and the time-related effects.

Chapter 8 presents the application of additive and interaction GLMs (univariate general linear models) by Group and Time, using the original measured variables or transformed variables, starting with simple models such as ANOVA and quantile models; bivariate correlation analysis and STEPLS regressions. In addition, this chapter also presents piece-wise autoregressive models by time-points with one, two and several numerical exogenous variables, the applications of the White and Newey–West options, polynomial effects models, general continuous linear models and ANCOVA models, including the worst ANCOVA model in a theoretical sense. Finally, this chapter also presents a discussion on the non-stationary problem.

Chapter 9 presents fixed-effects models and alternatives. The limitation or hidden assumption of each fixed-effects model, such as the individual fixed-effects model, time fixed-effects model, and the individual-time fixed-effects model, are discussed in detail. In addition, this chapter also presents extended fixed-effects models. Several fixed-effects models are selected from the Journal of Finance (2011) to present their specific characteristics to the readers. Note that the fixed effects are in fact ANCOVA models with specific hidden assumptions. For this reason, several alternative heterogeneous regression models are presented and recommended, such as the heterogeneous classical growth models by individuals or groups, piece-wise heterogeneous regressions and heterogeneous regressions with trend or time-related-effects by individuals or groups.

Chapter 10 presents special notes on selected problems, which are the impacts of misclassification of the research objects defined based on the whole or total firm-time observations, or by doing data analysis by treating the pool panel data as a random cross-section of data. For instance, a dummy of the return rate R it, is defined as DR = 1 if R it equals and less than zero, and DR = 0 if otherwise, would be misleading. Take note that this dummy variable DR i t does not represent two disjointed sets of observed firms or individuals over times, but it represents two disjointed sets of R it s scores, namely the sets: img and img. So some or many firms should have both negative and positive scores of R it over times, or based on all firm-time observations. In other words, the two sets of scores are not the firms' classifications. At the very extreme case, all firms may have both negative and positive observed scores, for a long time period of observations. Similarly, this applies to the other dummy variables and first difference variables.

On the other hand, the models with R it as a numerical variable and using a ratio variable could also be misleading; these are demonstrated using scatter graphs with regression lines, the kernel fit or nearest neighbor fit curves. In fact, Agung (2009a) presented unexpected relationships between pairs of time-series variables in the form of their scatter graphs, because the pairs of time-series variables were presented as cross-section variables. In addition, some models from the international journals are selected to discuss their limitation or hidden assumptions, and present alternative or extended models.

Chapter 11 presents the application of various types of the Seemingly Causal Models (SCMs), wherein the most important and recommended models are the multivariate heterogeneous models by group and time or time period (TP). The illustrative multivariate linear-effect models, nonlinear-effect models, and bounded models by groups or times are presented with special notes and comments. We find that unexpected statistical results can be obtained because of outliers. For this reason, possible treatment of the outliers is presented.

Part III presents Natural Experimental Data Analysis. In this part the panel data considered is stacked by cross-section with img observations, where the units of the analysis are the individual-time or firm-time observations. So the sets or multidimensional of exogenous, endogenous and environmental variables, respectively, for the firm i, at the time t can be presented using the symbols Xit = (X1, … , Xk, …) it, Yit = (Y1, … , Yg, …)it, and Z t = (Z1, … , Zj,…) t, for i = 1, … , N; and t = 1, … , T. In the natural experimental data analysis, environmental variables could be represented by the time or time period variable, say TP, such as before and after a critical time or event; before, during and after an economic crisis; and before, between and after two consecutive critical events. In addition, classification or cell-factor, namely CF, can be defined as the treatment factors, with the response variable Y it and the covariates (cause or upstream variables, or predictors) are the lags Y it (−p) and X it (−q), at least for p = q = 1, if the data is annual, semi-annual or quarterly data set.

Take note that the cell-factor (CF) should be generated or defined as a reference or based group of the research objects (individuals or firms), which is invariant or constant over times; such as regions or states, the firm sectors, status of the business (public and nonpublic), and family-nonfamily business. The CF or an invariant GROUP variable also can be generated based on numerical variables, such as SIZE, ASSET, or LOAN of the firms at a certain time point, for instance at the time T = 1, using the median, quantile or percentile as the alternative cutting points.

Yg it (−1) should have different effects on a response variable, say Yg it, between the groups generated by CF and TP, then the simplest recommended model considered involves the heterogeneous regression lines of Yg on Xk (−1) by CF and TP. In practice, a set of heterogeneous regressions could be reduced to homogeneous regressions (ANCOVA or fixed-effects models), because we find that the covariate Xk (−1) has insignificant different effects on Yg between the groups generated by CF and TP in a statistical sense. In a theoretical sense, however, the homogeneous regressions might not be appropriate and some could be the worst ANCOVA models. For these reasons, this part presents special notes and comments of the limitation of the fixed-effects models, and demonstrates the worst possible ANCOVA model in a theoretical sense.

This is similar for the simplest heterogeneous regressions of Yg on Yg (−1), namely the LV(1) model of Yg it by CF and TP. Furthermore, the models could easily be extended to LVAR(p,q) models with the exogenous variables X it, and X it (−1), and the environmental variable Z t .

All models based on a true experimental data, presented in Agung (2011, Chapter 8), should be valid based on balanced pool data, by using CF and TP as the environmental treatment factors.

This part contains three chapters as follows:

Chapter 12, at the first stage, presents how to develop special balanced pool data, and how to define and generate the reference or based group variable(s), namely the cell-factor (CF), which is invariant or constant over times or time periods. Then various types of heterogeneous regressions are presented, such as LV(p), AR(q), and LVAR(p,q), starting with p = q = 1, such as (1) The models by CF (Groups) and times, (2) The models by CF with Trend, and (3) The models by CF with time-related effects, with or without exogenous or environmental variables. More advanced models, such as bounded, polynomial and generalized LV(p) models, also are presented. For doing data analysis, in EViews 6 using the function @Expand(CF,TP), is recommended, to write the equation specification of the model. For a model with a large number of various types of independent variables, applying the manual stepwise selection method is recommended, which is demonstrated using the general linear LV(1) models, binary choice LV(1) models and ordered choice LV(1) models. Finally, this chapter also presents quantile regressions, ANCOVA or fixed-effects models.

Chapter 13 presents the applications of various multivariate lagged variables autoregressive models, namely LVAR(p,q) SCMs (Seemingly Causal Models), where each of the multiple regressions in a model can have different sets of independent variables, as the extension of all models presented in previous chapter. In fact, all models presented in previous chapter are valid models for each endogenous variable Y g or its transformed variable, namely H g (Y g ). Various illustrative SCMs based on the numerical variables (Y1,Y2), (Y1,Y2,Y3), (X1,Y1,Y2) and (X1,X2,Y1,Y2), are presented in the form of selected illustrative path diagrams, and then the corresponding linear-effects models can easily be defined, either additive, two-way or three-way interaction models, with or without an environmental variable. Hence, analysis could be done based on the heterogeneous regressions by CF and T or TP, or the heterogeneous regressions by CF with trend and time-related effects.

For comparison, the application of the VAR and VEC models, and fixed-effects or MANCOVA models is presented. Finally, the Granger Causality Test based on the VAR model and additive SCMs is presented, with an extension for the interaction SCMs, called the Generalized Granger Causality Tests.

Chapter 14 presents the applications of Generalized Least Square s (GLS) estimation method, especially for the Cross-Section Random Effects Model (CSREM) and Period Random Effects Model (PEREM), based on a special structural balanced panel data. Since the models can easily be derived from the OLS regression models presented in previous chapters, this chapter presents only some illustrative examples, by using the same models which have been presented in previous chapters. In addition, the applications of two-way effects models, such as two-way random effects model (TWREM), two-way fixed-effects model (TWFEM), two-way random-fixed-effects model (TWRFEM), and two-way fixed-random effects model (TWFREM) are also presented.

I wish to express my gratitude to the Graduate School of Management, Faculty of Economics, University of Indonesia, and The Ary Suta Center, for providing a rich intellectual environment and facilities that were indispensable for writing this text. In the process of doing data analyses, I wish to thank Thomas Gareth, Senior Principal Economist, IHS EViews, as well as my colleagues, Ruslan Priyadi, PhD, Professor Nachrowi D. Nachrowi, PhD, Zaafri Ananto Husodo, PhD, and Bambang Hermanto, PhD, for their input on selected illustrative examples presented in this book. I also would like to thank Tridianto Subagio, a member of the computing staff at the Graduate School of Management, who has given great help whenever I had problems with software.

In the process of writing this book in English, I am indebted to my daughter, Ningsih Agung Chandra, and my son, Darma Putra, for their time in correcting my English. My daughter has a Bachelor of Science from the Department of Biostatistics, School of Public Health, the University of North Carolina at Chapel Hill, USA, and a Master's Degree in Communication Studies (MSi) from the London School of Public Relations, Jakarta (LSPR). Now, she is a senior lecturer and thesis coordinator of the graduate program at LSPR. In addition, she is also the PR and Communication Manager of the Macau Government Tourist Office (MGTO) Representative in Indonesia, and her profile can be found through Google by typing her complete name – Martingsih Agung Chandra. My son has an MBA from De La Salle University, Philippines, and a BSc in Management from Adamson University, Phillipines. Now, he is the director of the Pure Technology Indonesia, Jakarta.

Finally, I would like to thank the reviewers, editors and staff at John Wiley & Sons, Ltd for their work in getting this book to publication.

About the Author

With regards to the request from John Wiley & Sons, Ltd: “Please provide us with a brief biography including details that explain why you are the ideal person to write this book,” I present my background, experiences and findings in doing statistical data analysis.

I have a PhD in Biostatistics (1981) and a Master's degree in Mathematical Statistics (1977) from the North Carolina University at Chapel Hill, NC, USA; a Master's degree in Mathematics from New Mexico State University, Las Cruces, NM, USA; a degree in Mathematical Education (1962) from Hasanuddin University, Makassar, Indonesia; and a certificate from “Kursus B-I/B-II Ilmu Pasti” (B-I/B-II Courses in Mathematics), Yogyakarta – a five-year non-degree program in advanced mathematics. So I would say that I have good background knowledge in theoretical as well as applied statistics. In my dissertation on biostatistics, I presented new findings, namely the Generalized Kendall's tau, the Generalized Pair Charts, and the Generalized Simon's Statistics, based on the data censored to the right.

Based on my knowledge in mathematics, mathematical functions in particular, I can evaluate the limitation, hidden assumptions or the unrealistic assumption(s), of all regression functions, such as the fixedeffects models, which are in fact ANCOVA models. For comparison, my book presents the best and the worst ANCOVA models.

Furthermore, based on my exercises and experiments in doing data analyses of various field of studies; such as finance, marketing, education and population studies since 1981 when I worked at the Population Research Center, Gadjah Mada University, 1985–1987; and while I have been at the University of Indonesia 1987 up until now, I have found unexpected or unpredictable statistical results based on alternative panel data models, compared to panel data models which are commonly applied.

Part One
Panel Data as a Multivariate Time Series by States

Abstract

Part I, containing the first four chapters, considers unstacked panel data where the units of the analysis are time observations. So the sets or multidimensional exogenous, endogenous and environmental variables, respectively, for the state i can be presented using the symbols; img, img, and img, for img; and img. Note that the scores of the environmental variables are constant for all states or individuals. Using these symbols, panel data is considered as the data of multivariate time series by states (countries, regions, agencies, firms, industries, households or individuals).

It is noted that all of the time series models presented in Agung (2009a) can easily be applied to conduct the data analysis based on each state in the panel data; as well as the general multivariate models by states and time periods, presented in Section 3.7. This part presents just the specific analyses for unstacked data with a small number of N . The models for a large N will be presented in Part II and Part III.