diff --git a/.Rbuildignore b/.Rbuildignore index 10b5a60..20cab0e 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -7,7 +7,5 @@ ^Data Sets- R$ ^R data sets for 5e$ ^cran-comments.md$ -^Farnsworth-EconometricsInR.pdf$ -^Jeffrey_M._Wooldridge_Introductory_Econometrics_A_Modern_Approach__2012.pdf$ ^appveyor\.yml$ ^NEWS\.md$ diff --git a/DESCRIPTION b/DESCRIPTION index 711b36e..41047ea 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -8,13 +8,12 @@ Description: Economics students new to both Econometrics and R may find the tasks of learning both a bit daunting. However, if your text is "Introductory Econometrics: A Modern Approach" by Jeffrey M. Wooldridge, then you are in luck! The `wooldridge` data package aims to lighten the task by loading any - data set from the text with a single command. In addition, the package contains + data set from the text with a single command. The package contains documentation for each data set and all data has been efficiently compressed resulting in a total size that is 62.73% of its original. A wooldridge-vignette - recreates examples from every chapter of the text, offering a relevant - introduction to R's statistical model syntax. Note: Data sets are from the - 5th edition (Wooldridge 2013, ISBN-13:978-1-111-53104-1), which is compatible - with all other editions. + provides examples from the text, offering a relevant introduction to R's + econometric modelling syntax. Note: Data sets are from the 5th edition + (Wooldridge 2013, ISBN-13:978-1-111-53104-1), and are compatible with other editions. Depends: R (>= 3.1.0) License: GPL-3 Encoding: UTF-8 diff --git a/README.md b/README.md index d7cb82b..e8afaca 100644 --- a/README.md +++ b/README.md @@ -3,32 +3,39 @@ [![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/wooldridge)](https://cran.r-project.org/package=wooldridge) [![Travis-CI Build Status](https://travis-ci.org/JustinMShea/wooldridge.svg?branch=master)](https://travis-ci.org/JustinMShea/wooldridge) [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/JustinMShea/wooldRidge?branch=master&svg=true)](https://ci.appveyor.com/project/JustinMShea/wooldRidge) +Economics students new to both Econometrics and R may find the tasks of learning both a bit daunting. However, if your text is **"Introductory Econometrics: A Modern Approach"** by Jeffrey M. Wooldridge, then you are in luck! -Economics students new to both Econometrics and R may find the tasks of learning both a bit daunting. However, if your text is **"Introductory Econometrics: A Modern Approach"** by Jeffrey M. Wooldridge, then you are in luck! The `wooldridge` data package aims to lighten the task by loading any data set from the text with a single command. +The `wooldridge` data package aims to lighten the task by loading any data set from the text with a single command. The package contains documentation for each data set and all data has been efficiently compressed resulting in a total size that is **62.73%** of its original. Just install the package, load it, and call the data set you need to work with. -In addition the package contains documentation for each data set and all data has been efficiently compressed resulting in a total size that is **62.73%** of its original size. Just install the package, load it, and call the data set you need to work with. +_**But wait...there's more!**_ Act now (or at anytime) and you will receive the :sparkles: [`wooldridge-vignette`](https://github.com/JustinMShea/wooldridge/tree/master/vignettes/wooldridge-vignette.pdf).:sparkles: The vignette illustrates how to recreate examples provided in the text, offering a relevant introduction to getting started with R's econometric modelling syntax. -_**But wait...there's more!**_ Contained in the package is also the [`wooldridge-vignette`](https://github.com/JustinMShea/wooldridge/tree/master/vignettes/wooldridge-vignette.pdf), which shows you how to recreate examples from every chapter of the text, -offering a relevant introduction to R's statistical model syntax. +While the course companion site also provides publicly available data sets for E-views, Excel, MiniTab, and Stata commercial software products, **R** is an open source option. Furthermore, taking the step to use R while building a foundation in Econometrics, offers the curious Student a gateway to accessing advanced topics available in the R package ecosystem. - -**Note:** All data sets are from the 5th edition (Wooldridge 2013, `ISBN-13:978-1-111-53104-1`), which is compatible with most other editions. +**Note:** All data sets are from the 5th edition (Wooldridge 2013, `ISBN-13: 978-1-111-53104-1`), which is compatible with most other editions. ## Installation -If you don't already have `devtools` installed, try the `ghit` package, a lightweight github installer. +Install directly from CRAN, which depends on R version >= 3.4.0. + +```{r} +install.packages("wooldridge") +``` + +For developer version (with dependencies relaxed to R version >= 3.1.0), +install from GitHub. ```{r} -install.packages("ghit") +devtools::install_github("JustinMShea/wooldridge") ``` -Next, install 'wooldridge' package from my GitHub page. +Or: ```{r} -ghit::install_github("JustinMShea/wooldridge") +devtools::install_github("JustinMShea/wooldridge", build_vignettes = TRUE) ``` + ## Example Load the `wooldridge` package and use the `data()` function to load the desired set. @@ -42,7 +49,7 @@ Check out the documentation on the variable column names and what they are. ?jtrain ``` -In addition, load the [`wooldridge-vignette`](https://github.com/JustinMShea/wooldridge/tree/master/vignettes/wooldridge-vignette.pdf) for a recreation of examples from the text. +In addition, load [`wooldridge-vignette`](https://github.com/JustinMShea/wooldridge/tree/master/vignettes/wooldridge-vignette.pdf) for a recreation of examples from the text. ```{r} vignette("wooldridge-vignette") diff --git a/data-raw/final_roxy_build_delete.R b/data-raw/final_roxy_build_delete.R index a7ff840..d5ef70a 100644 --- a/data-raw/final_roxy_build_delete.R +++ b/data-raw/final_roxy_build_delete.R @@ -6,17 +6,16 @@ # time to roxygenize those .R description files we wrote! devtools::document() -# Build vignette -devtools::build_vignettes() +# Build package +devtools::build() # delete Building vignette folder as it creates build warning. unlink("inst/doc", recursive = TRUE) unlink("inst", recursive = TRUE) -# Render .pdf vignette +# Render .pdf and .html vignettes library(rmarkdown) -rmarkdown::render("vignettes/wooldridge-vignette.Rmd", pdf_document(toc=TRUE)) - +rmarkdown::render("vignettes/wooldridge-vignette.Rmd", "all") # build checks use_travis() diff --git a/vignettes/wooldridge-vignette.Rmd b/vignettes/wooldridge-vignette.Rmd index a2d50b2..d453f53 100644 --- a/vignettes/wooldridge-vignette.Rmd +++ b/vignettes/wooldridge-vignette.Rmd @@ -3,8 +3,11 @@ title: "wooldridge-vignette" author: "Justin M Shea" date: " " output: + rmarkdown::html_document: + toc: true pdf_document: - toc: true + toc: true + vignette: > %\VignetteIndexEntry{wooldridge-vignette} %\VignetteEngine{knitr::rmarkdown} @@ -15,9 +18,11 @@ vignette: > ## Introduction -This vignette contains examples from every chapter, which show you how to load data from the `wooldridge` data package, and run the appropriate model to match the results of the text examples. The syntax provided here should get you through the book. +This vignette contains examples of using R with _"Introductory Econometrics: A Modern Approach"_ by Jeffrey M. Wooldridge. Each example illustrates how to load data, run econometric models, and view the results with **R**. + +While the course companion site also provides publicly available data sets for E-views, Excel, MiniTab, and Stata commercial software products, **R** is an open source option. Furthermore, taking the step to use **R** while building a foundation in Econometrics, offers the curious Student a gateway to accessing advanced topics available in the greater package ecosystem. -Load the `wooldridge` package to access data in the manner specified in each example. +First, load the `wooldridge` package to access data in the manner specified in each example. ```{r, echo = TRUE, eval = TRUE, warning=FALSE} library(wooldridge) @@ -31,9 +36,6 @@ library(stargazer) **`Example 2.10:` A Log Wage Equation** -From the text: - -> " Using the wage1 data as in Example 2.4, but using log(wage) as the dependent variable, we obtain the following relationship:" $$\widehat{log(wage)} = \beta_0 + \beta_1educ$$ @@ -60,10 +62,6 @@ stargazer(log_wage_model, single.row = TRUE, header = FALSE) **`Example 3.2:` Hourly Wage Equation** -From the text: - -> " Using the 526 observations on workers in 'wage1', we include $educ$(years of education), $exper$(years of labor market experience), and $tenure$(years with the current employer) in an equation explain log($wage$)." - $$\widehat{log(wage)} = \beta_0 + \beta_1educ + \beta_3exper + \beta_4tenure$$ Estimate the model regressing _education_, _experience_, and _tenure_ against _log(wage)_. @@ -83,11 +81,6 @@ stargazer(hourly_wage_model, single.row = TRUE, header = FALSE) **`Example 4.7` Effect of Job Training on Firm Scrap Rates** -From the text: - -> " The scrap rate for a manufacturing firm is the number of defective items - products that must be discarded - out of every 100 produced. Thus, for a given number of items produced, a decrease in the scrap rate reflects higher worker productivity." - -> "We can use the scrap rate to measure the effect of worker training on productivity. Using the data in jtrain, but only for the year 1987 and for non-unionized firms, we obtain the following estimated equation:" First, load the `jtrain` data set. ```{r, echo = TRUE, eval = TRUE, warning=FALSE} @@ -127,9 +120,6 @@ stargazer(linear_model, single.row = TRUE, header = FALSE) **`Example 5.3:` Economic Model of Crime** -From the text: - -> "We illustrate the Lagrange Multiplier $(LM)$ statistics test by using a slight extension of the crime model from example 3.5." $$narr86 = \beta_0 + \beta_1pcnv + \beta_2avgsen + \beta_3tottime + \beta_4ptime86 + \beta_5qemp86 + \mu$$ @@ -152,9 +142,6 @@ Load the `crime1` data set containing arrests during the year 1986 and other inf data(crime1) ``` -From the text: - -> "We use the $LM$ statistic to test the null hypothesis that $avgsen$ and $tottime$ have no effect on $narr86$ once other factors have been controlled for. First, estimate the restricted model by regressing $narr86$ on $pcnv, ptime86,$ and $qemp86$; the variables $avgsen$ and $tottime$ are excluded from this regression." ```{r, tidy = TRUE} restricted_model <- lm(narr86 ~ pcnv + ptime86 + qemp86, data = crime1) @@ -169,9 +156,7 @@ restricted_model_u <- restricted_model$residuals Next, we run the regression of: $$\tilde{\mu} = \beta_1pcnv + \beta_2avgsen + \beta_3tottime + \beta_4ptime86 + \beta_5qemp86$$ -From the text: -> "As always, the order in which we list the independent variables is irrelevant.This second regression produces $R^2_{\mu}$, which turns out to be about 0.0015." ```{r, tidy = TRUE} LM_u_model <- lm(restricted_model_u ~ pcnv + ptime86 + qemp86 + avgsen + tottime, data = crime1) @@ -179,8 +164,6 @@ LM_u_model <- lm(restricted_model_u ~ pcnv + ptime86 + qemp86 + avgsen + tottime summary(LM_u_model)$r.square ``` -> "This may seem small, but we must multiple it by $n$ to get the $LM$ statistic:" - $$LM = 2,725(0.0015)$$ ```{r} @@ -188,14 +171,11 @@ LM_test <- nobs(LM_u_model) * 0.0015 LM_test ``` -> "The 10% critical value in a chi-square distribution with two degrees of freedom is about 4.61 (rounded to two decimal places)." ```{r} qchisq(1 - 0.10, 2) ``` -> "Thus, we fail to reject the null hypothesis that $\beta_{avgsen} = 0$ and -$\beta_{tottime} = 0$ at the 10% level." The _p_-value is: $$P(X^2_{2} > 4.09) \approx 0.129$$ @@ -210,11 +190,6 @@ so we would reject the $H_0$ at the 15% level. **`Example 6.1:` Effects of Pollution on Housing Prices, standardized.** -From the text: - -> "We use the data $hrprice2$ to illustrate the use of beta coefficients. -Recall that the key independent variable is $nox$, a measure of nitrogen oxide -in the air over each community. One way to understand the size of the pollution effect-without getting into the science underling nitrogen oxide's effect on air quality-is to compute beta coefficients. The population equation is the level-level model:" $$price = \beta_0 + \beta_1nox + \beta_2crime + \beta_3rooms + \beta_4dist + \beta_5stratio + \mu$$ @@ -230,7 +205,6 @@ $dist$: weighted distance of the community to 5 employment centers. $stratio$: average student-teacher ratio of schools in the community. -The beta coefficients are reported in the following equation (so each variable has been converted to its $z$-score):" $$\widehat{zprice} = \beta_1znox + \beta_2zcrime + \beta_3zrooms + \beta_4zdist + \beta_5zstratio$$ First, load the `hrpice2` data. @@ -250,6 +224,7 @@ stargazer(housing_standard, single.row = TRUE, header = FALSE) ``` +\newpage **`Example 6.2:` Effects of Pollution on Housing Prices, Quadratic Interactive Term** @@ -263,7 +238,7 @@ housing_interactive <- lm(lprice ~ lnox + log(dist) + rooms+I(rooms^2) + stratio Lets compare the results with the model from `example 6.1`. ```{r, results = 'asis', warning=FALSE, message=FALSE, tidy=TRUE} -stargazer(housing_standard, echo=FALSE, housing_interactive, single.row = TRUE, header = FALSE) +stargazer(housing_standard, housing_interactive, single.row = TRUE, header = FALSE) ``` \newpage @@ -296,25 +271,14 @@ housing_qualitative <- lm(lprice ~ llotsize + lsqrft + bdrms + colonial, data = stargazer(housing_qualitative, single.row = TRUE, header = FALSE) ``` -Summary from the text: - -> "All the variables are self-explanatory except $colonial$, which is a binary variable equal to one if the house is of the colonial style. What does the coefficient on $colonial$ mean? For given levels of $lotsize$, $sqrt$, and $bdrms$, the difference in $\widehat{log(price)}$ between a house of colonial style and that of another style is 0.54. This means that colonial-style house is predicted to sell for about 5.4% more, holding other factors fixed." - \newpage ## Chapter 8: Heteroskedasticity **`Example 8.9:` Determinants of Personal Computer Ownership** -> "We use the data in $GPA1$ to estimate the probability of owning a computer. -Let $PC$ denote a binary indicator equal to unity if the student owns a computer, and zero otherwise. The variable $hsGPA$ is high school GPA, $ACT$ is achievement test score, and $parcoll$ is a binary indicator equal to unity if at least one parent attended college." - -> "The equation estimated by OLS is:" - $$\widehat{PC} = \beta_0 + \beta_1hsGPA + \beta_2ACT + \beta_3parcoll + \beta_4colonial $$ - - Create a new variable combining the`fathcoll` and `mothcoll`, into `parcoll`. This new column indicates if either parent went to college. ```{r} @@ -326,7 +290,6 @@ gpa1$parcoll <- as.integer(gpa1$fathcoll==1 | gpa1$mothcoll) GPA_OLS <- lm(PC ~ hsGPA + ACT + parcoll, data = gpa1) ``` -> "Just as with example 8.8, there are no striking differences between the usual and robust standard errors. Nevertheless, we also estimate the model by Weighted Least Squares or $WLS$. Because all of the $OLS$ fitted values are inside the unit interval, no adjustments are needed" First, calculate the weights and then pass them to the same linear model. @@ -342,7 +305,6 @@ Compare the OLS and WLS model in the table below: stargazer(GPA_OLS, GPA_WLS, single.row = TRUE, header = FALSE) ``` -> "There are no important differences in the OLS and WLS estimates. The only significant explanatory variable is $parcoll$, and in both cases we estimate that the probability of $PC$ ownership is about .22 higher if at least on parent attended college" \newpage @@ -350,12 +312,9 @@ stargazer(GPA_OLS, GPA_WLS, single.row = TRUE, header = FALSE) **`Example 9.8:` R&D Intensity and Firm Size** -> "Suppose the R&D expenditures as a percentage of sales, $rdintens$, are realted to $sales$ (in millions) and profits as a percentage of sales, $profmarg$:" $$rdintens = \beta_0 + \beta_1sales + \beta_2profmarg + \mu$$ -> "The $OLS$ equation using data on 32 chemical companies in $rdchem$ is" - Load the data, run the model, and apply the `summary` diagnostics function to the model. ```{r} @@ -364,9 +323,7 @@ data(rdchem) all_rdchem <- lm(rdintens ~ sales + profmarg, data = rdchem) ``` -Neither $sales$ nor $profmarg$ is statistically significant at even the 10% level in this regression. - -Of the 32 firms, 31 have annual sales less than $20$ billion. One firm has annual sales of almost $40$ billions. Figure 9.1 shows how far this firm is from the rest of the sample. +Notice the outlier on the far right of the plot. ```{r, tidy=TRUE} plot_title <- "FIGURE 9.1: Scatterplot of R&D intensity against firm sales" @@ -376,8 +333,6 @@ y_axis <- "R&D as a percentage of sales" plot(rdintens ~ sales, pch = 21, bg = "lightblue", data = rdchem, main = plot_title, xlab = x_axis, ylab = y_axis) ``` -> "In terms of sales, this firm is over twice as large as every other firm, so it might be a good idea to estimate the model without it. When we do this, we obtain:" - ```{r} smallest_rdchem <- lm(rdintens ~ sales + profmarg, data = rdchem, subset = (sales < max(sales))) @@ -396,8 +351,6 @@ stargazer(all_rdchem, smallest_rdchem, single.row = TRUE, header = FALSE) **`Example 10.2:` Effects of Inflation and Deficits on Interest Rates** -> "The data in $INTDEF$ data come from the 2004 Economic Report of the President (Tables B-73 and B-79) and span the years 1948 through 2003. The variable $i3$ is the three-month T-bill rate, $inf$ is the annual inflation rate based on the consumer price index (CPI), and $def$ is the federal budget deficit as a percentage of GDP. The estimated equation is:" - $$\widehat{i3} = \beta_0 + \beta_1inf_t + \beta_2def_t$$ ```{r} @@ -410,7 +363,6 @@ tbill_model <- lm(i3 ~ inf + def, data = intdef) stargazer(tbill_model, single.row = TRUE, header = FALSE) ``` -> "These estimates show that increases in inflation or the relative size of the deficit increase short-term interest rates, both of which are expected from basic economics. For example, a ceteris paribus one percentage point increase in the inflation rate increases i3 by .606 points. Both inf and def are very statistically significant, assuming, of course, that the CLM assumptions hold." **`Example 10.11:` Seasonal Effects of Antidumping Filings** @@ -421,7 +373,6 @@ data("barium") barium_imports <- lm(lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6, data = barium) ``` -> "Therefore, we should add seasonal dummy variables to make sure none of the important conclusions change. It could be that the months just before the suit was filed are months where imports are higher or lower, on average, than in other months." ```{r, tidy=TRUE} barium_seasonal <- lm(lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6 + feb + mar + apr + may + jun + jul + aug + sep + oct + nov + dec, data = barium) @@ -434,7 +385,6 @@ stargazer(barium_imports, barium_seasonal, single.row = TRUE, header = FALSE) stargazer(barium_anova, single.row = TRUE, header = FALSE) ``` -> "When we add the 11 monthly dummy variables as in 10.41 and test their joint significance, we obtain $p-value = 5 .5852$, and so the seasonal dummies are jointly insignificant. In addition, nothing important changes in the estimates once statistical significance is taken into account. Krupp and Pollard (1996) actually used three dummy variables for the seasons (fall, spring, and summer, with winter as the base season), rather than a full set of monthly dummies; the outcome is essentially the same." \newpage @@ -442,24 +392,15 @@ stargazer(barium_anova, single.row = TRUE, header = FALSE) **`Example 11.7:` Wages and Productivity** -> "The variable $hrwage$ is average hourly wage in the U.S. economy, and $outphr$ is output per -hour. One way to estimate the elasticity of hourly wage with respect to output per hour is -to estimate the equation:" - $$\widehat{log(hrwage_t)} = \beta_0 + \beta_1log(outphr_t) + \beta_2t + \mu_t$$ -> "where the time trend is included because $log(hrwage)$ and $log(outphr)$ both display clear, -upward, linear trends. Using the data in 'EARNS' for the years 1947 through 1987, we obtain:" ```{r} data("earns") wage_time <- lm(lhrwage ~ loutphr + t, data = earns) ``` -> "(We have reported the usual goodness-of-fit measures here; it would be better to report those based on the detrended dependent variable, as in Section 10.5.). The estimated elasticity seems too large: a 1% increase in productivity increases real wages by about 1.64%. Because the standard error is so small, the 95% confidence interval easily excludes a unit elasticity. U.S. workers would probably have trouble believing that their wages increase by more than 1.5% for every 1% increase in productivity." - -> "The regression results must be viewed with caution. Even after linearly detrending $log(hrwage)$, the first order autocorrelation is .967, and for detrended $log(outphr), \hat{p} = 0.945$. These suggest that both series have unit roots, so we reestimate the equation in first differences (and we no longer need a time trend):" ```{r} wage_diff <- lm(diff(lhrwage) ~ diff(loutphr), data = earns) @@ -469,7 +410,6 @@ wage_diff <- lm(diff(lhrwage) ~ diff(loutphr), data = earns) stargazer(wage_time, wage_diff, single.row = TRUE, header = FALSE) ``` -> "Now, a 1% increase in productivity is estimated to increase real wages by about 0.81%, and the estimate is not statistically different from one. The adjusted $R^2$ shows that the growth in output explains about 35% of the growth in real wages." \newpage @@ -477,50 +417,28 @@ stargazer(wage_time, wage_diff, single.row = TRUE, header = FALSE) **`Example 12.4`: Prais-Winsten Estimation in the Event Study** -> "Again using the data in BARIUM, we estimate the equation in Example 10.5 using iterated Prais-Winsten estimation." - -> "The coefficients that are statistically significant in the Prais-Winsten estimation do -not differ by much from the OLS estimates [in particular, the coefficients on $log(chempi)$, $log(rtwex)$, and $afdec6$]. It is not surprising for statistically insignificant coefficients to change, perhaps markedly, across different estimation methods. - -First, run the linear model from example 10.5 and 10.11. - ```{r, tidy=TRUE} data("barium") barium_model <- lm(lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6, data = barium) -``` - -Then load the `prais` package and use the `prais.winsten` function to estimate the same model. - -```{r, tidy=TRUE} +# Load the `prais` package, use the `prais.winsten` function to estimate. library(prais) barium_prais_winsten <- prais.winsten(lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6, data = barium) ``` - - Print the names of both models to the console to compare the results of both. ```{r} barium_model barium_prais_winsten ``` -> "Notice how the standard errors in the second column are uniformly higher than the -standard errors in column (1). This is common. The Prais-Winsten standard errors account -for serial correlation; the $OLS$ standard errors do not. As we saw in Section 12.1, the OLS standard errors usually understate the actual sampling variation in the OLS estimates and should not be relied upon when significant serial correlation is present. Therefore, the effect on Chinese imports after the International Trade Commissions decision is now less statistically significant than we thought." - -> "Finally, an R-squared is reported he $PW$ estimation that is well below the -R-squared for the $OLS$ estimation in this case. However, these R-squareds should not be -compared. For $OLS$, the R-squared, as usual, is based on the regression with the untransformed dependent and independent variables. For $PW$, the R-squared comes from the final regression of the $transformed$ dendent variable on the transformed independent vari-ables. It is not clear what this $R^2$ actually measuring; nevertheless, it is traditionally reported." - \newpage **`Example 12.8:` Heteroskedasticity and the Efficient Markets Hypothesis** -> "In Example 11.4, we estimated the simple $AR(1)$ model:" $$return_t = \beta_0 + \beta_1return_{t-1} + \mu_t$$ -> "The EMH states that $\beta_1 = 0$. When we tested this hypothesis using the data in 'NYSE', we obtained $t_b{1} = 1.55$ with $n = 689$. + ```{r} data("nyse") @@ -529,12 +447,10 @@ return_AR1 <-lm(return ~ return_1, data = nyse) ``` -> "With such a large sample, this is not much evidence against the EMH. Although the EMH states that the expected return given past observable information should be constant, it says nothing about the conditional variance. In fact, the Breusch-Pagan test for heteroskedasticity entails regressing the squared $OLS$ residuals $\hat{\mu^2_t}$ on $return_{t-1}$"" - $$\hat{\mu^2_t} = \beta_0 + \beta_1return_{t-1} + residual_t$$ -Calculated $\hat{\mu^2_t}$ by taking the residuals contained in the `return_AR` model object and store the results in the variable named `return_mu`. Then regress the `return_1` variable against the square of `return_mu`. Notice, we set data equal to the `return_AR` objects model matrix, which contains data free of leading missing values inherent to lagged variables. + ```{r} return_mu <- residuals(return_AR1) @@ -544,16 +460,12 @@ mu2_hat_model <- lm(return_mu^2 ~ return_1, data = return_AR1$model) ```{r, results = 'asis', warning=FALSE, message=FALSE, tidy=TRUE} stargazer(return_AR1, mu2_hat_model, single.row = TRUE, header = FALSE) ``` -> "The $t$ statistic on $return_{t-1}$ is about -5.5, indicating strong evidence of heteroskedasticity. Because the coeffict on $return_{t-1}$ is negative, we have the interesting finding that volatility in stock returns is lower the previous return was high, and vice versa. Therefore, we have found what is common in many financial studies: the expected value of stock returns does not depend on past returns, but the variance of returns does." + \newpage **`Example 12.9:` ARCH in Stock Returns** -> "In Example 12.8, we saw that there was heteroskedasticity in weekly stock returns. This -heteroskedasticity is actually better characterized by the ARCH model in (12.50). If we -compute the OLS residuals from (12.47), square these, and regress them on the lagged -squared residual, we obtain:" $$\hat{\mu^2_t} = \beta_0 + \hat{\mu^2_{t-1}} + residual_t$$ @@ -571,9 +483,6 @@ arch_model <- lm(mu2_hat ~ mu2_hat_1) stargazer(arch_model, single.row = TRUE, header = FALSE) ``` -> "The t statistic on $\hat{\mu^2_{t-1}}$ (mu2_hat_1) is over nine, indicating strong ARCH. As we discussed earlier, a larger error at time $t-1$ implies a larger variance in stock returns today. - -> "It is important to see that, though the $squared$ $OLS$ residuals are autocorrelated, the $OLS$ residuals themselves are not (as is consistent with the EMH). Regressing on $\hat{\mu_t}$ and $\hat{\mu_{t-1}}$ gives $\hat{p} = 0.0014$ with $t_{\hat{p}} = 0.038$. \newpage @@ -581,18 +490,12 @@ stargazer(arch_model, single.row = TRUE, header = FALSE) **`Example 13.7:` Effect of Drunk Driving Laws on Traffic Fatalities** -> "Many states in the United States have adopted different policies in an attempt to curb -drunk driving. Two types of laws that we will study here are $open$ $container$ $laws$ -which make it illegal for passengers to have open containers of alcoholic beverages -and $administrative$ $per$ $se$ $laws$ -which allow courts to suspend licenses after a driver is arrested for drunk driving but before the driver is convicted. One possible analysis is to use a single cross section of states to regress driving fatalities (or those related to drunk driving) on dummy variable indicators for whether each law is present. This is unlikely to work well because states decide, through legislative processes, whether they need such laws. Therefore, the presence of laws is likely to be related to the average drunk driving fatalities in recent years. A more convincing analysis uses panel data over a time period where some states adopted new laws (and some states may have repealed existing laws). The -file TRAFFIC1 contains data for 1985 and 1990 for all 50 states and the District of -Columbia. The dependent variable is the number of traffic deaths per 100 million miles -driven (dthrte). In 1985, 19 states had open container laws while 22 states had such laws -in 1990. In 1985, 21 states had per se laws; the number had grown to 29 by 1990. -Using OLS after first differencing gives:" $$\widehat{\Delta{dthrte}} = \beta_0 + \Delta{open} + \Delta{admin}$$ ```{r} data("traffic1") + DD_model <- lm(cdthrte ~ copen + cadmn, data = traffic1) ``` @@ -600,21 +503,20 @@ DD_model <- lm(cdthrte ~ copen + cadmn, data = traffic1) stargazer(DD_model, single.row = TRUE, header = FALSE) ``` -> "The estimates suggest that adopting an open container law lowered the traffic fatality rate by $0.42$, a nontrivial effect given that the average death rate in 1985 was -2.7 with a standard deviation of about 0.6. The estimate is statistically significant at the 5% level against a twosided alternative. The administrative per se law has a smaller effect, and its t statistic is only -1.29; but the estimate is the sign we expect. The intercept in this equation shows that traffic fatalities fell substantially for all states over the five-year period, whether or not there were any law changes. The states that adopted an open container law over this period saw a further drop, on average, in fatality rates." - -> "Other laws might also affect traffic fatalities, such as seat belt laws, motorcycle helmet laws, and maximum speed limits. In addition, we might want to control for age and gender distributions, as well as measures of how influential an organization such as Mothers Against Drunk Driving is in each state." - +\newpage ## Chapter 14: Advanced Panel Data Methods **`Example 14.1:` Effect of Job Training on Firm Scrap Rates** -> "We use the data for three years, 1987, 1988, and 1989, on the 54 firms that reported scrap rates in each year. No firms received grants prior to 1988; in 1988, 19 firms received grants; in 1989, 10 different firms received grants. Therefore, we must also allow for the possibility that the additional job training in 1988 made workers more productive in 1989. This is easily done by including a lagged value of the grant indicator. We also include year dummies for 1988 and 1989. +In this section, we will estimate a linear panel modeg using the `plm` function in the +`plm: Linear Models for Panel Data` package. ```{r, tidy=TRUE} library(plm) + data("jtrain") + scrap_panel <- plm(lscrap ~ d88 + d89 + grant + grant_1, data = jtrain, index = c('fcode','year'), model = 'within', effect ='individual') ``` @@ -622,41 +524,21 @@ scrap_panel <- plm(lscrap ~ d88 + d89 + grant + grant_1, data = jtrain, ```{r, results = 'asis', warning=FALSE, message=FALSE, tidy=TRUE} stargazer(scrap_panel, single.row = TRUE, header = FALSE) ``` -> "We have reported the results in a way that emphasizes the need to interpret the estimates in light of the unobserved effects model, (14.4). We are explicitly controlling for the unobserved, time-constant effects in $\alpha_i$. The time-demeaning allows us to estimate the $\beta_j$, but (14.5) is not the best equation for interpreting the estimates. - -> "Interestingly, the estimated lagged effect of the training grant is substantially -larger than the contemporaneous effect: job training has an effect at least one year -later. Because the dependent variable is in logarithmic form, obtaining a grant in -1988 is predicted to lower the firm scrap rate in 1989 by about 34.4% [$exp(-0.422)-1 \approx -0.344$]; the coefficient on $grant_1$ is significant at the 5% level against a twosided alternative. The coefficient $grant$ is significant at the 10% level, and the -size of the coefficient is hardly trivial. Notice the $df$ is obtained as N(T-1) - k = 54(3-1)-4 = 104" - -> "The coefficient on $d89$ indicates that the scrap rate was substantially lower in 1989 -than in the base year, 1987, even in the absence of job training grants. Thus, it is important to allow for these aggregate effects. If we omitted the year dummies, the secular increase in worker productivity would be attributed to the job training grants. -The diagnostic results above shows that, even after controlling for aggregate trends in productivity, the job training grants had a large estimated effect." - -> "Finally, it is crucial to allow for the lagged effect in the model. If we omit $grant_1$, then we are assuming that the effect of job training does not last into the next year. The estimate on $grant$ when we drop $grant_1$ is -0.082 $t = -0.65$; this is much smaller and statistically insignificant." +\newpage ## Chapter 15: Instrumental Variables Estimation and Two Stage Least Squares **`Example 15.1:` Estimating the Return to Education for Married Women** -> "We use the data on married working women in $mroz$ to estimate the return to education in the simple regression model" $$log(wage) = \beta_0 + \beta_1educ + \mu$$ -> "For comparison, we first obtain the $OLS$ estimates:" - - - ```{r, message=FALSE} data("mroz") wage_educ_model <- lm(lwage ~ educ, data = mroz) ``` -> "The estimate for $\beta_1$ implies an almost 11% return for another year of education." - -> "Next, we use father's education $fatheduc$ as an instrumental variable for $educ$. We have to maintain that $fatheduc$ is uncorrelated with $\mu$. The second requirement is that $educ$ and $fatheduc$ are correlated. We can check this very easily using a simple regression of $educ$ on $fatheduc$, using only the working women in the sample:" $$\widehat{educ} = \beta_0 + \beta_1fatheduc$$ @@ -666,9 +548,6 @@ We run the typical linear model, but notice the use of the `subset` argument. `i fatheduc_model <- lm(educ ~ fatheduc, data = mroz, subset = (inlf==1)) ``` -> "The $t$ statistic on $fatheduc$ is $9.42$, which indicates that $educ$ and $fatheduc$ have a statistically significant positive correlation. In fact, $fatheduc$ explains about 17% of the variation in $educ$ in the sample. Using fatheduc as an $IV$ for $educ$ gives:" - - In this section, we will perform an **Instrumental-Variable Regression**, using the `ivreg` function in the `AER (Applied Econometrics with R)` package. ```{r, message=FALSE} @@ -681,13 +560,10 @@ stargazer(wage_educ_model, fatheduc_model, wage_educ_IV, single.row = TRUE, head ``` -> "The $IV$ estimate of the return to education is 5.9%, which is barely more than one half of the OLS. This suggests that the $OLS$ estimate is too high and is consistent with omitted ability bias. But we should remember that these are estimates from just one sample: we can never know whether $0.109$ is above the true return to education, or whether $0.059$ is closer to the true return to education. Further, the standard error of the $IV$ estimate is two and one-half times as large as the $OLS$ standard error this is expected, for the reasons we gave earlier. The 95% confidence interval for using $OLS$ is much tighter than that using the $IV$. In fact, the $IV$ confidence interval actuay contains the $OLS$ estimate. Therefore, although the differences between 15.15 and 15.17 are practically large, we cannot say whether the difference is statistically significant. We will show how to test this in Section 15.5." - \newpage **`Example 15.2:` Estimating the Return to Education for Men** -> "We now use $wage2$ data to estimate the return to education for men. We use the variable $sibs$, or number of siblings, as an instrument for $educ$. These are negatively correlated, as we can verify from a simple regression:" $$\widehat{educ} = \beta_0 + sibs$$ @@ -697,7 +573,6 @@ data("wage2") educ_sibs_model <- lm(educ ~ sibs, data = wage2) ``` -> "This equation implies that every sibling is associated with, on average, about 0.23 less of a year of education. If we assume that sibs is uncorrelated with the error term in 15.14, then the IV estimator is consistent. Estimating equation 15.14 from example 15.1, using $sibs$ as an IV for $educ$ gives:" $$\widehat{log(wage)} = \beta_0 + educ$$ @@ -713,14 +588,10 @@ educ_sibs_IV <- ivreg(lwage ~ educ | sibs, data = wage2) stargazer(educ_sibs_model, educ_sibs_IV, wage_educ_IV, single.row = TRUE, header = FALSE) ``` -> "For comparison, the OLS estimate of $\beta_1$ is 0.059 with a standard error of 0.006. Unlike in the previous example, the $IV$ estimate is now much higher than the $OLS$ estimate. While we do not know whether the difference is statistically significant, this does not mesh with the omitted ability bias from $OLS$. It could be that $sibs$ is also correlated with ability: more siblings means, on average, less parental attention, which could result in lower ability. Another interpretation is that the $OLS$ estimator is biased toward zero because of measurement error in $educ$. This is not entirely convincing because, as we discussed in Section 9.3, educ is unlikely to satisfy the classical errors-in-variables model." +\newpage **`Example 15.5:` Return to Education for Working Women** -> "We estimate equation 15.40 using the data in $mroz$. First, we test -$H_0: \pi_3 = 0, \pi_4 = 0$ in 15.41using an $F$ test. The result is $F = 55.40$, and $p-value = 0.0000$. As expected, $educ$ is partially correlated with parents education." - -> "When we estimate 15.40 by 2SLS, we obtain, in equation form," $$\widehat{log(wage)} = \beta_0 + \beta_1educ + \beta_2exper + \beta_3exper^2$$ @@ -733,33 +604,18 @@ wage_educ_exper_IV <- ivreg(lwage ~ educ + exper + expersq | exper + expersq + m stargazer(wage_educ_exper_IV, single.row = TRUE, header = FALSE) ``` -> "The estimated return to education is about 6.1%, compared with an $OLS$ estimate of about 10.8%. Because of its relatively large standard error, the 2SLS estimate is barely statistically significant at the 5% level against a two-sided alternative." - +\newpage ## Chapter 16: Simultaneous Equations Models **`Example 16.4:` INFLATION AND OPENNESS** -> "Romer (1993) proposes theoretical models of inflation that imply that more "open" countries should have lower inflation rates. His empirical analysis explains average annual inflation rates (since 1973) in terms of the average share of imports in gross domestic product since 1973 - which is his measure of openness. In addition to estimating -the key equation by OLS, he uses instrumental variables. While Romer does not specify -both equations in a simultaneous system, he has in mind a two-equation system:" $$inf = \beta_{10} + \alpha_1open + \beta_{11}log(pcinc) + \mu_1$$ $$open = \beta_{20} + \alpha_2inf + \beta_{21}log(pcinc) + \beta_{22}log(land) + \mu_2$$ - -> "where $pcinc$ is 1980 per capita income, in U.S. dollars, assumed to be exogenous, and -$land$ is the land area of the country in square miles, also assumed to be exogenous. The first equation is the one of interest, with the hypothesis that $\alpha < 0$. More open -economies have lower inflation rates." - -> "The second equation reflects the fact that the degree of openness might depend on the average inflation rate, as well as other factors. The variable $log(pcinc)$ appears in both equations, but $log(land)$ is assumed to appear only in the second equation. -The idea is that, ceteris paribus, a smaller country is likely to be more open, so $\beta_{22} < 0$." - -> "Using the identification rule that was stated earlier, the first equation is identified, provided $\beta_{22} \ne 0$. The second equation is $not$ identified because it contains both exogenous variables. Be we are interested in the first equation. - **`Example 16.6:` INFLATION AND OPENNESS** -> "Before we estimate the first equation in 16.4 using the data in $openness$, we check to see whether $open$ has sufficient partial correlation with the proposed $IV$, $log(land)$. The reduced form regression is:" $$\widehat{open} = \beta_0 + \beta_{1}log(pcinc) + \beta_{2}log(land)$$ @@ -770,10 +626,6 @@ data("openness") open_model <-lm(open ~ lpcinc + lland, data = openness) ``` -> "The $t$ statistic on $log(land)$ is over nine in absolute value, which verifies Romer's assertion that smaller countries are more open. The fact that $log(pcinc)$ is so insignificant in this regression is irrelevant." - -> "Estimating the first equation using $log(land)$ as an $IV$ for $open$ gives:" - $$\widehat{inf} = \beta_0 + \beta_{1}open + \beta_{2}log(pcinc)$$ ```{r} @@ -786,21 +638,14 @@ inflation_IV <- ivreg(inf ~ open + lpcinc | lpcinc + lland, data = openness) stargazer(open_model, inflation_IV, single.row = TRUE, header = FALSE) ``` -> "The coefficient on open is statistically significant at about the 1% level against a one sided alternative of $\alpha_1 < 0$. The effect is economically important as well: for every percentage point increase in the import share of GDP, annual inflation is about 1/3 of a percentage point lower. For comparison, the OLS estimate is -0.215, $se = 0.095$."" \newpage + ## Chapter 17: Limited Dependent Variable Models and Sample Selection Corrections **`Example 17.3:` POISSON REGRESSION FOR NUMBER OF ARRESTS** -> "We now apply the Poisson regression model to the arrest data in $crime1$ data, used, -among other places, in Example 9.1. The dependent variable, $narr86$, is the number of -times a man is arrested during 1986. This variable is zero for 1,970 of the 2,725 men in the -sample, and only eight values of $narr86$ are greater than five. Thus, a Poisson regression -model is more appropriate than a linear regression model. The table below also presents the -results of OLS estimation of a linear regression model." - ```{r, tidy=TRUE, warning=FALSE} data("crime1") @@ -816,25 +661,14 @@ stargazer(econ_crime_model, econ_crim_poisson, single.row = TRUE, header = FALS ``` -> "The standard errors for OLS are the usual ones; we could certainly have made these -robust to heteroskedasticity. The standard errors for Poisson regression are the usual -maximum likelihood standard errors. Because $\hat{\sigma} = 1.232$, the standard errors for Poisson regression should be inflated by this factor (so each corrected standard error is about 23% higher). For example, a more reliable standard error for $tottime$ is $1.23(.015) \approx 0.0185$, which gives a $t$ statistic of about 1.3. The adjustment to the standard errors reduces the significance of all variables, but several of them are still very statistically significant." - -> "The OLS and Poisson coefficients are not directly comparable, and they have very -different meanings. For example, the coefficient on $pcnv$ implies that, if $\Delta pcnv = -.10$, the expected number of arrests falls by $0.013$ ($pcnv$ is the proportion of prior arrests that led to conviction). The Poisson coefficient implies that $\Delta pncv = 0.10$ reduces expected arrests by about 4% [0.402(.10) = 0.0402, and we multiply this by 100 to get the percentage effect]. As a policy matter, this suggests we can reduce overall arrests by about 4% if we can increase the probability of conviction by 0.1." - \newpage ## Chapter 18: Advanced Time Series Topics **`Example 18.8:` FORECASTING THE U.S. UNEMPLOYMENT RATE** -> "We use the $PHILLIPS$ data set, but only for the years 1948 through 1996, to forecast -the U.S. civilian unemployment rate for 1997. We use two models. The first is a simple -AR(1) model for $unem$:" $$\widehat{unemp_t} = \beta_0 + \beta_1unem_{t-1}$$ -> "In a second model, we add inflation with a lag of one year:" $$\widehat{unemp_t} = \beta_0 + \beta_1unem_{t-1} + \beta_2inf_{t-1}$$ @@ -850,13 +684,5 @@ unem_inf_VAR1 <- lm(unem ~ unem_1 + inf_1, data = phillips, subset = (year <= 19 stargazer(unem_AR1, unem_inf_VAR1, single.row = TRUE, header = FALSE) ``` -> "The lagged inflation rate is very significant in the second model $(t \approx 4.5)$, and the adjusted R-squared much higher than that from the first. Nevertheless, this does not necessarily mean that the second equation will produce a better forecast for 1997. All we can say so far is that, using the data up through 1996, a lag of inflation helps to explain variation in the unemployment rate." - -> "To obtain the forecasts for 1997, we need to know $unemployment$ and $inflation$ in 1996. These are 5.4 and 3.0, respectively. Therefore, the forecast of $unem_{1997}$ from the first equation is $1.572 + .732(5.4)$, or about $5.52$. The forecast from the second equation is $1.304 + 0.647(5.4) + 0.184(3.0)$, or about $5.35$. The actual civilian unemployment rate for 1997 was $4.9$, so both -equations overpredict the actual rate. The second equation does provide a somewhat better forecast." - -> "We can easily obtain a 95% forecast interval. When we regress $unem_1$ on $(unem_{t-1} - 5.4)$ -and $(inf_{t-1} - 3.0)$, we obtain $5.35$ as the intercept - which we already computed as the forecast - and $se({\hat{f_n}}) = 0.137$. Therefore, because $\hat{\sigma} = 0.883$, we have $se({\hat{e_{n+1}}}) = [(0.137)^2 + (0.883)^2]^{1/2} \approx 0.894$. The 95% forecast interval of $\hat{f_n} \frac{+}{-} 1.96*se(\hat{e_{n-1}})$ is $5.35 \frac{+}{-} 1.96(0.894)$, or about [3.6, 7.1]. This is a wide interval, and the realized 1997 value, $4.9$, is well within the interval. As expected, the standard error of $\mu_{n+1}$, which is .883, is a very large fraction of $se(\hat{e_{n-1}})$" - diff --git a/vignettes/wooldridge-vignette.html b/vignettes/wooldridge-vignette.html new file mode 100644 index 0000000..12a532c --- /dev/null +++ b/vignettes/wooldridge-vignette.html @@ -0,0 +1,559 @@ + + + + + + + + + + + + + + +wooldridge-vignette + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + +
+ +
+ + +
+

Introduction

+

This vignette contains examples of using R with “Introductory Econometrics: A Modern Approach” by Jeffrey M. Wooldridge. Each example illustrates how to load data, run econometric models, and view the results with R.

+

While the course companion site also provides publicly available data sets for E-views, Excel, MiniTab, and Stata commercial software products, R is an open source option. Furthermore, taking the step to use R while building a foundation in Econometrics, offers the curious Student a gateway to accessing advanced topics available in the greater package ecosystem.

+

First, load the wooldridge package to access data in the manner specified in each example.

+
library(wooldridge)
+
+
+

Chapter 2: The Simple Regression Model

+

Example 2.10: A Log Wage Equation

+

\[\widehat{log(wage)} = \beta_0 + \beta_1educ\]

+

First, load the wage1 data.

+
data(wage1)
+

Next, estimate a linear relationship between the log of wage and education.

+
log_wage_model <- lm(lwage ~ educ, data = wage1)
+

Finally, print the coefficients and \(R^2\).

+
stargazer(log_wage_model, single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 3: Multiple Regression Analysis: Estimation

+

Example 3.2: Hourly Wage Equation

+

\[\widehat{log(wage)} = \beta_0 + \beta_1educ + \beta_3exper + \beta_4tenure\]

+

Estimate the model regressing education, experience, and tenure against log(wage).

+
hourly_wage_model <- lm(lwage ~ educ + exper + tenure, data = wage1)
+

Again, print the estimated model coefficients:

+
stargazer(hourly_wage_model,  single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 4: Multiple Regression Analysis: Inference

+

Example 4.7 Effect of Job Training on Firm Scrap Rates

+

First, load the jtrain data set.

+
data("jtrain")
+

Next, create a logical index identifying which observations occur in 1987 and are non-union.

+
index <- jtrain$year == 1987 & jtrain$union == 0
+

Next, subset the jtrain data by the new index. This returns a data.frame of jtrain data of non-union firms for the year 1987.

+
jtrain_1987_nonunion <- jtrain[index,]
+

Now create the linear model regressing hrsemp(total hours training/total employees trained), the log of annual sales, and the log of the number of the employees, against the log of the scrape rate.

+

\[lscrap = \alpha + \beta_1 hrsemp + \beta_2 lsales + \beta_3 lemploy\]

+
linear_model <- lm(lscrap ~ hrsemp + lsales + lemploy, data = jtrain_1987_nonunion)
+

Finally, print the complete summary statistic diagnostics of the model.

+
stargazer(linear_model,  single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 5: Multiple Regression Analysis: OLS Asymptotics

+

Example 5.3: Economic Model of Crime

+

\[narr86 = \beta_0 + \beta_1pcnv + \beta_2avgsen + \beta_3tottime + \beta_4ptime86 + \beta_5qemp86 + \mu\]

+

\(narr86:\) number of times arrested, 1986.

+

\(pcnv:\) proportion of prior arrests leading to convictions.

+

\(avgsen:\) average sentence served, length in months.

+

\(tottime:\) time in prison since reaching the age of 18, length in months.

+

\(ptime86:\) months in prison during 1986

+

\(qemp86:\) quarters employed, 1986

+

Load the crime1 data set containing arrests during the year 1986 and other information on 2,725 men born in either 1960 or 1961 in California.

+
data(crime1)
+
restricted_model <- lm(narr86 ~ pcnv + ptime86 + qemp86, data = crime1)
+

We obtain the residuals \(\tilde{\mu}\) from this regression, 2,725 of them.

+
restricted_model_u <- restricted_model$residuals
+

Next, we run the regression of:

+

\[\tilde{\mu} = \beta_1pcnv + \beta_2avgsen + \beta_3tottime + \beta_4ptime86 + \beta_5qemp86\]

+
LM_u_model <- lm(restricted_model_u ~ pcnv + ptime86 + qemp86 + avgsen + tottime, 
+    data = crime1)
+
+summary(LM_u_model)$r.square
+
## [1] 0.001493846
+

\[LM = 2,725(0.0015)\]

+
LM_test <- nobs(LM_u_model) * 0.0015
+LM_test
+
## [1] 4.0875
+
qchisq(1 - 0.10, 2)
+
## [1] 4.60517
+

The p-value is: \[P(X^2_{2} > 4.09) \approx 0.129\] so we would reject the \(H_0\) at the 15% level.

+
1-pchisq(LM_test, 2)
+
## [1] 0.129542
+ +
+
+

Chapter 6: Multiple Regression: Further Issues

+

Example 6.1: Effects of Pollution on Housing Prices, standardized.

+

\[price = \beta_0 + \beta_1nox + \beta_2crime + \beta_3rooms + \beta_4dist + \beta_5stratio + \mu\]

+

\(price\): median housing price.

+

\(nox\): Nitrous Oxide concentration; parts per million.

+

\(crime\): number of reported crimes per capita.

+

\(rooms\): average number of rooms in houses in the community.

+

\(dist\): weighted distance of the community to 5 employment centers.

+

\(stratio\): average student-teacher ratio of schools in the community.

+

\[\widehat{zprice} = \beta_1znox + \beta_2zcrime + \beta_3zrooms + \beta_4zdist + \beta_5zstratio\] First, load the hrpice2 data.

+
data(hrpice2)
+

Next, estimate the coefficient with the usual lm regression model but this time, standardized coefficients by wrapping each variable with R’s scale function:

+
housing_standard <- lm(scale(price) ~ 0 + scale(nox) + scale(crime) + scale(rooms) + 
+    scale(dist) + scale(stratio), data = hprice2)
+
stargazer(housing_standard,  single.row = TRUE, header = FALSE)
+ + +

Example 6.2: Effects of Pollution on Housing Prices, Quadratic Interactive Term

+

We modify the housing model, adding a quadratic term in rooms:

+

\[log(price) = \beta_0 + \beta_1log(nox) + \beta_2log(dist) + \beta_3rooms + \beta_4rooms^2 + \beta_5stratio + \mu\]

+
housing_interactive <- lm(lprice ~ lnox + log(dist) + rooms+I(rooms^2) + stratio, data = hprice2)
+

Lets compare the results with the model from example 6.1.

+
stargazer(housing_standard, housing_interactive, single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 7: Multiple Regression Analysis with Qualitative Information

+

Example 7.4: Housing Price Regression, Qualitative Binary variable

+

This time we use the hrpice1 data.

+
data(hrpice1)
+

Having just worked with hrpice2, it may be helpful to view the documentation on this data set and read the variable names.

+
?hprice1
+

\[\widehat{log(price)} = \beta_0 + \beta_1log(lotsize) + \beta_2log(sqrft) + \beta_3bdrms + \beta_4colonial \]

+

Estimate the coefficients of the above linear model on the hprice data set.

+
housing_qualitative <- lm(lprice ~ llotsize + lsqrft + bdrms + colonial, data = hprice1)
+
stargazer(housing_qualitative, single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 8: Heteroskedasticity

+

Example 8.9: Determinants of Personal Computer Ownership

+

\[\widehat{PC} = \beta_0 + \beta_1hsGPA + \beta_2ACT + \beta_3parcoll + \beta_4colonial \]

+

Create a new variable combining thefathcoll and mothcoll, into parcoll. This new column indicates if either parent went to college.

+
data("gpa1")
+gpa1$parcoll <- as.integer(gpa1$fathcoll==1 | gpa1$mothcoll)
+
GPA_OLS <- lm(PC ~ hsGPA + ACT + parcoll, data = gpa1)
+

First, calculate the weights and then pass them to the same linear model.

+
weights <- GPA_OLS$fitted.values * (1-GPA_OLS$fitted.values)
+
+GPA_WLS <- lm(PC ~ hsGPA + ACT + parcoll, data = gpa1, weights = 1/weights)
+

Compare the OLS and WLS model in the table below:

+
stargazer(GPA_OLS, GPA_WLS, single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 9: More on Specification and Data Issues

+

Example 9.8: R&D Intensity and Firm Size

+

\[rdintens = \beta_0 + \beta_1sales + \beta_2profmarg + \mu\]

+

Load the data, run the model, and apply the summary diagnostics function to the model.

+
data(rdchem)
+ 
+all_rdchem <- lm(rdintens ~ sales + profmarg, data = rdchem)
+

Notice the outlier on the far right of the plot.

+
plot_title <- "FIGURE 9.1: Scatterplot of R&D intensity against firm sales"
+x_axis <- "firm sales (in millions of dollars)"
+y_axis <- "R&D as a percentage of sales"
+
+plot(rdintens ~ sales, pch = 21, bg = "lightblue", data = rdchem, main = plot_title, 
+    xlab = x_axis, ylab = y_axis)
+

+
smallest_rdchem <- lm(rdintens ~ sales + profmarg, data = rdchem, 
+                      subset = (sales < max(sales)))
+

The table below compares the results of both models side by side. By removing the outlier firm, \(sales\) become a more significant determination of R&D expenditures.

+
stargazer(all_rdchem, smallest_rdchem, single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 10: Basic Regression Analysis with Time Series Data

+

Example 10.2: Effects of Inflation and Deficits on Interest Rates

+

\[\widehat{i3} = \beta_0 + \beta_1inf_t + \beta_2def_t\]

+
data("intdef")
+
+tbill_model <- lm(i3 ~ inf + def, data = intdef)
+
stargazer(tbill_model, single.row = TRUE, header = FALSE)
+ +

Example 10.11: Seasonal Effects of Antidumping Filings

+

In Example 10.5, we used monthly data (in the file BARIUM) that have not been seasonally adjusted.

+
data("barium")
+barium_imports <- lm(lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + 
+    afdec6, data = barium)
+
barium_seasonal <- lm(lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + 
+    afdec6 + feb + mar + apr + may + jun + jul + aug + sep + oct + nov + dec, 
+    data = barium)
+
+barium_anova <- anova(barium_imports, barium_seasonal)
+
stargazer(barium_imports, barium_seasonal, single.row = TRUE, header = FALSE)
+ +
stargazer(barium_anova, single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 11: Further Issues in Using OLS with with Time Series Data

+

Example 11.7: Wages and Productivity

+

\[\widehat{log(hrwage_t)} = \beta_0 + \beta_1log(outphr_t) + \beta_2t + \mu_t\]

+
data("earns")
+
+wage_time <- lm(lhrwage ~ loutphr + t, data = earns)
+
wage_diff <- lm(diff(lhrwage) ~ diff(loutphr), data = earns)
+
stargazer(wage_time, wage_diff, single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 12: Serial Correlation and Heteroskedasticiy in Time Series Regressions

+

Example 12.4: Prais-Winsten Estimation in the Event Study

+
data("barium")
+barium_model <- lm(lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6, 
+    data = barium)
+# Load the `prais` package, use the `prais.winsten` function to estimate.
+library(prais)
+barium_prais_winsten <- prais.winsten(lchnimp ~ lchempi + lgas + lrtwex + befile6 + 
+    affile6 + afdec6, data = barium)
+
barium_model
+
## 
+## Call:
+## lm(formula = lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + 
+##     afdec6, data = barium)
+## 
+## Coefficients:
+## (Intercept)      lchempi         lgas       lrtwex      befile6  
+##   -17.80300      3.11719      0.19635      0.98302      0.05957  
+##     affile6       afdec6  
+##    -0.03241     -0.56524
+
barium_prais_winsten
+
## [[1]]
+## 
+## Call:
+## lm(formula = fo)
+## 
+## Residuals:
+##      Min       1Q   Median       3Q      Max 
+## -2.01146 -0.39152  0.06758  0.35063  1.35021 
+## 
+## Coefficients:
+##            Estimate Std. Error t value Pr(>|t|)    
+## Intercept -37.07771   22.77830  -1.628   0.1061    
+## lchempi     2.94095    0.63284   4.647 8.46e-06 ***
+## lgas        1.04638    0.97734   1.071   0.2864    
+## lrtwex      1.13279    0.50666   2.236   0.0272 *  
+## befile6    -0.01648    0.31938  -0.052   0.9589    
+## affile6    -0.03316    0.32181  -0.103   0.9181    
+## afdec6     -0.57681    0.34199  -1.687   0.0942 .  
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+## 
+## Residual standard error: 0.5733 on 124 degrees of freedom
+## Multiple R-squared:  0.9841, Adjusted R-squared:  0.9832 
+## F-statistic:  1096 on 7 and 124 DF,  p-value: < 2.2e-16
+## 
+## 
+## [[2]]
+##        Rho Rho.t.statistic Iterations
+##  0.2932171        3.483363          8
+ +

Example 12.8: Heteroskedasticity and the Efficient Markets Hypothesis

+

\[return_t = \beta_0 + \beta_1return_{t-1} + \mu_t\]

+
data("nyse")
+ 
+return_AR1 <-lm(return ~ return_1, data = nyse)
+

\[\hat{\mu^2_t} = \beta_0 + \beta_1return_{t-1} + residual_t\]

+
return_mu <- residuals(return_AR1)
+
+mu2_hat_model <- lm(return_mu^2 ~ return_1, data = return_AR1$model)
+
stargazer(return_AR1, mu2_hat_model, single.row = TRUE, header = FALSE)
+ + +

Example 12.9: ARCH in Stock Returns

+

\[\hat{\mu^2_t} = \beta_0 + \hat{\mu^2_{t-1}} + residual_t\]

+

We still have return_mu in the working environment so we can use it to create \(\hat{\mu^2_t}\), (mu2_hat) and \(\hat{\mu^2_{t-1}}\) (mu2_hat_1). Notice the use R’s matrix subset operations to perform the lag operation. We drop the first observation of mu2_hat and squared the results. Next, we remove the last observation of mu2_hat_1 using the subtraction operator combined with a call to the NROW function on return_mu. Now, both contain \(688\) observations and we can run a standard linear model.

+
mu2_hat  <- return_mu[-1]^2
+
+mu2_hat_1 <- return_mu[-NROW(return_mu)]^2
+
+arch_model <- lm(mu2_hat ~ mu2_hat_1)
+
stargazer(arch_model, single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 13: Pooling Cross Sections across Time: Simple Panel Data Methods

+

Example 13.7: Effect of Drunk Driving Laws on Traffic Fatalities

+

\[\widehat{\Delta{dthrte}} = \beta_0 + \Delta{open} + \Delta{admin}\]

+
data("traffic1")
+
+DD_model <- lm(cdthrte ~ copen + cadmn, data = traffic1)
+
stargazer(DD_model, single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 14: Advanced Panel Data Methods

+

Example 14.1: Effect of Job Training on Firm Scrap Rates

+

In this section, we will estimate a linear panel modeg using the plm function in the plm: Linear Models for Panel Data package.

+
library(plm)
+
+data("jtrain")
+
+scrap_panel <- plm(lscrap ~ d88 + d89 + grant + grant_1, data = jtrain, index = c("fcode", 
+    "year"), model = "within", effect = "individual")
+
stargazer(scrap_panel, single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 15: Instrumental Variables Estimation and Two Stage Least Squares

+

Example 15.1: Estimating the Return to Education for Married Women

+

\[log(wage) = \beta_0 + \beta_1educ + \mu\]

+
data("mroz")
+wage_educ_model <- lm(lwage ~ educ, data = mroz)
+

\[\widehat{educ} = \beta_0 + \beta_1fatheduc\]

+

We run the typical linear model, but notice the use of the subset argument. inlf is a binary variable in which a value of 1 means they are “In the Labor Force”. By sub-setting the mroz data.frame by observations in which inlf==1, only working women will be in the sample.

+
fatheduc_model <- lm(educ ~ fatheduc, data = mroz, subset = (inlf==1))
+

In this section, we will perform an Instrumental-Variable Regression, using the ivreg function in the AER (Applied Econometrics with R) package.

+
library("AER")
+wage_educ_IV <- ivreg(lwage ~ educ | fatheduc, data = mroz)
+
stargazer(wage_educ_model, fatheduc_model, wage_educ_IV, single.row = TRUE, 
+    header = FALSE)
+ + +

Example 15.2: Estimating the Return to Education for Men

+

\[\widehat{educ} = \beta_0 + sibs\]

+
data("wage2")
+ 
+educ_sibs_model <- lm(educ ~ sibs, data = wage2)
+

\[\widehat{log(wage)} = \beta_0 + educ\]

+

In this section, we will perform an Instrumental-Variable Regression, using the ivreg function in the AER (Applied Econometrics with R) package.

+
library("AER")
+
+educ_sibs_IV <- ivreg(lwage ~ educ | sibs, data = wage2)
+
stargazer(educ_sibs_model, educ_sibs_IV, wage_educ_IV, single.row = TRUE, header = FALSE)
+ + +

Example 15.5: Return to Education for Working Women

+

\[\widehat{log(wage)} = \beta_0 + \beta_1educ + \beta_2exper + \beta_3exper^2\]

+
data("mroz")
+wage_educ_exper_IV <- ivreg(lwage ~ educ + exper + expersq | exper + expersq + 
+    motheduc + fatheduc, data = mroz)
+ + +
+
+

Chapter 16: Simultaneous Equations Models

+

Example 16.4: INFLATION AND OPENNESS

+

\[inf = \beta_{10} + \alpha_1open + \beta_{11}log(pcinc) + \mu_1\] \[open = \beta_{20} + \alpha_2inf + \beta_{21}log(pcinc) + \beta_{22}log(land) + \mu_2\]

+

Example 16.6: INFLATION AND OPENNESS

+

\[\widehat{open} = \beta_0 + \beta_{1}log(pcinc) + \beta_{2}log(land)\]

+
data("openness")
+ 
+open_model <-lm(open ~ lpcinc + lland, data = openness)
+

\[\widehat{inf} = \beta_0 + \beta_{1}open + \beta_{2}log(pcinc)\]

+
library(AER)
+
+inflation_IV <- ivreg(inf ~ open + lpcinc | lpcinc + lland, data = openness)
+
stargazer(open_model, inflation_IV, single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 17: Limited Dependent Variable Models and Sample Selection Corrections

+

Example 17.3: POISSON REGRESSION FOR NUMBER OF ARRESTS

+
data("crime1")
+
+formula <- (narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + inc86 + black + 
+    hispan + born60)
+
+econ_crime_model <- lm(formula, data = crime1)
+
+econ_crim_poisson <- glm(formula, data = crime1, family = poisson)
+
stargazer(econ_crime_model, econ_crim_poisson, single.row = TRUE, header = FALSE)
+ + +
+
+

Chapter 18: Advanced Time Series Topics

+

Example 18.8: FORECASTING THE U.S. UNEMPLOYMENT RATE

+

\[\widehat{unemp_t} = \beta_0 + \beta_1unem_{t-1}\]

+

\[\widehat{unemp_t} = \beta_0 + \beta_1unem_{t-1} + \beta_2inf_{t-1}\]

+
data("phillips")
+
+unem_AR1 <- lm(unem ~ unem_1, data = phillips, subset = (year <= 1996))
+
+unem_inf_VAR1 <- lm(unem ~ unem_1 + inf_1, data = phillips, subset = (year <= 1996))
+ +
+ + + + +
+ + + + + + + + diff --git a/vignettes/wooldridge-vignette.pdf b/vignettes/wooldridge-vignette.pdf index ef1de63..3530523 100644 Binary files a/vignettes/wooldridge-vignette.pdf and b/vignettes/wooldridge-vignette.pdf differ