ISQS 5349, Regression Analysis, Spring 2016

Course Syllabus
Old midterms and finals
Class recordings


Supplemental books: 
Practical Regression and Anova using R, by Julian Faraway.
Probabilistic Modeling in Computer Science, by Norm S. Matloff of UC Davis, a free book licensed under creative commons. See Ch. 22 in particular on regression. (I took my first course in regression analysis from Dr. Matloff around 1979!)

Helpful R materials:
Winston Chang’s Cookbook for R, a free book licensed under creative commons.
From UCLA’s Statistical Consulting Group:
swirl teaches you R programming and data science interactively, at your own pace, and right in the R console!
http://www.ats.ucla.edu/stat/r/seminars/intro.htm
A start for this class showing how to access data and do basic things
An overview of R for statistics – everything you want, in a nutshell

A list of useful R functions

http://www.cyclismo.org/tutorial/R/
http://www.rstudio.com/ide/docs/
http://ww2.coastal.edu/kingw/statistics/R-tutorials/
http://www.stat.auckland.ac.nz/~ihaka/120/Notes/ch03.pdf (graphics, from a founder of R, Ross Ihaka)

                       

Class Topics (Rough schedule – this will change depending on student presentations. Dr. Westfall will update the class regularly on the schedule changes)

Preparation – Read and study everything in this column. There will be a quiz at the beginning of class on the day listed.  Refer back to these documents repeatedly. Links within the links are recommended and may aid understanding but are not required.

R codes, Homework, and etc.

1. 1/21 Smoothing; Scatterplots, LOESS smoothers, the classical regression model and its assumptions.

Readings and videos. The quiz will be on 1/26 and will count double.

 

Approximating functions (regression functions for us) by linear terms

 

The regression function as a conditional expectation (Section 1 only, but the rest is good too)

 

Approximating regression functions using LOWESS in R

 

Regression, populations, and processes

 

Read this summary of the assumptions of the regression model

 

Read the Wikipedia entry (all of it) about the R2 statistic

 

Read the Wikipedia entry about the Gauss-Markov theorem, up to the words “The theorem now states that the OLS estimator is a BLUE.”

HW 1, due Thursday 1/28

 

Initial regression analyses using R; introduction to some data sets we’ll be using.

 

Models produce data.  How different do the data look when they are produced by the same model?  How different do the data look when they are produced by different models?

 

Estimating curvature using LOESS smoothing: Toluca, Peak Energy, Car Sales, and Product complexity examples.

 

Understanding how to interpret LOESS smooths by simulating data where the true mean function is known.

 

 

 

 

 

Why is probability needed in the regression model?

2. 1/26  Maximum likelihood and least squares. The Gauss-Markov theorem.

 

(Today’s quiz covers all the readings for 1/21 and 1/26, and counts double)

The quiz for Today’s class will be over two days’ worth of material, shown in the box above) and will count double

 

 

 

Least squares demo

Why do assumptions matter?

Illustrating the Gauss-Markov property, both good and bad.

 

 

 

Why is probability needed in the regression model?

3. 1/28 Exact Inferences in the classical parametric regression model: p-values, confidence and prediction intervals.

Read this discussion of a confidence interval for the slope, from Doug Stirling, Massey University in Palmerston North, New Zealand.

 

Read this document on interpreting p-values, by Dr. Westfall.

 

Read this document on “Why you should never say “Accept Ho,” written by Dr. Westfall

 

Read this discussion of confidence intervals for E(Y|X=x) versus Prediction interval for Y|X=x, from “Musings on Using and Misusing Statistics,” by Martha K. Smith, retired UT professor.

 

Read this document on “Prediction and Generalization,” written by Dr. Westfall.

 

Read this document on “Confidence intervals and significance tests as predictions,” written by Dr. Westfall.

HW 2, due Thursday 2/4

 

 

Confidence intervals for slope and intercepts.

 

Understanding the standard error: a simulation study.

 

Constructing frequentist confidence and prediction intervals; GPA/SAT example and Toluca example; Constructing the corresponding Bayesian intervals using the same examples.

Bayesian analysis using transformation to solve obvious problem with nonnormality of the GPA distribution

 

 


Why is probability needed in the regression model?

4. 2/2 Checking the assumptions of the classic model

Read the document, “How to check assumptions using data,” written by Dr. Westfall, and run the R code therein.

 

Read and run the code in the document, “Why do assumptions matter?” written by Dr. Westfall

 

Read the document, “Comments on Transformations,” written by Dr. Westfall.

HW 3, due 2/11. Write a report where you show how to replicate all the analyses (all statistics, model equations, tables and graphs) in this paper using R rather than Minitab. Include all code and outputs, as well as surrounding words, in a professional, clean, publication-quality document. Use a “tutorial” style presentation, aimed at teaching a person (say not in this class) how to do everything. (Note: it is not necessary to make the graphs appear in the “panel” display as in the paper where they show all four together.  Instead, you can show them all separately. Just be sure to label them clearly in your report, e.g., “Upper right graph of Figure 6.”

 

Testing for curvature using quadratic regression: Toluca, Peak Energy, Car Sales, and Product Complexity examples

Statistical versus practical significance: A demonstration of the difference.

Estimating the relationship between mean absolute residual and predictor variable using LOESS smoothing, as well as testing for existence of heteroscedasticity: Toluca and Peak Energy examples.

Understanding how to interpret LOESS smooths of absolute residuals by simulating data where the true variance function is known.

Evaluating the normality assumption using q-q plots and Shapiro-Wilk hypothesis test: Toluca, Peak Energy, and Product Complexity examples.

Understanding how to interpret q-q plots by simulating data where the true error distribution is known.

 

Why is probability needed in the regression model?

5. 2/4  Using transformations to achieve a more reasonable model. The multiple regression model.

Read this introduction to multiple regression analysis

 

Lance Car Sales example: Analysis of model using x-1 transformation.

Peak Energy Use example: Analysis of model using ln(y) transformation.

Box – Cox transformations


Why is probability needed in the regression model?

6. 2/9 The multiple regression model, added variable plots

 

Note on presentations: Good example PowerPoint student presentations can be found on last Fall’s ISQS 6348 page (multivariate analysis), in the left column. DO NOT use last year’s ISQS 5349 presentations as examples – they were trying to present as if at a conference, rather than teaching the material (which was my fault because I did not give adequate guidance.)

Read about added variable plots (also called partial regression plots)

 

Read Sections 1-7 of this matrix algebra preparation material, courtesy A. Colin Cameron, UC Davis (whoo hoo! My alma mater).

 

(Not required, but if you find yourself needing additional self-study on matrix algebra, see the SOS math tutorials, matrix0, matrix1, matrix2, and Introduction to matrix algebra using R.  See also Matrix and linear algebra tutorials; See also the MIT Open courseware (a free online course with separate lectures for separate linear and matrix algebra topics).)

Multiple regression analysis of how computer time to run a job relates to Ram and processor speed.

R code for Sales vs. Int rate and Gas Price example: How curvature can be explained by a third variable.

Visualizing the Multiple Regression model- 3-D and partial plots using EXCEL.

Simulation study showing that simple (Xj,Y) scatterplots and other diagnostics are not completely adequate to judge the adequacy of the multiple regression model.

Why is probability needed in the regression model?

7. 2/11  Matrix form of model, estimates, standard errors, t intervals and tests.

Read these presentation slides by Carlos Carvalho, UT Austin. 

 

Read the document “Prediction as association and prediction as causation,” written by Dr. Westfall. The document shows that you cannot infer causation using the regression model.

HW 4, due Thursday 2/18

The multivariate normal distribution (from Wikipedia)

Information on covariance matrices, from Wikipedia

The various matrices and regression calculations shown in R code.

Why is probability needed in the regression model?

8. 2/16  Causality

Read “Causal Inference using Regression on the Treatment Variable”, by Andrew Gelman, Sections 9.1 and 9.2 only. (The rest is great, but not required. Note: Gelman is a major dude.)

The computer speed example will be good for discussing causality.

Why is probability needed in the regression model?

9. 2/18 Multicollinearity  The ANOVA table, the F test, and the R-squared statistic

Read these presentation slides, slides 1-6, by Alicia Carriquiry of Iowa State.

Notes: (i) There is a mistake on Slide 4. It should read “…where R2j is the coefficient of determination obtained by regressing the jth predictor on all other predictors.” (In particular, the VIF has nothing to do with the Y data.)

 

Read this document on multicollinearity, by Dr. Westfall

File to illustrate problems with multicollinearity

 R code for diagnosis and interpretation of multicollinear variables; also indicates one of many potential solutions to the problem.

Full model - reduced model F test.

Why is probability needed in the regression model?

 

Why is probability needed in the regression model?

10. 2/23 Interactions; the inclusion principle

Read this document.

Code to explain why Var(Y-hat) is not constant, even though Var(Y) is constant.

x = 2*rnorm(10) + 10
y = 2 + .4*x + rnorm(10)
plot(x,y, xlim = c(0,20), ylim = c(0,10))
abline(2,.4, lwd=3)
abline(v=mean(x), col="red", lwd=2)
abline(lsfit(x,y), lty=2)
abline(lsfit(x,2 + .4*x + rnorm(10)), lty=2)

3-d graphs of interaction and non-interaction surfaces

Examining interactions – an R demo

Moderator example, from Karl Wuensch's web page http://core.ecu.edu/psyc/wuenschk/. The publication is here.

A "Hand-drawn" graph using Excel of the moderating effect.

File to illustrate problems with violating the inclusion principle

 

Why is probability needed in the regression model?

11. 2/25 Midterm 1. Solutions.

 

 

12. 3/1 Variable and model selection

Read about the variance/bias trade-off here.

Comments: (1) Right below the first graph: characterized à characterize   (2) Two sentences later, g(x) is called an “estimator” of f(x), and this persists through the article.  This is not the usual usage of the term “estimator,” because an estimator is a function of random data.  If you put a “^” on top of g(x) you could call it an estimator. Better to call it a “candidate mean function” or “supposed model” or something like that. Later on, though, he refers to g(x) as a function of data, and this is indeed an estimator.

 


Read summary comments on variable selection, data snooping, and a strategy for variable selection, by Dr. Westfall



The Law of Total Variance

A simulation example: Predicting Hans’ graduate GPA from GRE scores and Undergrad GPA.

R code to demo to demonstrate the danger of overfitting.

An R simulation to illustrate that including extraneous variables does not cause bias, but does inflate the variance.

An R simulation file to illustrate the variance/bias tradeoff, and show why you might prefer biased estimates in terms of estimating the mean value.

An R simulation to illustrate the variance/bias trade-off in terms of parameter estimation.  Fitting the wrong (reduced) model results in biased parameter estimates, but they are sometimes more accurate than the unbiased estimates obtained from fitting the correct model.

R file for producing and comparing n-fold cross-validation statistics for different models

All subsets model selection for predicting doctors per capita using R.

 

Why is probability needed in the regression model?

13. 3/3 Variable and model selection.

.

Re-read the midterm 1 Solutions. The quiz will be about this material. Be prepared to re-interpret the answers in a context that is relevant to you, whether it be an accounting context, a pfp context, a finance context, an engineering context, or an ag econ context. By “context,” I mean you need to pick your own Y and X variables that are relevant to your field of study, and then answer in terms of those variables in the quiz. If you are having trouble making the connection, I will be happy to help.

 

Remember, “I don’t want to hear what I want to hear.” So don’t talk about Hans or simulation. Make your answers specific to the processes at work in your field of study, whether they be behavioral, economic, biological, physical, political, etc.

 

Pick an example that you yourself are interested in, and do not duplicate others’ examples. Pick something you will look at for your thesis or in your future employ.

 

This quiz will count double.

The Law of Total Variance

R code to demo to demonstrate the danger of overfitting.

An R simulation to illustrate that including extraneous variables does not cause bias, but does inflate the variance.

An R simulation file to illustrate the variance/bias tradeoff, and show why you might prefer biased estimates in terms of estimating the mean value.

An R simulation to illustrate the variance/bias trade-off in terms of parameter estimation.  Fitting the wrong (reduced) model results in biased parameter estimates, but they are sometimes more accurate than the unbiased estimates obtained from fitting the correct model.

R file for producing and comparing n-fold cross-validation statistics for different models

All subsets model selection for predicting doctors per capita using R

Why is probability needed in the regression model?

14. 3/8 Dummy variables, ANOVA, ANCOVA, ANCOVA with interactions, graphical summaries.  Student presentations.  PowerPoint file, R code.

 

 

Read this document through slide 24. This i

 that you are referring to the model for row “i” in your data frame) or you should subscript none of the variables (which means you are modelling your process more generically.)  

ANOVA/ANCOVA, first file – comparing GPAs of Male and Female students (two-level ANOVA/ANCOVA).

ANOVA/ANCOVA, second file – comparing GPAs of students different degree plans.

Why is probability needed in the regression model?

15. 3/10 Heteroscedasticity: WLS, ML estimation, robust standard errors.  Student presentations PowerPoint,    R code.

Update: Modified readings. I really like Cosma Shalizi of Carnegie-Mellon.  Expect to see more from him. Read pages 1-9 from this document. 

 

Comments:
1. A variable with an arrow on top refers to a vector (list) of values. Not sure why he wants to use it for some vectors and not others.  For example, on p. 1, “beta” is also a vector but there is no arrow on top.

2. The assumed functional form of the heteroscedasticity, 1 + x^2/2, is somewhat unusual in that it makes the variance decrease then increase as x ranges from its min to its max. This might make sense in some cases where the X variable has zero in its range, as it does in Cosma’s simulation, but for typical cases where the X and Y data are positive, the heteroscedasticity function is either monotonically increasing or monotonically decreasing, with no “down then up” behavior.

3. “The oracle of regression” is what I sometimes refer to as “Hans.”

 

 

Optional: Read this critique of robust standard errors. The prof summarizes:  If you think you have a problem with heteroscedasticity, you probably have more serious problems and should not be using OLS anyway. Using robust standard errors with OLS is like putting a bright red bow on the head of a ugly pig lying in the stinking pig slop, and then saying that the pig now looks pretty.

 

 

First file to illustrate benefit of Weighted Least Squares – shows imprecise predictions of OLS in the presence of heteroscedasticity

Comparison of prediction limits: Homoscedastic vs. Heteroscedastic models – shows OLS prediction limits are incorrect in the presence of heteroscedasticity

Estimating the heteroscedastic variance GE returns as a function of trading volume via maximum likelihood using R . (Try different variance functions and compare log likelihoods to see which fit better.)

Obtaining heteroscedasticity-consistent standard errors using R

Why is probability needed in the regression model?

16. 3/22 Outlier………….s

 

 

Read Dr. Westfall’s comments on outliers.

 

Read Section 3 from this nice document by Anne Boomsma.  (You might find the other sections very helpful as well, but only Section 3 for now.)

Note:  The paper is mostly good with minor errors.  One rather major mistake, though, is on the top of page 13:

The statistic sig gives an estimate of residual standard error, as it is called; in fact, sqrt(MSE)

is an unbiased estimator of the standard deviation of ei.
Actually, it is biased low, by Jensen’s inequality.  This issue is covered in the ISQS 5347 textbook.

This code will get you started if you want to run the examples yourself.

install.packages("faraway")
library(faraway)
data(savings)
M1 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data=savings)

Outlier analysis

Why is probability needed in the regression model?

17. 3/24  Quantile regression and Winsorizing – Student Presentations.  PowerPoint, R code for quantile, data file, R code for Winsorizing

Read quantile regression

 

 

Data on weekly salaries from the BLS, from 2002 to 2014. Note that the 0.10 and 0.90 quantiles have different slopes.

EXCEL spreadsheet to explain the quantile estimation method

EXCEL spreadsheet to explain the quantile estimation method in the regression case – The CAPM regression model

The CAPM model via quantile regression

Why is probability needed in the regression model?

18. 3/29:  Review up to now. Generalized Least Squares, Correlated errors, Time series

Read this document from John Fox

 

 

John Fox code; also some code to show how the AR and MA data-generating processes work.

Simulation studies of inefficiency and Type I error rates when using OLS and ML/GLS:  An example with auto-correlated data.

Why is probability needed in the regression model?

19. 3/31 Continuation of Generalized Least Squares, Correlated errors, Time series

Read this document from John Fox (again)

 

John Fox code; also some code to show how the AR and MA data-generating processes work.

Simulation studies of inefficiency and Type I error rates when using OLS and ML/GLS:  An example with auto-correlated data.

Why is probability needed in the regression model?

20. 4/5 Intro to Mixed Effects Models. Nice R video

Read this tutorial on lmer by Bodo Winter.

 

Some code from Bodo Winters’ tutorial

A data analysis indicating the problem with correlated observations: Standard errors are clearly wrong.

 

Why is probability needed in the regression model?

21. 4/7 Panel data.  Student presentations. PowerPoint, R code, Data file, Notes.

Read chapter one of this great book by Jed Frees. (Actually the whole book is really good!)

 

Supplemental article. (Optional but excellent reading.) Shows that random effects models are better for panel data than fixed effects.

 

22. 4/12  Repeated measures and random effects,  Bayesian-style “shrinkage” estimates of random effects.

Read p. 199-207 of this article by Simon Jackman.

 

Supplemental (not required) reading (also by Jackman) explaining why biased “shrinkage” estimates are better than the standard unbiased estimates.

R code related to the quiz.

Simulation code to help understand the shrinkage estimates shown in Jackman’s papers.

Ranking of teaching in various majors at TTU using Bayesian-style random effects (“shrinkage”) estimates, with comparison to simple OLS fixed-effects estimates.

Why is probability needed in the regression model?

23. 4/14 Multilevel analysis

Required: Read the Preface and Sections 1, 2, 3 and 7 from this document.  Here is the R code from Section 7. There are a couple mistakes in their syntax, but I was able to replicate their results precisely once the code was cleaned up.

 

If you don’t like the “students within schools” example, just substitute something you do like, such as “employees within company” or “companies within NAISC category.”


Notes:
1. Model (2) might require an additional subscript “k” on both “y” and “e” if there are repeat observations on a patient/nurse combination.
2. The last index of the mu terms in H0 on page 5 should be J not j. Also, the alternative should include “for at least one pair of means”
3. They use the word “population” a lot. Make the appropriate substitution of “data generating process” or insert “”s around “population.”
4. Note that their model (5) on page 8 is exactly the same model that I used in the previous class to estimate the shrinkage estimates for MAJOR.
5. What they call “full maximum likelihood” (FML) is just ordinary ML as you have learned it.  In particular the “full” in FML has nothing to do with the “full model/restricted model” comparison that you have learned.
6. See my comments at the end of my R code from Section 7.

7. They are interpreting the standard deviations of the variance component estimates incorrectly. The R software gives the square root of the estimate of the variance, which is just the estimate of the standard deviation. It’s not the standard error of the estimate, which is the way that they are interpreting it.

 

Optional: The Bliese tutorial paper.  You will find a lot of this paper very useful as well, but it’s not required reading. Bliese wrote the R code and gives tutorials on the topic, so he is a great resource.

Why is probability needed in the regression model?

“Variance between means”

A multilevel regression of trends (random coefficients regression modelling) in the TTU grad data.

R code from Section 7 of the reading.

The lmer manual (the “bible”)

24. 4/19 More on Multilevel analysis

Read the Preface and Sections 1, 2, 3 and 7 from this document. (Again!)  In the quiz, I will ask you about their model development, and about fixed versus random.

 

25. 4/21 Finishing up mixed effects analyses - The Hausman “test” for fixed versus random effects



Read this critique of the over-used, misunderstood, and trained parrot-ish Hausman test for fixed versus random effects.

 

The most important sentence in this document appears on p. 403: "The Hausman test does not aid in evaluating this tradeoff."

 

Just found this paper (optional reading), which gives the simple fix to the problem of random effects being correlated with the X data. This makes the Hausman test even more useless than is indicated by the main reading.

 

Note also (optional) a similar solution from Andrew Gelman.

 

Take home message: Just like all tests for model assumptions, such as tests for normality, homoscedasticity, etc., the Hausman test for fixed versus random effects is not the way you should evaluate the model assumptions (fixed vs random in this case.)

Issues with correlation between random effects and the X variables – bias.

The Hausman test

How to perform the Hausman test using R.

Code showing the simulation method from the reading.

 

Why is probability needed in the regression model?

26. 4/26 Binary regression models

Read sections 1 and 2 (up to page 31) of this fine document by John Fox.
 
Optional: this R tutorial.
Mistake: About the Wald test in R, they wrote, “Sigma supplies the variance covariance matrix of the error terms…” which should instead be “Sigma supplies the variance covariance matrix of the parameter estimates...”

Models produce data: Normal regression versus logistic regression

 

Maximum likelihood estimation:  Normal regression versus logistic regression.

 

Some logistic regression curves

 

Code from the UCLA tutorial.

 

Code for in-class project: Bayesian logistic regression of “Trashball” – a variation of basketball.

 

Code for finding the p-value for the likelihood ratio test:

pval = 1 - pchisq(anova(t1,t2)$Deviance, anova(t1,t2)$Df)

Why is probability needed in the regression model?

27. 4/28 Nominal response regression models. Student presentations. PowerPoint, Excel data file with ML calculations, R codes.

 

 

Read this article from the UCLA Institute for Digital Research and Education  (Thank you UCLA/IDRE for supplying such nice materials!  You guys rock!)

 

Optional reading: “Applying discrete choice models to predict Academy Award winners, J. R. Statist. Soc. A (2008), 375-394, by Pardoe and Simonton.

 

Multinomial logistic regression using R. Summaries using EXCEL. 

Excel file showing ML Estimation for multinomial logistic regression.

 

Why is probability needed in the regression model?

28. 5/3 Poisson, negative binomial and other count data regression models.

Read about the Poisson regression, through Section 3.2.

 

Read about Negative binomial regression, too.

 

Supplemental reading on count data models in R (not required but excellent)

Analysis of data on financial planners using R

ML estimation of financial planners data – Poisson vs. Normal via EXCEL

More on the latent variable formulation

Simulating and analyzing data from Poisson and Negative binomial regression models.

Analysis of experimental data on wine sales using count data models.

Why is probability needed in the regression model?

29. 5/5 Survival analysis regression models

 

 

Read this summary by David Madigan.

 

Optional: this summary by Maartin Buis of Vrije Universiteit Amsterdam.   Section 6, “Unobserved Heterogeneity” is optional (not required) reading.

Proportional hazards regression follow-up using excel, with comparison to lognormal regression.

 

Why is probability needed in the regression model?

30. 5/10

 

 

Read through section 4.1.2 of this paper.

 

Optional reading:

Sample selection bias (Heckman model and method)

 

Future semesters – more time on endogeneity issues, instrumental variable regression

Why is probability needed in the regression model?

Final Exam  5/12, 4:30-7:30PM. Solutions.

 

Old finals and solutions are available in the old courses link. But every semester is different.

 

Review questions (these are largely from 1994 and 1996, but still relevant today. The questions have been updated a little to reflect current terminology and software.

 

Why is probability needed in the regression model?