ISQS
5349, Regression Analysis, Spring 2016
Course Syllabus
Old
midterms and finals
Class recordings
Supplemental books:
Practical
Regression and Anova using R, by Julian Faraway.
Probabilistic
Modeling in Computer Science, by Norm S. Matloff of UC Davis, a free book
licensed under creative commons. See Ch. 22 in particular on regression.
(I took my first course in regression analysis from Dr. Matloff around 1979!)
Helpful R materials:
Winston Chang’s Cookbook for R, a free book licensed under creative commons.
From UCLA’s Statistical Consulting Group:
swirl teaches you R programming and data science
interactively, at your own pace, and right in the R console!
http://www.ats.ucla.edu/stat/r/seminars/intro.htm
A start for this
class showing how to access data and do basic things
An overview of R for
statistics – everything you want, in a nutshell
A list of
useful R functions
http://www.cyclismo.org/tutorial/R/
http://www.rstudio.com/ide/docs/
http://ww2.coastal.edu/kingw/statistics/Rtutorials/
http://www.stat.auckland.ac.nz/~ihaka/120/Notes/ch03.pdf
(graphics, from a founder of R, Ross Ihaka)
Class
Topics (Rough schedule – this will change depending on student
presentations. Dr. Westfall will update the class regularly on the schedule
changes) 
Preparation
– Read and study everything in this column. There will be a quiz at the beginning
of class on the day listed. Refer
back to these documents repeatedly. Links within the links are recommended
and may aid understanding but are not required. 
R codes,
Homework, and etc. 
1. 1/21 Smoothing; Scatterplots, LOESS smoothers,
the classical regression model and its assumptions. 
Readings and videos. The quiz will be on 1/26 and
will count double. Approximating functions
(regression functions for us) by linear terms The regression
function as a conditional expectation (Section 1 only, but the rest is
good too) Approximating regression
functions using LOWESS in R Regression,
populations, and processes Read this
summary of the assumptions of the regression model Read the
Wikipedia entry (all of it) about the R^{2} statistic Read the Wikipedia
entry about the GaussMarkov theorem, up to the words “The theorem now
states that the OLS estimator is a BLUE.” 
Initial
regression analyses using R; introduction to some data sets we’ll be using. Models produce data. How
different do the data look when they are produced by the same model? How different do the data look when they
are produced by different models? 
2. 1/26
Maximum likelihood and least squares. The GaussMarkov theorem. (Today’s quiz covers all the readings for 1/21 and
1/26, and counts double) 
The quiz for Today’s class will be over two days’
worth of material, shown in the box above) and will count double 
Illustrating
the GaussMarkov property, both good and bad. 
3. 1/28 Exact Inferences in the classical parametric
regression model: pvalues, confidence and prediction intervals. 
Read this
discussion of a confidence interval for the slope, from Doug Stirling,
Massey University in Palmerston North, New Zealand. Read this document
on interpreting pvalues, by Dr.
Westfall. Read this
document on “Why you should never say “Accept Ho,” written by Dr.
Westfall Read this
discussion of confidence intervals for E(YX=x) versus Prediction interval for YX=x, from “Musings on Using and Misusing
Statistics,” by Martha K. Smith, retired UT professor. Read this
document on “Prediction and Generalization,” written by Dr. Westfall. Read this
document on “Confidence intervals and significance tests as predictions,”
written by Dr. Westfall. 
Confidence
intervals for slope and intercepts. Understanding the
standard error: a simulation study. Constructing
frequentist confidence and prediction intervals; GPA/SAT example and Toluca
example; Constructing the corresponding
Bayesian intervals using the same examples. 
4. 2/2 Checking the assumptions of the classic
model 
Read the document, “How
to check assumptions using data,” written by Dr. Westfall, and run the R
code therein. Read and run the code in the document, “Why
do assumptions matter?” written by Dr. Westfall Read the document, “Comments
on Transformations,” written by Dr. Westfall. 
HW 3, due 2/11. Write a report where
you show how to replicate all the analyses (all statistics, model equations,
tables and graphs) in this
paper using R rather than Minitab. Include all code and outputs, as well
as surrounding words, in a professional, clean, publicationquality document.
Use a “tutorial” style presentation, aimed at teaching a person (say not in
this class) how to do everything. (Note: it is not necessary to make the
graphs appear in the “panel” display as in the paper where they show all four
together. Instead, you can show them
all separately. Just be sure to label them clearly in your report, e.g.,
“Upper right graph of Figure 6.” Testing
for curvature using quadratic regression: Toluca, Peak Energy, Car Sales, and
Product Complexity examples 
5. 2/4
Using transformations to achieve a more reasonable model. The multiple
regression model. 
Read
this introduction to multiple regression analysis 
Lance Car
Sales example: Analysis of model using x^{1} transformation. 
6. 2/9 The multiple regression model, added
variable plots Note on presentations: Good example PowerPoint
student presentations can be found on last Fall’s ISQS 6348
page (multivariate analysis), in the left column. DO NOT use last year’s ISQS
5349 presentations as examples – they were trying to present as if at a
conference, rather than teaching the material (which was my fault because I
did not give adequate guidance.) 
Read about added
variable plots (also called partial regression plots) Read
Sections 17 of this matrix algebra preparation material, courtesy A. Colin Cameron, UC
Davis (whoo hoo! My alma mater). (Not required, but
if you find yourself needing additional selfstudy on matrix algebra, see the
SOS math tutorials, matrix0, matrix1, matrix2, and Introduction
to matrix algebra using R.
See
also Matrix and linear algebra
tutorials; See
also the MIT
Open courseware (a free online course with separate lectures for separate
linear and matrix algebra topics).) 
Multiple
regression analysis of how computer time to run a job relates to Ram and
processor speed. R code for
Sales vs. Int rate and Gas Price example: How curvature can be explained
by a third variable. Visualizing the
Multiple Regression model 3D and partial plots using EXCEL. 
7. 2/11
Matrix form of model, estimates, standard errors, t intervals and
tests. 
Read these
presentation slides by Carlos Carvalho, UT Austin. Read the document “Prediction
as association and prediction as causation,” written by Dr. Westfall. The
document shows that you cannot infer causation using the regression model. 
The
multivariate normal distribution (from Wikipedia) Information on
covariance matrices, from Wikipedia The various
matrices and regression calculations shown in R code. 
8. 2/16
Causality 
Read “Causal
Inference using Regression on the Treatment Variable”, by Andrew Gelman,
Sections 9.1 and 9.2 only. (The rest is great, but not required. Note: Gelman
is a major dude.) 
The computer speed
example will be good for discussing causality. 
9. 2/18 Multicollinearity The ANOVA table, the F test, and the
Rsquared statistic 
Read these
presentation slides, slides 16, by Alicia Carriquiry of Iowa State. Notes: (i) There is a mistake on Slide 4. It
should read “…where R^{2}_{j} is the coefficient of
determination obtained by regressing the
jth predictor on all other predictors.”
(In particular, the VIF has nothing to do with the Y data.) Read this
document on multicollinearity, by Dr. Westfall 
File to
illustrate problems with multicollinearity R code for diagnosis and
interpretation of multicollinear variables; also indicates one of many potential
solutions to the problem. Full model  reduced
model F test. Why
is probability needed in the regression model? 
10. 2/23 Interactions; the inclusion principle 
Code to explain why Var(Yhat)
is not constant, even though Var(Y) is constant. x
= 2*rnorm(10) + 10 3d
graphs of interaction and noninteraction surfaces Examining
interactions – an R demo Moderator example,
from Karl Wuensch's web page http://core.ecu.edu/psyc/wuenschk/.
The publication is here. A "Handdrawn" graph using Excel of the moderating effect. File to illustrate
problems with violating the inclusion principle 

11. 2/25 Midterm 1. Solutions. 


12. 3/1 Variable and model selection 
Read
about the variance/bias tradeoff here. Comments: (1) Right below the first
graph: characterized à characterize
(2) Two sentences later, g(x) is called an “estimator” of f(x), and
this persists through the article.
This is not the usual usage of the term “estimator,” because an
estimator is a function of random data.
If you put a “^” on top of g(x) you could call it an estimator. Better
to call it a “candidate mean function” or “supposed model” or something like
that. Later on, though, he refers to g(x) as a function of data, and this is
indeed an estimator.

A simulation
example: Predicting Hans’ graduate GPA from GRE scores and Undergrad GPA. R code to demo
to demonstrate the danger of overfitting. R file for producing
and comparing nfold
crossvalidation statistics for different models All
subsets model selection for predicting doctors per capita using R. 
13. 3/3 Variable and model selection. . 
Reread
the midterm 1 Solutions.
The quiz will be about this material. Be prepared to reinterpret the answers
in a context that is relevant to you, whether it be an accounting context, a
pfp context, a finance context, an engineering context, or an ag econ
context. By “context,” I mean you need to pick your own Y and X variables
that are relevant to your field of study, and then answer in terms of those
variables in the quiz. If you are having trouble making the connection, I
will be happy to help. Remember, “I don’t want to hear what I want to
hear.” So don’t talk about Hans or simulation. Make your answers specific to
the processes at work in your field of study, whether they be behavioral,
economic, biological, physical, political, etc. Pick an example that you yourself are interested
in, and do not duplicate others’ examples. Pick something you will look at
for your thesis or in your future employ. This quiz will count double. 
R code to demo
to demonstrate the danger of overfitting. R file for producing
and comparing nfold
crossvalidation statistics for different models All
subsets model selection for predicting doctors per capita using R 
14. 3/8 Dummy variables, ANOVA, ANCOVA, ANCOVA
with interactions, graphical summaries.
Student presentations. PowerPoint
file, R
code. 
Read this
document through slide 24. This i that you are referring to the model
for row “i” in your data frame) or you should subscript none of the
variables (which means you are modelling your process more generically.) 
ANOVA/ANCOVA,
first file – comparing GPAs of Male and Female students (twolevel
ANOVA/ANCOVA). ANOVA/ANCOVA,
second file – comparing GPAs of students different degree plans. 
15. 3/10 Heteroscedasticity: WLS, ML estimation,
robust standard errors. Student
presentations PowerPoint, R code. 
Update: Modified readings. I really like Cosma
Shalizi of CarnegieMellon. Expect to
see more from him. Read pages 19 from this
document. Comments: 2. The assumed functional form of the
heteroscedasticity, 1 + x^2/2, is somewhat unusual in that it makes the
variance decrease then increase as x ranges from its min to its max. This
might make sense in some cases where the X variable has zero in its range, as
it does in Cosma’s simulation, but for typical cases where the X and Y data
are positive, the heteroscedasticity function is either monotonically increasing
or monotonically decreasing, with no “down then up” behavior. 3. “The oracle of regression” is what I sometimes
refer to as “Hans.” Optional: Read this critique of
robust standard errors. The prof summarizes: If you think you have a problem with
heteroscedasticity, you probably have more serious problems and should not be
using OLS anyway. Using robust standard errors with OLS is like putting a bright
red bow on the head of a ugly pig lying in the stinking pig slop, and then
saying that the pig now looks pretty. 
First file
to illustrate benefit of Weighted Least Squares – shows imprecise
predictions of OLS in the presence of heteroscedasticity Comparison
of prediction limits: Homoscedastic vs. Heteroscedastic models – shows
OLS prediction limits are incorrect in the presence of heteroscedasticity Estimating
the heteroscedastic variance GE returns as a function of trading volume via
maximum likelihood using R . (Try different variance functions and
compare log likelihoods to see which fit better.) 
16. 3/22 Outlier………….s 
Read Dr. Westfall’s
comments on outliers. Read Section 3
from this nice document by Anne Boomsma.
(You might find the other sections very helpful as well, but only
Section 3 for now.) Note: The paper is mostly good with minor errors. One rather major mistake, though, is on the
top of page 13: The statistic sig gives an estimate of residual
standard error, as it is called; in fact, sqrt(MSE) is an unbiased estimator of the
standard deviation of e_{i}. This code will get
you started if you want to run the examples yourself. install.packages("faraway") 

17. 3/24
Quantile regression and Winsorizing – Student Presentations. PowerPoint,
R code
for quantile, data file, R
code for Winsorizing 
Read quantile
regression 
EXCEL
spreadsheet to explain the quantile estimation method 
18. 3/29:
Review up to now. Generalized Least Squares, Correlated errors, Time
series 
Read this
document from John Fox 
John Fox code;
also some code to show how the AR and MA datagenerating processes work. Simulation studies of
inefficiency and Type I error rates when using OLS and ML/GLS: An example
with autocorrelated data. 
19. 3/31 Continuation
of Generalized Least Squares, Correlated errors, Time
series 
Read this
document from John Fox (again) 
John Fox code;
also some code to show how the AR and MA datagenerating processes work. Simulation studies of
inefficiency and Type I error rates when using OLS and ML/GLS: An example
with autocorrelated data. 
20. 4/5 Intro to Mixed Effects Models. Nice R video 
Read this
tutorial on lmer by Bodo Winter. 
Some code from
Bodo Winters’ tutorial 
21. 4/7 Panel
data. Student presentations. PowerPoint, R code, Data file,
Notes. 
Read chapter one of this
great book by Jed Frees. (Actually the whole book is really good!) Supplemental
article. (Optional but excellent reading.) Shows that random effects
models are better for panel data than fixed effects. 

22. 4/12
Repeated measures and random effects,
Bayesianstyle “shrinkage” estimates of random effects. 
Read
p. 199207 of this article by Simon Jackman. 
Simulation
code to help understand the shrinkage estimates shown in Jackman’s papers. 
23. 4/14 Multilevel analysis 
Required: Read the Preface and Sections 1, 2, 3
and 7 from this
document. Here is the R code from Section
7. There are a couple mistakes in their syntax, but I was able to
replicate their results precisely once the code was cleaned up. If you don’t like the “students within schools”
example, just substitute something you do like, such as “employees within
company” or “companies within NAISC category.”
7. They are interpreting the standard deviations
of the variance component estimates incorrectly. The R software gives the
square root of the estimate of the variance, which is just the estimate of
the standard deviation. It’s not the standard error of the estimate, which is
the way that they are interpreting it. Optional: The Bliese
tutorial paper. You will find a
lot of this paper very useful as well, but it’s not required reading. Bliese
wrote the R code and gives tutorials on the topic, so he is a great resource. 
Why
is probability needed in the regression model? A
multilevel regression of trends (random coefficients regression modelling) in
the TTU grad data. R code from Section
7 of the reading. 
24. 4/19 More on
Multilevel analysis 
Read the Preface and Sections 1, 2, 3 and 7 from this
document. (Again!) In the quiz, I will
ask you about their model development, and about fixed versus random. 

25. 4/21 Finishing up mixed effects analyses  The
Hausman “test” for fixed versus random effects

The most important sentence in this document
appears on p. 403: "The Hausman test does not aid in evaluating this
tradeoff." Just found this
paper (optional reading), which gives the simple fix to the problem of
random effects being correlated with the X data. This makes the Hausman test
even more useless than is indicated by the main reading. Note also (optional) a
similar solution from Andrew Gelman. Take home message: Just like all tests for model
assumptions, such as tests for normality, homoscedasticity, etc., the Hausman
test for fixed versus random effects is not the way you should evaluate the
model assumptions (fixed vs random in this case.) 
Issues with
correlation between random effects and the X variables – bias. How to
perform the Hausman test using R. Code
showing the simulation method from the reading. 
26. 4/26 Binary regression models 
Read sections 1 and 2 (up to page 31) of
this fine document by John Fox. 
Models
produce data: Normal regression versus logistic regression Maximum
likelihood estimation: Normal
regression versus logistic regression. Some
logistic regression curves Code
for inclass project: Bayesian logistic regression of “Trashball” – a
variation of basketball. Code for finding the pvalue for the
likelihood ratio test: 
27. 4/28 Nominal response regression models.
Student presentations. PowerPoint, Excel data
file with ML calculations, R
codes. 
Read this article from
the UCLA Institute for Digital Research and Education (Thank you UCLA/IDRE for supplying such
nice materials! You guys rock!) Optional reading: “Applying discrete
choice models to predict Academy Award winners, J. R. Statist. Soc. A (2008),
375394, by Pardoe and Simonton. 
Multinomial
logistic regression using R. Summaries
using EXCEL. 
28. 5/3 Poisson, negative binomial and other count
data regression models. 
Read about the Poisson
regression, through Section 3.2. Read about Negative
binomial regression, too. Supplemental
reading on count data models in R (not required but excellent) 
Analysis of data
on financial planners using R ML estimation of
financial planners data – Poisson vs. Normal via EXCEL More on the
latent variable formulation Simulating
and analyzing data from Poisson and Negative binomial regression models. Analysis of
experimental data on wine sales using count data models. 
29. 5/5 Survival analysis regression models 
Read this
summary by David Madigan. Optional: this summary by Maartin Buis
of Vrije Universiteit Amsterdam.
Section 6, “Unobserved Heterogeneity” is optional (not required)
reading. 
Proportional hazards
regression followup using
excel, with comparison to lognormal regression. 
30. 5/10 
Read through section 4.1.2 of this
paper. Optional reading: Sample selection
bias (Heckman model and method) Future semesters – more time on endogeneity
issues, instrumental variable regression 

Final Exam 5/12, 4:307:30PM. Solutions. 
Old finals and solutions are available in the old
courses link. But every semester is different. Review questions (these are largely from
1994 and 1996, but still relevant today. The questions have been updated a
little to reflect current terminology and software. 