BTRY 4030 – Fall 2018 – Homework 4 Q
代做homework | R语言代写 | 统计代写 | homework | assignment作业 – 这个项目是assignment代写的代写题目
Put Your Name and NetID Here
Due Tuesday, December 4, 2018
Instructions :
Create your homework solution file by editing the hw5-2018_q2.Rmd Rmarkdown file provided. Your
solution to this homework assignment should include the relevant R code and output (fit summaries, ANOVA
tables and computed statistics, as well as requested plots) in addition to written comments where requested.
Do not include output that is not relevant to the question. You should turn in a .pdf version of your compiled
code.
You may discuss the homework problems and computing issues with other students in the class. However, you must write up your homework solution on your own. In particular, do not share your homework RMarkdown file with other students.
Here we will illustrate the results from Question 1 with a real world data set. We will use the study of
mortality in 55 US cities as it is influenced by pollutants NOX (nitrous oxide) and SO2 (sulfur dioxide),
while controlling weather (PRECIP) and sociological variables (EDUC and NONWHITE) that appeared on
homework 4.
You can find the data inairpollution.csvon CMS.
a.Delete each of the first four observation in turn, fit a model with the remaining observations (ie, each
model should be fit based on n 1 observations) and use this to predict MORT in the left-out sample.
Verfiy that your answer in Question 1d returns the same error.
b.Using your identity in Question 1e, compute the cross-validation score for a model using all covariates.
c. Calculate the cross-validation score in the sequence of models obtained by starting from the intercept
and adding each column in the order given in the data (so every model should have one more covariate
than the previous one). Which model has the lowest score?
d.What happens if you add them in reverse order? Plot both sequences of scores versus the number of
covariates in the model.
e.An alternative (fairly classical) means of selecting models in linear regression in Mallows Cp score. This
can be expressed as
Cj =
y T ( I Hj ) y
y T ( I H ) y / ( n p 1)
2 j
often also written as SSEj/ ^2 2 j , where SSEj is the SSE for a model with j covariates, and ^2 is
calculated from a model with all covariates.
Obtain Cj for each of your models in part c, how does this compare with cross validation?
bonus : When there is no natural ordering of the covariates, one way to create one is to first choose the
covariate that produces the smallest SSE among all models in 1 covariate. Then, keeping that in the model
look for the best covariate to add to it. Continue this process until all covariates are in the model. If you do
this, what ordering do you get? What is your optimal model?
bonus : Simulate data from the model that you get before SO2 and NOX are entered (that is, fit a model
with just PRECIP, EDUC and NONWHITE and simulate data with the estimated coefficients and residual
variance). Carry out the model-selection step in part c for each of 100 simulations. How frequently does cross
validation choose the right model?