lab | R语言代写 | 统计代写 – 632 HW

632 HW

lab | R语言代写 | 统计代写 –  这个项目是lab代写的代写题目,关于R语言

lab代写 代写lab

Qianwen Shi
2019/4/

library (ISLR) library (tree) library (rpart.plot)

Warning: package ‘rpart.plot’ was built under R version 3.5.

Loading required package: rpart

library (rpart) library (caret)

Warning: package ‘caret’was built under R version 3.5.

Loading required package: lattice

Loading required package: ggplot

library (randomForest)

randomForest 4.6-

Type rfNews() to see new features/changes/bug fixes.

Attaching package:’randomForest’

The following object is masked from’package:ggplot2′:

margin

Q1:
a.

set.seed (1) train = sample (1 :nrow (Carseats), nrow (Carseats) / 2) training = Carseats[train, ] testing = Carseats[ train,]

b.

reg_tree = tree (Sales ~ .,data = Carseats, subset=train) summary (reg_tree)

Regression tree:

tree(formula = Sales ~ ., data = Carseats, subset = train)

Variables actually used in tree construction:

[1] “ShelveLoc” “Price” “Age” “Advertising” “Income”

[6] “CompPrice”

Number of terminal nodes: 18

Residual mean deviance: 2.36 = 429.5 / 182

Distribution of residuals:

Min. 1st Qu. Median Mean 3rd Qu. Max.

-4.2570 -1.0360 0.1024 0.0000 0.9301 3.

reg_model <- rpart (Sales ~ .,data = training, method = “anova”) rpart.plot (reg_model)

ShelveLoc = Bad,Medium
Price >= 121
Age >= 67
CompPrice < 148
Advertising < 11
Age >= 51
Price >= 92
ShelveLoc = Bad
Price >= 105
ShelveLoc = Bad
Price >= 113
US = No
7.
100%
6.
82%
5.
30%
3.
6%
5.
24%
5.
18%
4.
12%
6.
6%
7
6%
7.
52%
6.
32%
6.
24%
5.
8%
6.
16%
8.
8%
9.
20%
8.
7%
9.
12%
8.
4%
10
9%
9.
18%
8.
13%
7.
4%
9.
9%
12
5%
yes no

yhat = predict (reg_tree,newdata = testing) mean ((yhat testing $ Sales) ^ 2)

[1] 4.

As we can see in the plot, as the price higher than 121, the rate is 82%. In this percentage age older than 67 is 30%, age older than 51 is 53%, then for the prcie higher than 113 is 18%. And the test MSE is 4.148897.

c.

trControl <- trainControl (method =’cv’, number = 10) crossvali_model <- train (Sales ~ ., data = training, method =’rpart’, trControl = trControl, tuneGrid = expand.grid (cp = seq (0, 0.4, length.out = 30)))

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =

trainInfo, : There were missing values in resampled performance measures.

plot.train (crossvali_model)

Complexity Parameter
RMSE (CrossValidation)
2.
2.
2.
2.
2.
2.
0.0 0.1 0.2 0.3 0.

New_tree = prune.tree (reg_tree, best = 8) rpart.plot (crossvali_model $ finalModel)

ShelveLocGood = 0
Price >= 121
Age >= 67 Age >= 51
Price >= 92
Price >= 113
7.
100%
6.
82%
5.
30%
3.
6%
5.
24%
7.
52%
6.
32%
6.
24%
8.
8%
9.
20%
9.
18%
8.
13%
12
5%
yes no

yhat= predict (New_tree, newdata= testing) mean ((yhat testing $ Sales) ^ 2)

## [1] 5.

As we can in the result, the new tree increse the test MSE to 5.09085.

d.

set.seed (1) bagging = randomForest (Sales ~ ., data = training, mtry = 10, importance = TRUE) yhat_bagging = predict (bagging, newdata = testing) mean ((yhat_bagging testing $ Sales) ^ 2)

[1] 2.

importance (bagging)

%IncMSE IncNodePurity

CompPrice 16.4714051 126.

Income 4.0561872 78.

Advertising 16.2730251 122.

Population 0.7711188 62.

Price 54.5571815 512.

ShelveLoc 42.4486118 320.

Age 20.5369414 184.

Education 2.7755968 42.

Urban -2.3962157 8.

US 7.2258536 17.

varImpPlot (bagging)

Urban
Population
Education
Income
US
Advertising
CompPrice
Age
ShelveLoc
Price
0 10 30 50
%IncMSE
Urban
US
Education
Population
Income
Advertising
CompPrice
Age
ShelveLoc
Price
0 100 300 500
IncNodePurity

bagging

As we can see in the plot, the most important variables are the Price and the quality for car seats in each site. And the test MSE with bagging approach regression is 2.614642.

e.

set.seed (1) random_forst = randomForest (Sales ~ ., data = training, mtry = 3, importance = TRUE) yhat_random = predict (random_forst, newdata = testing) mean ((yhat_random testing $ Sales) ^ 2)

[1] 3.

As we can see the test MSE is 3.237463, which is bigger than bagging test MSE. Thus the random forests cannot improve this case.

Q2:
a.

set.seed (1) train = sample ( dim (OJ)[1],800) training_OJ = OJ[train,] testing_OJ = OJ[ train,]

b.

OJ_tree = tree (Purchase ~ ., data = training_OJ) summary (OJ_tree)

Classification tree:

tree(formula = Purchase ~ ., data = training_OJ)

Variables actually used in tree construction:

[1] “LoyalCH” “PriceDiff” “SpecialCH” “ListPriceDiff”

Number of terminal nodes: 8

Residual mean deviance: 0.7305 = 578.6 / 792

Misclassification error rate: 0.165 = 132 / 800

As we can see the training error rate is 0.165 and there are 800 terminal nodes.

c.

OJ_tree

node), split, n, deviance, yval, (yprob)

* denotes terminal node

1) root 800 1064.00 CH ( 0.61750 0.38250 )

2) LoyalCH < 0.508643 350 409.30 MM ( 0.27143 0.72857 )

4) LoyalCH < 0.264232 166 122.10 MM ( 0.12048 0.87952 )

8) LoyalCH < 0.0356415 57 10.07 MM ( 0.01754 0.98246 ) *

9) LoyalCH > 0.0356415 109 100.90 MM ( 0.17431 0.82569 ) *

5) LoyalCH > 0.264232 184 248.80 MM ( 0.40761 0.59239 )

10) PriceDiff < 0.195 83 91.66 MM ( 0.24096 0.75904 )

20) SpecialCH < 0.5 70 60.89 MM ( 0.15714 0.84286 ) *

21) SpecialCH > 0.5 13 16.05 CH ( 0.69231 0.30769 ) *

11) PriceDiff > 0.195 101 139.20 CH ( 0.54455 0.45545 ) *

3) LoyalCH > 0.508643 450 318.10 CH ( 0.88667 0.11333 )

6) LoyalCH < 0.764572 172 188.90 CH ( 0.76163 0.23837 )

12) ListPriceDiff < 0.235 70 95.61 CH ( 0.57143 0.42857 ) *

13) ListPriceDiff > 0.235 102 69.76 CH ( 0.89216 0.10784 ) *

## 7) LoyalCH > 0.764572 278 86.14 CH ( 0.96403 0.03597 ) *

When we find the terminal node because of the asterisk. For example, the label9 of the split criterion what is LoyalCh > 0.0356415, which means the branch is 109 of the dviance of 100.90. As less than 17.4% in the branch take the value of CH and the remaining 82.6% is take the value of MM. d. rpart_modelOJ <- rpart (Purchase ~ ., data = training_OJ, method =’class’, control = rpart.control (cp = 0)) rpart.plot (rpart_modelOJ)

LoyalCH >= 0.
LoyalCH >= 0.
ListPriceDiff >= 0.
StoreID >= 3
LoyalCH >= 0.
SalePriceMM >= 1.
ListPriceDiff >= 0.
WeekofPurchase < 229
STORE >= 4
LoyalCH >= 0.
PriceDiff >= 0.
ListPriceDiff >= 0.
LoyalCH < 0.
LoyalCH >= 0.
STORE >= 4
PriceMM < 2.
StoreID >= 2
WeekofPurchase >= 247
SpecialCH = 1
LoyalCH >= 0.
ListPriceDiff < 0.
DiscMM >= 0.
0.38CH
100%
0.11CH
56%
0.04CH
35%
0.24CH
22%
CH
0.1113%
0.04CH
9%
0.24CH
4%
0.15CH
3%
MM0.
1%
CH
0.439%
0.37CH
8%
0.20CH
2%
0.45CH
5%
0.14CH
1%
0.52MM
4%
0.31CH
2%
0.65MM
2%
0.80MM
1%
0.73MM
44%
0.59MM
22%
CH
0.4512%
0.29CH
4%
0.12CH
1%
0.35CH
2%
0.23CH
2%
0.57MM
1%
0.51MM
9%
0.25CH
1%
0.55MM
8%
0.48CH
6%
CH
0.414%
0.32CH
2%
0.53MM
2%
0.67MM
2%
MM0.
2%
MM
0.7610%
0.31CH
2%
MM0.
9%
0.88MM
21%
MM
0.8214%
MM0.
4%
0.44CH
1%
MM0.
2%
MM0.
10%
MM0.
7%
yes no
As we can see the above plot, as the information about that a particular customer that the plot to predict
which brand will buy the orange juice.
e.
Prediction_tree = predict (rpart_modelOJ, newdata = testing_OJ, type = "class")
table (Prediction_tree, testing_OJ $ Purchase)
##
## Prediction_tree CH MM
## CH 132 25
## MM 27 86
(132 + 86) / 270
## [1] 0.
Thus the test observation are 81% are correctly and the test error rate is 19%.
f.
CV_OJ = cv.tree (OJ_tree, FUN = prune.misclass)
CV_OJ
## $size
## [1] 8 5 2 1

$dev

[1] 146 146 160 306

$k

[1] -Inf 0.000000 4.666667 160.

$method

[1] “misclass”

attr(,”class”)

[1] “prune” “tree.sequence”

g.

tree_size <- CV_OJ $ size deviance <- CV_OJ $ dev plot (tree_size, deviance / nrow (training_OJ), type = “b”, x lab = “tree_size”, ylab = “deviance”)

1 2 3 4 5 6 7 8
0.
0.
0.
0.
tree_size
deviance

As we can see in the plot, the lowest classification error rate is tree size 5.

h.

pruned_tree <- rpart (Purchase ~ ., data = training_OJ, method = ‘class’, control = rpart.control (cp = 0.01)) rpart.plot (pruned_tree)

LoyalCH >= 0.
LoyalCH >= 0.
PriceDiff >= 0.
SpecialCH = 1
CH
0.
100%
CH
0.
56%
MM
0.
44%
MM
0.
22%
CH
0.
12%
MM
0.
10%
CH
0.
2%
MM
0.
9%
MM
0.
21%
yes no
i.

Pred_tree = predict (pruned_tree, newdata = testing_OJ, type = “class”) table (Pred_tree, testing_OJ $ Purchase)

Pred_tree CH MM

CH 147 47

MM 12 64

(147 + 64) / 270

[1] 0.

Thus we can find the test error in part.d. is 0.8074074 and the part.i. is 0.7814815. Which means the pruned trees is better than the unpruned trees.