lab | R语言代写 | 统计代写 - 632 HW - 学霸代写 - CS代写, 程序代写, CS作业代写, 代码代写, CS编程代写, java代写, python代写, c++/c代写, R代写, 算法作业代写, web代写, CS assignment代写, MATH代写, 统计代写, 金融代写, business代写, economic, accounting代写等

632 HW

lab | R语言代写 | 统计代写 – 这个项目是lab代写的代写题目，关于R语言

Qianwen Shi

2019/4/

library (ISLR) library (tree) library (rpart.plot)

Warning: package ‘rpart.plot’ was built under R version 3.5.

Loading required package: rpart

library (rpart) library (caret)

Warning: package ‘caret’was built under R version 3.5.

Loading required package: lattice

Loading required package: ggplot

library (randomForest)

randomForest 4.6-

Type rfNews() to see new features/changes/bug fixes.

Attaching package:’randomForest’

The following object is masked from’package:ggplot2′:

margin

Q1:

a.

set.seed (1) train = sample (1 :nrow (Carseats), nrow (Carseats) / 2) training = Carseats[train, ] testing = Carseats[ – train,]

b.

reg_tree = tree (Sales ~ .,data = Carseats, subset=train) summary (reg_tree)

Regression tree:

tree(formula = Sales ~ ., data = Carseats, subset = train)

Variables actually used in tree construction:

[1] “ShelveLoc” “Price” “Age” “Advertising” “Income”

[6] “CompPrice”

Number of terminal nodes: 18

Residual mean deviance: 2.36 = 429.5 / 182

Distribution of residuals:

Min. 1st Qu. Median Mean 3rd Qu. Max.

-4.2570 -1.0360 0.1024 0.0000 0.9301 3.

reg_model <- rpart (Sales ~ .,data = training, method = “anova”) rpart.plot (reg_model)

ShelveLoc = Bad,Medium

Price >= 121

Age >= 67

CompPrice < 148

Advertising < 11

Age >= 51

Price >= 92

ShelveLoc = Bad

Price >= 105

ShelveLoc = Bad

Price >= 113

US = No

7. 100%

6. 82%

5. 30%

3. 6%

5. 24%

5. 18%

4. 12%

6. 6%

7 6%

7. 52%

6. 32%

6. 24%

5. 8%

6. 16%

8. 8%

9. 20%

8. 7%

9. 12%

8. 4%

10 9%

9. 18%

8. 13%

7. 4%

9. 9%

12 5%

yes no

yhat = predict (reg_tree,newdata = testing) mean ((yhat – testing $ Sales) ^ 2)

[1] 4.

As we can see in the plot, as the price higher than 121, the rate is 82%. In this percentage age older than 67 is 30%, age older than 51 is 53%, then for the prcie higher than 113 is 18%. And the test MSE is 4.148897.

c.

trControl <- trainControl (method =’cv’, number = 10) crossvali_model <- train (Sales ~ ., data = training, method =’rpart’, trControl = trControl, tuneGrid = expand.grid (cp = seq (0, 0.4, length.out = 30)))

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =

trainInfo, : There were missing values in resampled performance measures.

plot.train (crossvali_model)

Complexity Parameter

RMSE (CrossValidation)

2. 0.0 0.1 0.2 0.3 0.

New_tree = prune.tree (reg_tree, best = 8) rpart.plot (crossvali_model $ finalModel)

ShelveLocGood = 0

Price >= 121

Age >= 67 Age >= 51

Price >= 92

Price >= 113

7. 100%

6. 82%

5. 30%

3. 6%

5. 24%

7. 52%

6. 32%

6. 24%

8. 8%

9. 20%

9. 18%

8. 13%

12 5%

yes no

yhat= predict (New_tree, newdata= testing) mean ((yhat – testing $ Sales) ^ 2)

## [1] 5.

As we can in the result, the new tree increse the test MSE to 5.09085.

d.

set.seed (1) bagging = randomForest (Sales ~ ., data = training, mtry = 10, importance = TRUE) yhat_bagging = predict (bagging, newdata = testing) mean ((yhat_bagging – testing $ Sales) ^ 2)

[1] 2.

importance (bagging)

%IncMSE IncNodePurity

CompPrice 16.4714051 126.

Income 4.0561872 78.

Advertising 16.2730251 122.

Population 0.7711188 62.

Price 54.5571815 512.

ShelveLoc 42.4486118 320.

Age 20.5369414 184.

Education 2.7755968 42.

Urban -2.3962157 8.

US 7.2258536 17.

varImpPlot (bagging)

Urban

Population

Education

Income

US

Advertising

CompPrice

Age

ShelveLoc

Price

0 10 30 50

%IncMSE

Urban

US

Education

Population

Income

Advertising

CompPrice

Age

ShelveLoc

Price

0 100 300 500

IncNodePurity

bagging

As we can see in the plot, the most important variables are the Price and the quality for car seats in each site. And the test MSE with bagging approach regression is 2.614642.

e.

set.seed (1) random_forst = randomForest (Sales ~ ., data = training, mtry = 3, importance = TRUE) yhat_random = predict (random_forst, newdata = testing) mean ((yhat_random – testing $ Sales) ^ 2)

[1] 3.

As we can see the test MSE is 3.237463, which is bigger than bagging test MSE. Thus the random forests cannot improve this case.

Q2:

a.

set.seed (1) train = sample ( dim (OJ)[1],800) training_OJ = OJ[train,] testing_OJ = OJ[ – train,]

b.

OJ_tree = tree (Purchase ~ ., data = training_OJ) summary (OJ_tree)

Classification tree:

tree(formula = Purchase ~ ., data = training_OJ)

Variables actually used in tree construction:

[1] “LoyalCH” “PriceDiff” “SpecialCH” “ListPriceDiff”

Number of terminal nodes: 8

Residual mean deviance: 0.7305 = 578.6 / 792

Misclassification error rate: 0.165 = 132 / 800

As we can see the training error rate is 0.165 and there are 800 terminal nodes.

c.

OJ_tree

node), split, n, deviance, yval, (yprob)

* denotes terminal node

1) root 800 1064.00 CH ( 0.61750 0.38250 )

2) LoyalCH < 0.508643 350 409.30 MM ( 0.27143 0.72857 )

4) LoyalCH < 0.264232 166 122.10 MM ( 0.12048 0.87952 )

8) LoyalCH < 0.0356415 57 10.07 MM ( 0.01754 0.98246 ) *

9) LoyalCH > 0.0356415 109 100.90 MM ( 0.17431 0.82569 ) *

5) LoyalCH > 0.264232 184 248.80 MM ( 0.40761 0.59239 )

10) PriceDiff < 0.195 83 91.66 MM ( 0.24096 0.75904 )

20) SpecialCH < 0.5 70 60.89 MM ( 0.15714 0.84286 ) *

21) SpecialCH > 0.5 13 16.05 CH ( 0.69231 0.30769 ) *

11) PriceDiff > 0.195 101 139.20 CH ( 0.54455 0.45545 ) *

3) LoyalCH > 0.508643 450 318.10 CH ( 0.88667 0.11333 )

6) LoyalCH < 0.764572 172 188.90 CH ( 0.76163 0.23837 )

12) ListPriceDiff < 0.235 70 95.61 CH ( 0.57143 0.42857 ) *

13) ListPriceDiff > 0.235 102 69.76 CH ( 0.89216 0.10784 ) *

## 7) LoyalCH > 0.764572 278 86.14 CH ( 0.96403 0.03597 ) *

When we find the terminal node because of the asterisk. For example, the label9 of the split criterion what is LoyalCh > 0.0356415, which means the branch is 109 of the dviance of 100.90. As less than 17.4% in the branch take the value of CH and the remaining 82.6% is take the value of MM. d. rpart_modelOJ <- rpart (Purchase ~ ., data = training_OJ, method =’class’, control = rpart.control (cp = 0)) rpart.plot (rpart_modelOJ)

LoyalCH >= 0.

LoyalCH >= 0.

ListPriceDiff >= 0.

StoreID >= 3

LoyalCH >= 0.

SalePriceMM >= 1.

ListPriceDiff >= 0.

WeekofPurchase < 229

STORE >= 4

LoyalCH >= 0.

PriceDiff >= 0.

ListPriceDiff >= 0.

LoyalCH < 0.

LoyalCH >= 0.

STORE >= 4

PriceMM < 2.

StoreID >= 2

WeekofPurchase >= 247

SpecialCH = 1

LoyalCH >= 0.

ListPriceDiff < 0.

DiscMM >= 0.

0.38CH
100%

0.11CH
56%

0.04CH
35%

0.24CH
22%

CH
0.1113%

0.04CH
9%

0.24CH
4%

0.15CH
3%

MM0.
1%

CH
0.439%

0.37CH
8%

0.20CH
2%

0.45CH
5%

0.14CH
1%

0.52MM
4%

0.31CH
2%
0.65MM
2%
0.80MM
1%

0.73MM
44%

0.59MM
22%

CH
0.4512%

0.29CH
4%

0.12CH
1%

0.35CH
2%

0.23CH
2%
0.57MM
1%

0.51MM
9%

0.25CH
1%

0.55MM
8%

0.48CH
6%

CH
0.414%

0.32CH
2%
0.53MM
2%
0.67MM
2%

MM0.
2%

MM
0.7610%

0.31CH
2%

MM0.
9%

0.88MM
21%

MM
0.8214%

MM0.
4%

0.44CH
1%

MM0.
2%

MM0.
10%

MM0.
7%

yes no

As we can see the above plot, as the information about that a particular customer that the plot to predict
which brand will buy the orange juice.
e.
Prediction_tree = predict (rpart_modelOJ, newdata = testing_OJ, type = "class")
table (Prediction_tree, testing_OJ $ Purchase)

##
## Prediction_tree CH MM
## CH 132 25
## MM 27 86
(132 + 86) / 270

## [1] 0.
Thus the test observation are 81% are correctly and the test error rate is 19%.
f.
CV_OJ = cv.tree (OJ_tree, FUN = prune.misclass)
CV_OJ

## $size

## [1] 8 5 2 1

$dev

[1] 146 146 160 306

$k

[1] -Inf 0.000000 4.666667 160.

$method

[1] “misclass”

attr(,”class”)

[1] “prune” “tree.sequence”

g.

tree_size <- CV_OJ $ size deviance <- CV_OJ $ dev plot (tree_size, deviance / nrow (training_OJ), type = “b”, x lab = “tree_size”, ylab = “deviance”)

1 2 3 4 5 6 7 8

0. tree_size

deviance

As we can see in the plot, the lowest classification error rate is tree size 5.

h.

pruned_tree <- rpart (Purchase ~ ., data = training_OJ, method = ‘class’, control = rpart.control (cp = 0.01)) rpart.plot (pruned_tree)

LoyalCH >= 0.

LoyalCH >= 0.

PriceDiff >= 0.

SpecialCH = 1

CH
0.
100%

CH
0.
56%

MM
0.
44%

MM
0.
22%

CH
0.
12%

MM
0.
10%

CH
0.
2%

MM
0.
9%

MM
0.
21%

yes no

i.

Pred_tree = predict (pruned_tree, newdata = testing_OJ, type = “class”) table (Pred_tree, testing_OJ $ Purchase)

Pred_tree CH MM

CH 147 47

MM 12 64

(147 + 64) / 270

[1] 0.

Thus we can find the test error in part.d. is 0.8074074 and the part.i. is 0.7814815. Which means the pruned trees is better than the unpruned trees.