# Algorithm | Artificial | 代做mining | 代做Objective | Python | assignment – 1 Wine Quality Prediction with Decision Tree

### 1 Wine Quality Prediction with Decision Tree

Algorithm | Artificial | 代做mining | 代做Objective | Python | assignment – 该题目是一个常规的Decision Tree的练习题目代写, 涵盖了Algorithm/Artificial/mining/Objective/Python等方面, 这是值得参考的Decision Tree代写的题目

### 1 Wine Quality Prediction with Decision Tree

Wine experts evaluate the quality of wine based on sensory data. We could also collect the features of wine from Objective tests, thus the objective features could be used to predict the experts judgement, which is the quality rating of the wine. This could be formed as a supervised learning problem with the objective features as the data features and wine quality rating as the data labels. In this assignment, we provide objective features obtained from physicochemical statistics for each white wine sample and its corresponding rating provided by wine experts. You are expect to implementdecision tree learning (DTL), and use the training set to train your decision tree, then provide wine quality prediction on the test set. Wine quality rating is measured in the range of 0-9. In our dataset, we only keep the samples for quality ratings 5, 6 and 7. The 11 objective features are listed as follows [1]:

• facid: fixed acidity
• vacid: volatile acidity
• cacid: citric acid
• ressugar: residual sugar
• chlorides: chlorides
• fsdioxide: free sulfur dioxide
• tsdioxide: total sulfur dioxide
• density: density
• pH: pH
• sulphates: sulphates
• alcohol: alcohol

Explanation of the Data. train: The first 11 columns represent the 11 features and the 12th column is the wine quality. A sample is depicted as follows:

``````f_acid v_acid c_acid res_sugar chlorides fs_dioxide ts_dioxide density pH sulphates alcohol quality
8.10 0.270 0.41 1.45 0.033 11.0 63.0 0.99080 2.99 0.56 12.0 5
8.60 0.230 0.40 4.20 0.035 17.0 109.0 0.99470 3.14 0.53 9.7 5
7.90 0.180 0.37 1.20 0.040 16.0 75.0 0.99200 3.18 0.63 10.8 5
8.30 0.420 0.62 19.25 0.040 41.0 172.0 1.00020 2.98 0.67 9.7 5
6.50 0.310 0.14 7.50 0.044 34.0 133.0 0.99550 3.22 0.50 9.5 5
``````

test: Similar totrain, but without the 12th colum as they are the values your model will predict. A sample is depicted as follows:

``````f_acid v_acid c_acid res_sugar chlorides fs_dioxide ts_dioxide density pH sulphates alcohol
7.0 0.360 0.14 11.60 0.043 35.0 228.0 0.99770 3.13 0.51 8.
6.3 0.270 0.18 7.70 0.048 45.0 186.0 0.99620 3.23 0.47 9.
7.2 0.290 0.20 7.70 0.046 51.0 174.0 0.99582 3.16 0.52 9.
7.1 0.140 0.35 1.40 0.039 24.0 128.0 0.99212 2.97 0.68 10.
7.6 0.480 0.28 10.40 0.049 57.0 205.0 0.99748 3.24 0.45 9.
``````

#### 1.1 Decision Tree Learning

From the given training data, our goal is to learn a function that can predict the wine quality rating of a wine sample, based on the objective features. In this assignment, the predictor function will be constructed as a decision tree. Since the attributes (objective features) are continuous valued, you shall apply the DTL Algorithm for continuous data, as outlined in Algorithms 1 and 2. Once the tree is constructed, Algorithm 3 to predict the wine quality to a new wine sample.

Algorithm 1DTL(data,minleaf)

Require:datain the form ofNinput-output pairs{xi, yi}Ni=1, minleaf 1 1:if(Nminleaf) or (yi=yjfor alli, j) or (xi=xjfor alli, j)then 2: Create new leaf noden. 3: ifthere is a unique mode (most frequent value) in{yi}Ni=1then 4: n.labelmode in{yi}Ni= 5: else 6: n.labelunknown 7: end if 8: return n 9:end if 10:[attr, splitval]ChooseSplit(data)=Algorithm 2 11:Create new noden. 12:n.attrattr 13:n.splitvalsplitval 14:n.lef tDTL(datawithxi[attr]splitval, minleaf) 15:n.rightDTL(datawithxi[attr]> splitval, minleaf) 16:return n

Algorithm 2ChooseSplit(data)

Require:datain the form of N input-output pairs{xi, yi}Ni=1. 1:bestgain 0 2:foreachattrindatado 3: Sort the arrayx 1 [attr],x 2 [attr], …,xN[attr]. 4: fori= 1, 2 , …N 1 do 5: splitval 0 .5(xi[attr] +xi+1[attr]) 6: gainInformation gain of (attr, splitval) // See lecture slides. 7: ifgain > bestgainthen 8: bestattrattrandbestsplitvalsplitvalandbestgaingain 9: end if 10: end for 11:end for 12:return (bestattr, bestsplitval)

Algorithm 3PredictDTL(n, data) Require:Decision tree root noden,datain the form of attribute valuesx. 1:whilenis not a leaf nodedo 2: if x[n.attr]n.splitvalthen 3: nn.lef t 4: else 5: nn.right 6: end if 7:end while 8:return n.label

#### 1.2 Deliverable

Write your decision tree learning program in Python 3.6.9 in a file calledwinequality.py. Your program must be able to run as follows:

\$ python winequality.py [train] [test] [minleaf]

The inputs/options to the program are as follows:

• [train]specifies the path to a set of training data file.
• [test]specifies the path to a set of testing data file.
• [minleaf]is an integer greater than zero which specifies the second input param- eter to the DTL algorithm (Algorithm 1).

Given the inputs, your program must learn a decision tree (following the prescribed algorithms) using the training data, then predict the quality rating of each of the wine sample in the testing data. Your program must thenprint to standard output(i.e., the command prompt) the list of predicted wine quality ratings, vertically based on the order in which the testing cases appear in[test].

1.2.1 Python libraries

You are allowed to use the Python standard library to write your decision tree learning program (seehttps://docs.python.org/3/library/for the components that make up the Python v3.6.9 standard library). In addition to the standard library, you are allowed to use NumPy. Note that the marking program will not be able to run your program to completion if other third party libraries are used.

#### 1.4 Expected run time

Your program must be able to terminate within 300 seconds on the sample data given.

#### 1.5 Debugging Suggestions

Step-by-step debugging by checking intermediate values/results will help you to identify the problems of your code. This function is enabled by most of the Python IDE. If not in your case, you could also print the intermediate values out. The values that are worth checking when debugging are (but not limited to):

• bestsplitval,bestgain,bestattrwhen splitting;
• n.splitval,n.attr,n.label, when creating nodes and prediction

You could use sample data or create data in the same format for debugging.

#### 1.6 Assessment

I will compile and run your code on several test problems. If it passes all tests, you will get15%(undergrads) or12%(postgrads) of the overall course mark.For undergrad- uates, bonus marks of3%will be awarded if Section 2 is completed correctly. There will be no further manual inspection/grading of your program to award marks on the basis of coding style, commenting or amount of code written.

#### 1.7 Using other source code

You may not use other source code for this assignment. All submitted code must be your own work written from scratch. Only by writing the solution yourself will you fully understand the concept.

#### 1.8 Due date and late submission policy

This assignment is due by11:59pm Saturday 7 May. If your submission is late, the maximum mark you can obtain will be reduced by 25% per day (or part thereof) past the due date or any extension you are granted.

``````Continues next page for postgraduate section.
``````

### 2 Wine Quality Prediction with Random Forest

For postgraduate students, completing this section will give you the remaining3%of the assignment marks. Random forest is an ensemble method which combines predictions of multiple decision trees. In this task, you will extend your knowledge learnt from decision tree learning to random forest learning (RFL). The process for a simplified RFL givenNinput-output pairs is:

``````(1) Randomly select a set ofNsamples via bootstrap sampling (see explianation later).
This dataset is used for training a decision tree (i.e., the root node of the decision
tree).
``````
``````(2) Build a decision tree on the dataset from (1) and apply Algorithm 1.
``````
``````(3) Repeat (1) and (2) until reaching the maximum number of trees.
``````

This process is also shown in Algorithm 4. In random forest learning, a sample set is used to train a decision tree. That is to say, different trees in the forest could have different root data. For prediction, the random forest will choose the most voted label as its prediction. For the wine quality prediction task, you shall apply Algorithm 4 for random forest learning and apply Algorithm 5 to predict the wine quality for a new wine sample.

Bootstrap sampling. It is a random sampling with replacement (i.e., replace the sam- pled data back to the original dataset). For example, ifN= 5, so we have{x 1 ,x 2 , …,x 5 } (omityifor simplicity). To perform bootstrap sampling, we randomly select one data point from the five, lets sayx 3 , to be the first sample. Then we putx 3 back to the dataset and draw another sample from the five data points. This time, the sample could also bex 3 or another one. We repeat this process until we getN samples and form a sample set. In this way, the sample set could contain repeated data points. So the following cases are possible: {x 3 ,x 3 ,x 3 ,x 3 ,x 3 }, when all the data samples are the same. {x 1 ,x 3 ,x 4 ,x 2 ,x 5 }, when all the data samples are different. {x 3 ,x 1 ,x 3 ,x 2 ,x 1 }, when some samples are repeated. …

#### 2.1 Deliverables

Write your random forest program in Python 3.6.9 in a file calledwinequalityRFL.py. Your program must be able to run as follows:

\$ python winequalityRFL.py [train] [test] [minleaf] [n_trees] [random_seed]

The inputs/options to the program are as follows:

• [train]specifies the path to a set of training data file.

Algorithm 4RFL(data, minleaf, ntrees, randseed)

Require:datain the form ofNinput-output pairs{xi, yi}Ni=1, ntrees > 1 , minleaf1. 1:f orest[] 2:sampleindexesNntreesrandom integers with value in [0, N) generated byrandseed 3:count 0 4:forcount < ntreesdo 5: sampleddataNdata pairs selected byNindexes fromsampleindexessequentially 6: n=DTL(sampleddata,minleaf) =Algorithm 1 7: f orest.append(n) 8:end for 9:return f orest

Algorithm 5PredictRFL(f orest, data) Require:f orestis a list of tree roots,datain the form of attribute valuesx. 1:labels[] 2:for Each treenin thef orestdo 3: labelPredictDTL (n, data) =Algorithm 3 4: labels.append(n) 5:end for 6:return the most voted label inlabels

• [test]specifies the path to a set of testing data file.
• [minleaf]is an integer greater than zero which specifies the second input param- eter to the RFL algorithm (Algorithm 4).
• [n_trees]is an integer greater than one which specifies the third input parameter to the RFL algorithm (Algorithm 4).
• [random_seed]is the seed value generate random values. Please use import ran- dom package, use random.seed(randseed) to set seed value and random.randint(0, M) to create a random value in [0,M], repeat random.randint to get enough random values. Given the inputs, your program must learn a random forest (following the prescribed algorithms) using the training data, then predict the quality rating of each wine sample in the testing data. Your program must thenprint to standard output(i.e., the command prompt) the list of predicted wine quality ratings, vertically based on the order in which the testing cases appear in[test]. Submit your program in the same way as the submission for Sec. 1.For postgradu- ates, please submit your learning programs (winequality.pyandwinequalityRFL.py) toAssignment 2 – Postgraduates. The due date, late submission policy and code reuse policy are also the same as in Sec. 1.

#### 2.2 Expected run time

Your program must be able to terminate within 300 seconds on the sample data given.

#### 2.3 Debugging Suggestions

In addition to Sec. 1.5, another value worth checking when debugging is (but not limited to):

• thesampleindexes by setting random seed, the indexes should be the same each time you run the code

#### 2.4 Assessment

I will compile and run your code on a single test case. If it passes, you will get3%of the overall course mark.

``````The End
``````

### References

[1]Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. Modeling wine preferences by data mining from physicochemical properties. Decision support systems 47, 4 (2009), 547553.