Machine Learning代做|Python|Algorithm代写-Programming assignment

作业Homework|Machine Learning代做|Python|Algorithm代写|Assignment代做|Lab代写 – 这是一个机器学习的代写,主要是利用python语言,涉及贝叶斯等内容

2 Programming assignment (80%)

2.1 Naive Bayes (40%)

Much of Machine learning with text involves counting events. Naive Bayes fits nicely into this framework. The classifier needs just a few counters. For this assignment we will be performing document classification using Naive Bayes. First, letybe the labels for the training documents andW={wi}Li=1be the sets of words in the document. Here are the counters we need to get for predication:

  • #(Y=y)for each labelythe number of training instances of that class
  • #(Y=)heremeans anything, so this is just the total number of training instances
  • #(Y=y,W=w)number of times tokenwappears in the documents with labely
  • #(Y=y,W=)total number of tokens for documents with labely.

One way to compute these counters is to convert all the documents into feature vectors,w Z|V|, i.e.|V|dimension integer vectors, whereVis the dictionary, which provides indices for all words. Thenwiis the number of appearances of word with indexiin this document. Store all of these feature vectors together as a matrix, which is saved in the format of .csv file in your case (see Section 2.3 for details). Without smoothing, the final prediction is

y=argmax
y
(
L
i=
#(Y=y,W=wi)
#(Y=y,W=)
)
#(Y=y)
#(Y=)
(1)

Also, to avoid the numerical problem, you may want to use logarithm of probability

y=argmax
y
(
L
i=
log
#(Y=y,W=wi)
#(Y=y,W=)
) + log
#(Y=y)
#(Y=)
(2)

Note that your implementation will use smoothed probabilities. For example, at classification time, you can use Laplace smoothing with= 1as described here:https://en.wikipedia.org/wiki/ Additive_smoothing.

2.2 Voted Perceptron (40%)

Perceptron Algorithm is one of the classic algorithms which has been used in machine learning from early 1960s. It is an online learning algorithm for learning a linear threshold function which works as a classifier. We will also apply the Voted Perceptron algorithm on the same document classification task in this assignment. LetT ={(yi,xi)}mi=1be the training data,yibe the labels (yi { 1 , 1 }in binary clas- sification problem) for theith document and the corresponding feature vector isxi. The goal of the Perceptron algorithm is to find a vector w which defines a linear function such that i,yiw,xi> 0 (,is the inner product operation). For the Voted Perceptron, you need a list of weighted vectors{(wk,ck)}Kk=1, wherewkis the vector andckis its weight. (Please referhttps: //cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdffor a clear descrip-

tion of Voted Perceptron)

Initiatek= 1,c 1 = 0,w 1 = 0 ,t= 0; whiletT(number of rounds)do foreach training example(yi,xi)do ifyiwk,xi 0 then wk+1=wk+yixi; ck+1= 1; k=k+ 1; else ck=ck+ 1; end end t=t+ 1; end Then you can use all these vectors and their weight to do the classification: the predicted label yfor any unlabeled document with feature vectorxwould be

y=sign(
K
k=
cksign(x,wk)) (3)

2.3 Data Set

For this assignment, you should download the training data fromhttps://www.dropbox.com/s/ 5nvdcezf85vpetj/hw2_data.zip?dl=0. It contains two csv files: Xtrain.csv and Ytrain.csv.

Xtrain.csv Each row is a feature vector of one document. The values in theith columns are number of occurrences of the word with Idiin the corresponding documents. For example, assume the first row in Xtrain.csv is: 0, 1, 2, 10, 0, 2, … which means the first word in the dictionary does not appear in this document, the second word in the dictionary appears once in this document, the third word in the dictionary appears twice in the document, and so on.

Ytrain.csv The csv file Ytrain.csv provides the binary labels for corresponding documents in the file Xtrain.csv. The corresponding label for the document in the above example has been saved at the first row of the file Ytrain.csv. There are two different kind of news documents, your task is to detect onion documents (fake news) from them:

  • onion(fake news, labeled with 1), containing articles from the not-so-serious American magazine TheOnion.com
  • economist(normal news, labeled with 0), containing articles from the serious European magazine Economist.com

2.4 Implementation & Submission

For this assignment, we publish a Coda lab competition (https://competitions.codalab.org/ competitions/20280?secret_key=a9ed97f1-66b4-4d71-8151-0ec85ccd7856), where you can submit your code and get evaluation scores that will be listed on a leader-board. To submit during four late days, please go tohttps://competitions.codalab.org/competitions/20279? secret_key=d99d8c12-2269-488d-a592-d48ab91991d1. If you are not familiar with CodaLab Competitions, check out here: participating in a competition. Remember to register a CodaLab Competitions account using your umail account, so that the username will be your UCSBNetID. After that, log into CodaLab Compectitions and set up your team name (whatever nickname you like). To protect your privacy, only the team names will be shown on the leader-board and your usernames will be anonymous. After your submission finishes running, please choose to submit it to the leader-board. Note that here team

name is equivalent to your nickname, and it is still an independent homework assignment. You must implement the two algorithms according to the instructions above. You must avoid calling APIs from existing machine learning toolkits such as scikit-learn. Using numpy is allowed. There are two phases,Naive BayersandVoted Perceptron, representing the two parts of the assignment. You must submit one zip file per phase via the corresponding submission page. Your code must be written in Python.

Submission format The final submission format should be: naive_bayers.zip

voted_perceptron.zip

  • run.py
  • report.pdf (see Sec. 3 for more details)
  • other python scripts you wrote

Note that to create a valid submission, zip all the file with zip -r zipfilename * starting from this directory. DO NOT zip the directory itself, just its contents. (THIS IS VERY IMPORTANT!) Here is a sample run.py file:https://www.dropbox.com/s/a2ahv2qh7solhhn/run.py?dl=0. You must submit the run.py file with the exactly same format. For each phase, you have a total of 30 possible submissions.

Computational restrictions For both phases, the prescribed "time budget" is 300 seconds, including both the running time of your code and the evaluation time of the predicted results. In other words, the running time of your program must be less than 300 seconds.

Figure 1: Precision, Recall and Accuracy

2.5 Evaluation Criteria

As illustrated in Section 2.4, the prediction file generated by your code should be in the exactly same format of the file Ytrain.csv. We will then compare your predictions with the ground truth labels and use a weighted combination of accuracy and F-measure to evaluate your results. So your final score for each submission

final_score= 50accuracy+ 50F_measure. For someone who is not familiar with the definitions of precision and F-measure, here is a simple explanation (see Figure. 1,a,b,c,dare the number of documents with different true and predicted labels ):

precision=
d
b+d
, recall=
d
c+d
,
F_measure= 2
precisionrecall
precision+recall
, accuracy=
a+d
a+b+c+d
.

Even though your code is automatically evaluated on CodaLab Competitions, we will download your latest submissions and run a Plagiarism detection program on them. So please be honest.

3 Report (20%)

In addition to the code submission on CodaLab Competitions, you are also supposed to write a report and include it in the submission zip file for the second phase. The report should solve the following question: First, use the last10%of the training data as your test data. Compare Naive Bayes and Voted Perceptron on several fractions of your remaining training data. For this purpose, pick 1%,2%,5%,10%,20%and100%of the first 90% training data to train and compare the perfor- mance of Naive Bayes and Voted Perceptron on the test data respectively. Plot the accuracy as a function of the size of the fraction you picked (x-axis should be percent of the remaining training data and y-axis should be accuracy).

发表评论

电子邮件地址不会被公开。 必填项已用*标注