AI代写|Deep learning|python代写-Artificial Intelligence

AI代写|Deep learning|python代写: 这是一个通过python进行的AI方面的代写任务，设计deep learning等机器学习的任务

…

|– PN

|

--- <your *.cpp, *.h, *.vcxproj for problem #N>

2 Code Provided

2.1 Intro

In this assignment, you will work with character and word (string) based language models. To efficiently store and countnGrams, you should use either hash tables or balanced search trees. C++ standard library has hash table namedunorderedmap. I provide you with code that illustrates efficientnGramprocessing usingunorderedmap, both in the case of character (char) and word (string) models. If you wish, you can storenGramsusing another data structure, but make sure it is an efficient implementation (such as a hash table or a balanced tree). I also provide code to read from files. You must use this code, in order to make sure parsing is done in a consistent way. Lastly, I provide you with code to sample from a probability distribution (for random sentence generation, problem 3) and code to compute edit distance between strings (for spell checking application, problem 6). All code was compiled and tested under MS studio 2010.

2.2 Code Provided

2.2.1 fileRead.h,fileRead.cpp

Contains code for reading from files. You will need the following functions.

void readtokens(const std::string & filename, std::vector<std::string> & tokens, bool eos) : reads file fromfilenameinto a vector ofstring. Ifeos = true, then reads end of sentence marker as a special stringEOS.
void readtokens(const std::string & filename, std::vector & tokens, bool latinonly): reads file fromfilenameinto a vectorchar. Iflatinonly = false, it reads all characters from file. Iflatinonly = true, it reads only Latin characters and converts upper case to lower case.
EOS = "": special marker for end of sentence.

2.2.2 test.cpp

Illustrates how to build character model and string model based on C++unorderedmap, which is a hash table. I will use terms hash table andunorderedmapinterchangeably. For word (string) language model, you should havetypedef string T. For character (char) language model, you should havetypedef char T. In both cases, an nGram is represented as avector. This vector serves as a key into the hash table. I useintas the value associated with the key to give the count of how many times nGram (vector) occurred in the text data. When I insert a new nGram, I set its count to 1. If I read that nGram again from the text, I update the count by 1.

One should be careful when usingunorderedmapbuilt-in functioncount(key). Despite being calledcount, it has only two return values: 0 , in casekeyis not in the hash table, and 1 ifkey is in the hash table. To see how oftenkeyoccurs in theunorderedmap, that is the value associated with thekey, use the square bracket operator. But also be aware that the squared bracket operator will insert a new entry in the hash table if entry with the givenkeyis not already in theunorderedmap. To illustrate, supposehisunorderedmap, and currently does not have key"abc". If you use h["abc"], it will cause entry with key"abc"to be inserted intohwith some default value in the value field (which is of typeintin our case). Thus to check if there is an entry with key"abc", use h.count("abc"). However, remember that ifh.count("abc")returns 1, all this means that an entry with key"abc"is in your hash tableh. The actual count of how many times nGram"abc" was read from file is in the value field associated with key"abc". You access it withh["abc"]. At this point, accessingh["abc"]is fine, since you already know that key"abc"is in your hash table.

2.2.3 VectorHash.h

class VectorHash: class needed forunorderedmapin function to construct a hash table for vector keys. You just need to include this into your project.

2.2.4 utilsToStudents.h

int drawIndex(vector double &probs)): samples from probabilities given in input vector of probabilities. Checks that input probabilities approximately add to 1. Use this for the problem of random sentence generation.
sizet uiLevenshteinDistance(const std::string &s1, const std::string &s2): Com- putes distance between two strings. Use it for the spell checker problem.

2.3 Samples of Input and Output

FolderInputOutputhas sample input and output for this assignment.

2.4 Texts

FolderTextscontains text files and language files for this assignment.

Problem 1 (10 %)

This problem investigates word distribution in English language and whether it is true that it is enough to know 100 words of English to read about 50% of the text. Read string tokens without EOSmarkers.

(a) Write a program P1 which takes as command line arguments the name of the text file and a value ofk. Your program should compute counts of words in an input file and outputk most frequent words together with their counts, one per line, word and its count separated by comma. Your program should also compute what percentage of words in the input file are

among the mostkfrequent words. For example, to output 4 most frequent words from file
text.txt, your program is invoked as

P1 text.txt 4

The output format should be:
and, 4
all, 2
the, 2
their, 2
33.3333 %

(b) How many words do you need to know to read about 50% of DostoevskyKaramazov.txt and DrSeuss.txt? In other words, what shouldkbe set for the most frequent words parameter to get the output from the above program around 50% for these two texts?

Problem 2 (10 %)

This problem investigates language sparseness. Use the word language model. Do not read EOS
for this problem.

(a) Write a programP2that takes as the command line arguments the names of two text files,
the sizenfor thenGram, and the last parameter that indicates whether to print out common
nGramsor not. The program should count and print to the standard output the percentage
of nGrams in the second text file that do not occur in the first text file. If the last input
parameter is 1, the program should print out the common nGrams between the two files, one
per line. If the last parameter is 0, your program should not print out common nGrams.
For example, if we want to count 5-Grams without printing, the program should be executed
with:

P2 text1.txt text2.txt 5 0

The output format should be:
65.
If we want to count 6-Grams with the printing option, the program should be executed with:

P2 text1.txt text2.txt 6 1

The output format should be:
he thought much better of it
I can play piano in the
75.

(b) Take two parts of the same novel, DostoevskyPart1.txt (as the first input file) and Dosto- evskyDostoevskyPart2.txt(as the second input file). Use your program to compute and write down the percentages of zero count nGrams forn= 1, 2 , 3 , …,6. What is the value ofnthat gives no common nGrams between the two files? What is (are) the largest common nGram for these files?

(c) Repeat part (a) for different writers, Dickens.txt (as first) and KafkaTrial.txt (as second).

(d) Repeat part (a) for theopposite writers, MarxEngelsManifest.txt (as first) and Smith- WealthNations.txt (as second).

(e) Discuss the difference in results between parts (b,c,d).

Problem 3 (15 %)

This problem is about random text generation according to ML language model. Use string model
readEOSmarkers.
Random Sentence Generation
Let vocabularyV be all unique words in the input text file. Letmdenote the size ofV. Let
vocabulary words be indexed with integers in the range from 0 tom. Thus, our vocabularyV =
{v^0 , v^1 , ...vm^1 }. You will generate a sequence of wordsw 1 , w 2 , ..., stopping sentence generation
whenEOSmarker is generated.
Case 1: Ifn = 1, then each word in a sentence is generated independently (no context).
ComputeP(v) for all wordsvV, according to the ML unigram model. Store the computed
probabilities in avector<double> probs, whereprobs[i] =P(vi), fori { 0 , 1 , ..., m 1 }. To
generate the first word, use provided function int drawIndex(vector<double> &probs). The
output ofdrawIndex()is the index of the word chosen. For example, if the output is 10, this
means thatv^10 is chosen, and you set the first word in the sentencew 1 tov^10. Generate all other
words in the same way, stopping whenEOSis generated.
Case 2: Ifn >1, then there is context. For the first wordw 1 , its context (previous word) is
EOSmarker. Use ML to estimateP(v|EOS) for allvV. Store these probabilities in the vector
probsand generate the first wordw 1 usingdrawIndex(probs). To generate the second word, use
ML to estimateP(v|w 1 ) for allvV, store these probabilities in vectorprobsand generate the
second wordw 2 usingdrawIndex(probs). Continue this procedure until you generate theEOS
marker.
Note that ifn >2, then as more context becomes available, you should use it. For example,
ifn= 3, then to generate the third (and forth, and so on) word, you should use two previously
generated words for context. To be precise, to generate the kth word in the sentencewk, sample
fromP(v|wkn+1, ..., wk 1 ), wherewkn+1, ..., wk 1 aren1 previously generated words.
Make sure to initialize random seed generator with something likesrand(time(NULL)).

(a) Write a programP3which takes as command line arguments the name of a text file and the
size nof the nGram model. Your program should construct an ML (maximum likelihood)
language model fromtext.txt, and generate and print out a random sentence according to
the procedure described in class and summarized above. For example, to generate a sentence
usingtrigrammodel learned from from filetext.txt, the program should be invoked with

P3 text.txt 3

(b) Run your program on KafkaTrial.txt withn= 1, 2 , 3 , 4 ,6. Discuss your results, pay partic- ular attention to the case ofn= 1 andn= 6. You might want to look inside KafkaTrial.txt to interpret your results forn= 6.

(c) Setn= 3 and generate a random sentence from MarxEngelsManifest.txt. Compare them
with the results generated from KafkaTrial.txt.

(d) Just for fun: Submit the funniest sentence from your program generated with eithern= 2 or 3 from any text file you wish. I will run a poll for selecting the funniest sentence.

Problem 4 (10 %)

This problem is about Add-Delta language model. Use word model and do not readEOSmarkers.

(a) Write a program

P4 textModel.txt sentence.txt n delta

that builds Add-Delta model from text in filetextModel.txtforn-grams withdelta. The pro-
gram should read the sentence in the second filesentence.txtand output its log probability
to the standard output.
For example, to model language from filetextModel.txt, estimate log probability of sentence
in filesentences.txt, build a5-GramAdd-Delta model withdelta = 0.1, use:

P4 textModel.txt sentences.txt 5 0.

The output should be formatted as
-55.
Set vocabulary size to the number of unique words in filetextModel.txtplus one. We add
one to account for all the words that occur in file sentence.txtbut do not occur in file
textModel.txt. Intuitively, this is maps all words that are not in our model file to the same
word UNKNOWN.
Implementaiton notes:

Usedoubledata type for probabilities to avoid underflows.
To avoid integer overflows, usedoubleinstead ofint.
Be careful if your count variables are integers. With integer division, count of 2 divided by count of 5 is 0, but the correct answer is0.4. Cast integers todoublebefore division.
The output of your program is log probabilities (to avoid underflow), therefore your output should be a negative number, sincelog(x)0 forx1.
Make sure that your program works for the special case of unigrams (n = 1).
Ifdelta = 0, then Add-Delta model is equivalent to ML model and some sentences will have probability 0. In this case, your program should not crash but output the maximum negative number in double precision, which is pre-defined in the compiler as-DBLMAX.

(b) Run your program and report the output of your program for the following cases:

P4 KafkaTrial.txt testFile.txt 1 1
P4 KafkaTrial.txt testFile.txt 2 1
P4 KafkaTrial.txt testFile.txt 2 0.
P4 KafkaTrial.txt testFile.txt 3 0.

Problem 5 (15%)

This problem is about Good-Turing language model. Use word model and do not readEOS
markers.

(a) Write a program

P5 textModel.txt sentence.txt n threshold

that builds Good-Turing model from text in filetextModel.txtforn-grams with parameter
threshold. The program should read the sentence in the second filesentence.txtand output
its log probability to the standard output. Recall that thethresholdfor Good-Turing is used
as follows. If an nGram has rater < threshold, use Good-Turing estimate of probability.
Ifr  thresholduse ML estimate of probabilities.
For example, to model language from filetextModel.txt, estimate log probability of sentence
in filesentence.txt, build a4-GramGood-Turing model withthreshold = 6, use:

P5 textModel.txt sentence.txt 4 6

The output should be formatted as
-55.
Just as in Problem 4, set vocabulary size to the number of unique words in filetextModel.txt
plus 1.
Implementaiton notes:

Do not forget to renormalize so that probabilities add up to 1.
If the user setsthresholdso high thatNr=0for somer < threshold, then some esti- mated GT probabilities will be 0. Before computing probabilities, loop overNrto check that they are not zero for allrthreshold. You have to do this separately for 1-grams, 2-grams, …, n-grams. If needed, reset threshold to a lower value so thatNr>0for allr < threshold.
Use GT probabilities for alln-Grams. Namely, if inputn= 3, you will need tri-grams, bi-grams, and unigrams to compute probability of a sentence. Use GT estimates for all of them.

(b) Run your program and report the output of your program for the following cases:

P4 KafkaTrial.txt testFile.txt 1 1
P4 KafkaTrial.txt testFile.txt 2 5
P4 KafkaTrial.txt testFile.txt 3 5

Problem 6 (20 %)

In this problem we will use Add-Delta language model to classify which human languages (i.e.
English, French, etc.) a given sentence is written in. Use the character based language model that
reads all the characters, i.e.latinonly = false. Set vocabulary size to 256.
FolderLanguages contains training and test files for six languages. Each language file has
the name corresponding to the language. Training files have endings1.txt (english1.txt,
danish1.txt, etc), and test files have ending2.txt(english2.txt, danish2.txt, etc). Assume
that all the text files are stored in the same directory where the program is run, so you do not
have to specify their location.
Train the language models, separately, for each language on training files, i.e. those ending
in 1. Language classification is performed as follows. Given a sentence, compute its probability
under French, English, etc. language models and classify the sentence with the language giving
the maximum probability. For this problem, a sentence is a sequence of characters of fixed length
senLen, given as an input parameter. You need to parse an input file into consecutive chunks
of characters of lengthsenLen, and classify each chunk. If the last chunk is of length less than
senLen, it should be omitted from classification.
Your program should output the total error rate (in percents), and the confusion matrix. The
total percentage error rate for all languages is the overall number of misclassified sentences in all
language files divided by the overall number of sentences in all language files, multiplied by 100
to express as percentage.
The confusion matrix is a 6 by 6 array, whereC(i,j)is the number of sentences of languagei
that were classified as languagej. That is diagonal has the correct classifications, and off-diagonal
wrong classifications.

(a) Write programP6for language identification. Your program is invoked with

P6 n delta senLength

Wherenis the size of the nGram,deltais the parameter for Add-Delta, andsenLengthis
the sentence length.
The output of your program should be formatted as:
16.
134 3 0 0 0 1
24 341 1 0 0 0
38 2 85 0 0 0
23 1 0 213 0 0
26 9 1 3 221 0
77 2 0 0 0 50
Where the first line is the percentage error rate, and the next size lines is the confusion matrix.

(b) Run your program and report the error rate on the following cases:

P6 1 0 50
P6 2 0 50
P6 3 0 50

(c) Run your program and report the error rate on the following cases.

P6 1 0.05 50
P6 2 0.05 50
P6 3 0.05 50

(d) Run your program and report the error rate on the following cases:

P6 3 0.05 50
P6 3 0.005 50
P6 3 0.0005 50

(e) Compare and discuss the performance between (b,c,d).

(f) Explore and discuss how classification performance changes with the sentence length by run-
ning your program on the following cases:

P6 2 0.05 10
P6 2 0.05 50
P6 2 0.05 100

Problem 7 (20 %)

In this problem you will develop a simple spell-checker.

(a) Write a spelling programP7that is invoked with:

P7 textTrain.txt textCheck.txt dictionary.txt n t delta model

The input parameters are as follows: name of text file for model training, name of text file to
check spelling, name of text file with dictionary,nfor the nGram model, threshold for Good-
Turing, delta for Add-One smoothing, model to use. As before, ifmodel = 0use Good-Turing,
and ifmodel = 1use Add-Delta.
Use word language model without readingEOSmarkers to readtextTrain.txtanddictionary.txt.
File textCheck will contain several sentences (one per line) to check spelling of. Check
each sentence separately. It is convenient to read textCheck.txtwithEOSmarker to sep-
arate into sentences. Output the result of checking separately on new line. For example, if
textCheck.txthas sentences:
I lke to eat cereal in the morning.
Pla nicely!
The output should be formatted as:
i like to eat cereal in the morning
play nicely
We will assume that there is only one misspelled word per sentence. For each wordwin the
sentence, find all candidate words in the dictionary with edit distance of 1 fromw, using

the function uiLevenshteinDistance(). LetC(w)be the set of such candidate words forw.
We also includewinC(w).
As discussed in class, we will consider all possible sentences where one wordwis replaced with
a word fromC(w). This means you should iterate over all words in the sentence, and for each
wordwiterate over all words inC(w), replacingwwith a word fromC(w)to generate the next
possible sentence.
You will implement a simpler version than what was discussed in class. Instead of implement-
ing the noisy channel model, you just select a sentence with the highest probability according
to the language model. Print to the standard output the best sentence. Note that the highest
probability sentence can be the unchanged input sentence.

(b) Report and discuss the results of your program for the following cases:

P7 trainHuge.txt textCheck.txt dictionary.txt 2 3 1 1
P7 trainHuge.txt textCheck.txt dictionary.txt 2 3 0.1 1
P7 trainHuge.txt textCheck.txt dictionary.txt 2 3 0.01 1

(c) Report and discuss the results of your program for the following cases:

P7 trainHuge.txt textCheck.txt dictionary.txt 1 3 0.01 1
P7 trainHuge.txt textCheck.txt dictionary.txt 2 3 0.01 1
P7 trainHuge.txt textCheck.txt dictionary.txt 3 3 0.01 1

(d) Report and discuss the results of your program for the following cases:

P7 trainHuge.txt textCheck.txt dictionary.txt 1 3 1 0
P7 trainHuge.txt textCheck.txt dictionary.txt 2 3 1 0
P7 trainHuge.txt textCheck.txt dictionary.txt 3 3 1 0

Extra Credit Problem 8 (20 %)

Implement programP8that improves your spelling correction program in problem 7 in any way.
You can build a better language model, implement the noisy channel model discussed in class,
implement a better edit distance between strings. Hand in your improved program, and re-
port/discuss the cases where your new program does something better compared to the old pro-
gram.

Posts

…