csc401 – Neural Machine Translation

Neural Network | 神经网络 | cuda | Python代写 | 代做assignment – 这道题目是利用Neural Network进行的机器学习代写任务, 涵盖了Neural Network /神经网络/机器学习/cuda/ai/oop/Python等代做方面, 这个项目是assignment代写的代写题目

Computer Science 401 11 February 2023 St. George Campus University of Toronto

 homework  assignment

1 Overview

1.1 Canadian Hansards

The main corpus for this assignment comes from the official records (Hansards) of the 36th Canadian Parliament, including debates from both the House of Representatives and the Senate. This corpus is available at/u/cs401/A2/data/Hansard/and has been split intoTraining/andTesting/directories. This data set consists of pairs of corresponding files (.eis the English equivalent of the French.f) in which every line is a sentence. Here, sentence alignment has already been performed for you. That is, thenthsentence in one file corresponds to thenthsentence in its corresponding file (e.g., lineninfubar.e is aligned with linenin fubar.f). Note that this data only consists of sentence pairs; many-to-one, many-to-many, and one-to-many alignments are not included.

1.2 Seq2seq

We will be implementing a simpleseq2seq model, without attention, with single-headed attention, and with multi-headed attention based largely on the course material. You will train the models with teacher- forcing and decode using beam search. We will write it inPyTorch version 1.13 (https://pytorch. org/docs/1.13/), andPythonversion 3.10, which are the versions installed on theteach.csservers. For those unfamiliar with PyTorch, we suggest you first read the PyTorch tutorial (https://pytorch.org/ tutorials/beginner/deep_learning_60min_blitz.html).

1.3 Tensors and batches

PyTorch, like many deep learning frameworks, operate with tensors, which are multi-dimensional arrays. When you work in PyTorch, you will rarely if ever work with just one bitext pair at a time. Youll instead be working with multiple sequences in one tensor, organized along one dimension of the batch. This means that a pair of source and target tensorsFandEactually correspond to multiple sequences F= (F1:(bS)(b))b[1,B],E= (E1:(bT)(b))b[1,B]. We work with batches instead of individual sequences because: a) backpropagating the average gradient over a batch tends to converge faster than single samples, and b) sample computations can be performed in parallel. For example, if we want to multiply source sequences F(b)andF(b+1)with an embedding matrixW, we can tell one CPU core to compute the result forF(b) and another forF(b+1), halving the overall time it would take to multiply them independently. Learning to work with tensors can be difficult at first, but is integral to efficient computation. We suggest you read more about it in theNumPydocs (https://numpy.org/doc/stable/user/basics.broadcasting.html), which PyTorch borrows for tensors.

Copyright2023 University of Toronto. All rights reserved.

1.4 Differences from the lectures

There are three changes to the seq2seq architectures that we make for this assignment. First, instead of scaled dot-product attention scoresscore(u,v) =|u|^1 /^2 u,v, well use the cosine similarity between vectorsuandv:

score(u,v) = u,v max (u 2 v 2 ,) Where 0< 1 ensuresscore(u,v) = 0 whenu= 0 orv= 0. The second relates to how we calculate the first hidden state for the decoder when we dont use attention. Recall that a bidirectional recurrent architecture processes its input in both directions separately: the forward direction processes (x 1 ,x 2 ,…,xS) whereas the backward direction processes (xS,xS 1 ,…,x 1 ). The bidirectional hidden state concatenates the forward and backward hidden states for the same time ht= [hf orwardt ,hbackwardt ]. This implieshShas processed all the input in the forward direction, but only one input in the backward direction (and vice versa forh 1 ). To ensure the decoder gets access to all input from both directions, you should initialize the first decoder state

h 1 = [hf orwardS ,hbackward 1 ]

When you use attention, set h 1 = 0. Our final change isnt so much a change as a clarification. In multi-headed attention, recall we have Nheads such that h(tn)=W (n) htandh(sn)=W(n)hs. W(n)andW (n)need not be square matrices; the

size of h(tn)need not be the size of htnorh(sn)the size ofhs. For this assignment, we will be setting the size of| h(tn)|=| ht|/Nand|h(sn)|=|hs|/N(you may assumeNevenly divides the hidden state size).

2 Your tasks

2.1 Setup

You are expected to run your solutions onteach.cs.Download the starter code from MarkUs.You can download the starter files by clicking the Download button in the Starter Files section of the Assignment 2 page. The Hansards parallel text data are located in the/h/u1/cs401/A2/data/directory onteach.cs. You dont need to copy the text data to your ownteach.csdirectory. Download the starter code MarkUs into your working directory to get started. You should have 8 files: a2abcs.py,a2bleuscore.py,a2encoderdecoder.py,a2dataloader.py,a2run.py, a2trainingandtesting.py,testa2bleuscore.py, andtesta2encoderdecoder.py. You should take advantage of unit testing (see appendix A.5) and debug on a small version of the network (see appendix A.6). Feel free to use your favorite debugger or addbreakpoint()at any location. In the starter code, there is also a folder namedreports.zip. It contains the LATEXtemplates for the reports. You can upload thereports.zipto overleaf to start editing them.

2.2 Calculating BLEU scores

Modifya2bleuscore.pyto be able to calculate BLEU scores on single reference and candidate strings. We will be using the definition of BLEU scores from the lecture slides:

BLEU=BPC(p 1 p 2 ...pn)(1/n)

To do this, you will need to implement the functionsgrouper(…),ngramprecision(…), brevitypenalty(…), andBLEUscore(…). Make sure tocarefullyfollow the doc strings of each function. Do not re-implement functionality that is clearly performed by some other function.

Your functions will operate on sequences (e.g., lists) of tokens. These tokens could be the words themselves (strings) or an integer ID corresponding to the words. Your code should be agnostic to the type of token used, though you can assume that both the reference and candidate sequences will use tokens of the same type.

2.3 Building the encoder/decoder

You are expected to fill out a number of methods ina2encoderdecoder.py. These methods belong to sub-classes of the abstract base classes ina2abcs.py. The latter defines the abstract classesEncoderBase, DecoderBase, andEncoderDecoderBase, which implement much of the boilerplate code necessary to get a seq2seq model up and running. Though you are welcome to read and understand this code, it is not necessary to do so for this assignment. You will, however, need to read the doc strings ina2abcs.pyto understand what youre supposed to fill out ina2encoderdecoder.py. Do not modify any of the code ina2abcs.py. A high-level description of the contents of the requirements fora2encoderdecoder.pyfollows here. More details can be found in the doc strings ina2{abcs,encoderdecoder}.py.

2.3.1 Encoder

a2encoderdecoder.Encoderwill be the concrete implementation of all encoders you will use. The encoder is always a multi-layer neural network with a bidirectional recurrent architecture. The encoder gets a batch of source sequences as input and outputs the corresponding sequence of hidden states from the last recurrent layer. Encoder.forwardpassdefines the structure of the encoder. For every model in PyTorch, theforward function defines how the model will run, and theforwardfunction of every encoder or decoder will first clean up your input data and callforwardpassto actually define the model structure. Now you need to implement theforwardpassfunction that defines how your encoder will run. Encoder.initsubmodules(…)should be filled out to initialize a word embedding layer and a recur- rent network architecture.Encoder.getallrnninputs(…)accepts a batch of source sequencesF1:(bS)(b)

and lengthsS(b)and outputs word embeddings for the sequencesx(1:b)S(b).

Encoder.getallhiddenstates(...) converts the word embeddingsx(1:b)S(b)into hidden states for

the last layer of the RNNh(1:b)S(b)(note were using (b) here for the batch index, not the layer index).

2.3.2 Decoder without attention

a2encoderdecoder.DecoderWithoutAttentionwill be the concrete implementation of the decoders that do not use attention (so-called transducer models). Method implementations should thus be tailored to not use attention. In order to feed the previous output into the decoder as input, the decoder can only process one step of input at a time and produce one output. ThusDecoderWithoutAttentionis designed to process one slice of input at a time (though it will still be a batch of input for that given slice of time). The goal, then, is to take some target slice from the previous time stepEt(b) 1 and produce an un-normalized log-probability

distribution over target words at time stept, calledlogits(tb). Logits can be converted to a categorical distribution using a softmax:

P(y(tb)=i|...) =

exp(logits(t,ib))
P
jexp(logits

(b) t,j) DecoderWithoutAttention.forwardpassdefines the structure of network. Similar to what you did for your encoder, you need to assemble the model here.

DecoderWithoutAttention.initsubmodules(…)should be filled out to initialize a word embedding layer, a recurrentcell, and a feed-forward layer to convert the hidden state to logits. DecoderWithoutAttention.getfirsthiddenstate(…) produces h( 1 b) given the encoder hidden

statesh(1:b)S(b).

DecoderWithoutAttention.getcurrentrnninput(...)takes the previous targetEt(b) 1 (or the pre-

vious outputyt(b) 1 in testing) and outputs word embedding x(tb)for the current step.

DecoderWithoutAttention.getcurrenthiddenstate(...) takes x(tb)and h(tb) 1 and produces the

current decoder hidden state h(tb).

DecoderWithoutAttention.getcurrentlogits(...)takes h(tb)and produceslogits(tb).

2.3.3 Decoder with (single-headed) attention

a2encoderdecoder.DecoderWithAttentionwill be the concrete implementation of the decoders that use single-headed attention. It inherits fromDecoderWithoutAttentionto avoid re-implementing getcurrenthiddenstates(…)andgetcurrentlogits(…). The remaining methods must be re- implemented, but slightly modified for the attention context. Two new methods must be implemented forDecoderWithAttention. DecoderWithAttention.getattentionscores(…) takes in a decoder state h(tb) and all encoder

hidden statesh(1:bS)(b)and produces attention scores forthatdecoder state butallencoder hidden states:

a(t,b1:)S(b).

DecoderWithAttention.attend(...)takes in a decoder state h(tb)and all encoder hidden statesh(1:bS)(b)

and produces the attention context vectorc(tb). Betweengetattentionscoresandattend, use

getattentionweights(…)to converta(t,b1:)S(b)to(t,b1:)S(b), which has been implemented for you.

2.3.4 Decoder with multi-head attention

a2encoderdecoder.DecoderWithMultiHeadAttentionimplements a multi-headed variant of attention. It inherits fromDecoderWithAttention. Two methods must be re-implemented for this variant. DecoderWithMultiHeadAttention.initsubmodules(…) should initialize new submodules for the matricesW,W , andQ. DecoderWithMultiHeadAttention.attend(…) should split hidden states h(tb) into h(tb,n) and h(sb) into h(sb,n), wherebstill indexes the batch number andnindexes the head. Then it should call super().attend(…) to do the attention, and combinec(tb,n)of theNheads. We want you to do this without ever actually splitting any tensors! The key is toreshapethe a full hidden output intoNchunks. If its a little bit too tricky, try starting from writing the case whereN= 1, i.e. when theres no need for splitting.

2.3.5 Putting it together: the Encoder/Decoder

a2encoderdecoder.EncoderDecodercoordinates the encoder and decoder. Its behaviour depends on whether its being used for training or testing. In training, it receives bothF1:(bS)(b)andE1:(bT)(b)and outputs

logits(1:b)T(b)un-normalized log-probabilities overy(1:b)T(b). In testing, it receives onlyF1:(bS)(b)and outputsK

paths from beam search per batch elementn:y1:(n,kT()n,k). EncoderDecoder.initsubmodules(…)initializes the encoder and decoder.

EncoderDecoder.getlogitsforteacherforcing(...)provides you the encoder outputh(1:b)S(b)and

the targetsE1:(bT)(b)and asks you to derivelogits(1:b)T(b)according to the MLE (teacher-forcing) objective. EncoderDecoder.updatebeam(…) asks you to handle one iteration of a simplified version of the beam search from the slides. While a proper beam search requires you to handle the set of finished paths f,updatebeamdoesnt need to. Letting (n,k) indicate thenthbatch elementskthpath:

n,k,v.B(t,n,k 0 v) h(tn,k+1)
B(t,n,k 1 v)[B(t,n,k 1 ),v]
logP(b(tn,kv))logP(B(tn,k)) + logP(yt+1=v| h(tn,k+1))

n,k.B(t+1n,k)argmaxk
Bt(n,kv)
logP(B(n,k

v)
t )

In short, extend the existing paths, then prune back to the beam width. A greedy update function updategreedyis provided for you ina2abcs.py. You can use the option–greedyto switch to greedy update. This option might be handy when you want to test the correctness of the rest of your assignment.

2.3.6 Padding

An important detail when dealing with sequences of input and output is how to deal with sequence lengths. Individual sequences within a batchF(b)andE(b)can have unequal lengthsS(b)=S(n+1),T(b)=T(n+1), but we pad the shorter sequences to the right to match the longest sequence. This allows us to parallelize across multiple sequences, but its important that whatever the network learns (i.e., the error signal) is not impacted by padding. Weve mostly handled this for you in the functions weve implemented, with three exceptions: first, no word embedding should be learned for padding (which youll have to guarantee); second, youll have to ensure the bidirectional encoder doesnt process the padding; and third, the first hidden state of the decoder (without attention) should not be based on padded hidden states. You are given plenty of warning in the starter code when these three cases ought to be considered. The decoder uses the end-of-sequence symbol as padding, which is entirely handled ina2trainingandtesting.py.

2.4 The training and testing loops

After following the PyTorch tutorial, you should be familiar with how models are trained and tested in PyTorch. You are expected to implement training and testing loops ina2trainingandtesting.py. In a2trainingandtesting.computebatchtotalbleu(…), you are given reference (from the dataset) and candidate batches (from the model) in the target language and asked to compute the to- tal BLEU score over the batch. You will have to convert the PyTorch tensors in order to use a2bleuscore.BLEUscore(…). Ina2trainingandtesting.computeaveragebleuoverdataset(…), you are to follow instruc- tions in the doc string and usecomputebatchtotalbleu(…) to determine the average BLEU score over a data set. In a2trainingandtesting.trainforepoch(…), once again follow the doc strings to iterate through a training data set and update model parameters using gradient descent.

2.4.1 Visualizing and logging training

We will use Weights and Biases [W&B] and (optionally) Tensorboard, to visualize and log model training. Go to the W&B site^1 and sign-up, then create a new project space named: csc401-w23-a2. We refer to

(^1) https://wandb.ai/

your W&Busernameas$WBUSERNAMEhereafter.

2.5 Training the models

Once you have completed the coding portion of the assignment, it is time you train your models. In order to do so in a reasonable amount of time, youll have to train your models using a machine with a GPU. There are a few ways you can do this:

At this point,you should be confident that you are error-free. If not, please refer toappendix A.6for how to debug your code with a smaller network. Ideally, you should only need to run the full task once.
You can ssh to teach.cs and use srun to run your code on a GPU on the departments cluster. See more details below about how to use srun.
A number of teaching labs in the Bahen building have GPUs (listed inhttps://www.teach.cs. toronto.edu/faq.html#ABOUT4), but you must log in at the physical machines to use them (as opposed to remote access).
If you have access to your own GPU, you may run this code locally and report the results. However, any modifications you make to run the code locally must be reverted to work on teach before you submit!

Even on a GPU, the code can take upwards of 2 hours to complete in full. Be sure to plan accordingly!

You are going to interface with your models using the scripta2run.py. This script glues together the components you implemented previously. The only meaningful remaining code is ina2dataloader.py, which converts the Hansard sentences into sequences of IDs. Suffice to say that you not need to know how eithera2run.pynora2dataloader.pyworks, only use them (unless you are interested).

Run the following code block line-by-line from your working directory. In order, it:

Builds maps between words and unique numerical identifiers for each language.
Splits the training data into a portion to train on and a hold-out portion.
Trains the encoder/decoder without attention and stores the model parameters.
Trains the encoder/decoder with single-headed attention and stores the model parameters.
Trains the encoder/decoder with multi-headed-attention and stores the model parameters.
Returns the average BLEU score of the encoder/decoder without attention on the test set.
Returns the average BLEU score of the encoder/decoder with single-headed attention on the test set.
Returns the average BLEU score of the encoder/decoder with multi-headed attention on the test set.

0. Prepare preprocess directory

mkdir data

1. Generate vocabularies

python3 a2_run.py vocab e data/english_vocab.txt

python3 a2_run.py vocab f data/french_vocab.txt

2. Split train and dev sets

python3 a2_run.py split

3. Train a model without attention

srun -p csc401 –gres gpu python3 a2_run.py train rnn_model.pt –cell-type rnn –viz-wandb $WB_USERNAME –device cuda srun -p csc401 –gres gpu python3 a2_run.py train lstm_model.pt –cell-type lstm –viz-wandb $WB_USERNAME –device cuda

4. Train a model with attention

srun -p csc401 –gres gpu python3 a2_run.py train rnn_model_att.pt –with-attention –cell-type rnn –viz-wandb $WB_USERNAME –device cuda srun -p csc401 –gres gpu python3 a2_run.py train lstm_model_att.pt –with-attention –cell-type lstm –viz-wandb $WB_USERNAME –device cuda

5. Train a model with multi-head attention

srun -p csc401 –gres gpu python3 a2_run.py train rnn_model_mhatt.pt –with-multihead-attention –cell-type rnn –viz-wandb $WB_USERNAME –device cuda srun -p csc401 –gres gpu python3 a2_run.py train lstm_model_mhatt.pt –with-multihead-attention –cell-type lstm –viz-wandb $WB_USERNAME –device cuda

6. Test the model with attention

srun -p csc401 –gres gpu python3 a2_run.py test {model_name}.pt –{attention-type} –device cuda

Steps 1 and 2 should not fail and need only be run once. Step 3 onward depend on the correctness of your code.

Thesrun -p csc401 –gres gpuis necessary to run on a GPU onteach. You do not need a GPU for the first two steps. You willnotneedsrun -p csc401 –gres gpuwhen running steps 3-6 if you are running the training/testing locally. Thesrunprefix is only needed when running on the (remote) teach server that uses SLURM^2 to schedule processes on the departments cluster. Because all students in this class or any class requiring GPUs will be running their jobs on the clusterplease only run steps 3-6 after you have debugged your code. We discuss how you can train a smaller network below (in appendix A.6) that will only take a fraction of the time anyway.

Inanalysis.pdf(useanalysis.textemplate provided) section1 Training Results, provide the follow- ing:

The printout after every epoch of the training l oop of both the model for the model trained without, with single-headed, and with multi-headed attention (Or, you can provide the equivalent information as wandb screenshots). Clearly indicate which is which.
The average BLEU score reported on the test set for each model. Again, clearly indicate which is which.
A brief discussion on your findings. Was there a discrepancy in between training and testing results? Why do you think that is? If one model did better than the others, why do you think that is?

2.6 Lets translate some sentences!

Ok, now we have the neural translation models. Lets actually use them to translate some sentences.

InEncoderDecoder.translate(…), you are given a raw input sentence. You need to tokenize the sentence, convert the tokens into ordinal IDs, feed the IDs into your encoder-decoder model, and, finally, convert the output of the model into an actual sentence.

You can load any one of your trained model with the following commands. Notice that you need to specify the decoder type.

python3 a2_run.py interact model_wo_att.pt python3 a2_run.py interact model_w_att.pt –with-attention python3 a2_run.py interact model_w_mhatt.pt –with-multihead-attention

An interactive Python prompt will start and your trained encoder-decoder machine translation model is loaded to the variablemodel. You can then utilize the model by interacting with it in the prompt as shown.

Trained model from path YOUR_MODEL.pt loaded as the object model Python 3.10.5 (main, Jun 29 2022, 16:51:27) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole)

model.translate("Nous sommes innovateurs et ouverts aux idees nouvelles.")

(^2) https://en.wikipedia.org/wiki/Slurm_Workload_Manager

~~we welcome innovation and new ideas~~

model.translate(Toronto est une ville du Canada.) ~~halifax is a national horse~~

Translate the following three sentences^3 using your models (without attention, with attention and with multi-head attention):

Toronto est une ville du Canada.
Les professeurs devraient bien traiter les assistants denseignement.
Les etudiants de lUniversite de Toronto sont excellents.

In section2 Translation Analysisofanalysis.pdf, list all the translations. Then, describe the quality of those sentences. Can you observe any correlation with the models BLEU score? Include a brief discussion on your findings inanalysis.pdf.

It is expected that your model may produce both highly accurate and poorly translated results. The grade you receive will not depend on the accuracy of the translations, but rather on the credibility and depth of your analysis on the models behavior.

2.7 Bonus [up to 15 marks]

We will give bonus marks for innovative work going substantially beyond the minimal requirements. How- ever, your overall mark for this assignment cannot exceed 100%. Submit your write-up inbonus.pdf.

You may decide to pursue any number of tasks of your own design related to this assignment, although you should consult with the instructor or the TA before embarking on such exploration. Certainly, the rest of the assignment takes higher priority. Some ideas:

Perform substantial qualitative data analysis of the translation results. Can you recognise some interesting patterns in your machine translation problem? Conduct statistical analysis to verify your findings, and use those findings to discuss the connection between statistical modelling and language.
Perform substantial data analysis of the error trends observed in each method you implement. This must go well beyond the basic discussion already included in the assignment.
Explore the effects of using different attention mechanisms for and includeattention visualization of the different attention functions.

(^3) French accent marks ( e, c,`a,o, ) are removed in the Canadian Hansards dataset and, therefore, you should too. Use francaise,Universiteandetudiantsinstead offran caise,Universit eand etudiants.

3 Submission requirements

This assignment is submitted electronically. Submit your assignment on MarkUs. Donottarorcompress your files, and do not place your files in subdirectories.

You should submit:

The filesa2bleuscore.py,a2encoderdecoder.py, anda2trainingandtesting.pythat you filled out according to assignment specifications. We will not accepta2abc.py,a2dataloader.py, nora2run.py- your assignment must be compatible with the versions we provided you.
Your (Latex) write-up on the experiment inanalysis.pdf.
If you are submitting a bonus, tell us what youve done by submitting a write-up inbonus.pdf. Please distribute bonus code amongst the above*.pyfiles, being careful not to break functions, methods, and classes related to the assignment requirements.

You should not submit any additional files that you generated to train and test your models. For example, do not submit your model parameter files.ptor vocab filesvocab.txt. Only submit the above files.Additional source files containing helper functions are not permitted.

A Suggestions

A.1 Check Piazza regularly

Updates to this assignment as well as additional assistance outside tutorials will be primarily distributed via Piazza (https://piazza.com/class/lccov4dd8ij2a0).It is your responsibility to check Piazza regularly for updates.

A.2 Run cluster code early and at irregular times

Because GPU resources are shared with your peers, yoursrunjob may end up on a weaker machine (or even postponed until the resources are available) if too many students are training at once. To help balance resource usage over time, we recommend you finish this assignment as early as possible. You might find that your peers are more likely to run code at certain times in the day. To check how many jobs are currently queued or running on our partition, please runsqueue -p csc401.

If you decide to run your models right before the assignment deadline, please be aware that we will be unable to request more resources or run your code sooner. We will not grant extensions for this reason.

A.3 Connection persistence: keep training after disconnecting

Training a model can take hours. If your internet connection is weak, you run the risk of losing your progress. You can use the Linuxscreen(https://linux.die.net/man/1/screen) ortmux(https:// github.com/tmux/tmux/wiki) command to create a persistent shell that is only destroyed when you exit from it. This will allow training to continue even if disconnected.

The most basic usage of screen is as follows. To start a new shell, run the commandscreen. You should see the same shell interface as before (i.e.wolf:$) but you are now in a new shell instance. You can typeCtrl+Dor the commandexitto kill the screen, or typeCtrl+Afollowed byCtrl+Dtodetach from the screen. A detached screen persists between SSH sessions. If you want to reconnect to your shell, try the commandscreen -r.

Tmux users can find the Tmux cheatsheet athttps://tmuxcheatsheet.com/helpful.

A.4 Using your own computer

If you want to do some or all of this assignment on your laptop or other computer, you will have to do the extra work of downloading and installing the requisite software and data. You will also need to configure the relevant paths and flags (e.g.–training-dir your/data/path/Training). You take on the risk that your computer might not be adequate for the task. You are strongly advised to upload regular backups of your work to teach.cs, so that if your machine fails or proves to be inadequate, you can immediately continue working on the assignment at teach.cs. When you have completed the assignment, you should try your programs out on teach.cs to make sure that they run correctly there.A submission that does not work on teach.cs will get zero marks.

That said, due to concerns of limited resources, we will allow you to report the results of training/testing your model on your local machine. The code must still conform to the teach.cs environment, but the contents ofanalysis.pdfcan be based on your local environment.

A.5 Unit testing

We strongly recommend you test the methods and functions youve implemented prior to running the train- ing loop. You can test actual output against expected input for complex methods likeupdatebeam(…). The Python 3.7 teach environment has installed pytest (https://docs.pytest.org/en/5.1.2/). We have included some preliminary tests forBLEUscore(…)andupdatebeam(…). You can run your test suite on teach by calling

python3 -m pytest

While the test suite will execute initially, the provided tests will fail until you implement the necessary methods and functions. While passing these initial tests is a necessity for full marks, they are not sufficient on their own. Please be sure to add your own tests.

Unit testing is not a requirement, nor will you receive bonus marks for it.

A.6 Debugging task

Instead of re-running the entire task on GPU when debugging your code, we recommend that you run a much smaller version of the task until you are confident that you are error free. Ideally,you should only need to run the full task once.The following commands may be used to set up such a task.

export OMP_NUM_THREADS=4 # avoids a libgomp error on teach

create an input and output vocabulary of only 100 words

python3 a2_run.py vocab e data/english_vocab_tiny.txt –max-vocab 100 python3 a2_run.py vocab f data/french_vocab_tiny.txt –max-vocab 100

only use the proceedings of 2 meetings, 3 for training and 1 for dev

python3 a2_run.py split –train-prefixes data/train_tiny.txt –dev-prefixes data/dev_tiny.txt –limit 2

use far fewer parameters in your model

python3 a2_run.py train –tiny-preset model.pt –epochs 2 –word-embedding-size 21 –encoder-hidden-size 40 \

--batch-size 5 \
--cell-type rnn \
--beam-width 2

Use with the flags –with-attention and –with-multihead-attention to test single-

and multi-headed attention, respectively.

The flag –tiny-preset is a shortcut to overwrite the vocab and prefix paths.

It is equivalent to using all the following flags:

–train-prefixes data/train_tiny.txt \

–dev-prefixes data/dev_tiny.txt \

–english-vocab data/english_vocab_tiny.txt \

–french-vocab data/french_vocab_tiny.txt

We request that you first see if running directly on theteach.csserver (wolf) runs quickly enough before attempting to use the cluster. Tested at low occupancy, each epoch took about 10-120 seconds (models with attention mechanisms are slower) with well-optimized code directly onwolf. At high occupancy, each epoch took 4 minutes.

Note that your BLEU will likely be high in the reduced vocabulary condition – even using very few parameters – since your model will end up learning to output the out-of-vocabulary symbol. Do not report your findings on the toy task inanalysis.pdf.

A.7 Beam search not finished warning

You might come across a warning like this during training:

a2_abcs.py:882: UserWarning: Beam search not finished by t=100. Halted

This just means your model failed to output an end-of-sequence token aftert= 100.

This may not mean youve made a mistake. On certain machines in the cluster and certain network configurations, even our solutions give this warning. However,it should not occur over all epochs. Your BLEU score shouldnt change much whether or not you get this warning because it should only occur on a few sentences. If the warning keeps popping up or your BLEU scores are close to zero, you probably have an error.

A.8 Recurrent cell type

Youll notice that the code asks you to set up a different recurrent cell type depending on the setting of the attributeself.celltype. This could be an LSTM or RNN (the last refers to the simple linear weighting ht=(W[x,ht 1 ] +b) that you saw in class).

The two cell types act very similarly, except the LSTM cell often requires you to carry around both a cell state and a hidden state. Pay careful attention to the documentation for whenht,htildet, etc. might actually be a pair of the hidden state and cell state as opposed to just a hidden state. Sometimes no change

is necessary to handle the LSTM cell. Other times you might have to repeat an operation on the elements individually.

Take advantage of the following pattern:

if self.cell_type == lstm:

do something

else:

do something else

Be sure to rerun training with different cell types (i.e. use the flag–cell-type) to ensure your code can handle the difference.

B Variable names and slides

We try to match the variable names ina2abc.pyto those in the lecture slides on machine translation. The table below serves as reference for converting between the two.

Note that the slides are 1-indexed, whereas code is 0-indexed. Also, all variables in the PyTorch code are batched, but the slides only look at one sequence at a time.

Variable Slides Notes
sourcex X1:S Source sequence.
sourcexlens S In the code, the maximum-length source sequence in the
batch is said to have lengthS. The actual length of each
sequence in the batch (before padding) is stored inFlens.
x x Encoder RNN inputs.
h h1:S Encoder hidden states. Always refers to the last encoder
layers hidden states, with both directions concatenated.
htilde 0 h 1 The first decoder hidden state.
sourcextm1 Xt 1 The target token att1 (previous).
xtildet x t Decoder RNN input at timet(current).
htildet ht Decoder hidden state at timet(current). For the LSTM
architecture, this can be a pair with the cell state. Note in
the beam search updatehtildetalso includes paths, i.e.
h(1:t K).
logitst logP(yt|...) +C Un-normalized log-probabilities over the target vocabulary
at timet(current). Pre-softmax.
targety Y1:T Target sequence.
logits logP(yt|...)1:T+C Un-normalized log-probabilities over the the target vocabu-
lary across each time step. Pre-softmax.
beamtm1 1 Bt(1:K 1 , 1 ) All prefixes in the beam search at timet1 (previous).
logpbtm1 logP(b(1:t 1 K)) The log-probabilities of the beam search prefixes up to time
t1 (previous).
logpyt logP(yt|...) Valid (normalized) log-probabilities over the target vocabu-
lary at timet(current). Post-softmax.
beamt 0 Bt,(1: 0 K) Decoder hidden states at timet(current). The difference
betweenb(1:t, 0 K)and h(1:t K)is contextual: the latter points to
the paths in the beam before the update, the former after
the update.
beamt 1 Bt,(1: 1 K) All prefixes in the beam seach at timet(current).
logpbt logP(b(1:t K)) The log-probabilities of the beam search prefixes up to time
t(current).
ct ct Context vector (for attention) at timet(current).
alphat t,1:S Attention weights over all source times at target timet(cur-
rent).
at at,1:S Attention scores over all source times at target timet(cur-
rent).