Description
Assignment 8: Machine Translation
Jordan Boyd-Graber
Introduction
As always, check out the Github repository with the course homework templates:
git://github.com/ezubaric/cl1-hw.git
The code for this homework is in the hw8 directory. The goal of this homework is for you to build a machine translation scoring function based on IBM Model 1.
Data
The data are gzipped samples from Europarl. And can be found in the GitHub directory with the code.
Also in the directory are two text files with lists of words. These can help you monitor the progress of the algorithm to see if the lexical translation probabilities look like what they should.
What to Do
You have been given a partial EM implementation of IBM Model 1 translating foreign (f) from English (e). The maximization function is complete, but the expectation is not, nor is the function to score a complete translation pair. You need to fill in two functions.
Generating Counts (20 points)
The first function you need to fill in is sentence counts, which should return an iterator over english and foreign words pairs along with their expected count. The expected count is the expected number of times that word pair was translated in the sentence, given by the equation
lf
c(f|e;~e,f~) = Xp(a|~e,f~)Xδ(f,fj)(e,ea(j)). (1)
a j=1
Scoring (10 points)
The second function you need to fill in is the noisy channel scoring method translate score, which is the translation probability (given by Model 1) times the probability of the English output under the language model.
Running the Model (10 points)
Run the model and produce the lexical translation table for the development words. Don’t leave this until the last moment, because this can take a while.
How to solve the Problem
Don’t start using the big data immediately. Start with the small data. If you run a correct implementation on the toy data, you should get something like the following:
What to turn in
You will turn in your complete ibm trans.py file and the word translations for the supplied test file (devwords.txt) after three (3) iterations of EM.
Extra Credit (5 points)
If you would like extra credit, add an additional function that computes the best alignment between a pair of sentences.
Listing 1: Successful output from running on toy data
19
21
28
Questions
1. My code is really slow; is it okay if I don’t run my code on the entire dataset? (by using the limit argument)
2. I’m getting probabilities greater than 1.0!
The logprob function of nltk’s language model returns the negative log probability.
3. I can’t get my translation probabilities to match.
• You will need to add a None to the English sentence in the translate score function. The assert is there to make sure it hasn’t happened outside of the function.
• Make sure you compute the lm probabilities conditioning the first word on the empty string (’’).




Reviews
There are no reviews yet.