== BASICS == This package contains code to perform word alignment using IBM model 1, 2 and the HMM model, using both EM to train and also using constrained EM with agreement constraints and substochastic constraints. The details are described in: Expectation Maximization and Posterior Constraints. Joao Graca, Kuzman Ganchev and Ben Taskar. Advances in Neural Information Processing Systems 20 (NIPS). MIT Press, 2008. and also Better Alignments = Better Translation? Kuzman Ganchev, Joao Graca and Ben Taskar. Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics. ACL 2008. Mainly we release this to allow other researchers to reproduce our results. We encourage you to use/modify this code. == INSTALLING == The package comes precompiled, but so once you have unpacked the tarball you should be able to use it. If you change the code, you should just need to run: ant to recompile it. See http://ant.apache.org/ for details on how to get/run ant. == TRYING IT OUT == Included with the code is some sample data in the small_data/ sub-directory. To run the scripts below, you will need some auxiliary programs (e.g. tee, find, tail) that are standard on *NIX systems. If you don't have them, it should be easy to change the scripts or just copy the commandline. When you are in the base directory of the package run: ./scripts/saveModels.sh small_data/small_hansards.params This will create an output/saved_alignment_models/ directory and save the trained baseline and agreement HMM there. ./scripts/computeErrorMetrics.sh small_data/small_hansards.params This will compute alignment error rates for both models and three decoding types and both directions. It creates output/error_metrics/ and writes output there. At the end of the script it outputs a summary. The precision and recalls should be in the range 75-89 and 68-87 respectively. ./scripts/alignmentsForMoses.sh small_data/small_hansards.params will create output/alignments_for_moses/ and save the alignments in a format usable by the Moses statistical machine translation system. Finally the prettyPrintAlignments script generates pretty visualizations of the alignments and alignment posteriors in LaTeX. ./scripts/prettyPrintAlignments.sh small_data/small_hansards.params will create output/latex_alignments/ and save the alignments for the test data in differen sub-directories for the different decoding schemes we implement. == RUNNING FOR REAL == The code expects a configuration file with absolute path names. The first step is to edit small_data/small_hansards.params to change all the relative path names to absolute ones. For a new corpus, you will need to create a similar configuration file. If you do not have hand-aligned data, then omit the wa_* declarations from the file. The intended use of the provided scripts is to show example scrits that we found useful. You will probably want to edit them, and possibly write your own scripts. == CAVEATS == The models are meant to be used in a transductive fasion, so it's important to make sure that you include the test and dev corpora at training time. The gold alignments aren't necessary, but the raw text is. This is research code, so it will probably never be really well documented. We've only tested this code on *NIX systems. Most of it should work with little or no changes on Windows