Apache Joshua Tutorial for nlpgrid
Apache Joshua is a Syntax-based and Phrase-based Machine Translation Decoder. It is a large Java package, pieced together by perl scripts, which encompases the many steps necessary to translate sentences from a source language to a target language.
This tutorial is meant to quickly get new students up and running with Joshua on
nlpgrid, the computing resources available to NLP researchers at Penn.
Students and researchers with some experience should feel comfortable skipping sections on bash and working on a computing cluster.
The time and computational resources required to run Joshua on practical datasets makes home use impractical, so I encourage people to get used to working on
nlpgrid general rules of thumb
Our computing grid is composed of nodes, or individual computers. When you log in, as:
you arrive at the login or “head” node. On this node, you can compile code, edit text files, and run quick scripts. Crucially, it is not for running experiments. This includes (especially) Apache Joshua.
We’ll learn a bit about how to run experiments efficiently on the grid, but until then, just be conscious that every other reseracher has the same login node, so treat it with care! Any programs that have a lot of I/O (reading or writing to disk) or take up a lot of main memory are likely to frustrate your co-researchers.
just enough bash
Apache Joshua is installed, configured, and used through a command line interface, usually via
The wonders of
bash are never-ending, but if you’re just getting started, there are a grounding principles to keep in mind:
- When using the command line, everything you do happens in some folder on the computer.
The folder you’re in right now is called the “working directory”.
pwdwill show you what directory this is. Executing
lswill show you the files in your working directory. Without explicit instruction otherwise,
bashwill assume all the files you try to edit or execute exist in the current directory, so this is crucial.
- Changing the working directory is as easy as
cd <target dir>. To change to the directory just “above” your current directory, run
- The symbol
~is a short name for your “home” directory, which is exactly like the home folder you have on your local computer.
- Everything after a
#on a single line is a comment in bash.
installing silent dependencies
This note is just for those trying to use Apache Joshua on their home machines.
You’ll see us using commands like
These are essential tools, but they don’t come built in to every OS.
- On Windows, give up and use a different OS.
- On MacOS, install
homebrew, after which you’ll run commands like
brew install git.
- On Linux, use your friendly neighborhood package manager.
A few such programs that you should install forthwith:
maven– a Java build and dependency management system
ant– another Java build and dependency management system. Yeah, I know.
git– the most popular version control system
curl– for basic communication with a server
wget– for downloading files
Compiling and Installing Joshua
Let’s borrow heavily from the official getting started page.
Let’s start out from your home directory. The following command sends you there.
Now, get the code.
~/joshua is where your installation will live.
git clone https://github.com/apache/incubator-joshua joshua
Next, move into the directory you’ve just created – (1). After that, let bash know where we’ve installed Joshua – (2). Finally, configure bash so that the next time we run it, it’ll remember the installation – (3).
cd joshua #(1) export JOSHUA=$(pwd) #(2) echo "export JOSHUA=$JOSHUA" >> ~/.bashrc #(3)
Now we compile using maven:
You’ll see a bunch of scary-looking warnings and “libken.so missing” messages, but as long as it says “BUILD SUCCEEDED” at the end, you’re in the clear.
Side note – if at this point something has broken, you see “BUILD FAILED” or similar, and similarly if any other step of the process fails, it’s likely that your problem relates to some missing dependency or mis-configured setting. Do not give up. Contact a mailing list, or me. You can do this.
Finally, we download a few external dependencies.
Hopefully temporary file fixes
For reasons unknown to me, a couple of binaries seem to be out of place. For a quick fix for now, run the following :
Now we install our last major dependency, Hadoop, which is a framework for distributed computing.
Check out this release page, and make sure to download the binary version, not the source version.
I don’t suggest trying to compile it from souce. Use version
Remember that you want to download hadoop to the grid, not your laptop.
So once you’ve clicked binary, you should be redirected to a site with a bunch of options to download hadoop.
At this point, we (1) download to the grid, (2) decompress it, and (3) let bash know where we’ve installed it.
wget http://www-us.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz #(1) tar xzf hadoop-2.7.3.tar.gz #(2) export PATH=$PATH:/nlp/users/johnhew/hadoop-2.7.3/bin echo "export PATH=$PATH:/nlp/users/johnhew/hadoop-2.7.3/bin" >> ~/.bashrc
The Joshua Tutorial
Preparing a directory for experiments
Now that we have Joshua set up, let’s do a little bit of work to set up a directory in which we’ll run our experiments.
~ directory has a maximum capacity of 10GB. This is pretty tiny, and we’ll need more space.
/scrach-shared/ (note the slash at the beginning) has much more space.
We’ll be writing the files made by Joshua to that directory.
Navigate to your home directory before running these:
mkdir -p /scratch-shared/users/<PENNKEY>/ ln -s /scratch-shared/users/<PENNKEY>/ joshua-expts
Now, when we write to
~/joshua-expts, we’re actually writing to
Now, change directory to
From here, we’re walking through some of the official tutorial with a few modifications.
Joshua is trained, tuned, and tested on bilingual parallel text.
In practice, this means a pair of text files,
corpus.ar, for example, where both files have the same number of lines, and each line of each file is a translation of the same line in the other language.
We’ll use a small corpus for our purposes to make sure Joshua is working correctly.
mkdir joshua-tutorial cd joshua-tutorial wget --no-check -O fisher-callhome-corpus.zip https://github.com/joshua-decoder/fisher-callhome-corpus/archive/master.zip unzip fisher-callhome-corpus.zip export FISHER=$PWD/fisher-callhome-corpus-master echo "export FISHER=$PWD/fisher-callhome-corpus-master" >> ~/.bashrc
Now, we’ll make a directory for the “runs” of our experiment. A run is an independent execution (training,tuning,testing) of Joshua that acts on the same test data as other runs in the same experiment. Grouping runs together helps us to summarize the experiments we’ve run so far.
Example pipeline execution
Finally, we execute the Joshua pipeline on the data we’ve prepared. But not so fast. Don’t run on the head node. Instead, run the following:
qlogin cd ~/joshua-expts/joshua-tutorial/runs
Now you’re working on a compute node, and we can run the following command.
This will take a while, and produce a lot of output.
In short, the Joshua pipeline determines all of the steps that need to be accomplished to translate your text, checks to see if each step has been done already, and if it hasn’t, runs it.
Thus, you’ll see a lot of
NOT FOUND messages. This is intended.
If Joshua crashes, it’ll write logs to a few different places. We’ll discuss these in depth later.
$JOSHUA/bin/pipeline.pl \ --rundir 1 \ --readme "Baseline Hiero run" \ --source es \ --target en \ --type hiero \ --corpus $FISHER/corpus/asr/fisher_train \ --tune $FISHER/corpus/asr/fisher_dev \ --test $FISHER/corpus/asr/fisher_dev2 \ --maxlen 11 \ --maxlen-tune 11 \ --maxlen-test 11 \ --tuner-iterations 1 \ --lm-order 3
Breaking down the first
Recall that Joshua is itself a “pipeline”. In NLP, a pipeline is a software system composed of multiple parts, wherein some input data is fed from one part to the next, undergoing computation at each step.
Machine Translation pipelines traditionally have many steps. For Joshua the
pipeline.pl command links these steps together so that you don’t have to.
If the script completes, it means you’ve done all of:
- Training a translation model
- Training a language model
- Tuning decoder weights for the two models
- Translating sentences Joshua didn’t see during training time. This is the test set.
- Reporting a measure of quality – how well did we translate the test set?
Even these macro steps are each broken down into mutiple programmatic steps, which we’ll go into. For now, though, let’s look the command we ran, step by step:
This invoces the pipeline. Recall that $JOSHUA is another name for the directory in which you installed Joshua.
This gives Joshua a directory into which it’ll put all of the files for this run (described above.)
This will write a short note into the run directory so you know what you were trying to do when you come back to it later.
--readme "Baseline Hiero run"
These specify the file extensions Joshua should use when it’s looking for your train/tune/test corpora.
--source es --target en
These specify the paths to 3 disjoint parallel corpora. The training data is used to build the model, the tuning data is used to set parameters for the decoder, and the testing data is used to report accuracy.
--corpus $FISHER/corpus/asr/fisher_train --tune $FISHER/corpus/asr/fisher_dev --test $FISHER/corpus/asr/fisher_dev2
These parameters specify that only sentences of length 11 or shorter should be used in training/tuning/testing. Many sentences are longer than this. These parameters speed up the test run, but should be set to 80 for any actual experimental run.
--maxlen 11 --maxlen-tune 11 --maxlen-test 11
This specifies that the machine learning algorithm Joshua uses to tune the weights in the decoder (the system that actually builds new translations) should only be run once (again, for speed.) This should also be removed for any experimental run.
This specifies that the language model built by Joshua should work with sequences no greater than 3 words in length. If you don’t know what this means, don’t worry about it for now.
Dealing with error conditions
This deserves its whole tutorial, but here are some general tips:
pipeline.ploutput is your friend, and you should use it as a starting point for debugging. Output it to a file instead of printing it to the terminal. Thus, the above command mig
Once you’ve determined which part of the pipeline failed, look in the run directory for the folder that holds the intermediate files used. For example, if
GIZA++fails, you want the
alignmentsfolder. Then look for log files that may give you more information about the failure. Turning back to our example,
GIZA++writes logs both to:
- Google around for errors, but don’t be afraid to contact the Joshua users list with your problems. Be concise, and state the exact error code.
Tips for running on
So, we ran the tutorial on
nlpgrid10, the compute node that we arrived at when we ran
For experiments with larger datasets, however, this is still bad practice.
Further, runs may take days, and we don’t want to have to babysit them.
We’ll thus be submitting our jobs to the grid.
All we have to do is write the execution command above to a file, and add the following lines to the top of the file:
#$ -o $PWD/runs/output.o #$ -o $PWD/runs/output.e #$ -l mem_free=25G #$ -l ram_free=25G #$ -parallel-onenode 10 #$ -V #$ -cwd #$ -S /bin/bash