Apache Joshua Tutorial for nlpgrid

Apache Joshua is a Syntax-based and Phrase-based Machine Translation Decoder. It is a large Java package, pieced together by perl scripts, which encompases the many steps necessary to translate sentences from a source language to a target language.

This tutorial is meant to quickly get new students up and running with Joshua on nlpgrid, the computing resources available to NLP researchers at Penn. Students and researchers with some experience should feel comfortable skipping sections on bash and working on a computing cluster.

The time and computational resources required to run Joshua on practical datasets makes home use impractical, so I encourage people to get used to working on nlpgrid.

nlpgrid general rules of thumb

Our computing grid is composed of nodes, or individual computers. When you log in, as:

    pennkey@nlpgrid.seas.upenn.edu

you arrive at the login or “head” node. On this node, you can compile code, edit text files, and run quick scripts. Crucially, it is not for running experiments. This includes (especially) Apache Joshua.

We’ll learn a bit about how to run experiments efficiently on the grid, but until then, just be conscious that every other reseracher has the same login node, so treat it with care! Any programs that have a lot of I/O (reading or writing to disk) or take up a lot of main memory are likely to frustrate your co-researchers.

just enough bash

Apache Joshua is installed, configured, and used through a command line interface, usually via bash. The wonders of bash are never-ending, but if you’re just getting started, there are a grounding principles to keep in mind:

  • When using the command line, everything you do happens in some folder on the computer. The folder you’re in right now is called the “working directory”. Executing pwd will show you what directory this is. Executing ls will show you the files in your working directory. Without explicit instruction otherwise, bash will assume all the files you try to edit or execute exist in the current directory, so this is crucial.
  • Changing the working directory is as easy as cd <target dir>. To change to the directory just “above” your current directory, run cd ...
  • The symbol ~ is a short name for your “home” directory, which is exactly like the home folder you have on your local computer.
  • Everything after a # on a single line is a comment in bash.

installing silent dependencies

This note is just for those trying to use Apache Joshua on their home machines. You’ll see us using commands like curl and git. These are essential tools, but they don’t come built in to every OS.

  • On Windows, give up and use a different OS.
  • On MacOS, install homebrew, after which you’ll run commands like brew install git.
  • On Linux, use your friendly neighborhood package manager.

A few such programs that you should install forthwith:

  • maven – a Java build and dependency management system
  • ant – another Java build and dependency management system. Yeah, I know.
  • git – the most popular version control system
  • curl – for basic communication with a server
  • wget – for downloading files

Compiling and Installing Joshua

Let’s borrow heavily from the official getting started page.

Let’s start out from your home directory. The following command sends you there.

cd ~

Now, get the code. ~/joshua is where your installation will live.

git clone https://github.com/apache/incubator-joshua joshua

Next, move into the directory you’ve just created – (1). After that, let bash know where we’ve installed Joshua – (2). Finally, configure bash so that the next time we run it, it’ll remember the installation – (3).

cd joshua  #(1)
export JOSHUA=$(pwd) #(2)
echo "export JOSHUA=$JOSHUA" >> ~/.bashrc #(3)

Now we compile using maven:

mvn package

You’ll see a bunch of scary-looking warnings and “libken.so missing” messages, but as long as it says “BUILD SUCCEEDED” at the end, you’re in the clear.

Side note – if at this point something has broken, you see “BUILD FAILED” or similar, and similarly if any other step of the process fails, it’s likely that your problem relates to some missing dependency or mis-configured setting. Do not give up. Contact a mailing list, or me. You can do this.

Finally, we download a few external dependencies.

bash download-deps.sh

Hopefully temporary file fixes

For reasons unknown to me, a couple of binaries seem to be out of place. For a quick fix for now, run the following :

Installing Hadoop

Now we install our last major dependency, Hadoop, which is a framework for distributed computing. Check out this release page, and make sure to download the binary version, not the source version. I don’t suggest trying to compile it from souce. Use version 2.7.3. Remember that you want to download hadoop to the grid, not your laptop. So once you’ve clicked binary, you should be redirected to a site with a bunch of options to download hadoop. At this point, we (1) download to the grid, (2) decompress it, and (3) let bash know where we’ve installed it.

    wget http://www-us.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz #(1)
    tar xzf hadoop-2.7.3.tar.gz #(2)

    export PATH=$PATH:/nlp/users/johnhew/hadoop-2.7.3/bin
    echo "export PATH=$PATH:/nlp/users/johnhew/hadoop-2.7.3/bin" >> ~/.bashrc

The Joshua Tutorial

Preparing a directory for experiments

Now that we have Joshua set up, let’s do a little bit of work to set up a directory in which we’ll run our experiments.

Your ~ directory has a maximum capacity of 10GB. This is pretty tiny, and we’ll need more space. Contrastively, /scrach-shared/ (note the slash at the beginning) has much more space. We’ll be writing the files made by Joshua to that directory.

Navigate to your home directory before running these:

mkdir -p /scratch-shared/users/<PENNKEY>/
ln -s  /scratch-shared/users/<PENNKEY>/ joshua-expts

Now, when we write to ~/joshua-expts, we’re actually writing to /scratch-shared.

Now, change directory to joshua-expts. From here, we’re walking through some of the official tutorial with a few modifications.

Preparing data

Joshua is trained, tuned, and tested on bilingual parallel text. In practice, this means a pair of text files, corpus.en and corpus.ar, for example, where both files have the same number of lines, and each line of each file is a translation of the same line in the other language. We’ll use a small corpus for our purposes to make sure Joshua is working correctly.

mkdir joshua-tutorial
cd joshua-tutorial
wget --no-check -O fisher-callhome-corpus.zip https://github.com/joshua-decoder/fisher-callhome-corpus/archive/master.zip
unzip fisher-callhome-corpus.zip
export FISHER=$PWD/fisher-callhome-corpus-master
echo "export FISHER=$PWD/fisher-callhome-corpus-master" >> ~/.bashrc

Now, we’ll make a directory for the “runs” of our experiment. A run is an independent execution (training,tuning,testing) of Joshua that acts on the same test data as other runs in the same experiment. Grouping runs together helps us to summarize the experiments we’ve run so far.

mkdir runs

Example pipeline execution

Finally, we execute the Joshua pipeline on the data we’ve prepared. But not so fast. Don’t run on the head node. Instead, run the following:

qlogin
cd ~/joshua-expts/joshua-tutorial/runs

Now you’re working on a compute node, and we can run the following command. This will take a while, and produce a lot of output. In short, the Joshua pipeline determines all of the steps that need to be accomplished to translate your text, checks to see if each step has been done already, and if it hasn’t, runs it. Thus, you’ll see a lot of NOT FOUND messages. This is intended. If Joshua crashes, it’ll write logs to a few different places. We’ll discuss these in depth later.

$JOSHUA/bin/pipeline.pl \
  --rundir 1 \
  --readme "Baseline Hiero run" \
  --source es \
  --target en \
  --type hiero \
  --corpus $FISHER/corpus/asr/fisher_train \
  --tune $FISHER/corpus/asr/fisher_dev \
  --test $FISHER/corpus/asr/fisher_dev2 \
  --maxlen 11 \
  --maxlen-tune 11 \
  --maxlen-test 11 \
  --tuner-iterations 1 \
  --lm-order 3

Breaking down the first pipeline.pl command

Recall that Joshua is itself a “pipeline”. In NLP, a pipeline is a software system composed of multiple parts, wherein some input data is fed from one part to the next, undergoing computation at each step. Machine Translation pipelines traditionally have many steps. For Joshua the pipeline.pl command links these steps together so that you don’t have to. If the script completes, it means you’ve done all of:

  • Training a translation model
  • Training a language model
  • Tuning decoder weights for the two models
  • Translating sentences Joshua didn’t see during training time. This is the test set.
  • Reporting a measure of quality – how well did we translate the test set?

Even these macro steps are each broken down into mutiple programmatic steps, which we’ll go into. For now, though, let’s look the command we ran, step by step:

  • This invoces the pipeline. Recall that $JOSHUA is another name for the directory in which you installed Joshua.

     $JOSHUA/bin/pipeline.pl 
    
  • This gives Joshua a directory into which it’ll put all of the files for this run (described above.)

     --rundir 1
    
  • This will write a short note into the run directory so you know what you were trying to do when you come back to it later.

     --readme "Baseline Hiero run"
    
  • These specify the file extensions Joshua should use when it’s looking for your train/tune/test corpora.

       --source es 
       --target en 
    
  • These specify the paths to 3 disjoint parallel corpora. The training data is used to build the model, the tuning data is used to set parameters for the decoder, and the testing data is used to report accuracy.

       --corpus $FISHER/corpus/asr/fisher_train 
       --tune $FISHER/corpus/asr/fisher_dev 
       --test $FISHER/corpus/asr/fisher_dev2 
    
  • These parameters specify that only sentences of length 11 or shorter should be used in training/tuning/testing. Many sentences are longer than this. These parameters speed up the test run, but should be set to 80 for any actual experimental run.

       --maxlen 11
       --maxlen-tune 11
       --maxlen-test 11
    
  • This specifies that the machine learning algorithm Joshua uses to tune the weights in the decoder (the system that actually builds new translations) should only be run once (again, for speed.) This should also be removed for any experimental run.

       --tuner-iterations 1
    
  • This specifies that the language model built by Joshua should work with sequences no greater than 3 words in length. If you don’t know what this means, don’t worry about it for now.

       --lm-order 3
    

Dealing with error conditions

This deserves its whole tutorial, but here are some general tips:

  1. The pipeline.pl output is your friend, and you should use it as a starting point for debugging. Output it to a file instead of printing it to the terminal. Thus, the above command mig
  2. Once you’ve determined which part of the pipeline failed, look in the run directory for the folder that holds the intermediate files used. For example, if GIZA++ fails, you want the alignments folder. Then look for log files that may give you more information about the failure. Turning back to our example, GIZA++ writes logs both to:

     alignments/run.log
     alignments/0/giza.log
    
  3. Google around for errors, but don’t be afraid to contact the Joshua users list with your problems. Be concise, and state the exact error code.

Tips for running on nlpgrid

So, we ran the tutorial on nlpgrid10, the compute node that we arrived at when we ran qlogin. For experiments with larger datasets, however, this is still bad practice. Further, runs may take days, and we don’t want to have to babysit them. We’ll thus be submitting our jobs to the grid.

All we have to do is write the execution command above to a file, and add the following lines to the top of the file:

      #$ -o $PWD/runs/output.o
      #$ -o $PWD/runs/output.e
      #$ -l mem_free=25G
      #$ -l ram_free=25G
      #$ -parallel-onenode 10
      #$ -V
      #$ -cwd
      #$ -S /bin/bash

CC-Attribution-ShareAlike 4.0

Join My Newsletter

Sign up to receive weekly updates.

x