Simple README Compiling ---------- 1. Set JAVA_HOME environtment variable to java runtime environment e.g. JAVA_HOME=/usr/java/jdk1.5/ export JAVA_HOME 2. type "build.sh" that should be all. Demo ----- run "bin/runGeneDemo.sh" should produce two files with the mask samle_output - check them out Running the tagger ------------------- ./bin/runTagger.sh type model_name text_file output_mask type = gene or var or mal (for gene or variation or malignancy) model_name = name of stored model text_file = input file (plain text or MEDLINE format - see example in data/testing/medline.txt) output_mask = will create output_mask.html and output_mask.txt (the output files) output_mask.txt - MEDLINE style file with a list of entities in the Metadata fields output_mask.html - HTML view with entities colored. There are pretrained gene, variation and malignancy models in the data directory. Training taggers ------------------ Training data format: w11 t11 w21 t21 w31 t31 ... wn1 tn1 w12 t12 w22 t22 ... The first column is the token (or word) in the sentence and the second the appropriate tag (i.e. using the BIO tag set). The columns are tab separated. Documents are separated by a space. See examples in data/training. bin/geneTrainer.sh training_data testing_data model_name e.g. bin/geneTrainer.sh data/training/gene/data/all_9-15-2004_train.txt null data/geneModel.crf if you do not have testing data, just put type "null" (as in the example) Training the variation and malignancy tagger is similar Training data included: For Genes, the best data is at: data/training/gene/data/gene_train_data For Variations, the best data is at: data/training/variation/variation_train_data For Malignancies, the best data is at: data/training/malignancy/malignancy_train_data Some data directories have a subdirectory called "other" that contains additional data. Testing on new data ------------------- Say you have a new annotated set of data (in the same format as the training data) and you want to measure performance. Then you can type: bin/geneTestSetEvaluator.sh testing_data model_name and it will output the performance of the model (i.e. precision/recall/f-measure). Questions --------- Email ryantm@cis.upenn.edu if there are questions. I will try to do my best to answer them.