== SIMetrix: Summary Input similarity Metrics == This tool is built to perform evaluation by comparing a generated summary with the source document for which it was produced. Since summaries are expected to be surrogates of the input, high similarity with the source would be indicative of good quality summaries. This package contains code to obtain various input-summary similarily metrics that were compared in our work described in the following papers. 1. Annie Louis and Ani Nenkova, Automatically Evaluating Content Selection in Summarization without Human Models, Proceedings of EMNLP 2009. 2. Annie Louis and Ani Nenkova, Predicting Summary Quality using Limited Human Input, Text Analysis Conference (TAC) 2009. 3. Annie Louis and Ani Nenkova, Summary Evaluation without Human Models, Poster, Text Analysis Conference (TAC) 2008. An information-theoretic measure, Jensen Shannon divergence between vocabulary distributions of the input and summary texts was found to produce the best predictions of summary quality. System scores produced by this metric obtain correlations with pyramid scores in the range of 0.89 (TAC 2008 data) and 0.74 (TAC 2009). More details about the performance of various features can be found in the papers above. == INSTALLING == Uncompress the file SIMetrix-v1.tar.gz by typing gunzip < SIMetrix-v1.tar.gz | tar xvf - The file ieval.jar contains the class files. The source of the java programs are in the directory SIMetrix-v1/src. Data for a sample evaluation is present in SIMetrix-v1/sampleEval Auxiliary files necessary for computing features are in SIMetrix-v1/data == SETTING UP AN EVALUATION == An evaluation job for the tool is specified by two files. 1 => mappings.txt This file is used to specify the input and summaries that need to be compared. Each line of the file must have the following format. Input_Name System_Id Path_To_Input_File(s) Path_To_Summary where Input_Name: specifies the id of the source document. System_Id: is the id of the system that produced the summary. Path_To_Input_Files: is the absolute path to the source document. When the source is multi-document, this entry can be path to a folder. Path_To_Summary: is the absolute path to the file containing the summary. See sampleEval/sampleMappings.txt for an example mappings file. 2 => config.txt This file contains the settings for various options. These options control preprocessing like stemming and stop word removal as well as which features must be computed. In our experiments, comparing stemmed vocabularies produced similarity scores that had higher agreement with manual evaluations. Porter stemmer is used to perform stemming. The features are divided into 4 classes which can be turned on/off during evaluation. a) divergence - Kullback Leibler and Jensen Shannon divergence b) frequencyFeatures - Unigram and multinomial summary probabilities under a frequency based generative model. c) cosineOverlap - Cosine similarity between input and summary d) topicWordFeatures - 3 topic signature based features. Fraction of input's topic words present in summary, proportion of summary tokens composed of topic words, cosine overlap of input's topic words and all summary words. See sampleEval/config.example for more details. == AUXILARY DATA == AUX1 => stoplist.txt A stopword list. The list included with the tool is a freely available list of frequent words assembled for the SMART information retrieval system at Cornell. AUX2 => backgroundCorpusFreqCounts For the topic signature based features an extra file must be specified in the config file which supplies occurrence counts for words computed from a large background corpus. Topic words are computed by comparing frequencies of words in a given text with that in this background corpus. A frequency count file is included with the tool - "data/bgFreqCounts.unstemmed.txt". It contains frequency information for all content words (no stopwords) from a subset of 5000 documents from the New York Times section of the GigaWord Corpus. This file can be replaced with counts from a different document collection if needed. The file must be in the following format. Each line has a word and its count separated by a space in between. The background corpus will be automatically stemmed and frequencies recomputed for stems when the performStemming=Y option is specified in the config file. AUX3 => backgroundIdfUnstemmed (or backgroundIdfStemmed) These options list the idf values to use while computing cosine overlap. These should be computed from a large corpus and formatted as follows. The first line in the file must contain the total number of documents in the background corpus. The rest of the file has a word and idf value (separated by space) on each line. When the stemming option is specified the idf values for stems needs to the supplied via the 'backgroundIdfStemmed' option. Else use the unstemmed option to supply the idf values for unstemmed words. With this tool we supply these files computed over a corpus of 5000 documents from the GigaWord Corpus. You may change the file to another one if necessary. == FORMAT OF INPUT AND SUMMARIES == The program is built to evaluate plain text input documents and their summaries. The documents must contain no HTML tags etc. but no further processing is required, eg. segmentation. Both single and multidocument inputs can be evaluated by the tool. A multidocument input can be specified by its folder name. Some example inputs and summaries are in sampleEval/sampleInputs and sampleEval/sampleSummaries. There are 3 multidocument inputs in sampleInputs which were gathered from the Newsblaster system's archives http://newsblaster.cs.columbia.edu/. Five 150 word summaries are available for each of these inputs in sampleSummaries. The systems are as follows. fb - A baseline system. The first sentences of each document starting from most recent are added upto the length limit. rb - Another baseline where the lead sentences only from the most recent document are added upto the 150 word limit. nb - Summary from newsblaster system me - Summary produced by the MEAD summarizer http://www.summarization.com/mead/ ts - A topic signature based summarizer. Ranks sentences based upon ratio of number of distinct topic words present to number of distinct content words in the sentence. (Conroy, Schlesinger, J. and O.Leary, Topic-focused multi-document summarization using an approximate oracle score, COLING 06) == HOW TO RUN == Add the jar file ieval.jar to the classpath. export CLASSPATH=$CLASSPATH:absolute_path_to_SIMetrix.jar In the directory containing the config and mappings files, type java -Xmx1000m InputBasedEvaluation mappings.txt config.txt To run the sample evaluation: export CLASSPATH=$CLASSPATH:abs_path_to_SIMetrix.jar cd into sampleEval directory. Type java -Xmx1000m InputBasedEvaluation sampleMappings.txt config.example == OUTPUT == NOTE: Higher divergence scores indicate poor quality summaries. For the other metrics higher scores indicate better summaries. The tool generates two files in the same directory as that containing the mappings file. mappings.txt.ieval.micro: contains the scores for each evaluation pair specified in the mappings.txt file. A line contains the inputname, system id and the values for all features specified in the config file. mappings.txt.ieval.macro: contains the scores at system level. This is obtained by averaging the micro scores for each system over all the different inputs. == FEATURE NAMES IN OUTPUT FILE == inputId - input name as in mappings file sysId - system id as in mappings file KLInputSummary - Kullback Leibler divergence between input and summary KLSummaryInput - Kullback Leibler divergence between summary and input. Since KL divergence is not symmetric, the features are computed both ways Input-Summary and Summary-Input. Both above features use smoothing. unsmoothedJSD - Jensen Shannon divergence between input and summary. No smoothing. smoothedJSD - A version with smoothing. cosineAllWords - Cosine similary between all words in the input and summary. percentTopicTokens- Proportion of tokens in the summary that are topic words of the input. fractionTopicWords- The fraction of topic words of the input that appear in the summary. topicWordOverlap - Cosine similarity using all words of the summary but only the topic words from the input. unigramProb - The probability of a word in the input is assumed to represent how likely it is for the word to be emitted into a summary. The likelihood of the summary content is computed under this assumption. This feature is the unigram probability of the summary. multinomialProb - Summary probability as above. Multinomial version. NOTE: Higher divergence scores indicate poor quality summaries. For the other metrics higher scores indicate better summaries.