CIS 120 Homework 5 - SpellCheck

Due Wednesday , October 28th, 2009 at 11:00am. There are five required problems in this homework.

Important notes:

Table of Contents

  1. Introduction
  2. Dictionary
  3. Corrections
  4. Document
  5. Reversible Maps
  6. Spell Checker
  7. Extra Credit

Introduction

All modern word processors, along with increasingly many other programs, have a built in spell check function that allows even the worst typist some level of efficiency. To complete this homework you will be coding a rudimentary spell check program based on a dictionary and list of corrections for misspelled words. Your program will ask the user to correct any misspelled words in a file, offering suggestions for common typos.

This homework is designed for you to gain practice with Java I/O (see java.io) and the Java Collections Framework (see java.util). The I/O package is very large with lots of different classes to choose from -- we recommend taking a look at the following classes:

Input
Output

As always, please post any questions, comments or clarifications to the Bulletin Board and make use of office hours.

We've zipped up the testing data and sources for the tests, and provide the usual jar file as well. The data/ directory in the zip should be saved to the directory you expect to run the tests from. For Eclipse, this is in the root directory of your project; for example, if your workspace is ~/workspace and the project is named hw5, then you should have ~/.workspace/hw5/data. Below are brief explanations of some of the files in the zip.


Problem 1: Building the Dictionary (15 points; file to submit: Dictionary.java)

Before we can actually program the spellchecker we need a class that reads in words from the dictionary, stores them in memory, and provides efficient access to them. Remember that the Java libraries have plenty of collection classes; choosing one of them is a good first step here. Make sure to look at the Dictionary javadocs to get some idea of how to approach this problem. Sun's documentation on file IO may also be helpful, especially FileReader and BufferedReader. None of the IOExceptions that may result from the execution of this part of the program need to be caught.

The most important method is isValid, which takes a word string and returns a boolean stating whether that word is a valid word according to our dictionary.

    //Test whether a word is valid (that is, present in the Dictionary).
    boolean isValid(java.lang.String word)

As this is a rudimentary program, we will make some simplifying assumptions.

Note: The unit tests use the smaller test dictionary and corrections file to speed up the testing process, but make sure to test your code with the full files.

Warning:If the Scanner class is part of your solution for Problem 1, it is likely that a number of Unicode-specific problems may be encountered. A number of these can be avoided by initializing scanner as new Scanner( new File(filename), "ISO-8859-1" ). Use of the Scanner class for this or other problems is not required.


Problem 2: Building the Database of Misspelled Words (15 points; file to submit: Corrections.java)

We also need a way to handle misspellings and map each misspelled word to its correct counterpart. The class Corrections contains a constructor that is responsible for parsing the file that contains suggested corrections and the method getMap that looks for a suggested spelling for an incorrectly spelled word.

The format of the input file is very simple: each line contains an incorrectly-spelled word, followed immediately by a comma, followed immediately by the correct spelling. If the file does not conform to this format, the constructor for the class should throw an IOException.

How to handle capitalization:

As always, the javadocs and source for the unit tests may help. Again, none of the IOExceptions that may result from the execution of this part of the program need to be caught.


Problem 3: Reading the Document (15 points; file to submit: Document.java)

Now that we have the dictionary and list of misspellings, we can now turn our attention to the document itself. The purpose of this class is to read the input file and split it up into tokens with the method getNextToken. There are two sorts of tokens: words and nonwords. Word tokens only contain word characters. Nonwords do not contain any word charachers. Following the assumptions made by the Dictionary class above, a word character is either a letter (see Character.isLetter) or an apostrophe.

For example, the text "They aren't brown." should produce 6 tokens:

    "They"
    " "    
    "aren't"
    " " 
    "brown"
    "."

The key method in Document is getNextToken:

    //return next token, or null if at the end of document
    public String getNextToken() throws java.io.IOException


Problem 4: Reversing a Map (15 points; file to submit: ReversibleMap.java)

There are times when working with Maps where a reverse-lookup is prudent. For example: the Corrections class in Problem 2 will tell us, for every misspelling, what the correct word is likely to be. But it does not give us an easy way to answer, "What are all the common misspellings of some given word?" To do so, we would have to reverse the map to use correctly-spelled words as keys and a Set of popular misspellings as values. (Why do we need a set?)

For example, consider a Map with the values:

    A --> 1
    B --> 2
    C --> 3
    D --> 1
    E --> 2
    F --> 3
The reverse of this map would be a map with
    1 --> {A, D}
    2 --> {B, E}
    3 --> {C, F}

To generalize this approach, we have created an interface IReversibleMap. Any Map which implements IReversibleMap must implement the reverse method, described in IReversibleMap as:

    //Takes the existing map of K to V elements and reverses it,
    //plotting for every V any K that leads to it in the original map.
    public Map<V, Set<K>>reverse();

Your task is to create the class ReversibleMap<K,V> which implements IReversibleMap. You can extend HashMap<K,V>; this will take care of most of the IReversibleMap interface. All that remains for you to do after extending HashMap is to implement the reverse method. The keySet function of Maps provides a handy way to iterate over all the keys of a map, so make sure to read the documentation for it.


Problem 5: Spell Checker (40 points; file to submit: SpellCheck.java)

Finally we need to put all of our classes together. This class should take in the filepaths of all text files (dictionary, misspellings, input and output documents) and run the entire spell check from a single instance of an object. The SpellCheck object will then create a Dictionary object from the dictionary filepath, a Corrections object from the misspellings filepath, a Document object from the input filepath, and a Writer object from the output filepaths.

Non-word tokens should be copied directly from the original Document into the newly corrected file. This includes numbers, symbols, and whitespace. Only word tokens should be checked for spelling and (possibly) replaced. We parameterize the InputStream for users to enter input to make automated testing (and grading!) possible. While IOExceptions don't need to be caught, you should handle invalid user input appropriately. For example, if you give a user 3 options and they enter an invalid selection, you should notify them and prompt for a selection again.

Pay attention to capitalization - a reminder: for checking the spelling of a word, capitalization should be ignored. However, a correction should map the new word as having the same first-letter capitalization as the original word, with the rest of the word lowercase. User provided replacement words should retain the capitalization the user provided.

Sample Command Line Interaction

  $ java SpellCheck data/misspellings.txt data/dictionary.txt data/Gettysburg.txt data/Gettysburg-out.txt

  The word: "Fuor" is not in the dictionary. Please enter the number
  corresponding with the appropriate action
  0: Ignore and continue
  1: Replace with another word
  2: Replace with "Four"
  //Entered 2


  The word: "thsi" is not in the dictionary. Please enter the number
  corresponding with the appropriate action
  0: Ignore and continue
  1: Replace with another word
  2: Replace with "this"
  //Entered 2

  The word: "Libert" is not in the dictionary. Please enter the number
  corresponding with the appropriate action
  0: Ignore and continue
  1: Replace with another word
  //Entered 1

  Please enter the new word:
  //Entered "Liberty"

  The word: "civl" is not in the dictionary. Please enter the number
  corresponding with the appropriate action
  0: Ignore and continue
  1: Replace with another word
  //Entered 1

  Please enter the new word:
  //Entered "civil"

  The word: "thta" is not in the dictionary. Please enter the number
  corresponding with the appropriate action
  0: Ignore and continue
  1: Replace with another word
  2: Replace with "that"
  //Entered 2

  The word: "honored" is not in the dictionary. Please enter the number
  corresponding with the appropriate action
  0: Ignore and continue
  1: Replace with another word
  //Entered 0

  The word: "higly" is not in the dictionary. Please enter the number
  corresponding with the appropriate action
  0: Ignore and continue
  1: Replace with another word
  //Entered 1

  Please enter the new word:
  //Entered "highly"

Document completed

To test it out yourself, you can either run from the command line or within Eclipse. The interaction above shows a sample command line. Within Eclipse, you must set the command line arguments in a slightly different way. Create a new launch configuration via Project -> Preferences -> Run/Debug Settings -> New... -> Java Application. While creating it, be sure to click on the "Arguments" tab, and enter the file names there; for example,

data/misspellings.txt data/dictionary.txt data/Gettysburg.txt data/Gettysburg-out.txt

Extra Credit (file to submit: AutomaticCorrections.java)

Having a list of common typos is only one way to come up with corrections. A less heuristic way to come up with corrections is to look for words that are "similar" to the invalid word. One way to measure similarity is using the Levenshtein edit distance, which is defined as the smallest number of edits needed to transform one word into another. Edits involve adding, deleting, or replacing a single character in the word.

For the extra credit, we ask you to implement the Correctable interface again, but this time, the getMap function should return a word that is within edit-distance one of the given invalid word. We will provide you the class' constructor with a dictionary file listing valid words.

It would be helpful (but is not required) to write a function which takes a single word and generates all of the words that are edit-distance one from that. For example, given "grass", it would generate words like

Note that if a given invalid word could be corrected ambiguously—that is, it is edit-distance one from multiple valid words—the getMap function should return null!

Some of the tests may require that you run in a UTF-8 locale. Practically speaking, the simplest way to achieve this is to run the tests with

LANG=en_US.utf8
in your environment. If you're using Eclipse, you can change this in the Environment tab of the run configuration editor, found at Run -> Run configurations....