CIS 120 Homework 5

Due Friday March 7, 2008 at 5 pm. There are three required problems in this homework.

(credit: xkcd.com)

Important notes:

Introduction

All modern word processors, along with increasingly many other programs, have a built in spell check function that allows even the worst typist some level of efficiency. To complete this homework you will be coding a rudimentary spell check program based on a dictionary and list of misspelled words that will be provided. As this is a rudimentary program, we are making some assumptions. We won't deal with apostrophes or capitalization; however, your spell checker will need to handle other types of punctuation and new-lines. Hyphens are a special case, but you can assume that a hyphen will only be used to separate two whole words as in the Gettysburg address below. If there are any other questions regarding punctuation, please feel free to ask them on the Bulletin Board. We will break down this spell checker into three separate parts, with each having their own interaction histories so you can debug and test at each stage in the homework.

This homework is designed to test both your knowledge of I/O as well as your ability to choose correct data structures for each step of the assignment (Hint: It is possible, and probably most efficient, to not use arrays at all for this homework). As always, please post any questions, comments or clarifications to the Bulletin Board and try to make use of office hours.

Final notes: Convert all strings to lowercase before comparison. Output capitalization should match input capitalization. Suggestions for replacement do not need to match the case of the misspelled word.

Here are some files you'll need:

A note on the files: The dictionary.txt file says some words are correct that the misspellings.txt has corrections for - basically, the files aren't quite synced. To make sure this doesn't become a problem, you should always check a word against dictionary.txt first, and then if it's not there, only then should you check in misspellings.txt.

Problem 1: Building the Dictionary and the Database of Misspelled Words (30 points; files to submit: Dictionary.java Corrections.java)

Before we can actually program the spellchecker we need a class that reads in words from the dictionary and stores them in memory (also needs to access them very quickly). If you have to check over 7 MB of data in order to see if a given word is correct and then need to repeat it 250-500 times (depending on the number of words in your document) you would have to take a rather long break each time you tested your code. We also need a way to read in the common misspellings file that is located above, and map each misspelled word to its correct counterpart. Make sure to look at the javadocs and the sample interactions below to get some idea of how to approach this problem. All IO Exceptions that may result from the execution of this program may be thrown and do not need to be caught.

HINT: While you test your dictionary class, you may wish to make a smaller input file initially to make debugging easier.

Sample Interactions (Dictionary.hist)

Welcome to DrJava.
> Dictionary dict = new Dictionary("dictionary.txt")
> dict.isValid("java")
true
> dict.isValid("cse")
true
> dict.isValid("color") //Note the Dictionary file we use contains the British spellings of words
false
> dict.isValid("colour")
true
> dict.isValid("computer")
true
> dict.isValid("sceince")
false

Sample Interactions (Misspelling.hist)

Welcome to DrJava.
> Corrections c = new Corrections("misspellings.txt");
> c.getMap("thsi")
"this"
> c.getMap("autor")
"author"
> c.getMap("practial")
"practical"
> c.getMap("cmputer")
null

Problem 2: Reading the Document (30 points; file to submit: Document.java)

Now that we have the dictionary and list of misspellings, we can now turn our attention to the document itself. This class should both be able to read a token from a file (getNextToken) and also output a string to another file (outputString). It is important to realize that to complete this assignment you should use tokens which can either be words or punctuation/whitespace. In the last part (part 3, see below), all you'll need to do is check if a given token is actually a word and if so check it against the dictionary. Again, all IO Exceptions can be thrown and don't need to be caught.

In the interactions below, after you close the file, Gettysburg_new.txt should be a new text file with the string "seven" contained in it.

Sample Interactions (Note you may have different tokens then those below) (Document.hist)

Welcome to DrJava.
> Document d = new Document("Gettysburg.txt", "Gettysburg_new.txt")
> d.getNextToken()
""
> d.getNextToken()
" "
> d.getNextToken()
" "
> d.getNextToken()
" "
> d.getNextToken()
" "
> d.getNextToken()
"Four"
> d.getNextToken()
" "
> d.getNextToken()
"score"
> d.getNextToken()
" "
> d.getNextToken()
"and"
> d.getNextToken()
" "
> d.getNextToken()
"seven"
> d.outputString("seven")
> d.closeInput()
> d.closeOutput()

Problem 3: Spell Checker(40 points; file to submit: SpellCheck.java)

Finally we need to put all of our classes together. This class should take in the filepaths of all text files (dictionary, misspellings, input and output documents) and run the entire spell check from a single instance of an object. The SpellCheck object will then create a Dictionary object from the dictionary filepath, a Corrections object from the misspellings filepath and a new document from the input and output filepaths. We use System.in as the InputStream for users to enter input(the reason we do not hard code it is because it makes it difficult to test your classes. See interactions below for more details. While IO Exceptions don't need to be caught, you should handle user input appropriately. Therefore if you give a user 3 options and they enter an invalid selection, you should notify them and prompt for a selection again.

Sample Interactions (SpellCheck.hist)

Welcome to DrJava.
> SpellCheck s = new SpellCheck("misspellings.txt","dictionary.txt")
> s.checkDocument("Gettysburg.txt", System.in, "Gettysburg_new.txt")

The word: "thsi" is not in the dictionary. Please enter the number corresponding with the appropriate action
0: Ignore and continue
1: Replace with another word
2: Replace with "this"
//Entered 2

The word: "libert" is not in the dictionary. Please enter the number corresponding with the appropriate action
0: Ignore and continue
1: Replace with another word
//Entered 1

Please enter the new word:
//Entered "liberty"

The word: "civl" is not in the dictionary. Please enter the number corresponding with the appropriate action
0: Ignore and continue
1: Replace with another word
//Entered 1

Please enter the new word:
//Entered "civil"

The word: "thta" is not in the dictionary. Please enter the number corresponding with the appropriate action
0: Ignore and continue
1: Replace with another word
2: Replace with "that"
//Entered 2

The word: "honored" is not in the dictionary. Please enter the number corresponding with the appropriate action
0: Ignore and continue
1: Replace with another word
//Entered 0

The word: "higly" is not in the dictionary. Please enter the number corresponding with the appropriate action
0: Ignore and continue
1: Replace with another word
//Entered 1

Please enter the new word:
//Entered "highly"

Document completed

Make sure that the interactions work with an InputStream other than System.in, also; namely, a FileInputStream. Given this sample_input.txt, these interactions should work:

> SpellCheck s = new SpellCheck("misspellings.txt","dictionary.txt")
> s.checkDocument("Gettysburg.txt", new java.io.FileInputStream("sample_input.txt"), "Gettysburg_new.txt")