CIS 120e: SpellChecker

All modern word processors, along with increasingly many other programs, have a built in spell check function that allows even the worst typist some level of efficiency. To complete this homework you will be coding a rudimentary spell check program based on a dictionary and list of corrections for misspelled words.

Your program will look at each word in a file, and if the word is not in the dictionary, ask the user to either a) leave it alone, b) change it to one or more suggested new words, or c) enter a correction of their own.

A few notes before we begin:

Recall: I/O and Collections

This homework is designed for you to gain practice with Java I/O (see java.io) and the Java Collections Framework (see java.util). The I/O package is very large with lots of different classes to choose from -- we recommend taking a look at the following classes:

Input Output

* Scanner is technically in java.util (not java.io). You will also find Lists, Maps and Sets helpful for this assignment. These collections resemble those we saw in OCaml.


Problem 0: A High Level Look

At a high level, a user (you) will run the program using SpellCheckerRunner. This creates the dictionary, correctors, opens the relevant files, and initializes a spell checker object which then does the actual spell checking process.

In addition to the provided files described above, we also provide several test classes (DictionaryTest, TokenScannerTest, etc.) and sample text files to run your code with.


Problem 1: Building the Dictionary

(File to submit: Dictionary.java)

Before we can actually program the spellchecker we need a class that reads in words from the dictionary, stores them in memory, and provides efficient access to them. Choosing the right data structure is important: if you have to check over 7 MB to see if a given word is correct and then need to repeat it 250-500 times (depending on the number of words in your document) you will have to take a rather long break each time you test your code.  Make sure to look at the javadocs and the unit test usage to get some idea of how to approach this problem. All IOExceptions that may result from the execution of this part of the program do not need to be caught.

As this is a rudimentary program, we will make some simplifying assumptions.

1.1 Add Test Cases

First open DictionaryTest.java. We only provide a few test cases. Using the given test cases as a template, look at the documentation about what the Dictionary class does, and add test cases for the following situations. By a test case, we mean that you should make the certain scenario happen, and verify that the response matches what you'd expect (from the Javadocs) using methods like assertEquals. Each bullet below should be its own test method:

Are there any other test situations which may occur? We've left out some scenarios from the list above. You should write additional test cases for any other situations you think of.

TIP: Don't forget about null! A null word is never in the dictionary.

1.2 Dictionary.java

Running the DictionaryTest will fail because although you've outlined the expected behavior, you haven't written the Dictionary class itself! Now create a file called Dictionary.java and implement the class.

You should run your tests, find what fails, and correct the code until your test cases pass. Note: When you attempt to run the unit test file, it may warn you: "Errors exist in required project(s) ... Proceed with launch?". You should hit "Proceed". The error is occuring because you haven't written the files for the next steps in the assignment, and the provided tests for those are looking for those missing files.

IMPORTANT: You should use BufferedReader and FileReader to read the file; do not use a Scanner. (An undocumented behavior of Scanners is that they stop processing if the file is very long; the dictionary.txt file we test with is long enough to trigger this behavior!)

Also: Always remember to close your input streams! If you find a test is not running, you may need to delete the file or exit the Java process in order to manually close the stream.

It is critical that your Dictionary works properly before moving on! Make sure that the method signatures match those in the Javadoc exactly.


Problem 2: Building Correctors

(File to submit: FileCorrector.java and SwapCorrector.java)

Suppose a word doesn't occur in a dictionary. How could we fix it?

Corrector is an abstract class where the getCorrections method is abstract. Any subclass of Corrector must implement this method, and thus is able to provide corrections for a given misspelled word. We will implement two different kinds of Correctors:

  1. FileCorrector - uses a reference text file containing common misspellings to provide suggestions for a misspelled word
  2. SwapCorrector - uses a Dictionary, and suggests any words from it that are a single "swap" from the given incorrect word. For example, it might suggest "this" if the incorrect word was "thsi".

2.1 Adding Tests, Implementing FileCorrector.java

First look closely at the documentation for FileCorrector, and add the following test cases (plus your own) to FileCorrectorTest.java:

When finished writing test cases, then write the FileCorrector class and ensure it passes your tests.

TIP: You may want to look at the indexOf and substring methods of String to help you extract the incorrect and correct words from any given line.

TIP: This tip applies to writing both File and SwapCorrector! It's totally like buy one get one free!

The getCorrections method requires that the suggested corrections have the same case as the given incorrect word. Thus, if "pease" was the incorrect word, we should suggest the set {"peas", "peace"}, but if either "PEASE" or "Pease" or "PeAsE" is the incorrect word, we should suggest back the set {"Peas", "Peace"}. (So, only first letter capitalization is relevant.)

To help with the capitalization correction, the abstract class Corrector has a helper method matchCase that FileCorrector and SwapCorrector can both use. Notice how putting this in the parent class allows us to avoid duplicating code in the two child classes.

A null word yields no suggestions for either type of Corrector.

2.2 Adding Tests, Implementing SwapCorrector.java

Repeat the above procedure for implementing SwapCorrector: look at the Javadocs and add necessary test cases, and then implement the class itself.

You might find that you can re-use some test cases from the FileCorrector, however, many will be different because the SwapCorrector executes different logic to generate suggestions.

TIP: Remember that you can create your own dictionary files to help with testing! For example, if a test case asks for corrections for the incorrect word "abcd", you may want to create your own dictionary file that contains "bacd", "acbd", etc as words and ensure your method correctly proposes the right swaps. Swaps are always between adjacent letters.

The results from calling getCorrections should only yield words that are precisely ONE swap away from the input word. Thus, the incorrect word itself should never appear in the result set (even if it happens to be in the dictionary). See our provided test case that exercises this situation. If it's not clear to you why this result occurs, ask a TA! :-)

All IOExceptions that may result from the execution of the Correctors do not need to be caught.

Read the Javadocs CLOSELY for the Correctors and make sure to test all of the possible edge cases! If you have questions about what is specified, ask for clarification on the list. We test some pretty subtle edge cases in the automatic grader :)


Problem 3: Reading Document Tokens

(File to submit: TokenScanner.java)

Now that we have the dictionary and two ways to correct words, we can now turn our attention to the document itself. The purpose of this class is to read the input file and split it up into tokens as an iterator.

There are two sorts of tokens: words and non-words. Word tokens only contain word characters. Nonwords do not contain any word characters. Following the assumptions made by the Dictionary class above, a word character is either a letter (see Character.isLetter) or an apostrophe.

For example, the text "They aren't brown, are they?" should produce the tokens:

"They" " " "aren't" " " "brown" ", " "are" " " "they" "?"

Green tokens are word tokens, yellow tokens are non-word tokens. Notice how we always alternate between the two types of tokens. We have provided you with a class called Token that stores (a) a true/false variable indicating whether it is a word token or not and (b) the token string itself. The TokenScanner will parse the document and return Tokens instead of Strings.

3.1 TokenScanner.java

You should now work on the class TokenScanner, which implements the interface Iterator<Token>. To fulfill the interface, a TokenScanner must provide the following:

As before, you should write unit tests for this class first, and then write TokenScanner as a means to pass those tests. Again, any IOExceptions can be thrown by this class and don't need to be caught.

TIP: You may want to look at the Histogram scanner written in class for an example of how to approach this problem.


Problem 4: Spell Checker

(File to submit: SpellChecker.java)

The end is near!

Finally, we need to put all of our classes together. A SpellChecker object takes in a Corrector and a Dictionary when constructed. When checkDocument is called, it should read tokens from the input document, spell-check any incorrect words, and write the output to the output file using a Writer object. Most of SpellChecker has already been written, except for the method that processes the document and does the checking.

4.1 checkDocument in SpellChecker.java

Complete the checkDocument method in SpellChecker.java, bearing in mind the following:

Sample Output/Interaction

Here is the output and interaction/functionality we expect to see when running the program:

// This uses a FileCorrector with misspellings.txt
$ java SpellCheckRunner Gettysburg.txt Gettysburg-out.txt dictionary.txt misspellings.txt

The word: "Fuor" is not in the dictionary. Please enter the 
number corresponding with the appropriate action:
0: Ignore and continue
1: Replace with another word
2: Replace with "Four"
3: Replace with "Furor"
// Entered 2

The word: "thsi" is not in the dictionary. Please enter the 
number corresponding with the appropriate action:
0: Ignore and continue
1: Replace with another word
2: Replace with "this"
// Entered 2

The word: "Libert" is not in the dictionary. Please enter the 
number corresponding with the appropriate action:
0: Ignore and continue
1: Replace with another word
// Entered 1
// Entered Liberty

The word: "civl" is not in the dictionary. Please enter the 
number corresponding with the appropriate action:
0: Ignore and continue
1: Replace with another word
// Entered 1
// Entered civil

The word: "thta" is not in the dictionary. Please enter the 
number corresponding with the appropriate action:
0: Ignore and continue
1: Replace with another word
2: Replace with "that"
// Entered that
Invalid input. Please try again!
// Entered 2

The word: "honored" is not in the dictionary. Please enter the 
number corresponding with the appropriate action:
0: Ignore and continue
1: Replace with another word
// Entered 0

The word: "higly" is not in the dictionary. Please enter the 
number corresponding with the appropriate action:
0: Ignore and continue
1: Replace with another word
// Entered 1
// Entered highly

Document completed
  

Your user display and options should match above. For example, do NOT number your options 1,2,3 instead of 0,1,2. We will test your code automatically with a file of user inputs.

Testing: Running in Eclipse

We provide an automated SpellCheckerTest, which uses a file as the "input" from the user. You should write more of these tests, and also run the program in Eclipse as follows:

  1. Go to Run > Run Configurations.
  2. Right click on Java Application and choose "New".
  3. Under "Main class", you should put SpellCheckerRunner.
  4. Switch from the "Main" tab to the "Arguments" tab.
  5. Under "Program arguments:", you can put the inputs to the runner. For example:
  6. You can give this configuration a name in the "Name:" bar at the top, so you can identify this from other runs.
  7. Hit "Apply" and "Run".
  8. The program will run in the "Console" tab, normally at the bottom below your code editors. (Your layout may be different.)

You can return to this configuration and change the arguments to run it with different input files, or to switch between different types of Correctors.

To edit a configuration, in the Run Configurations panel, select the named configuration under "Java Application" on the left side.

To re-run a configuration (after it has been set up), you can choose the configuration (as if you were editing it) and hit "Run". Alternatively, if you have recently run it, you can click the little arrow next to the green button at the top of Eclipse:

TIP: When you run the SpellChecker through Eclipse, you will need to refresh the Eclipse package in order to see the newly created output file.


Challenge Problem

If the drive to create the world's best spell checker has consumed you, you can implement a third type of corrector, Levenshtein.java. This kind of Corrector proposes words that are a Levenshtein distance of 1 away from the incorrect word. This is kudos-only and worth no points.