All modern word processors, along with increasingly many other programs, have a built in spell check function that allows even the worst typist some level of efficiency. In this homework you will implement a rudimentary spell check program based on a dictionary and a list of corrections for misspelled words.
For this project, you have only three free submissions. Subsequent submissions will cost you 5 points.
This means you won't be able to use the online testers to find your bugs! Writing your own tests will be critical to doing well.
An archive of files for this assignment are here: hw08.zip, but you should be set if you are using Codio.
Levenshtein
class
in the Javadocs.
This homework is designed for you to gain practice with Java I/O (see java.io) and the Java Collections Framework (see java.util). The I/O package is very large with lots of different classes to choose from -- we recommend taking a look at the following interfaces and classes:
Input |
---|
|
Output |
|
Collections |
*Scanner is technically in java.util (not java.io) because it implements
the Iterable
interface.
In order to create the spellchecker, you will implement the following classes:
TokenScanner
: parses a text file and breaks it up into tokens (words and nonwords)
Dictionary
: parses a text file and creates a collection of words in the dictionary
FileCorrector
and SwapCorrector
: given a misspelled word, suggests
possible corrections from a dictionary or collection of corrections.
SpellChecker
: links the TokenScanner, Dictionary and Corrector to interactively
spellcheck a document.
In addition to the provided files described above, we also provide several test classes (DictionaryTest, TokenScannerTest, etc.) and sample text files with which to test your code.
This diagram depicts how these files interact:TokenScanner.java
and
MyTests.java
for test casesSimilar to the Histogram demo from lecture, you will first start with a helper class for reading
the words in a file. The purpose of this class, which implements the
Iterator<String>
interface, is to read an input file and split it up into
strings, called tokens.
There are two sorts of tokens: words and non-words. Word tokens only contain word characters. Nonwords
do not contain any word characters. (A word character is either a letter (see
Character.isLetter) or an apostrophe.)
Given an input stream of characters, the TokenScanner
should
alternate between producing word and nonword tokens.
File Contents | Expected Tokens |
---|---|
They aren't brown, are they? | "They" " " "aren't" " " "brown" ", " "are" " " "they" "?" |
It's time 2 e-mail! |
"It's" " " "time" "\n2 " "e" "-" "mail" "!" |
A few things worth pointing out:
The class TokenScanner
, because it implements Iterator<String>
,
must provide the following methods (be sure to brush up on the javadocs):
hasNext()
- Tells the client whether any more tokens remain in the filenext()
- Returns the next token as a String
remove()
- You won't implement this functionality, so your remove method can
just contain:
throw new UnsupportedOperationException();
Read the TokenScanner javadocs, and then open up
TokenScannerTest.java
and examine the provided test cases.
Before you start your implementation, you should add more unit tests for this class to
MyTests.java
to exercise all of the intended behavior.
What tests should you add? Think about various situations that could occur for your
token scanner:
TokenScanner
, and you want to find out about such bugs before
you continue to the remainder of the assignment.
Next, implement TokenScanner
so that it
passes your tests. Note that the constructor is allowed to throw an
IOException
, but the next method should catch them and throw
NoSuchElementException
so that it conforms to the Iterator
signature.
TIP: You may want to look at the WordScanner written in class for an example of how to approach this problem. We strongly sugget you just use this as an example. Copying that code and trying to force it to work for this class will cause more headaches than help.
TIP: Remember to close your input streams in your test cases! If you find a test is not running, you may need to delete the file or exit the Java process in order to manually close the stream.
Dictionary.java
and MyTests.java
for
test casesBefore you can actually program the spellchecker you need a class that reads in words from the dictionary, stores them in memory, and provides efficient access to them. In this case, a dictionary is just a collection of 'legal' words, and does not store any definitions.
Begin by reading the Javadoc
for the Dictionary
class, to understand what it does.
Once you understand what the Dictionary class does, open DictionaryTest.java.
We only provide a few test cases in this file. Passing the given tests
doesn't mean your code works for all inputs! Using the given test
cases as a template, add test cases to MyTests.java
for the
following situations. By a
test case, we mean that you should make the certain scenario happen,
and check that the response matches what should happen, as
outlined in the Javadocs. Use methods like assertEquals,
assertTrue, assertFalse
to verify the correct behavior (Recall
that documentation for the assert methods can be found
here).
Each bullet below should be its own test method:
Are there any other test situations or corner cases which may occur? We've left out some scenarios from the list above. You should write additional test cases for any other situations you think of.
TIP: Don't forget about null!
Once you feel confident that your tests cover all aspects of the expected behavior, open up Dictionary.java. We have provided template methods, and you will have to fill in the implementation.
You should run your tests, find what fails, and correct the code until your test cases pass.
Choosing the right data structure to store the dictionary is
crucial. If the dictionary contains 1,000,000 words, you don't want to
be re-reading and searching through all of those words each
time isWord()
is called. The implementation of
the data structure is also important - different ways of storing data
make different operations efficient. Ask yourself:
ArrayList
class implements lists using resizeable arrays.
LinkedList
class implements lists using linked lists of nodes.
TIP: If your code takes more than two or three seconds to run all your Dictionary tests, it will take even longer to pass our tests, and this risks running into our timeout (note that if such a timeout happens, your tests will show as "failed".). In other words, you are using the wrong structure!
It is critical that your Dictionary works properly before moving on! Make sure that the method signatures match those in the Javadoc exactly.
FileCorrector.java
,
SwapCorrector.java
and MyTests.java
for
test cases.Suppose while reading a file we encounter a word that is not in a dictionary. How could we fix it?
Corrector
is an abstract class where the getCorrections
method is
abstract. An abstract class has some similarities to an interface; the biggest difference is
that abstract classes can have some implemented methods. Here matchCase
is
implemented, but getCorrections
is not. Any subclass of Corrector must implement
getCorrections
, and thus be able to provide corrections for a given misspelled
word. You will implement two different kinds of Correctors:
FileCorrector
- uses a reference text file containing common misspellings to
provide suggestions for a misspelled word.
SwapCorrector
- uses a Dictionary and suggests any words from it that are a
single "swap" from the given incorrect word. For example, it might suggest "this" if the
incorrect word was "thsi". A swap is two adjacent letters that are reversed in their
positions.
Read the documentation for FileCorrector, and then add
the following test cases (plus your own) to MyTests.java
:
The list above isn't comprehensive, so make sure to add your own additional test cases for any other behavior described in the Javadocs.
When you are finished adding test cases, write
the FileCorrector
class and ensure it passes your
tests. Think carefully about which data structure is the best suited
for storing the data read from the file.
TIP: You may want to look at
the indexOf
, substring
, and trim
methods
of
String
to help you extract the incorrect and correct words from any given
line.
TIP: This tip applies to writing
both FileCorrector
and SwapCorrector
.
The getCorrections
method requires that the suggested corrections have the
same case as the given incorrect word. Thus, if "pease" was the incorrect word, we should
suggest the set {"peas", "peace"}, but if either "PEASE" or "Pease" or "PeAsE" is the
incorrect word, we should suggest back the set {"Peas", "Peace"}. (Only first letter
capitalization is relevant.)
To help with the capitalization correction, the abstract class Corrector has a helper
method matchCase
that FileCorrector and SwapCorrector can both use. Notice how
putting this in the parent class allows us to avoid duplicating code in the two subclasses.
Repeat the above procedure for implementing SwapCorrector: look at the Javadocs and add necessary test cases
Here are some suggested test cases, but we've left out quite a few:
TIP: Remember that you can create your own dictionary files to help with testing! For example, if you make a test case that asks for corrections for the incorrect word "abcd", you may want to create your own dictionary file that contains "bacd", "acbd", etc and ensure your method doesn't miss any of the swaps!
Now, implement SwapCorrector.java
All IOExceptions that may result from the execution of the Correctors do not need to be caught.
Read the Javadocs closely for the Correctors and make sure to test all of the possible edge cases!
SpellChecker.java
and
MyTests.java
for test casesThe end is near!
You need to put all of the classes together. A SpellChecker object takes in a Corrector and a
Dictionary when constructed. When checkDocument
is called, reads tokens from the
input document, spell-checks any incorrect words, and writes the output to a file using a
Writer
object.
Most of SpellChecker has already been written, except for the method that processes the document and does the checking.
Click here to see a sample of the
output and interaction functionality we expect to see when running
the finished program. You should closely emulate this output in
your SpellChecker
class.
Complete the checkDocument
method in SpellChecker.java
so it matches
the functionality above. Bear in mind the following:
Read the Javadocs for the SpellChecker class first! :)
Non-word tokens should be copied directly from the input document into the newly corrected file. Only word tokens should be checked for spelling and (possibly) replaced.
If the user chooses option 1 ("replace with another word"), they can only enter single
word replacements. This is taken care of for you by the provided getNextString
helper method. You should use the submitted word verbatim; no need to correct the word
case.
If a word isn't in the dictionary, then we want to check if our Corrector has any suggestions for it. If more than one correction is available, then the possible corrections should be presented to the user in alphabetical order.
TIP: To alphabetically sort a string set, use the following:
// Suppose setOfCorrections is the Set<String> you got from a Corrector. List<String> sortedOptions = new LinkedList<String>(setOfCorrections); Collections.sort(sortedOptions); // sortedOptions is a sorted list, e.g. {"alpha", "beta"} // You can now use the List as you wish; see the List Javadocs. String first = sortedOptions.get(0); // "alpha" String second = sortedOptions.get(1); // "beta"
IOExceptions don't need to be caught, but you should handle invalid user input appropriately.
For example, if you give a user four options and they enter an invalid selection (such as 12 or "oaijeoijwef"), you should notify them and prompt for a selection again.
TIP:
We provide code to help you with this: see getNextInt
and
getNextString
in the partial SpellChecker implementation provided.
Your user display and options should match the sample output above. For example, do NOT number your options 1,2,3 instead of 0,1,2. We will test your code automatically with a file of user inputs.
The sample test case in SpellCheckerTest
uses a file as the "user" input so it
can run automatically.
The recommended way to run your spellchecker is from the command
line. See the README in Codio for more information. From the root directory of your
project in your workspace:
java -cp bin SpellCheckerRunner files/Gettysburg.txt files/Gettysburg-out.txt files/dictionary.txt files/misspellings.txt
TIP: If you are encountering an infinite loop when using theFox.txt, pay attention to how your token scanner handles end of file characters.
If the drive to create the world's best spell checker has consumed you, you can implement a
third type of corrector, Levenshtein.java
. This kind of Corrector proposes words
that are a Levenshtein distance
of 1 away from the incorrect word. This is kudos-only and worth no points.
As usual, you will be submitting on the
submissions page.
You should submit a zip file named hw08-(time)submit.zip
containing:
src/Dictionary.java src/FileCorrector.java src/SwapCorrector.java src/TokenScanner.java src/SpellChecker.java src/Levenshtein.java test/MyTests.java
These files should be organized similar to above (with a src and test directory). The easiest option is to use the Zip menu item in Codio. Do not include any of the other provided files, since doing so may cause your submission to fail to compile.
Follow these instructions to create and upload hw08-submit.zip:
TokenScanner
- 20%Dictionary
- 18%FileCorrector
- 18%SwapCorrector
- 16%SpellChecker
- 18%Levenshtein
- 0%Setting up in Eclipse: After downloading the homework files and unzipping them, open up Eclipse and follow these instructions:
Running the command line spellchecker in Eclipse: When you run the SpellChecker through Eclipse, you will need to refresh the Eclipse package in order to see the newly created output file.
SpellCheckerRunner
.You can return to this configuration and change the arguments to run it with different input files, or to switch between different types of Correctors.
To edit a configuration, in the Run Configurations panel, select the named configuration under "Java Application" on the left side.
To re-run a configuration (after it has been set up), you can choose the configuration (as if you were editing it) and hit "Run". Alternatively, if you have recently run it, you can click the little arrow next to the green button at the top of Eclipse:
Running the Spellchecker GUI in Eclipse: