CIS 120 Homework 4 - Text Processing

Due Friday, October 16th, 2009 at 11:00am.

Overview

Congratulations! You have gotten a job as a Junior Text Processing Specialist with Business Corp., Inc. (Hey, in this economy, you take what you can get.) Will you have what it takes to get that big promotion to Senior Text Processing Specialist? Let's find out...


Sections in this document:

  1. Overview
  2. Table of Contents
  3. Getting started
  4. Problem 1: Caesar shift
  5. Problem 2: Substring occurrences
  6. Problem 3: Word redaction
  7. Problem 4: Readers and Streams and Files, oh my!
  8. Extra credit

Getting started

First, download hw4.jar. Then create a new Java project in Eclipse and add hw4.jar to the classpath: right-click on the project and choose (Properties --> Java Build Path--> Libraries --> Add External Jar).

In your Java project, you should create a class called TextProcessor; this is the file you will submit for this assignment.

Of course, you will also want to test your assignment. Starting this week, the tests we provide will be broken up into a number of simpler classes containing normal JUnit tests. Here are the tester files you'll need this week:


Problem 1 (25 points): Caesar shift

File to submit: TextProcessor.java

Your first task is to decrypt some old meeting minutes which were encrypted for some reason; no one remembers why (or what they said).

A Caesar cipher is a simple type of cipher named for Julius Caesar, who used it to communicate secretly with his generals. At least, it was secret back then; today, his puny cipher will be no match for your programming skills.

To encrypt some text using a Caesar cipher, each letter is shifted by a certain amount. For example, if using a shift of 3, the letter A becomes D, B becomes E, C becomes F, ... and so on. Letters at the end of the alphabet "wrap around", so, for example, with a shift of 3, X becomes A, Y becomes B, and Z becomes C.

Decrypting a message encoded with a Caesar cipher is easy: just apply another shift so the total shift is 26. For example, to decrypt a message encrypted with a shift of 3, we would apply another shift of 23.

Your job for the first problem is to write a method

    public static String caesar(String input, int shift)

which applies a Caesar cipher shift to a given input string. For example, TextProcessor.caesar("The quick brown fox!", 7) should return "Aol xbpjr iyvdu mve!". Note that capital letters remain capital letters, lowercase remain lowercase, and non-letter characters (such as spaces or punctuation) remain unchanged. Note: We will only test your code with positive shift values. For more details, see the javadocs.

Hints

  1. Every char value has a corresponding int value, which can be obtained by casting. In fact, char values can simply be used as if they were of type int, and Java will automatically convert them. However, to convert in the other direction, one must use an explicit (char) cast. This is because int has a larger range than char, so Java cannot know whether an int to char conversion is safe; some information may be lost.
  2. You may find it useful to know that the int values corresponding to the characters 'A' through 'Z' are all consecutive and in increasing order, as are the values for 'a' through 'z'.
  3. Recall that the mod operator, written %, gives the remainder when one number is divided by another.
  4. Udg pc tmigp (cdi udg rgtsxi) rwpaatcvt, bpzt ndjg rpthpg btiwds ldgz lxiw ctvpixkt hwxuih ph ltaa. Lwn xhc'i xi tcixgtan higpxvwiudglpgs?

Problem 2 (25 points): Substring occurrences

File to submit: TextProcessor.java

The CEO of Business Corp., Inc., Mr. I. M. Bizy, has a habit of using certain words (such as "synergistically", "leverage", and "actionable") far too often. In an attempt to synergistically leverage your core competencies to produce an actionable artifact alerting him to this fact, you have been tasked with analyzing some documents and labeling the occurrences of certain words with how many times those words have been used.

In particular, you should write the following method:

    public static String numberOccurrences(String input, String word)

which consecutively numbers all the occurrences of word in input. For example,

TextProcessor.numberOccurrences("How much wood would a wood chuck chuck if a wood chuck could chuck wood?", "chuck")

should yield

"How much wood would a wood chuck1 chuck2 if a wood chuck3 could chuck4 wood?"

You may assume that you only need to count exact matches; for example, if the foregoing example contained "Chuck" it would not be counted, since "Chuck" and "chuck" are not identical. (However, see the extra credit.)

Hints

  1. You may find the String javadocs useful.

Problem 3 (25 points): Word redaction

File to submit: TextProcessor.java

You have now been tasked with preparing some company documents for public release. Some of the documents contain sensitive company secrets (such as the amount of coffee consumed by employees per day (hint: it is measured in hectaliters)), so your job is to remove secrets from the documents first. Actually, distinguishing between secrets and non-secrets is too hard, so you should just remove all the words.

Specifically, write a method

    public static String redact(String input)

which replaces every word in the string input with a single asterisk. For the purposes of this assignment, a "word" consists of a consecutive sequence of letters or apostrophes; for example, "don't" is one word, whereas "don,t" is two words separated by a comma. (A real text processing application would be slightly more precise; see the extra credit.) Spaces and punctuation should be left unchanged.

For example, TextProcessor.redact("The quick, brown'd fox!") should result in "* *, * *!".

Hints

  1. The apostrophe character needs to be escaped - if you are writing the apostrophe as a character literal you must put a backslash before the apostrophe to prevent Java from interpreting it incorrectly. For example:
        char c = '''; 
    is not valid Java code (why?). Instead, use the following:
        char c = '\'';  

Problem 4 (25 points): Readers and Streams and Files, oh my!

File to submit: TextProcessor.java

Getting input from a String is kind of limiting: in fact, there are lots of other places you could get input, such as from a file, over the network, from a pigeon carrying a sheet of aluminum stamped with Braille... wouldn't it be nice if your code worked in all of these scenarios?

This final task has two parts.

  1. First, write a method

       public static String redactStream(Reader input) throws IOException

    which performs the same function as redact, but gets its input from a Reader instead of a String.

    Note: in order to use Reader and the other classes you'll need for this problem, you should add this to the top of TextProcessor.java:

      import java.io.*;

    (Although if you forget, it's not that big of a deal: Eclipse will offer to do it for you!)

  2. For the second part, you should write a method
      public static String redactFile(String fileName) throws IOException
    which takes as an argument the name of a file, and returns its redacted contents. (Hint: take a look at the FileReader class.) To test this code, you will need the tester files above, plus testDocument.txt, which must be in the same directory as your .java files.

Extra credit

File to submit: TextProcessor.java

Congratulations, you got that promotion to Senior Text Processing Specialist! But do you have what it takes to become a Senior Text Processing Consultant, with your very own desk (in the basement)?

EC 1: Inexact matching

Update your numberOccurrences method so that it also counts inexact matches that differ only in case. For example, numberOccurrences("The the THE ThE tHE springtime", "THe") should result in "The1 the2 THE3 ThE4 tHE5 springtime".

EC 2: Precise word matching

Update your redact method so that it handles apostrophes correctly. For the purposes of the assignment, you were told to assume that an apostrophe always counts as part of a word; but really, an apostrophe is only part of a word if it occurs in between two letters. Otherwise, it is punctuation. For example, redact("I don't 'think' so") should yield "* * '*' *": the first apostrophe is part of a word, since it occurs between n and t, but the second and third apostrophes are punctuation, and are therefore copied unchanged to the redacted output.