CIS 120 Homework 7

Due Wednesday, April 16 at 5 pm. There are two required problems in this homework

Important notes:

Introduction

Regular expressions are a specialized language for representing string-matching patterns. Regular expressions were invented by the mathematician Stephen Kleene, one of the pioneers that created the foundations of the theory of computation (with Gödel, Turing, Church, and Post). Kleene invented regular expressions to represent the sets of possible behaviors of the abstract finite computation devices called finite-state machines. In Kleene's original formulation, regular expressions were were built from individual letters with three operators: concatenation, representing one pattern followed by another; alternation (also called union)denoted by |, representing two alternative patterns; and closure (also called Kleene-*), denoted by *, to represent zero or more repetitions of a pattern. Closure is conventionally written after what is repeated. By convention, the empty string represents a special regular expression, the empty regular expression, which matches the empty string. For example, the regular expression a(bc|d)*e matches all strings that start with a, then have some number of bc or d (possibly none) and end with e. Some such strings include ae, abce, abcde, adbcde, abcbcdbce. In the 1970s, computer scientists at Bell Labs were working on the first software tools for text processing, and they adapted and generalized Kleene's idea in several software tools, starting with grep, for searching and editing text. Regular expressions have many practical uses, mainly in pattern matching, which has applications in everything from compilers to searching databases of released AOL user searches for social security numbers.

In this homework, you will construct an evaluator for regular expressions, that is, a program that determines whether a regular expression matches a string. The textbook way to do this is to first translate the regular expression into a finite-state machine and then apply the finite-state matching to the string. However, there's a more direct, elegant, but not so well-known alternative, the method of derivatives due to Janusz A. Brzozowski. This method is described in more detail here or here, although you don't need to read these to do the homework. The basic idea is that given a regular expression and the first character in the input string to match, you can compute a set of regular expressions, one of which must match the remaining string in order for the original RegExp to match the entire input string.

For the purposes of this homework, instead of working with actual regular expressions like a(bc|d)*e, we will work with a representation of the expressions as Java classes. In a complete regular expression evaluator, you would have to parse the string representation of the regular expression into the Java representation. That's not too difficult to do, but it requires a lot of tricky coding that is too much for this assignment.

You will need the following test file and javadocs for the assignment:

Problem 1: Regular Expression Representation (30 points; files to submit: Alt.java, Concat.java, Empty.java, Letter.java, RegExp.java, Star.java)

Recall that any regular expression is either one individual character, an empty expression, or comprised of other regular expressions. For example the Concat expression, "ab", can be decomposed into two subexpressions "a" and "b", each of which is a Letter expression. The Alt expression "a|(a|b*)", can be broken down into subexpressions "a" and "(a|b*)". Every regular expression can be represented in this way as a combination of instances of subclasses of the RegExp class (Alt, Concat, Empty, Letter, Star).

In order to later evaluate a regular expression on strings, we first need to design a set of classes designed to store a regular expression. In this first part of the assignment, you should attempt to code general outline of the classes described in the javadocs, including all of the getExp() methods. While we will not use the toString() methods to test this portion of the code, we strongly suggest you implement toString() methods to obtain results similar to the sample interactions below. We provide you with a set of tests in RegTest.java, and once you complete this part of the assignment, you should be able to instantiate each of the RegExp objects created in the class. It is important to note however, that while these samples provide a good starting point for testing your code, they are not complete and you should try to develop your own regular expressions to test your code.

Please note: The purpose of this problem is to allow you to develop a mechanism for representing a regular expression. You only will need to turn in the files required for Problems 1 and 2 once, but we will test them separately: first to see if they allow for a representation of regular expressions, and second for the matching functionality explained in the next problem. As a result, you will find that some methods are not needed in this first problem, so feel free to use a dummy return statement to test your code and implement the real methods later.

Sample Interactions (RegTest1.hist)

Welcome to DrJava.
> RegTest.testCase1()
a*
> RegTest.testCase2()
((a|b)c)*
> RegTest.testCase3()
(((ac)|(bc)*)c)
> RegTest.testCase4()
((a*|b*)|c)
> RegTest.testCase5()
(((ab)|(cd))*e)
> RegTest.testCase6()
((((a|b)|c)|d)e)*

Problem 2: Regular Expression Matching (70 points; file to submit: Alt.java, Concat.java, Empty.java, Letter.java, RegExp.java, Star.java)

Problem Overview

In this problem, you will be writing the methods that actually evaluate whether a regular expression matches a string. To understand better what is intended, here is what each class of expressions should match in terms of the arguments for the corresponding constructor. You will need these later to code the match method.

It might seem at first that the easiest way to define the match method that implements the above definitions would be to use a recursive definition among the various classes. For instance, one could think of implementing the match(s) call for Alt as computing simply e.match(s) || f.match(s). Unfortunately, this won't work, because it does not explore correctly all the possible ways of trying to match a string (The reasons for this are a bit subtle; you'll find the precise explanation in CIS 262).

To avoid the above problem, we work in a different way. Suppose that we are trying to match an expression e against a string s made of a character c followed by string t (that is, s.equals(c+t)). For each class of regular expression it is possible to give a simple rule that computes a set of regular expressions such that if one of them matches t, then e matches the whole of s. We list the rules below in detail. For example, if e is Letter(a), the result contains the empty expression Empty() if a == c, otherwise the result is just the empty set (Don't confuse the empty expression, which matches the empty string, with the empty set of expressions, which cannot match anything!). To define these rules we need the help of two auxiliary concepts: Skippable and Append.

Skippable

An expression is said to be skippable (skippable method) if the expression can match the empty string (and possibly other strings as well). For example, "(a*|b)" is skippable because one of its alternatives is a*, which can match zero repetitions of a. Therefore, since the empty string "" matches (a*|b), it is skippable.

Append

The other concept requires appending an expression f to an expression e (append method). Appending is just like creating Concat(e, f) except that if e is Empty, the result can be simplified to f, because Empty can only match the empty string and so the result of appending f to it matches the same strings as f. In other words, append simplifies the result of expression concatenation when the first expression is Empty. This append method will be the same for almost all of RegExp's descendants so it should be concretely defined in RegExp, but try to think when you may want to redefine it.

Implementing the advance() methods

The following rules define the result of advance(c) for each subclass of RegExp .

Note: The Union of a set can be defined as follows. Set A = {1, 2, 3} Set B = {3, 5, 7} A Union B = {1, 2, 3, 5, 7}

The following picture illustrates the operations of these rules with a regular expression (a*|ab)c* and an input string "abc". The first letter of the input 'a' is processed leaving two possible RegExps in the set {(a*c*), (bc*)}. The second letter 'b' is then advanced on each one of the sets, the result of the first RegExp is the empty set while the result of advancing on the second RegExp is "c*". By unioning the two we are left with the set {(c*)}. Finally we advance 'c' on the only RegExp in the last set and are left with the set {(c*)} again. Since we have exhausted our input, we then must look at all of the remaining RegExps in our last set. Since (c*) is Skippable we can conclude that the string "abc" is matched by the regular expression (a*|ab)c*

advance

Implementing the match() method

Finally you need to define the concrete match(String example) method in RegExp which actually checks to see if an expression matches a string (see above for an example). To do that, take the first character, advance the regular expression based on that character, take the set that results and advance each regular expression in the set by the second character. Repeat until you reach the end of the string. A string matches the regular expression if there is a skippable expression located in the final set of regular expressions resulting from the evaluation, since a skippable expression can match the empty string, which is all that remains at the end of the given string after advancing over all the string's characters.

A regular expression must encapsulate an entire string to be matched(eg. the regular expression "ab" does not match the string "abc" nor does the regular expression "abc" match the string "ab"). Also, if a given regular expression is not skippable, it can not match the empty string(eg. the regular expression "ab" does not match the empty string)

Sample Interactions (RegTest2.hist)

Welcome to DrJava.
> RegTest.testCase1()
a*
> RegTest.testCase1().match("a")
true
> RegTest.testCase1().match("aa")
true
> RegTest.testCase1().match("ba")
false
> RegTest.testCase1().match("ab")
false
> RegTest.testCase2()
((a|b)c)*
> RegTest.testCase2().match("ac")
true
> RegTest.testCase2().match("acbc")
true
> RegTest.testCase2().match("aabc")
false
> RegTest.testCase2().match("")
true
> RegTest.testCase3()
(((ac)|(bc)*)c)
> RegTest.testCase3().match("acc")
true
> RegTest.testCase3().match("acbcc")
false
> RegTest.testCase3().match("bcbcc")
true
> RegTest.testCase4()
((a*|b*)|c)
> RegTest.testCase4().match("aabaac")
false
> RegTest.testCase4().match("abaaca")
false
> RegTest.testCase5()
(((ab)|(cd))*e)
> RegTest.testCase5().match("abcde")
true
> RegTest.testCase5().match("abe")
true
> RegTest.testCase5().match("abcdeab")
false
> RegTest.testCase6()
((((a|b)|c)|d)e)*
> RegTest.testCase6().match("abed")
false
> RegTest.testCase6().match("abce")
false
> RegTest.testCase6().match("bede")
true