Due Wednesday, April 16 at 5 pm. There are two required problems in this homework
Important notes:
Regular expressions are a specialized language for representing string-matching
patterns. Regular expressions were invented by the mathematician Stephen
Kleene,
one of the pioneers that created the foundations of the theory of computation
(with Gödel, Turing, Church, and Post). Kleene invented regular expressions
to represent the sets of possible behaviors of the abstract finite computation
devices called finite-state machines. In Kleene's original formulation,
regular expressions were were built from individual letters with three operators:
concatenation, representing one pattern followed by another; alternation (also
called union)denoted by |, representing two alternative patterns;
and closure (also called Kleene-*), denoted by *, to
represent zero or more repetitions of a pattern. Closure is conventionally
written after what is repeated. By convention, the empty string represents
a special regular expression, the empty regular expression, which
matches the empty string. For example, the regular expression a(bc|d)*e matches
all strings that start with a, then have some number of bc or d (possibly
none) and end with e. Some such strings include ae, abce, abcde,
adbcde, abcbcdbce. In the 1970s, computer scientists
at Bell Labs were working on the first software tools for text processing, and
they adapted and generalized Kleene's idea in several software tools, starting
with grep, for
searching and editing text. Regular expressions have
many practical uses, mainly in pattern matching, which has applications in everything
from compilers to searching databases of released
AOL user searches for social security numbers.
In this homework, you will construct an evaluator for regular expressions,
that is, a program that determines whether a regular expression matches a string.
The textbook way to do this is to first translate the regular expression into
a finite-state machine and then apply the finite-state matching to the string.
However, there's a more direct, elegant, but not so well-known alternative,
the method of derivatives due to Janusz
A. Brzozowski. This method is described in more detail here or here,
although you don't need to read these to do the homework. The basic idea is that given
a regular expression and the first character in the input string to match, you can compute
a set of regular expressions, one of which must match the remaining string in order for the original RegExp
to match the entire input string.
For the purposes of this homework, instead of working with actual regular
expressions like a(bc|d)*e, we will work with a representation
of the expressions as Java classes. In a complete regular expression
evaluator, you would have to parse the string representation of the
regular expression into the Java representation. That's not too difficult to
do, but it requires a lot of tricky coding that is too much for this assignment.
You will need the following test file and javadocs for the assignment:
Alt.java, Concat.java, Empty.java, Letter.java, RegExp.java, Star.java)Recall that any regular expression is either one individual character, an
empty expression, or comprised of other regular expressions. For example the
Concat expression, "ab", can be decomposed into two
subexpressions "a" and "b", each of which is a Letter expression.
The Alt expression "a|(a|b*)",
can be broken down into subexpressions "a" and "(a|b*)".
Every regular expression can be represented in this way as a combination of instances
of subclasses of the RegExp class (Alt, Concat,
Empty, Letter, Star).
In order to later evaluate a regular expression on strings, we first need
to design a set of classes designed to store a regular expression. In this
first part of the assignment, you should attempt to code general outline of
the classes described in the javadocs, including all of the getExp() methods. While we will not use the toString() methods to test this portion of the code, we strongly suggest you implement
toString() methods to obtain results similar to the sample interactions below.
We provide you with a set of tests in RegTest.java, and once you complete this
part of the assignment, you should be able to instantiate each of the RegExp objects created in the class. It is important to note however, that while these
samples provide a good starting point for testing your code, they are not complete
and you should try to develop your own regular expressions to test your code.
Please note: The purpose of this problem is to allow you to develop a mechanism for representing a regular expression. You only will need to turn in the files required for Problems 1 and 2 once, but we will test them separately: first to see if they allow for a representation of regular expressions, and second for the matching functionality explained in the next problem. As a result, you will find that some methods are not needed in this first problem, so feel free to use a dummy return statement to test your code and implement the real methods later.
Welcome to DrJava. > RegTest.testCase1() a* > RegTest.testCase2() ((a|b)c)* > RegTest.testCase3() (((ac)|(bc)*)c) > RegTest.testCase4() ((a*|b*)|c) > RegTest.testCase5() (((ab)|(cd))*e) > RegTest.testCase6() ((((a|b)|c)|d)e)*
Alt.java, Concat.java, Empty.java, Letter.java, RegExp.java, Star.java)In this problem, you will be writing the methods that actually evaluate whether
a regular expression matches a string. To understand better what is intended,
here is what each class of expressions should match in terms of the arguments
for the corresponding constructor. You will need these later to code the match method.
Empty(): matches "" Letter(n): matches any character m (including punctuation) as long as n == m. Note that Letter is case sensitive Concat(e,f):
matches any string u that is the concatenation of two strings v and w such
that e matches v and f matches w Alt(e,f): matches
any string that e matches or f matches Star(e): matches any string which is
the concatenation of 0 or more strings each of which matches e It might seem at first that the easiest way to define the match method that
implements the above definitions would be to use a recursive definition among
the various classes. For instance, one could think of implementing the match(s) call
for Alt as computing simply e.match(s) || f.match(s).
Unfortunately, this won't work, because it does not explore correctly all the
possible ways of trying to match a string (The reasons for this are a bit subtle;
you'll find the precise explanation in CIS 262).
To avoid the above problem, we work in a different way. Suppose that we are
trying to match an expression e against a string s made
of a character c followed
by string t (that is, s.equals(c+t)).
For each class of regular expression it is possible to give a simple rule that
computes a set of regular expressions such that if one of them matches t,
then
e matches the whole of s. We list the rules below in
detail. For example, if
e is Letter(a), the result
contains the empty expression Empty() if a == c, otherwise the result is just
the empty set (Don't confuse the empty expression, which matches the empty
string, with the empty set of expressions, which cannot match anything!). To define these rules we need the help of two
auxiliary concepts: Skippable and Append.
An expression is said to be skippable (skippable method) if the expression
can match the empty string (and possibly other strings as well). For example,
"(a*|b)" is skippable because one of its alternatives
is a*, which can match zero repetitions of a. Therefore, since the empty string "" matches (a*|b),
it is skippable.
The other concept requires appending an
expression f to an expression e (append method).
Appending is just like creating Concat(e, f) except
that if e is Empty, the result can
be simplified to f, because Empty can only match the
empty string and so the result of appending f to it matches the
same strings as f. In other words, append simplifies
the result of expression concatenation when the first expression is Empty.
This append method will be the same for almost all of RegExp's
descendants so it should be concretely defined in RegExp, but try to think when you may want to redefine it.
advance() methodsThe following rules define the result of advance(c) for each
subclass of RegExp .
Empty: the empty set since Empty cannot match
any character.Letter: If the character in Letter is c,
return a Set containing just an Empty,
meaning that we are done. Otherwise return the empty set, since the expression
fails to match the input. Concat: Separate Concat into two cases. In the first case, let the first RegExp not
be skippable. All we need to do in this case is advance the first RegExp by the given character, append
the second RegExp to each member of the set, and return the set. The second case is a bit more complicated.
When the first RegExp is skippable, you need to advance the first RegExp by the given character and
append as above but then must also union the result with the set created by advancing the second RegExp by the given
character.Alt: Union the result of advancing the first RegExp with
the result of advancing the second RegExp.Star: Append this Star to each of the results of advancing the
RegExp and
return a set containing the modified RegExps.Note: The Union of a set can be defined as follows. Set A = {1, 2, 3} Set B = {3, 5, 7} A Union B = {1, 2, 3, 5, 7}
The following picture illustrates the operations of these rules with a regular expression (a*|ab)c* and an input string
"abc". The first letter of the input 'a' is processed leaving two possible RegExps in the set
{(a*c*), (bc*)}. The second letter 'b' is then advanced on each one of the sets, the result of the first
RegExp is the empty set while the result of advancing on the second RegExp is "c*".
By unioning the two we are left with the set {(c*)}. Finally we advance 'c' on the only RegExp in the last
set and are left with the set {(c*)} again. Since we have exhausted our input, we then must look at all of the remaining
RegExps in our last set. Since (c*) is Skippable we can conclude that the string "abc" is matched
by the regular expression (a*|ab)c*

match() methodFinally you need to define the concrete match(String
example) method in RegExp
which actually checks to see if an expression matches a string (see above for an example). To do that,
take the first character, advance the regular expression based on that
character, take the set that results and advance each regular expression in
the set by the second character. Repeat until you reach the end of the string.
A string matches the regular expression if there is a skippable expression
located in the final set of regular expressions resulting from the evaluation,
since a skippable expression can match the empty string, which is all that
remains at the end of the given string after advancing over all the string's
characters.
A regular expression must encapsulate an entire string to be matched(eg. the regular expression "ab" does not match the string "abc" nor does the regular expression "abc" match the string "ab"). Also, if a given regular expression is not skippable, it can not match the empty string(eg. the regular expression "ab" does not match the empty string)
Welcome to DrJava.
> RegTest.testCase1()
a*
> RegTest.testCase1().match("a")
true
> RegTest.testCase1().match("aa")
true
> RegTest.testCase1().match("ba")
false
> RegTest.testCase1().match("ab")
false
> RegTest.testCase2()
((a|b)c)*
> RegTest.testCase2().match("ac")
true
> RegTest.testCase2().match("acbc")
true
> RegTest.testCase2().match("aabc")
false
> RegTest.testCase2().match("")
true
> RegTest.testCase3()
(((ac)|(bc)*)c)
> RegTest.testCase3().match("acc")
true
> RegTest.testCase3().match("acbcc")
false
> RegTest.testCase3().match("bcbcc")
true
> RegTest.testCase4()
((a*|b*)|c)
> RegTest.testCase4().match("aabaac")
false
> RegTest.testCase4().match("abaaca")
false
> RegTest.testCase5()
(((ab)|(cd))*e)
> RegTest.testCase5().match("abcde")
true
> RegTest.testCase5().match("abe")
true
> RegTest.testCase5().match("abcdeab")
false
> RegTest.testCase6()
((((a|b)|c)|d)e)*
> RegTest.testCase6().match("abed")
false
> RegTest.testCase6().match("abce")
false
> RegTest.testCase6().match("bede")
true