org.cis1200.MarkovChain

All Implemented Interfaces:: Iterator<String>

public class MarkovChain extends Object implements Iterator<String>

A Markov Chain is a data structure that tracks the frequency with which one value follows another value in sequence. This project uses a MarkovChain to model tweets by gathering the frequency information from a Twitter feed. We can use the MarkovChain to generate "plausible" tweets by conducting a random walk through the chain according to the frequencies. Please see the homework instructions for more information on Markov Chains.

TRAINING:

An example: Suppose we train the MarkovChain on these two Strings that represent tweets: "a table" and "A banana? A banana!"

We first "clean up" the tweets and parse them into individual sentences to use as training data. This process removes punctuation and puts all words into lower case, yielding these three sentences (written using OCaml list notation):

[ ["a"; "table"]; ["a"; "banana"]; ["a"; "banana"] ]

The MarkovChain that results from this training data maps each observed string to a ProbabilityDistribution that is based on the recorded occurrences of bigrams (adjacent words) in the data:

- "a" maps to "table":1, "banana":2

- "table" maps to "<END>":1

- "banana" maps to "<END>":2

"a" is followed by "table" one time and "banana" twice, "table" is the end of a sentence once, and "banana" is the end of a sentence twice. NOTE: we use the string "<END>" to mark the end of a sentence. Because we remove all punctuation first, this string is not a valid word.

The MarkovChain also records a ProbabilityDistribution that contains the frequencies with which words start any sentence. In this case, that startWords data will just say that "a" started 3 sentences.

GENERATING A TWEET:

Once we have trained the Markov model, we can use it to generate a tweet. Given a desired length of tweet (in characters), we repeatedly generate sentences until the tweet is long enough.

To generate a sentence, we treat the MarkovChain as an iterator that maintains state about the current word (i.e. the one that will be generated by next()).

- the reset() method picks (at random) one of the startWords to be the current word. We use reset() to start a new sentence.

- the next() method picks (at random) a successor of the current word according to the current word's probability distribution. That successor will be the new "current" word after the current one is returned by next().

In the example above, reset() sets the current word to "a" (the only choice offered by startWord). Then: next(); // yields "a" (the start word) with probability 3/3 next(); // yields "table" with probability 1/3 and "banana" with probability "2/3" then the iterator is finished (the current word will be "<END>"), since both "table" and "banana" appeared only at the end of sentences.

The random choices are determined by a NumberGenerator.

Field Summary

Fields

Modifier and Type

Field

Description

(package private) final Map<String,ProbabilityDistribution<String>>

chain

for each word, probability distribution of next word in a sentence

(package private) static final String

END_TOKEN

end of sentence marker

(package private) final ProbabilityDistribution<String>

startWords

probability distribution of initial words in a sentence
Constructor Summary

Constructors

Constructor

Description

MarkovChain()

No need to write any constructors.

MarkovChain(NumberGenerator ng)

No need to write any constructors.
Method Summary

Modifier and Type

Method

Description

(package private) void

addBigram(String first, String second)

Adds a bigram to the Markov Chain dictionary.

void

fixDistribution(List<String> words)

Modifies all ProbabilityDistributions to output words in the order specified.

void

fixDistribution(List<String> words, boolean pickFirst)

Modifies all ProbabilityDistributions to output words in the order specified.

(package private) ProbabilityDistribution<String>

get(String token)

Returns the ProbabilityDistribution for a given token.

boolean

hasNext()

This method should check if there is another word to retrieve from the Markov Chain based on the current word of our walk.

String

next()

If reset() or reset(String start) was called immediately before this, this method should return the string that was set up by that reset call.

void

reset()

DO NOT EDIT THIS METHOD.

void

reset(String start)

Given a starting String, sets up the Iterator functionality such that: (1) the Markov Chain will begin a walk at start.

String

toString()

Use this method to print out markov chains with words and probability distributions.

void

train(Iterator<String> sentence)

Adds a sentence's training data to the MarkovChain frequency information.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface java.util.Iterator
forEachRemaining, remove

Field Details
- startWords
  
  final ProbabilityDistribution<String> startWords
  
  probability distribution of initial words in a sentence
- chain
  
  final Map<String,ProbabilityDistribution<String>> chain
  
  for each word, probability distribution of next word in a sentence
- END_TOKEN
  
  static final String END_TOKEN
  
  end of sentence marker
  See Also:
  
  Constant Field Values
Constructor Details
- MarkovChain
  
  public MarkovChain()
  
  No need to write any constructors. They are provided for you.
- MarkovChain
  
  public MarkovChain(NumberGenerator ng)
  
  No need to write any constructors. They are provided for you.
  
  Parameters:
  
  ng - - A (non-null) NumberGenerator used to walk through the MarkovChain
Method Details
- addBigram
  
  void addBigram(String first, String second)
  
  Adds a bigram to the Markov Chain dictionary. Note that the dictionary is a field called chain of type Map<String, ProbabilityDistribution<String>>.
  
  Parameters:
  
  first - - The first word of the Bigram (should not be null)
  
  second - - The second word of the Bigram (should not be null)
  
  Throws:
  
  IllegalArgumentException - - when either parameter is null
- train
  
  public void train(Iterator<String> sentence)
  
  Adds a sentence's training data to the MarkovChain frequency information.
  This method is meant to be called multiple times. Each call to this method should provide this method with an Iterator that represents one sentence. If we were to train a Markov Chain with four unique sentences, we would convert each sentence into an iterator and call train() four times, once on each of the iterators. Remember to update startWords accordingly.
  Look at the homework instructions for more details on bigrams. You should use addBigram() in this method.
  Once you reach the last word of a sentence, add a bigram of that word and the string "<END>" (we provide it as the variable END_TOKEN). This functions similarly to None in OCaml, letting us notate an absence of data without various issues that come up if we use null. Using this string will teach the Markov Chain how to end a sentence.
  Do nothing if the sentence is empty.
  
  Parameters:
  
  sentence - - an iterator representing one sentence of training data
  
  Throws:
  
  IllegalArgumentException - - when the sentence Iterator is null
- get
  
  ProbabilityDistribution<String> get(String token)
  
  Returns the ProbabilityDistribution for a given token. Returns null if none exists.
  
  Parameters:
  
  token - - the token for which the ProbabilityDistribution is sought
  
  Returns:
  
  a ProbabilityDistribution or null
  
  Throws:
  
  IllegalArgumentException - - when parameter is null.
- reset
  
  public void reset(String start)
  
  Given a starting String, sets up the Iterator functionality such that: (1) the Markov Chain will begin a walk at start. (2) the first call to next() made after calling reset(start) will return start.
  If start is "<END>", then hasNext() should return false. start need not actually be a part of the chain (but it should have no successor). In the edge case that you've reset a word that isn't present in the training data, the first call to next() will return the word, and a second call to next() will throw a NoSuchElementException. Likewise, the first call to hasNext() will return true, and a call to hasNext() after a call to next() will return false.
  
  Parameters:
  
  start - - the element that will be the first word in a walk on the Markov Chain.
  
  Throws:
  
  IllegalArgumentException - - when parameter is null.
- reset
  
  public void reset()
  
  DO NOT EDIT THIS METHOD. WE COMPLETED IT FOR YOU.
  Sets up the Iterator functionality with a random start word such that the MarkovChain will now move along a walk beginning with that start word.
  The first call to next() after calling reset() will return the random start word selected by this call to reset().
- hasNext
  
  public boolean hasNext()
  
  This method should check if there is another word to retrieve from the Markov Chain based on the current word of our walk. Remember that the end of a sentence is denoted by the token "<END>".
  Your solution should be very short.
  
  Specified by:
  
  hasNext in interface Iterator<String>
  
  Returns:
  
  true if next() will return a non-trivial String (i.e. it is a meaningful part of the sentence - see train(java.util.Iterator<java.lang.String>)) and false otherwise
- next
  
  public String next()
  
  If reset() or reset(String start) was called immediately before this, this method should return the string that was set up by that reset call. Otherwise, this method should return a successor picked from the probability distribution associated with the word returned by the previous call to next().
  
  Specified by:
  
  next in interface Iterator<String>
  
  Returns:
  
  the next word in the MarkovChain (chosen at random via the number generator if it is a successor)
  
  Throws:
  
  NoSuchElementException - if there are no more words on the walk through the chain.
- fixDistribution
  
  public void fixDistribution(List<String> words)
  
  Modifies all ProbabilityDistributions to output words in the order specified.
  
  Parameters:
  
  words - - an ordered list of words that the distributions should generate
  
  Throws:
  
  IllegalArgumentException - - when parameter is null or empty
- fixDistribution
  
  public void fixDistribution(List<String> words, boolean pickFirst)
  
  Modifies all ProbabilityDistributions to output words in the order specified.
  
  Parameters:
  
  words - - an ordered list of words that the distribution should generate
  
  pickFirst - - whether to pick the first word in words from startWords
  
  Throws:
  
  IllegalArgumentException - - when parameter is null or empty or when first word in the list is not in startWords
- toString
  
  public String toString()
  
  Use this method to print out markov chains with words and probability distributions.
  
  Overrides:
  
  toString in class Object

Class MarkovChain

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface java.util.Iterator

Field Details

startWords

chain

END_TOKEN

Constructor Details

MarkovChain

MarkovChain

Method Details

addBigram

train

get

reset

reset

hasNext

next

fixDistribution

fixDistribution

toString