Class TweetParser
Note: TweetParser's public methods are csvDataToTrainingData() and getPunctuation(). These are the only methods that other classes should call.
All the other methods provided are helper methods that build up the code you'll need to write those public methods. They have "package" (default, no modifier) visibility, which lets us write test cases for them as long as those test cases are in the same package.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescription(package private) static String
Do not modify this method.csvDataToTrainingData
(BufferedReader br, int tweetColumn) Given a buffered reader and the column from which to extract the tweet data, computes a training set.csvDataToTweets
(BufferedReader br, int tweetColumn) Given a buffered reader and the column that the tweets are in, use the extractColumn and a FileLineIterator to extract every tweet from the reader.(package private) static String
extractColumn
(String csvLine, int csvColumn) Given a String that represents a CSV line extracted from a reader and an int that represents the column of the String that we want to extract from, return the contents of that column from the String.static char[]
The clone() function helps us clone the array and return the cloned versionparseAndCleanSentence
(String sentence) Splits a String representing a sentence into a sequence of words, filtering out any "bad" words from the sentence.parseAndCleanTweet
(String tweet) Processes a tweet in to a list of sentences, where each sentence is itself a (non-empty) list of cleaned words.(package private) static String
removeURLs
(String s) Do not modify this method(package private) static String
replacePunctuation
(String tweet) Do not modify this method.tweetSplit
(String tweet) Do not modify this method.
-
Constructor Details
-
TweetParser
public TweetParser()
-
-
Method Details
-
getPunctuation
public static char[] getPunctuation()The clone() function helps us clone the array and return the cloned version
- Returns:
- an array containing the punctuation marks used by the parser.
-
replacePunctuation
Do not modify this method.Given a string, replaces all the punctuation with periods.
The replace() function returns a string where all instances of a character are replaced with another character of your choice
- Parameters:
tweet
- - a String representing a tweet- Returns:
- A String with all the punctuation replaced with periods
-
tweetSplit
Do not modify this method.Given a tweet, splits the tweet into sentences (without end punctuation) and inserts each sentence into a list.
Use this as a helper function for parseAndCleanTweet().
The trim() function returns a string where the leading and trailing spaces are removed. The split() function breaks apart a string according to the given regular expression and returns a string array of these splits.
- Parameters:
tweet
- - a String representing a tweet- Returns:
- A List of Strings where each String is a (non-empty) sentence from the tweet
-
extractColumn
Given a String that represents a CSV line extracted from a reader and an int that represents the column of the String that we want to extract from, return the contents of that column from the String. Columns in the buffered reader are zero indexed.You may find the String.split() method useful here. Your solution should be relatively short.
You may assume that the column contents themselves don't have any commas.
- Parameters:
csvLine
- - a line extracted from a buffered readercsvColumn
- - the column of the CSV line whose contents ought to be returned- Returns:
- the portion of csvLine corresponding to the column of csvColumn. If the csvLine is null or has no appropriate csvColumn, return null
-
csvDataToTweets
Given a buffered reader and the column that the tweets are in, use the extractColumn and a FileLineIterator to extract every tweet from the reader. (Recall that extractColumn returns null if there is no data at that column.) You should skip lines in the reader for which the tweetColumn is out of bounds.- Parameters:
br
- - a BufferedReader that represents tweetstweetColumn
- - the number of the column in the buffered reader that contains the tweet- Returns:
- a List of tweet Strings, none of which are null (but that are not yet cleaned)
-
cleanWord
Do not modify this method.Cleans a word by removing leading and trailing whitespace and converting it to lower case. If the word matches the BAD_WORD_REGEX or is the empty String, returns null instead.
- Parameters:
word
- - a (non-null) String to clean- Returns:
- - a trimmed, lowercase version of the word if it contains no illegal characters and is not empty, and null otherwise.
-
parseAndCleanSentence
Splits a String representing a sentence into a sequence of words, filtering out any "bad" words from the sentence.Hint: use the String split method and the cleanWord helper defined above. You should be splitting on one space of whitespace since words are delimited by spaces.
- Parameters:
sentence
- - a (non-null) String representing one sentence with no end punctuation from a tweet- Returns:
- a (non-null) list of clean words in the order they appear in the sentence. Any "bad" words are just dropped.
-
removeURLs
Do not modify this methodGiven a String, remove all substrings that look like a URL. Any word that begins with the character sequence 'http' is simply replaced with the empty string.
- Parameters:
s
- - a String from which URL-like words should be removed- Returns:
- s where each "URL-like" string has been deleted
-
parseAndCleanTweet
Processes a tweet in to a list of sentences, where each sentence is itself a (non-empty) list of cleaned words. Before breaking up the tweet into sentences, this method uses removeURLs to sanitize the tweet.Hint: use removeURLs followed by tweetSplit and parseAndCleanSentence
- Parameters:
tweet
- - a String that will be split into sentences, each of which is cleaned as described above (assumed to be non-null)- Returns:
- a (non-null) list of sentences, each of which is a (non-empty) sequence of clean words drawn from the tweet.
-
csvDataToTrainingData
Given a buffered reader and the column from which to extract the tweet data, computes a training set. The training set is a list of sentences, each of which is a list of words. The sentences have been cleaned up by removing URLs and non-word characters, putting all words into lower case, and stripping out punctuation. Note that empty sentences are not added to the final list of training data examples.- Parameters:
br
- - a BufferedReader that contains the tweetstweetColumn
- - the number of the column in the buffered reader that contains the tweet- Returns:
- a list of training data examples
-