This assignment is due before 11:00PM on Wednesday, October 25, 2017. There are two parts to this homework assignment:
You’ll need to submit your solutions to both parts of the homework before the deadline.
Hash tables are an amazing data structure. Too many people go around blissfully using this magical all-operations-expected-O(1) data structure unaware of how it works. In this homework, you will implement a hash map that is similar to Java’s Map interface.
The term “trie” comes from information retrieval. It is most commonly pronounced “try”. A prefix trie is an ordered tree data structure, which stores string keys by storing characters in nodes. The “prefix” part of the name comes from the fact that common prefixes between keys share the same path from the root in the trie. For a naive trie, there is a fair amount of overhead since every character lives in its own node. One example of an optimization that addresses this object overhead issue is a PATRICIA trie, in which each node is compressed into its parent when appropriate in order to optimize space and memory usage, but we are not asking you to implement PATRICIA tries for this assignment.
The goals of this part of the assignment are:
We have put together several stub files, please click here to download.
After de-compression, you should find a folder named “src” that contains five Java files that you will need in this assignment.
You will fill in
TrieMap.java with your own Hash Table and Trie implementation respectively extending
AbstractTrieMap.java, which reduce the work for a few complex Jave
TestHarness.java, along with the
jamm.jar file and two text files outside the “src” folder, is used to monitor the running status of your code.
We have provided a test harness in
TestHarness.java that records execution duration and memory usage of both map implementations. The harness tests two datasets,
dictionary.txt (350,000 words) and
phonenumbers.txt (8,000 phone numbers, all of the form (215)-898-xxxx, encoded as letters and thus starting with “cbfiji”). You might need to move these files to the root of your project (not in the src folder) to avoid FileNotFoundExceptions. Before you do run the harness, you should take a look at each dataset and form a hypothesis on the CPU times and memory usages of each implementation on each dataset; the results might surprise you!
There are a couple steps to set this up.
TestHarness.javain your project and go to “Run As -> Run Configurations”. In Java Application, under the “arguments” tab, add the string “-javaagent:jamm.jar” to the “VM arguments” field, then Apply and Run.
TestHarness.java. It will report the CPU time and memory usage of your implementation! If it run successfully, and give zeros for your memory usage (since you haven’t implement any methods), you are good to go! (You will use this part again in Quantative Analysis).
Note: If you are using OS X, you may receive the following error in the console: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0 102.jdk/Contents/Home/bin/java and /Library/Java/JavaVirtualMachines/jdk1.8.0 102.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be used. Which one is undefined. This is fine, as the test harness will still run as long as everything else is set up properly.
Recall the method of chaining, discussed in lecture and in recitation. A hash table first hashes an Object’s hash code into a bucket index appropriate for the length of the backing array. Each bucket is a linked list of map entries that can be added to, modified in place, or removed from. Note that when you add an element to a bucket, you should add the element to the front of the bucket (i.e., the head of the linked list).
HashMap.java, we have provided a fair bit of skeleton code for you, namely the constructors, the table buckets, and the hashing methods. You need to worry about the following eight method stubs:
Each method stub contains further instructions. It is critical that you read both the Javadoc comments and the implementation comments for each method. They contain necessary information on both the external and internal behavior of these methods.
It is also critical that you understand the provided code and methods. You will want to pay special attention to the
loadFactor variables, the
hash(int h, int length) method, and the
Entry inner class. Your solution will explicitly invoke these entities.
The trickiest part of this implementation will be managing pointers when resizing. It might help to draw out some examples before you begin coding this part! Additionally, you should make sure you are correctly handling null keys and null values(For example, null keys should be hashed to index zero). Be careful to explicitly handle and test those, and don’t be afraid to repeat some code if you want to explicitly isolate the null cases.
This should be a straightforward “standard” trie implementation, as described in the lecture notes.
TrieMap.java, we have provided a skeleton for you in the form of a nested class
Node and a few helper methods, and we have left you to worry about the following method stubs:
Each method stub contains further instructions. It is critical that you read both the Javadoc specification (best done by using Eclipse to read the spec) and the implementation comments for each method. They contain necessary information on behavior of these methods, hints on how to go about implementing them, and important explanations of differences from the
The trickiest part of this implementation will be correct removal of keys. You might find it helpful to work out examples of each method on paper before going forward with your implementation.
Note that unlike in your
HashMap implementation, your
TrieMap should not support null keys or values. The empty string is still valid as a key, however.
entryIterator() method to return the entries in lexicographic order with respect to the keys. You must write this as a true lazy iterator—an implementation that simply dumps all the elements into a collection and retrieves an iterator from the collection will be awarded no points. The iterator only stores enough state to do its job. Constraints:
If this last constraint about space usage is too difficult, you can ignore that constraint for half kudos.
Please answer the following 5 questions in the written part of the homework (Q1 below):
After finish your implementation of the HashMap and TrieMap, click run on the TestHarness.java. Note: this will take a few minutes to run on your computer. We recommend that you take a brief walk outdoors in the meantime, or sword-fight a friend. Now please copy and paste the output of
TestHarness.java as an answer to this question, and place it inside of a verbatim environment.
dictionary.txt, which implementation had a better running time? Which implementation had better space usage? What about for
phonenumbers.txt? Was this what you were expecting? Why or why not?
It was mentioned in lecture that tries are more space-efficient than hash tables due to the fact that they compress common prefixes. Did your implementation reflect this for dictionary.txt? What about for
phonenumbers.txt? If so, why? If not, what could you potentially do to improve the memory consumption of your TrieMap?
Based on your answer to 3, does Big-Oh notation tell us anything about the actual running time or space usage of an algorithm on a data set? Why or why not? What might an implication of this be for software development?
Add a call to
initChildren() to the end of the Node constructor in
TrieMap.java, then re-run the test harness, and consider the results for
dictionary.txt. The difference here is that now we initialize the children array as soon as the node is constructed, instead of waiting until we actually add a child (the “lazy” way). How much memory, both absolutely and relatively, does the lazy initialization save us? (Give actual numbers.) Would you say the lazy initialization is a worthwhile optimization?
You must provide the actual program output for question 1, and actual numbers for question 5. If you do not, you will receive no credit for this section.
On this assignment in particular, focus on the behaviors that are specific to one implementation or the other. Especially consider edge cases, exceptions, and null keys and values. As always, use multiple methods instead of cramming a bunch of asserts into a single test method, and be sure to demonstrate that you’ve considered “bad” inputs, exceptions, etc.
When writing unit tests for your HashMap, you may find it useful to force collisions between Objects that you are inserting in the map. Since your implementation will require the use of Java’s hashCode() method, you will have to override this method when testing your chaining.
Suppose you want to force a collision between two objects in your unit test. You can do the following:
An alternative, more clean approach would be to create a class where you can set the hashCode at construction:
Either of these approaches works for testing your chaining.
Files to submit:
We will supply the other files included in the download. Your may not call library functions except those in those in java.lang, java.util, and algs4.jar.
The goals of this assignment are to test your understanding of the material covered in sections 3.4 and 5.2 of the textbook, and the lecture and recitation materials. You should read the textbook chapters before doing this part of the assignment.
Written homeworks must be typeset in LaTeX and submitted in PDF format. Please insert a page break between each question, so that your answer to each question starts on a new page in your PDF document.
Please answer the 5 questions above about your implementation.
The insert and search operations on a hash table run in constant time on average. However, the worst case runtime for these operations is O(N). Describe a way that you could maliciously construct keys in order to force the hash table into O(N) runtime for your keys. This could cause a denial of service attack if the hash table were being used in a web server like Apache’s Tomcat. In your description please include the following details (more details are fine):
When might we want to use a hash table instead of a ternary search trie?
What would happen if we had an R-way trie where we keep our next nodes in a linked list? How would this affect memory performance? Timing? What if we used an LLRB? A hash table?
Consider modular hashing for string keys with R = 256 and M = 255. Show that this is a bad choice because any permutation of letters within a string hashes to the same value.
In class we used R = 31: