This assignment is due before 11:00PM on Wednesday, October 25, 2017. There are two parts to this homework assignment:

• A programming assignment on Hashing and Tries. This assignment also requires you to submit unit tests along with your implementation.
• A set of written questions about Hash Tables.

You’ll need to submit your solutions to both parts of the homework before the deadline.

Collaboration policy.

• You must do this assignment by yourself.
• You must never give or expose your solutions to an assignment to anyone who is taking the course. For example, you may not place your solutions in a public location (such as a website, a public code repository, or a printout left in a lab). If you leave your computer unattended, be sure to protect it with a password.
• You must never view someone else’s solutions to a programming assignment (or variant of an assignment). For example, you may not download solutions to a Coursera version of the assignment from the web. You may not use Coursera’s autograder to check the correctness of your solution.
• All solutions are checked with plagrism detection software. Any assignment that is flagged by the software will be automatically referred to the Office of Student Conduct, which will adjudicate whether the course collaboration policy was violated. The first violation will result in your overall course grade being decreased by one letter grade. A second violation will result in an F in the class.

# Programming Assignment 5: Hashing and Tries

Hash tables are an amazing data structure. Too many people go around blissfully using this magical all-operations-expected-O(1) data structure unaware of how it works. In this homework, you will implement a hash map that is similar to Java’s Map interface.

The term “trie” comes from information retrieval. It is most commonly pronounced “try”. A prefix trie is an ordered tree data structure, which stores string keys by storing characters in nodes. The “prefix” part of the name comes from the fact that common prefixes between keys share the same path from the root in the trie. For a naive trie, there is a fair amount of overhead since every character lives in its own node. One example of an optimization that addresses this object overhead issue is a PATRICIA trie, in which each node is compressed into its parent when appropriate in order to optimize space and memory usage, but we are not asking you to implement PATRICIA tries for this assignment.

The goals of this part of the assignment are:

• Understand the underlying implementation of Hash Tables and Tries.
• Implement those two extremely efficient and useful data structures by yourself.

## Getting started

After de-compression, you should find a folder named “src” that contains five Java files that you will need in this assignment.

• BaseAbstractMap.java
• AbstractTrieMap.java
• HashMap.java
• TrieMap.java
• TestHarness.java

You will fill in HashMap.java and TrieMap.java with your own Hash Table and Trie implementation respectively extending BaseAbstractMap.java and AbstractTrieMap.java, which reduce the work for a few complex Jave Map methods.

TestHarness.java, along with the jamm.jar file and two text files outside the “src” folder, is used to monitor the running status of your code.

### Test harness setup

We have provided a test harness in TestHarness.java that records execution duration and memory usage of both map implementations. The harness tests two datasets, dictionary.txt (350,000 words) and phonenumbers.txt (8,000 phone numbers, all of the form (215)-898-xxxx, encoded as letters and thus starting with “cbfiji”). You might need to move these files to the root of your project (not in the src folder) to avoid FileNotFoundExceptions. Before you do run the harness, you should take a look at each dataset and form a hypothesis on the CPU times and memory usages of each implementation on each dataset; the results might surprise you!

There are a couple steps to set this up.

1. Right click on your project and go to “Build Path -> Configure Build Path -> Libraries -> Add JARs”. Select jamm.jar from the file selector drop down and hit Okay.
2. Right-click on TestHarness.java in your project and go to “Run As -> Run Configurations”. In Java Application, under the “arguments” tab, add the string “-javaagent:jamm.jar” to the “VM arguments” field, then Apply and Run.
3. Now, run TestHarness.java. It will report the CPU time and memory usage of your implementation! If it run successfully, and give zeros for your memory usage (since you haven’t implement any methods), you are good to go! (You will use this part again in Quantative Analysis).

Note: If you are using OS X, you may receive the following error in the console: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0 102.jdk/Contents/Home/bin/java and /Library/Java/JavaVirtualMachines/jdk1.8.0 102.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be used. Which one is undefined. This is fine, as the test harness will still run as long as everything else is set up properly.

## HashMap

Recall the method of chaining, discussed in lecture and in recitation. A hash table first hashes an Object’s hash code into a bucket index appropriate for the length of the backing array. Each bucket is a linked list of map entries that can be added to, modified in place, or removed from. Note that when you add an element to a bucket, you should add the element to the front of the bucket (i.e., the head of the linked list).

In HashMap.java, we have provided a fair bit of skeleton code for you, namely the constructors, the table buckets, and the hashing methods. You need to worry about the following eight method stubs:

Each method stub contains further instructions. It is critical that you read both the Javadoc comments and the implementation comments for each method. They contain necessary information on both the external and internal behavior of these methods.

It is also critical that you understand the provided code and methods. You will want to pay special attention to the threshold and loadFactor variables, the hash(int h, int length) method, and the Entry inner class. Your solution will explicitly invoke these entities.

The trickiest part of this implementation will be managing pointers when resizing. It might help to draw out some examples before you begin coding this part! Additionally, you should make sure you are correctly handling null keys and null values(For example, null keys should be hashed to index zero). Be careful to explicitly handle and test those, and don’t be afraid to repeat some code if you want to explicitly isolate the null cases.

## TrieMap

This should be a straightforward “standard” trie implementation, as described in the lecture notes.

In TrieMap.java, we have provided a skeleton for you in the form of a nested class Node and a few helper methods, and we have left you to worry about the following method stubs:

Each method stub contains further instructions. It is critical that you read both the Javadoc specification (best done by using Eclipse to read the spec) and the implementation comments for each method. They contain necessary information on behavior of these methods, hints on how to go about implementing them, and important explanations of differences from the HashMap implementation.

The trickiest part of this implementation will be correct removal of keys. You might find it helpful to work out examples of each method on paper before going forward with your implementation.

Note that unlike in your HashMap implementation, your TrieMap should not support null keys or values. The empty string is still valid as a key, however.

### Extra credit: Lazy iterator

Implement the entryIterator() method to return the entries in lexicographic order with respect to the keys. You must write this as a true lazy iterator—an implementation that simply dumps all the elements into a collection and retrieves an iterator from the collection will be awarded no points. The iterator only stores enough state to do its job. Constraints:

• The running time must be linear in the number of elements in the trie.
• The space usage must be proportional to the height of the trie.

If this last constraint about space usage is too difficult, you can ignore that constraint for half kudos.

## Quantitative Analysis

Please answer the following 5 questions in the written part of the homework (Q1 below):

1. After finish your implementation of the HashMap and TrieMap, click run on the TestHarness.java. Note: this will take a few minutes to run on your computer. We recommend that you take a brief walk outdoors in the meantime, or sword-fight a friend. Now please copy and paste the output of TestHarness.java as an answer to this question, and place it inside of a verbatim environment.

2. For dictionary.txt, which implementation had a better running time? Which implementation had better space usage? What about for phonenumbers.txt? Was this what you were expecting? Why or why not?

3. It was mentioned in lecture that tries are more space-efficient than hash tables due to the fact that they compress common prefixes. Did your implementation reflect this for dictionary.txt? What about for phonenumbers.txt? If so, why? If not, what could you potentially do to improve the memory consumption of your TrieMap?

4. Based on your answer to 3, does Big-Oh notation tell us anything about the actual running time or space usage of an algorithm on a data set? Why or why not? What might an implication of this be for software development?

5. Add a call to initChildren() to the end of the Node constructor in TrieMap.java, then re-run the test harness, and consider the results for TrieMap and dictionary.txt. The difference here is that now we initialize the children array as soon as the node is constructed, instead of waiting until we actually add a child (the “lazy” way). How much memory, both absolutely and relatively, does the lazy initialization save us? (Give actual numbers.) Would you say the lazy initialization is a worthwhile optimization?

You must provide the actual program output for question 1, and actual numbers for question 5. If you do not, you will receive no credit for this section.

## Style & Tests

On this assignment in particular, focus on the behaviors that are specific to one implementation or the other. Especially consider edge cases, exceptions, and null keys and values. As always, use multiple methods instead of cramming a bunch of asserts into a single test method, and be sure to demonstrate that you’ve considered “bad” inputs, exceptions, etc.

### Mocking hashCode()

When writing unit tests for your HashMap, you may find it useful to force collisions between Objects that you are inserting in the map. Since your implementation will require the use of Java’s hashCode() method, you will have to override this method when testing your chaining.

Suppose you want to force a collision between two objects in your unit test. You can do the following:

An alternative, more clean approach would be to create a class where you can set the hashCode at construction:

Either of these approaches works for testing your chaining.

## Deliverables

Files to submit: HashMap.java, TrieMap.java and HashMapTest.java, TrieMapTest.java.

We will supply the other files included in the download. Your may not call library functions except those in those in java.lang, java.util, and algs4.jar.

# Written Assignment 5: Hash Tables

The goals of this assignment are to test your understanding of the material covered in sections 3.4 and 5.2 of the textbook, and the lecture and recitation materials. You should read the textbook chapters before doing this part of the assignment.

Written homeworks must be typeset in LaTeX and submitted in PDF format. Please insert a page break between each question, so that your answer to each question starts on a new page in your PDF document.

## Q2. Hacking a Hash Table

The insert and search operations on a hash table run in constant time on average. However, the worst case runtime for these operations is O(N). Describe a way that you could maliciously construct keys in order to force the hash table into O(N) runtime for your keys. This could cause a denial of service attack if the hash table were being used in a web server like Apache’s Tomcat. In your description please include the following details (more details are fine):

• What assumption does the constant time runtime rely on? What do your maliciously constructed keys do to that assumption?
• What information would you need to know in order to effectively hack the symbol table?
• Would your attack hurt the performance of a separate chaining implementation of a hash table, or a linear probing implementation, or both? Why?
• What alternate symbol table data structure could the designer replace the hash table with in order to thwart your attack? Why would that work? Would it cause different average runtime behavior for non-malicious calls to the new symbol table’s insert and search operations?

## Q3. Hash tables versus TSTs

When might we want to use a hash table instead of a ternary search trie?

## Q4. R-way trie with a linked list

What would happen if we had an R-way trie where we keep our next nodes in a linked list? How would this affect memory performance? Timing? What if we used an LLRB? A hash table?

## Q5. Making a hash of strings

Consider modular hashing for string keys with R = 256 and M = 255. Show that this is a bad choice because any permutation of letters within a string hashes to the same value.

In class we used R = 31: