CIS 542 - Spring 2012

Lab Assignment #2

January 26, 2011

Due: February 2, 4:30pm

Introduction
Does the choice of data structure in an application really make a difference to its performance? Sure, we always talk about something being O(n log n) or whatever in the worst case, but does it really matter in practice? In this assignment, you will measure the performance of a simple application, then change the data structure and algorithm it uses, and lastly consider the choice of programming language and implementation of the data structure.

The first three parts of this assignment involve Java programming and are quite simple; you should have no problems with those. The second parts (particularly the last one) involve C programming and are substantially more complex. Do not wait until the last minute to attempt to finish those two parts! You may work with one other person on this assignment.

Please note that, for each part, you will submit your source code but also you will submit a writeup in which you answer the questions posed in each section below. Each student should do their own writeup, even if they are working with someone else. Also, please follow the naming conventions suggested in the assignment, as it will help the TAs with grading.


Before you begin
Download CompareDocs.java, which is a program that reads two text files (specified by command-line arguments), parses them word-by-word and keeps track of the distinct words in ArrayLists (one per text file), and then compares the elements of the ArrayList one at a time to calculate the number of words that they have in common.

Also download the text of Tolstoy's "War & Peace" and "Anna Karenina" (retrieved from Project Gutenberg). Then compile and run the program using those two files to make sure it's working correctly. It make take a while to run, so be patient! The output should indicate that they have 9307 words in common. Note: if you get an output other than 9307 words, there may be an issue with the Scanner class in your Java library; you can use this version, which uses BufferedReader instead.

Now, measure how long it takes to execute the CompareDocs program for those inputs. Depending on which OS you're using and how brave you're feeling, you may want to use a system utility like "time" or trusty old System.currentTimeMillis. We will look at microbenchmarking tools like Caliper at a later time.

Be sure to measure the entire time it takes to run the program, including reading the text files and looking through the data structures. Which technique you use to measure performance is up to you, but make sure you use the same one for all parts of this assignment, and think about any threats to validity that may be introduced by your choice. Also be sure to run experiments on the same machine for all parts of the assignment (duh), and try to limit the impact of outside influences, such as other processes running on the same host. In your writeup, please indicate which machine you used for your experiments.


Part 1 (5 points)
Modify the "compareFiles" method in CompareDocs.java and create a new program (named CompareDocs1.java) in which you use a java.util.LinkedList to hold the words in each document instead of an ArrayList. Do not change the algorithm by which elements in the lists are compared, or the manner in which the return value is calculated; just change "ArrayList" to "LinkedList" on lines 15 and 16 (easy, right?).

Now run the program using LinkedLists. You will notice that it takes a lot longer to run using the LinkedList solution compared to the ArrayList solution. Why is that happening? Discuss your findings in your writeup.

 

Part 2 (5 points)
Modify the "compareFiles" method in CompareDocs1.java and create a new program (named CompareDocs2.java) in which you still use a LinkedList but change the mechanism by which elements in the lists are compared (lines 46-52) to use the LinkedList "contains" method, rather than iterating over the elements one-by-one and checking for equality.

Now measure how long it takes to run the program using LinkedLists and the "contains" method. Comparing this implementation to the one from Part 1, which is faster? Why do you think that might be?

 

Part 3 (10 points)
Modify the "compareFiles" method in CompareDocs.java and create a new program (named CompareDocs3.java) in which you use a java.util.HashSet to hold the words in each document instead of an ArrayList. You will also need to make changes to the code in which elements in the lists are compared to match the HashSet API.

Now measure how long it takes to run the program using HashSets. Comparing the HashSet solution to the ArrayList and LinkedList solutions, which is fastest? Why do you think that might be?

Does changing the initial capacity in your HashSet affect your results? Explain.

 

Part 4 (10 points)
Java virtual machines are pretty fast, but are they as fast as a "native" implementation? In this part of the assignment, you will implement the "CompareDocs3" program in C, measure its performance when reading the two large text files, and compare your results with those from your Java program from Part 3.

Create a file called comparedocs.c, in which the "main" function does the following:

Your program should still report that the two files have 9307 words in common.

Now time how fast your C implementation runs. Note that, depending on which technique you were using to time your program, you may have to change how you measure performance so that it's a fair comparison between the Java and C programs.

Were you able to improve the performance compared to the Java program? Why or why not? Does the capacity of the hashtable (the number of "buckets") affect the performance? Explain.

 

Part 5 (20 points)
The hashtable implementation that we've provided you does not automatically resize itself like the Java one does. This means that if you set the capacity to a value that's too small, each bucket will get very large, and you'll lose the ability to find values in time closed to O(1).

Modify the hashtable implementation from Part 4 and create a new file (named hashtable-resize.c) as follows:

You should not need to change your implementation of comparedocs.c from Part 4 for this part of the assignment. You should only be changing the hashtable implementation.

Upon executing your new program, make sure it still reports that the files have 9307 words in common.

Now compare the performance of this new hashtable implementation to that of Part 4. Does the time saved by iterating over smaller linked lists make up for the time it takes to resize the hashtable? What value of MAXLOAD seems to give you the best performance?


Academic Honesty
As in the previous lab assignment, you may work with one other student on this assignment (even when you are outside the lab). However, you may not discuss or share solutions with any other students, nor should you be receiving any help from outside sources, including students not taking this course or online resources (except for the hashtable implementation provided above). If you run into problems, please ask a member of the teaching staff for help. Failure to abide by the academic honesty guidelines will result in a grade of 0 for this assignment.

Note that each student must do his/her own writeup; copying your partner's writeup verbatim and submitting it as your own will be considered academic dishonesty.


Submission
You should submit the assignment before 4:30pm on Thursday, February 2. After that, late submissions will be subject to a 10% per day deduction.

Make sure your code runs on eniac! The TAs will grade this assignment (and all others) on the eniac machines, so make sure it is possible to compile and run your code there, even if you did your experiments on a different platform.

All students, regardless of whether they are working alone or with someone else, must submit their solutions individually. For this lab assignment, you should submit the .java files for Parts 1-3, and any .c or .h files for Parts 4 and 5 (including the ones we provided to you); you should also include a Makefile for Parts 4 and 5. Your writeup should consist of a single plain-text or PDF file in which you answer the questions posed in each part, and indicate the name of the student with whom you worked, if applicable (note that you may work with a maximum of one other student). Again, please be sure that you do your own writeup; you can discuss your findings with your partner, but each of you should write your own document analyzing your results.

Put all your files into a single zip or tar file, and then follow the general instructions on the Course Overview page. Failure to properly follow the submission instructions could result in a delay of grading your assignment and/or a lateness penalty, so please be sure to do it correctly!


Updated: Thu, Jan 26, 10:27am