There are two parts to this assignment, with two separate due dates.

• A set of written questions about sorting algorithms. This part is due before 11:00PM on Wednesday, September 27, 2017.
• A programming assignment to use sorting algorithms to analyze data about scientific publications. This part is due before 11:00PM on Saturday, September 30, 2017.

Collaboration policy.

• You must do this assignment by yourself.
• You must never give or expose your solutions to an assignment to anyone who is taking the course. For example, you may not place your solutions in a public location (such as a website, a public code repository, or a printout left in a lab). If you leave your computer unattended, be sure to protect it with a password.
• You must never view someone else’s solutions to a programming assignment (or variant of an assignment). For example, you may not download solutions to a Coursera version of the assignment from the web.
• All solutions are checked with plagiarism detection software. Any assignment that is flagged by the software will be automatically referred to the Office of Student Conduct, which will adjudicate whether the course collaboration policy was violated. The first violation will result in your overall course grade being decreased by one letter grade. A second violation will result in an F in the class.

# Programming Assignment 3: Sorting large data sets

In this assignment, we will use sorting algorithms to analyze a large data set containing records of publications from the Associate for Computational Linguistics. The data set is called the ACL Anthology Network (AAN for short). The AAN contains more than 20,000 research papers that were published in more than 300 scientific venues by more than 17,000 authors, with 125,000 citations between the papers.

We have preprocessed the data set for you into 5-tuples containing the following information in tab separated files.

<citation count, paper_title, year, author, institution>

For example:

4   Experiments On Sentence Boundary Detection  2000    Stevenson, Mark     Reuters Ltd., London UK
4   Experiments On Sentence Boundary Detection  2000    Gaizauskas, Robert J.   University of Sussex, East Sussex UK
0   DP: A Detector For Presuppositions In Survey Questions  2000    Wiemer-Hastings, Katja  University of Memphis, Memphis TN
0   DP: A Detector For Presuppositions In Survey Questions  2000    Wiemer-Hastings, Peter  University of Edinburgh, Edinburgh UK
0   DP: A Detector For Presuppositions In Survey Questions  2000    Rajah, Sonya    University of Memphis, Memphis TN


You will sort these tuples in order to answer questions like

• Which papers are the most cited in the data set?
• How many papers are published each year?
• How many unique authors are in the data set?
• How many unique institutions are in the data set?
• What institutions have the most authorsv* What are the top publishing authors at each institution?
• What is the median number of citations in the data set?
• What is your professor’s H-Index?

The essential thing for answering those questions is being able to sort a set of AAN records by different fields of the 5-tuples. And that is the focus of this assignment.

The goals of this part of the assignment are:

• Understand the Java Comparable and Comparator
• Understand the properties of different sorting algorithms, including running times and stability,
• Apply sorting algorithms to enable analysis of large amounts of data.

## AAN Records

You should represent each of the AAN 5-tuple record illustrated above by a class implementing the following API:

### Comparable vs. Comparator

As we discussed in class, implementing Comparable allows us to define one way to compare between the user-defined class instances. However, in many cases, it may not be flexible enough if we need to sort instances of the same class using different criteria given different scenarios. That is when we should turn to Comparator.

In your AANRecord, you should implement both and try to understand the differences between them while writing the code.

For the Comparable part, we ask you to override compareTo method following the ordering given in the comments above.

For the Comparator part, we ask you to write a separate one for each single data field to allow sorting according to values of only that field. In your AANRecord class, all data fields are of either String or int type. You should follow the convention of comparing those two types of data in Java when overriding the compare method of the Comparator interface. You may find the java.lang.String.compareTo method useful.

Notice that more complex and flexible criteria can be designed and implemented with Comparator although we do not ask you to do so. It can work for any reasonable rules, for example, you could also write a Comparator to sort everything first by the institution and then by the citation count, i.e. using the combination of two fields. That being said, Comparator does not need to be constrained to use only the institution or to use only the citation count in comparison. You are encouraged to think of reasonable ways to compare the instances and try them out by yourself even though we will only test for the single field based ones.

### Reading and parsing the input file

As mentioned before, the AAN records are stored in tab separated files formatted as the 5-tuples. Your loadRecords method should read and parse those files.

A sample ANN record file can be found here: sample_records.tsv. We will use similar files for testing.

You can use any internal Java library to read input from files on the disk. You can also use In from the algs4.jar.

To parse the input file, you may find the java.lang.String.split method useful.

### Sorting

In this assignment, you may use the textbook’s implementation of sorting algorithms. You should think carefully about which sorting method you select, and whether it is appropriate for the different use cases. Is it stable? Is it efficient?

Notice that for your sort function, if the fieldComparator is pased in a null value, you should use the Comparable implementation. Otherwise, the corresponding Comparator implementation should be used. We will only test the Comparable and Comparator as specified in AANRecord API, but you are free to try others by yourself.

### Merge k sorted lists

In many applications, data instances are not stored in the same place. Instead, they are split into several parts where each part is in sorted order (for example, a set of files that are individually sorted). Merging k sorted lists together can be done with running time O(N log k) instead of O(N log N), where N is the total number of overall records. Your mergeSortedLists method should implement such an O(N log k) algorithm.

Your algorithm should be very similar to the merging process in the mergesort. For this part of the assignment you should use a priority queue as the basis of your implementation. You can assume the comparisons are done based on your Comparable implementation.

Hints: You may find IndexMinPQ from algs4.jar very helpful. An alternative way is to use either Java’s PriorityQueue or the textbook’s MinPQ. Then you need to wrap around your AANRecord with another class so that additional information needed during merging could be included. The wrapper class itself may need to be Comparable to be passed into a priority queue. You can borrow your compareTo method in AANRecord to implement it.

### Exceptions

Your code show throw an IllegalArgumentException when the String or List fields are passed in null values, the year of a record is less than 1962, or the citation count of any record is less than 0.

Your code show throw an IllegalArgumentException when the file specified by filename does not exist. You can assume that all input files are of the valid 5-tuple format.

## Analyzing the Data

You should implement the following API to analyze the AAN records.

### Analyzing records

You should implement above methods with appropriate sorting criteria, which could be one of the Comparable or Comparator defined in the previous section, or you can write new ones outside the AANRecord class whenever necessary.

### De-duplicating

No records from the file will be exactly the same, but as you may notice, they could share some fields. You should de-duplicate records reasonably to let your RecordAnalyzer correctly print or return results.

#### What defines a paper?

Three fields (citationCount, paperTitle, year) together define a paper. It means that if any pair of records share those three fields, they are considered as refering to the SAME paper. The author name and the institution name may vary.

Examples:

0   Perspectives On Parsing Issues  1981    Riesbeck, Christopher K.    Oregon State University, Corvallis OR
1   Perspectives On Parsing Issues  1981    Robinson, Jane J.   RAND Corporation, Santa Monica CA


Two records above have different papers because the citation counts are different.

0   Interactive Multimedia Explanation For Equipment Maintenance And Repair 1990    Feiner, Steven K.   Columbia University, New York NY
0   Interactive Multimedia Explanation For Equipment Maintenance And Repair 1990    McKeown, Kathleen R.    Columbia University, New York NY
2   Interactive Multimedia Explanation For Equipment Maintenance And Repair 1990    Feiner, Steven K.   Columbia University, New York NY
2   Interactive Multimedia Explanation For Equipment Maintenance And Repair 1990    McKeown, Kathleen R.    Columbia University, New York NY


The first two records refer to the same paper, and the last two records refer to the same paper. But the first two and the last have different papers because the citation counts are different.

#### What defines an author?

Two fields (authorName, institutionName) together define an author. So if any pair of records share those two fields, they are considered as refering to the SAME author.

Aist, Gregory   Florida Institute for Human and Machine Cognition, Pensacola FL
Aist, Gregory   NASA Ames Research Center, Moffett Field CA


Two authors above are different because they are associated with different institution names.

### Exceptions

Your code show throw an IllegalArgumentException when the List field is passed in null values or topK <= 0.

## Deliverables

Files to submit: AANRecord.java, RecordAnalyzer.java

Files to submit: AANRecordTest.java, RecordAnalyzerTest.java (plus any AAN record file you use in your own unit tests).

# Written Assignment 2: Sorting

The goals of this assignment are to test your understanding of the material covered in sections 1.3 and 1.4 of the textbook, and the lecture and recitation materials. You should read the textbook chapters before doing this part of the assignment.

Written homeworks must be typeset in LaTeX and submitted in PDF format. Please insert a page break between each question, so that your answer to each question starts on a new page in your PDF document.

## Q1. Comparing playing cards

We would like to sort playing cards from a deck. Associated with each card is a denomination (1 to 13) and a suit (CLUBS < DIAMONDS < HEARTS < SPADES). A card c1 is considered less than a card c2 if either of the following is true:

• the suit of c1 is less than the suit of c2, or
• c1 and c2 are of the same suit, but the denomination of c1 is less than the denomination of c2. The Card class is implemented in Java as follows. Complete the compareTo() function, implementing the ordering described on the previous page, and assuming that the argument is not null.

## Q2. Sorting playing cards

We would like to sort playing cards from a deck. Associated with each card is a denomination (1 to 13) and a suit (CLUBS < DIAMONDS < HEARTS < SPADES). A card c1 is considered less than a card c2 if either of the following is true:

• the suit of c1 is less than the suit of c2, or
• c1 and c2 are of the same suit, but the denomination of c1 is less than the denomination of c2. Suppose that the variable cards is an array of cards. We could sort it, using your compareTo function, with a call to MergeX.sort(cards). Which of the following code fragments would produce an equivalent final result? Circle all equivalent code fragments.

• Option 1:
• Option 2:
• Option 3:
• Option 4:
• Option 5:
• Option 6:
• Option 7:
• Option 8:

## Q3. Backing off to insertion sort

One strategy to speeding up quicksort is to stop quicksort when it reaches subarrays of size 10 or smaller. After quicksort completes, we run insertion sort on the entire array to make sure that everything is in order. Prove that applying insertion sort to the entire array is a linear time step, despite insertion sort’s worst case N^2 performance.

## Q4. Detecting a duplicate

Given k sorted arrays containing N keys in total, design an algorithm that determine whether there is any key that appears more than once. Your algorithm should use extra space at most proportional to k. For full credit, it should run in time at most proportional to N log k in the worst case; for partial credit, time proportional to Nk.

What is the order of growth of the worst case running time of your algorithm as a function of N and k? Briefly justify your answer.

## Q5. Picking the right sorting algorithm for the job

Choose appropriate sorting algorithm for each type of input described below, and give a one sentence explanation why it is the best choice. For the sorting algorithms, you can pick from among Mergesort, insertion sort, quick sort, selection sort, and 3-way quicksort. Please pick all that apply. An algorithm can be chosen multiple times for different questions.

• I want to sort the list of CIS 121 students so that they grouped first by recitation (in ascending order), and their names appear alphabetically within each recitation. Which algorithm(s) should I use?
• NASA is processing noisy satellite data. Occasionally some packets arrive in a different sequence than when they were sent by the satellite. They want to sort the packets by their timestamp instead of their arrival time.
• You are performing sorting on an embedded device with limited memory.
• The Social Security Administration wants to sort all Americans by their birth year.