Skip to main content

There are two parts to this assignment, with two separate due dates.

Collaboration policy.

Programming Assignment 3: Sorting large data sets

In this assignment, we will use sorting algorithms to analyze a large data set containing records of publications from the Associate for Computational Linguistics. The data set is called the ACL Anthology Network (AAN for short). The AAN contains more than 20,000 research papers that were published in more than 300 scientific venues by more than 17,000 authors, with 125,000 citations between the papers.

We have preprocessed the data set for you into 5-tuples containing the following information in tab separated files.

<citation count, paper_title, year, author, institution>

For example:

4   Experiments On Sentence Boundary Detection  2000    Stevenson, Mark     Reuters Ltd., London UK
4   Experiments On Sentence Boundary Detection  2000    Gaizauskas, Robert J.   University of Sussex, East Sussex UK
0   DP: A Detector For Presuppositions In Survey Questions  2000    Wiemer-Hastings, Katja  University of Memphis, Memphis TN
0   DP: A Detector For Presuppositions In Survey Questions  2000    Wiemer-Hastings, Peter  University of Edinburgh, Edinburgh UK
0   DP: A Detector For Presuppositions In Survey Questions  2000    Rajah, Sonya    University of Memphis, Memphis TN

You will sort these tuples in order to answer questions like

The essential thing for answering those questions is being able to sort a set of AAN records by different fields of the 5-tuples. And that is the focus of this assignment.

The goals of this part of the assignment are:

AAN Records

You should represent each of the AAN 5-tuple record illustrated above by a class implementing the following API:

public class AANRecord implements Comparable<AANRecord> {
    public static final Comparator<AANRecord> PAPER_TITLE_ORDER 
    public static final Comparator<AANRecord> YEAR_ORDER 
    public static final Comparator<AANRecord> AUTHOR_ORDER 
    public static final Comparator<AANRecord> INSTITUTION_ORDER 
    public static final Comparator<AANRecord> CITATION_COUNT_ORDER 

    public AANRecord(
        String paperTitle, 
        int year, 
        String author, 
        String institution, 
        int citationCount)

    // Compares records with the following ordering:
    // the papers should be first grouped by year 
    // (for example, all papers from the 2000 appearing together) with
    // earlier years appearing first, then by citationCount (where the 
    // most cited papers for a year appear first), then by paperTitle 
    // (for cases where there are ties in the citation count for a year).
    @Override
    public int compareTo(AANRecord that)

    public String getPaperTitle()
    public int getYear() 
    public String getAuthor()
    public String getInstitution() 
    public int getCitationCount() 
    
    // Allows sorting according to a single data field in this class while 
    // ignoring the others fields.
    private static class PaperTitleOrder implements Comparator<AANRecord> 
    private static class YearOrder implements Comparator<AANRecord> 
    private static class AuthorOrder implements Comparator<AANRecord> 
    private static class InstitutionOrder implements Comparator<AANRecord> 
    // The CitationCountOrder will return the lowest count first. In cases 
    // where you want the highest count first, you can always use reverse() 
    // or reverseOrder() from Java's Comparator interface.
    private static class CitationCountOrder implements Comparator<AANRecord> 

    // Reads and parses a file of AAN records. Returns a list of AANRecords 
    // in the same order as listed in the file.
    public static List<AANRecord> loadRecords(String filename)

    // Sorts a list of AAN records, return them in the sorted order specified
    // by the fieldCompartor.
    public static List<AANRecord> sort(
        List<AANRecord> records, 
        Comparator<AANRecord> fieldComparator)

    // Merges a set of sorted lists, in O(N log k) time.
    public static List<AANRecord> mergeSortedLists(
        List<List<AANRecord>> listOfLists)
}

Comparable vs. Comparator

As we discussed in class, implementing Comparable allows us to define one way to compare between the user-defined class instances. However, in many cases, it may not be flexible enough if we need to sort instances of the same class using different criteria given different scenarios. That is when we should turn to Comparator.

In your AANRecord, you should implement both and try to understand the differences between them while writing the code.

For the Comparable part, we ask you to override compareTo method following the ordering given in the comments above.

For the Comparator part, we ask you to write a separate one for each single data field to allow sorting according to values of only that field. In your AANRecord class, all data fields are of either String or int type. You should follow the convention of comparing those two types of data in Java when overriding the compare method of the Comparator interface. You may find the java.lang.String.compareTo method useful.

Notice that more complex and flexible criteria can be designed and implemented with Comparator although we do not ask you to do so. It can work for any reasonable rules, for example, you could also write a Comparator to sort everything first by the institution and then by the citation count, i.e. using the combination of two fields. That being said, Comparator does not need to be constrained to use only the institution or to use only the citation count in comparison. You are encouraged to think of reasonable ways to compare the instances and try them out by yourself even though we will only test for the single field based ones.

Reading and parsing the input file

As mentioned before, the AAN records are stored in tab separated files formatted as the 5-tuples. Your loadRecords method should read and parse those files.

A sample ANN record file can be found here: sample_records.tsv. We will use similar files for testing.

You can use any internal Java library to read input from files on the disk. You can also use In from the algs4.jar.

To parse the input file, you may find the java.lang.String.split method useful.

Sorting

In this assignment, you may use the textbook’s implementation of sorting algorithms. You should think carefully about which sorting method you select, and whether it is appropriate for the different use cases. Is it stable? Is it efficient?

Notice that for your sort function, if the fieldComparator is pased in a null value, you should use the Comparable implementation. Otherwise, the corresponding Comparator implementation should be used. We will only test the Comparable and Comparator as specified in AANRecord API, but you are free to try others by yourself.

Merge k sorted lists

In many applications, data instances are not stored in the same place. Instead, they are split into several parts where each part is in sorted order (for example, a set of files that are individually sorted). Merging k sorted lists together can be done with running time O(N log k) instead of O(N log N), where N is the total number of overall records. Your mergeSortedLists method should implement such an O(N log k) algorithm.

Your algorithm should be very similar to the merging process in the mergesort. For this part of the assignment you should use a priority queue as the basis of your implementation. You can assume the comparisons are done based on your Comparable implementation.

Hints: You may find IndexMinPQ from algs4.jar very helpful. An alternative way is to use either Java’s PriorityQueue or the textbook’s MinPQ. Then you need to wrap around your AANRecord with another class so that additional information needed during merging could be included. The wrapper class itself may need to be Comparable to be passed into a priority queue. You can borrow your compareTo method in AANRecord to implement it.

Exceptions

Your code show throw an IllegalArgumentException when the String or List fields are passed in null values, the year of a record is less than 1962, or the citation count of any record is less than 0.

Your code show throw an IllegalArgumentException when the file specified by filename does not exist. You can assume that all input files are of the valid 5-tuple format.

Analyzing the Data

You should implement the following API to analyze the AAN records.

public class RecordAnalyzer {
    public RecordAnalyzer(List<AANRecord> records)

    /**
     * Prints the topK most cited papers in the data set in the exact format of
     * citationCount,paperTitle,year 
     * with one paper per line, without the same paper (with exact same values in 
     * those three fields) repeatedly printed. 
     * 
     * The printed lines should be in sorted order: first by citation count from 
     * large to small, then by paperTitle in lexicographic order, finally by year 
     * from early to late.
     * 
     * @param topK the top K lines to be printed. Print all distinct papers when 
     *        topK >= number of distinct papers.
     * @throws IllegalArgumentException when topK <= 0.
     */
    public void printMostCited(int topK)

    /**
     * Prints the number of distinct papers each year in the exact format of
     * year,numPapers
     * with one year per line, without the same paper repeatedly counted or a 
     * year repeatedly printed out.
     * 
     * The printed lines should be in sorted order by year from early to late.
     */
    public void printNumPapersPerYear()

    /**
     * Prints a list of all distinct authors in the exact format of 
     * authorName,institutionName
     * with one author per line, without the same author (with exact same values 
     * in those two fields) repeatedly printed.
     * 
     * The printed lines should be in sorted order: first by authorName then by 
     * institutionName, both in lexicographic order.
     */
    public void printUniqueAuthors()

    /**
     * Prints the number of distinct authors per institution in the exact format:
     * numAuthors,institutionName
     * with one institution per line, without the same author repeatedly counted 
     * or a institution repeatedly printed out.
     * 
     * The printed lines should be in sorted order by institutionName in 
     * lexicographic order.
     */
    public void printNumAuthorsPerInstitution()

    /**
     * Prints the total number of citation for each author in the exact format:
     * institutionName,authorName,totalCitationsForAuthor
     * with one author per line, without the same paper repeatedly counted per 
     * author or the same author repeatedly printed out. 
     * 
     * The printed lines should be in sorted order: first by institutionName 
     * in lexicographic order, then by authorName in lexicographic order.
     */
    public void printAuthorCitationCounts()

    /**
     * De-duplicates the records so that each paper appears only once, then
     * return the median number of citation counts for all distinct papers.
     */
    public double getMedianCitations()
}

Analyzing records

You should implement above methods with appropriate sorting criteria, which could be one of the Comparable or Comparator defined in the previous section, or you can write new ones outside the AANRecord class whenever necessary.

De-duplicating

No records from the file will be exactly the same, but as you may notice, they could share some fields. You should de-duplicate records reasonably to let your RecordAnalyzer correctly print or return results.

What defines a paper?

Three fields (citationCount, paperTitle, year) together define a paper. It means that if any pair of records share those three fields, they are considered as refering to the SAME paper. The author name and the institution name may vary.

Examples:

0   Perspectives On Parsing Issues  1981    Riesbeck, Christopher K.    Oregon State University, Corvallis OR
1   Perspectives On Parsing Issues  1981    Robinson, Jane J.   RAND Corporation, Santa Monica CA

Two records above have different papers because the citation counts are different.

0   Interactive Multimedia Explanation For Equipment Maintenance And Repair 1990    Feiner, Steven K.   Columbia University, New York NY
0   Interactive Multimedia Explanation For Equipment Maintenance And Repair 1990    McKeown, Kathleen R.    Columbia University, New York NY
2   Interactive Multimedia Explanation For Equipment Maintenance And Repair 1990    Feiner, Steven K.   Columbia University, New York NY
2   Interactive Multimedia Explanation For Equipment Maintenance And Repair 1990    McKeown, Kathleen R.    Columbia University, New York NY

The first two records refer to the same paper, and the last two records refer to the same paper. But the first two and the last have different papers because the citation counts are different.

What defines an author?

Two fields (authorName, institutionName) together define an author. So if any pair of records share those two fields, they are considered as refering to the SAME author.

Aist, Gregory   Florida Institute for Human and Machine Cognition, Pensacola FL
Aist, Gregory   NASA Ames Research Center, Moffett Field CA

Two authors above are different because they are associated with different institution names.

Exceptions

Your code show throw an IllegalArgumentException when the List field is passed in null values or topK <= 0.

Deliverables

Files to submit: AANRecord.java, RecordAnalyzer.java

Files to submit: AANRecordTest.java, RecordAnalyzerTest.java (plus any AAN record file you use in your own unit tests).

Written Assignment 2: Sorting

The goals of this assignment are to test your understanding of the material covered in sections 1.3 and 1.4 of the textbook, and the lecture and recitation materials. You should read the textbook chapters before doing this part of the assignment.

Written homeworks must be typeset in LaTeX and submitted in PDF format. Please insert a page break between each question, so that your answer to each question starts on a new page in your PDF document.

Q1. Comparing playing cards

We would like to sort playing cards from a deck. Associated with each card is a denomination (1 to 13) and a suit (CLUBS < DIAMONDS < HEARTS < SPADES). A card c1 is considered less than a card c2 if either of the following is true:

  public class Card implements Comparable {
    // Comparators by suit and by denomination
    public static final Comparator SUIT_ORDER = new SuitOrder();
    public static final Comparator DENOM_ORDER = new DenomOrder();
    // Suit of the card (CLUBS = 1, DIAMONDS = 2, HEARTS = 3, SPADES = 4)
    private final int suit;

    // Denomination of the card
    private final int denom;
    public Card(int suit, int denom) {
      if (suit < 1 || suit > 4)
        throw new IllegalArgumentException("Invalid suit");
      if (denom < 1 || denom > 13)
        throw new IllegalArgumentException("Invalid denomination");
      this.suit = suit;
      this.denom = denom;
    }
    // COMPLETE THE FOLLOWING FUNCTION
    public int compareTo(Card that) {

    }

    // Compare cards according to the suit only
    private static class SuitOrder implements Comparator {
      // Implementation not shown
    }
    // Compare cards according to the denomination only
    private static class DenomOrder implements Comparator {
      // Implementation not shown
    }
}

Q2. Sorting playing cards

We would like to sort playing cards from a deck. Associated with each card is a denomination (1 to 13) and a suit (CLUBS < DIAMONDS < HEARTS < SPADES). A card c1 is considered less than a card c2 if either of the following is true:

    MergeX.sort(cards, Card.SUIT_ORDER);
    MergeX.sort(cards, Card.DENOM_ORDER);
    MergeX.sort(cards, Card.DENOM_ORDER);
    MergeX.sort(cards, Card.SUIT_ORDER);
    MergeX.sort(cards);
    MergeX.sort(cards, Card.SUIT_ORDER);
    MergeX.sort(cards, Card.DENOM_ORDER);
    MergeX.sort(cards);
    Quick.sort(cards, Card.SUIT_ORDER);
    Quick.sort(cards, Card.DENOM_ORDER);
    Quick.sort(cards, Card.DENOM_ORDER);
    Quick.sort(cards, Card.SUIT_ORDER);
    MergeX.sort(cards);
    Quick.sort(cards, Card.SUIT_ORDER);
    Quick.sort(cards, Card.DENOM_ORDER);
    MergeX.sort(cards);

Q3. Backing off to insertion sort

One strategy to speeding up quicksort is to stop quicksort when it reaches subarrays of size 10 or smaller. After quicksort completes, we run insertion sort on the entire array to make sure that everything is in order. Prove that applying insertion sort to the entire array is a linear time step, despite insertion sort’s worst case N^2 performance.

Q4. Detecting a duplicate

Given k sorted arrays containing N keys in total, design an algorithm that determine whether there is any key that appears more than once. Your algorithm should use extra space at most proportional to k. For full credit, it should run in time at most proportional to N log k in the worst case; for partial credit, time proportional to Nk.

Give a crisp and concise English description of your algorithm. Your answer will be graded on correctness, efficiency, and clarity.

What is the order of growth of the worst case running time of your algorithm as a function of N and k? Briefly justify your answer.

Q5. Picking the right sorting algorithm for the job

Choose appropriate sorting algorithm for each type of input described below, and give a one sentence explanation why it is the best choice. For the sorting algorithms, you can pick from among Mergesort, insertion sort, quick sort, selection sort, and 3-way quicksort. Please pick all that apply. An algorithm can be chosen multiple times for different questions.