CIS 120

Homework 2: Computing Human Evolution

To get started: Visit the CIS 120 course in Codio and select the “hw02: Computing Human Evolution” project. The only files you need to modify are dna.ml and dnaTest.ml. Once you are finished, be sure create a zipfile, download your zip to your local disk and then submit your homework.

The files for this homework are already available in the Codio project, but you can download a fresh copy if needed using this link: hw02.zip.


FAQ

What should I read for help on this homework assignment?

Chapters 5, 6, and 7 of the lecture notes. Consider reviewing the lecture slides, too.

Problem 2

Can I just hard-code the list for question_most_like_human?

You can just hard-code the list – you don’t need to write code.

Does hamming_distance need to account for the differences between a parent and its child’s child?

No. You only ever care about direct parent-child hamming distances.

Problem 3

Can we write helper functions for acids_of_helix?

You’re allowed to write helper functions—you’ll probably need one.

Do we have to account for multiple amino acid chains in the same helix?

You only need to find one (the first) amino acid chain in the helix.

If an end acid is at the beginning of the list, should the rest of the codons be read?

If the sequence that forms an END codon appears before the start codon ([T; A; C]), then you should keep looking for a start codon.

Can we assume that acids_of_helix takes a helix that’s at least a length of 3?

Don’t make that assumption. Remember that you can ignore acids that occur before (or without) the sequence [T ; A ; C].

Problem 4

My tree is outputted as a mirror image of the expected tree. Is that okay?

Yes. For these trees the left and right subtrees are interchangeable.

How do I check whether a value is a certain type in OCaml?

You should use a pattern match to match the type. If statements (such as if lt = Leaf or if lt = Node) won’t work with checks like that.

Why am I getting an error saying “type tree has no Empty constructor”?

Take a closer look at the type defined in the homework. It’s different than the one used in the lectures on binary trees.

Problem 7

My functions passed the test cases for all the earlier steps. However, my final print of the trees is wrong.

Go back and make sure the test cases for all your functions are actually exhaustive. It sounds like you missed a case.

General

Testing

How many tests should I write?

We expect that you will learn to exhaustively test your code, for in later homeworks, you will be responsible for unit testing your code without relying on any provided test cases.

There isn’t a specific answer on how many test cases to write. Generally the rule of thumb is “enough to cover all possible cases of input”, so this may mean testing a range of variants (e.g., a case with null/missing/no elements, a single element, a small set of elements, and a large set of elements) and any edge cases that you come up with when reading the problem (e.g., particular errors or behavior that aren’t immediately related to “standard” cases).

Basically, only stop writing unit tests once you reach the point where you’re reasonably confident that your implementation is correct! Where exactly that point lies is definitely not something that’s immediately obvious, but that’s why we’re having you practice—once you become familiar with how to unit-test effectively, you’ll be able to judge when your tests are sufficient.

Can I write tests that check that certain inputs should fail?

Yes. See run_failing_test in assert. Check out the HW01 FAQ.

Style

How can I get a good style grade?

Your TA should give you style feedback from the first homework at least two days before this homework is due. You should also look at the style guide to see what we’re looking for.

If I want to resubmit with better style, can I?

If you still have extra submissions remaining, you are highly encouraged to resubmit your code with better style. The last submission you make will be the one that is graded for style.

What do we do if a line of code crosses the 80 character limit?

The short answer is break the line into multiple shorter lines.

The long answer is, first, try to avoid having long lines at all. If an expression is too long, consider using some variables to do preprocessing.

Helper Functions & Comments

Do we need to write comments for every function?

You only need to write comments for helper functions that you add, to clarify what they are used for.

How should comments be written?

Comments should preferably be at the top of the function it describes. Keep it very short! You only need to provide an overview of the function.

The only time you should add comments inside your function is if it is some really cryptic calculation that warrants explanation; otherwise, it often detracts from general readability.

Can we use the same helper function for multiple functions?

Absolutely—feel free to use your helper functions however you can to help you!

Compiling/Running

The “Run Project” Terminal says a function is unimplemented even though I just wrote it.

This is usually a sign that your code is not compiling. The executable will only update when there are no errors in your code keeping your code from compiling. Find the red underlines, fix the errors, save the file, then run the executable again and see if this fixes the problem.

Grading

  • There are 95 points possible on the autograded part of this assignment (including style), plus 5 points for your test cases that will be added during a later phase of manual grading. The functions for which we will specifically grade your tests are as follows:
    • hamming_distance
    • decreasing_similarity
    • acids_of_helix
    • helix_of_tree
    • unlabel_tree
    • parent_child_hamming
  • Although we will grade only the tests for the problems defined above, you must be sure to always exhaustively test all functions in your program.
  • You may submit without penalty up to THREE times.
  • Each extra submission costs you 5 points.

Testing

If you’d like to write trees to use in your tests, put them in dnaTest.ml. When writing trees of depth greater than 1 for tests, please indent all children by 4 spaces, separate children by line breaks, and place parentheses similar to the following tree:

LNode (
    LNode (
        LLeaf h1,
        h2,
        LLeaf h3
    ),
    h4,
    LLeaf h5
)

Background

Biologists use evolutionary trees (Figure A below) to show species evolving from ancestor species. To practice using enumerations and recursion over lists and trees, we will write a program that automatically generates hypothesis trees like the one in Figure A.


Figure A: Evolutionary Tree for Apes


Our program input will be real samples of DNA, the genetic code that describes how to build organisms (Figure B). DNA has two complementary helices, each a sequence of the nucleotides adenine, thymine, guanine, and cytosine. Adenine always appears opposite thymine; same for guanine and cytosine.


Figure B: Modeling the DNA Double Helix

Diagram Model

The assignment file “dna.ml” contains the DNA sequences for 8 ape species. This is real data from the Entrez Nucleotide database. You will use this data to:

  1. Generate all possible evolutionary trees with the ape DNA at the leaves. Internal nodes in these trees correspond to ancestor species.
  2. Estimate each tree’s complexity.
  3. Choose the simplest tree as a candidate.
  4. Compare your candidate with the tree in Figure A.

For example, for the four “present-day” DNA helices “GCAT”, “TCGT”, “TAGA”, and “GAGA”, one possible evolutionary tree is shown in Figure C below. This constitutes an evolutionary tree for the given helices, since these helices are at the leaves of the tree, while the internal nodes have as labels helices that are in some sense as close as possible to the labels of their children. Specific details on constructing evolutionary trees are provided in the homework file.


Figure C: Example of a Completely Labeled Tree


Browse these links for more background information:


We derive our images from Wikimedia sources available under Creative Commons licenses.