**CIS 1920 Homework 1: Tries**
**Due January 27, 2023 11:59 pm EST**
(#) Learning objectives - Understand loops in Python - Familiarize yourself with `string` and `dict` data types - Understand pass by assignment in functions and mutable objects (#) Starter files - [`hw1.py`](hw1.py) (#) What are tries? A trie -- pronounced "try" -- is a tree-based data structure for storing strings, with support for efficient pattern and prefix matching. Formally, given: - an alphabet $\mathcal{A}$ (set of valid characters, such as the English alphabet, or $\{\text{A,C,D,G}\}$ for DNA) - a set of strings $\mathcal{S}$ *such that no string is the prefix of another* A trie is an ordered tree whose nodes (except the root node) are labeled with characters from $\mathcal{A}$. The tree has $|\mathcal{S}|$ (size of $\mathcal{S}$) leaves, one for every string. The tree is constructed such that the concatenation of characters from root to leaf forms a string in $\mathcal{S}$. Given $\mathcal{A}$ as the English alphabet and $\mathcal{S} = \{\text{bet, bid, bit, set, step}\}$, the corresponding trie is: ************************************************************************* * .-. * | | * +-+ * / \ * / \ * / \ * / \ * / \ * .+. .+. * | b | | s | * +-+ +-+ * / \ / \ * / \ / \ * .+. .+. .+. .+. * | e | | i | | e | | t | * +-+ +-+ +-+ +-+ * / / \ \ \ * / / \ \ \ * .+. .+. .+. .+. .+. * | t | | d | | t | | t | | e | * +-+ +-+ +-+ +-+ +++ * bet bid bit set | * .+. * | p | * '-' * step ************************************************************************* The primary application of tries is for string matching. String matching is a problem encountered in many fields of computer science, ranging from computational biology with DNA manipulation to NLP with sentence completion. Broadly speaking, given a collection of strings, we want to perform a "search" on the collection. What exactly does this mean? In some cases, we are interested in pattern matching: is a given string $X$ in the collection? In others, we are interested in prefix matching: find all strings that have a given substring $X$ as a prefix. (#) Tries in Python In this homework, you will implement the basic functionality of a trie in Python, leveraging some of Python's data structures. The tree for the trie must be implemented as a nested Python `dict` (short for "dictionary"). The `dict` functions similarly to a Java HashMap, except the keys and values can be of varying data types within the same `dict` (unlike Java). For more information about the `dict`, see [the official docs](https://docs.python.org/3/library/stdtypes.html#mapping-types-dict). Each key of the dictionary will be a string of length 1 representing a node, and each value will be another dictionary, whose keys correspond to the children of the key node. For the example trie above, the corresponding dictionary implementation would be: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python linenumbers trie = { "b": { "e": { "t": {} }, "i": { "d": {}, "t": {} } }, "s": { "e": { "t": {} }, "t": { "e": { "p": {} } } } } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Leaf nodes are represented by an empty dictionary. Also note that since the root node does not have a key, we use the outer-most dictionary itself as the root of the tree. (##) Implementation Details You will write the following functions: - `add_word(trie, word)`: Adds a word to the given trie dictionary representation. - `create_trie(word_list)`: Returns a dictionary trie built from the given list of words. - `in_trie(trie, word)`: Returns a `bool` indicating whether the given word is present within the trie. - `list_matches(trie, prefix)`: Returns a `list` of all words within the trie that matches the given prefix. The key to implementing all of these functions is a basic tree search algorithm. Pseudo-code of a sample iterative tree search is described below: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python ... # start the search at the root of the trie cur_node = trie for ... # iterate over characters c in word/prefix if ... # perform logic at cur_node # move to the next node in the trie cur_node = cur_node[c] ... ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (##) Other Requirements !!! WARNING No libraries may be imported for this homework. (#) Some words on code style An observation made by Guido van Rossum, the creator of Python, is that code is more often read than it is executed. One of the mantras of the Python language is that "readbility matters." With this in mind, we are looking for readable, consistent code when grading you on style. For some guidance, we encourage you to begin reading over the [official Python style guide](https://pep8.org), known as PEP8, when you get the chance. (##) Code style "close reads" For homeworks this semester, we will select function(s) that will be read closely for style and efficiency. Important stylistic choices that will be considered in the close reading are the descriptiveness and format of variable names, and comments that describe the code that is written. You can see the conventions that you should follow in the [official Python style guide](https://pep8.org), as mentioned above. Note that we don't expect you to meticulously follow every aspect of the official style guide, but provide it as guidance for your reference. The overall goal with the close reads is to provide more fine-grained feedback as opposed to just running a style checker like `pycodestyle` across the entire assignment, which is also useful but less personalized. !!! INFO For this assignment, we will be making a close read of `list_matches()` and any helper functions you might write for it. Since this is the first homework we will grade accordingly (leniently 🙂), and will provide feedback in the Gradescope rubric so that you can continue to improve your coding skills. (#) Rubric | Section | Points | |---------|--------| Name, PennKey, and hours filled in | 0.5 All functions are implemented | 0.5 `add_word()`| 2 `create_trie()`| 2 `in_trie()`| 2 `list_matches()`| 2 `list_matches()` code style | 1 **Total** | 10 (#) Testing your code and submission You can test your code in the `main()` method that we included to check the behavior of your implementation on your local machine. Your `main()` method will not be graded and is purely for you to use as a sandbox. You will upload your `hw1.py` code to [**Gradescope**](https://www.gradescope.com/courses/477992) for submission. We encourage you to work iteratively, implementing functions one at a time to verify their correctness before moving on to the next function. To facilitate this, you are welcome to submit to Gradescope to verify your code against the autograder as many times as you would like before the due date without penalty. Please keep in mind that any submission made **after the due date** will be considered late and will either be counted towards your alloted late days or penalized accordingly.