CIS 120 Homework 9: Search Engine

Due Tuesday, April 29, 5 pm. 100 points.

Introduction

For this assignment you will be writing a simple Internet search engine. Don't let it scare you. Thanks to the magic of Python, this project can be written in fewer than 100 lines of code. More verbose solutions are OK too, of course, but if you find it's taking you much more than 150 lines of code, you're probably missing something that would make your life easier.

Here's an overview of a sample run to give you an idea of how it all should work. The search engine has two modules, crawl.py to find and index pages, and search.py to find pages that match a search term. We'll come back and discuss each module in more detail. Suppose you type the following shell command (what you type is in blue, everything else is in black; note also that these blue commands are from a terminal prompt, not the Python interpreter)

> python crawl.py http://www.somesite.com/somefolder/somepage.htm 2 index.txt

The crawl module will start by reading http://www.somesite.com/somefolder/somepage.htm. It will build an index of the contents of the page, so that later we can look up a search term and quickly find out how many times it occurs on the page. It will also look through the page and find all of its links. This called crawling or spidering the site. For each link, it will go out, read the page, build its index, and find that page's links. It will find each of those pages' links too, and go out and index them, but it will stop there, because we specified in the third command line argument to only follow links to a depth of 2. So in the end, it will have page content information for the original page, all of its links, and all of its links' links. Finally, it will save the index information for all the pages to file index.txt, in a format that will be discussed below.

> python search.py index.txt cookie monster

Given a previously constructed index, this command it will find pages that contain at least one of the terms "cookie" and "monster" (not necessarily next to each other). It will rank the pages by the total number of occurrences of the search terms in the page and print the ranked list as follows:

   http://www.somesite.com/somefolder/otherpage.htm 15
http://www.somesite.com/somefolder/somepage.htm 8
http://www.somesite.com/somefolder/thirdpage.htm 3
 

1. Reading Web pages

To read a web page, crawl.py should import the urllib2 package. You can then get an object representing the content of the page with the Pyton statement "page = urllib2.urlopen(url)", and you can iterate through its lines of text with a for loop like "for line in page:"

All URLs must be fully specified, like the ones in our example. "www.somesite.com" isn't enough -- it must be "http://www.somesite.com". This is only a problem for you when providing the root URL for the engine. We will assume that the other webpages' links are all fully specified (though in a real search engine we'd want to be able to handle incomplete or relative URLs).

2. Extracting links from Web pages

To extract links from the webpages, look for text that matches the pattern href="urlGoesHere" . You must see the string href=" , and grab all the text from that point to the next quotation mark. Be careful when writing a regular expression to match it, because matching is normally greedy. That is, if you saw the string

href="This is a URL" But this is not href="And this is"

you might do a greedy match, and get everything from the first quote mark up through the last quote mark as your URL -- this would be

This is a URL" But this is not href="And this is

We don't want that, so you must override the greedy behavior of the regular expression engine. The star (*) will match as much text as it possibly can while still matching the expression (this is the greedy approach), but the star-question-mark (*?) will match the minimal amount of text it can. You will need the *? to make your regular expression work.

3. Indexing a page

When building the page's index, you should simply split the page's contents by whitespace. Don't worry about trimming punctuation or correcting upper/lower case -- for this assignment, the strings "cookie", "Cookie", and "Cookie." will be considered three different words. Searching for "cookie" should not match "Cookie" or "cookie."

Because there could be multiple links to the same page, you should make sure that you don't enter multiple times into the index a page you got through via a link.

4. Saving the index

The external representation of the index should be a file one line

word page count

for each word found in the crawl and each page (a URL) containing the word, with count the number of occurrences of the word in the page.

5. Searching

The search.py module takes the name of the index file and the sequence of search terms as arguments. It first reads the index and stores it into an appropriate data structure. It then finds and sorts the pages containing the search terms, and prints their URLs and total search term counts in decreasing order of the total count of occurrences of the search terms in the pages. Pages that contain no search terms should not be printed. There will be at least one search term in the command line, and no upper limit to the number of search terms.

6. Reading command line arguments

To read argument passed in on the command line, your Python file must import the sys package. The arguments may then be retrieved, as strings, by calling sys.argv[0], sys.argv[1], and so on. Be careful -- sys.argv[0] is the program name itself, so you probably want to start at sys.argv[1].

7. Overall structure

This project may seem like a bit much at first, but once you start to break it down you'll find that the pieces can be completed very easily and succinctly. Try starting with a function that takes a string, and returns a dictionary showing for each word, how many times the string contains that word. This should be easy -- we did it in lab. Test your function, then write another one that takes a string, and returns a list of all the links in it. Test it, then write a recursive function that takes a URL, builds the dictionary of terms, stores this information in some central variable, and then calls itself on each of the documents' links. Make sure to include the condition for when the recursion stops (in this case, when you've reached the maximum link depth). At this point, you're almost done with crawl.py. As for search.py, it should construct an appropriate map from the index file, collect information on pages containing the search terms from the index, have only to sort the responsivle URLs by the total number of occurrences of search terms.

8. Testing and submission

You can test your modules on our small set of test pages. Here's what we get when crawling those pages and searching the resulting index:

>
python crawl.py
http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc1.html
2 index.txt
> python search.py index.txt
to
http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc3.html 5
http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc1.html 2
http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc2.html 1

> python search.py index.txt
it
http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc2.html 1
> python search.py index.txt to the
http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc3.html 6 http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc2.html 4 http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc1.html 4

You can also test on a set of pages that you control. But please do not test it on other sites, because a buggy crawler can hit a server will too many requests, and your IP address (or your whole domain!) could be blacklisted. In addition, crawlers are supposed to obey the robots.txt protocol. If you don't, you are deemed to be a bad guy and blacklisting could occur. When debugging, don't set the link depth too high -- it'll slow your program down significantly, and also create heavy network load. Please keep it in the 0-3 range.

You should submit code in two files crawl.py and search.py. We will run it on several test sites to evaluate it.

Extra credit

Improve your search engine as follows:

  1. Support relative URLs in your link following.
  2. Allow the crawler to crawl pages to arbitrary depth until no more than a given number of distinct URLs are found. This is specified by passing a negative second argument to crawl.py. For example, a second argument of -100 requests that the crawler keep following links until it runs out of distinct URLs or 100 of them are found. There are several simple ways of doing this, one of which is based on the graph search code presented in class.