Due Tuesday, April 29, 5 pm. 100 points.
For this assignment you will be writing a simple Internet search engine. Don't let it scare you. Thanks to the magic of Python, this project can be written in fewer than 100 lines of code. More verbose solutions are OK too, of course, but if you find it's taking you much more than 150 lines of code, you're probably missing something that would make your life easier.
Here's an
overview of a sample run to give you an idea of how it all should
work.
The search engine has two modules, crawl.py to find and
index pages, and search.py to find pages that match a search
term.
We'll come back and discuss each module in more detail. Suppose
you type
the following shell command (what you type is in blue,
everything else is in black; note also that
these blue commands are from a terminal prompt, not the Python
interpreter)
> python crawl.py
http://www.somesite.com/somefolder/somepage.htm
2 index.txt
The crawl module will start by reading
http://www.somesite.com/somefolder/somepage.htm.
It will
build an index of the contents of the page, so that later we can
look up
a search term and quickly find out how many times it
occurs on the page.
It will also look through the page and find all of its
links. This
called crawling or spidering the site. For each
link,
it will go out, read the page, build its index, and find that page's
links. It will find each of those pages' links too, and go out and index
them,
but it will stop there, because we specified in the third command
line argument
to only follow links to a depth of 2. So in the end, it
will have page content
information for the original page, all of its
links, and all of its links'
links. Finally, it will save the index
information for all the pages to file
index.txt, in a
format that will be discussed below.
> python search.py index.txt cookie monster
Given a previously constructed index, this command it will find pages that contain at least one of the terms "cookie" and "monster" (not necessarily next to each other). It will rank the pages by the total number of occurrences of the search terms in the page and print the ranked list as follows:
http://www.somesite.com/somefolder/otherpage.htm 15 http://www.somesite.com/somefolder/somepage.htm 8 http://www.somesite.com/somefolder/thirdpage.htm 3
To read a web page, crawl.py should import
the urllib2 package.
You can then get an object
representing the content of the page with the Pyton
statement
"page
= urllib2.urlopen(url)", and you can
iterate through its lines
of text with a for loop like "for
line in page:"
All URLs must be fully specified, like
the ones in our example. "www.somesite.com" isn't
enough -- it must be "http://www.somesite.com". This
is only a problem
for you when providing the root URL for the engine. We
will assume that the
other webpages' links are all fully specified
(though in a real search engine
we'd want to be able to handle
incomplete or relative URLs).
To extract links from the webpages, look for text that matches
the pattern
href="urlGoesHere" . You must see the
string href=" , and grab
all the text from that point
to the next quotation mark. Be careful when writing
a regular expression
to match it, because matching is normally greedy. That
is, if you saw
the string
href="This is a URL" But this is not
href="And this is"
you might do a greedy match, and get everything from the first quote mark up through the last quote mark as your URL -- this would be
This is a URL" But
this is not href="And this is
We don't want that, so you must override the greedy behavior of the regular expression engine. The star (*) will match as much text as it possibly can while still matching the expression (this is the greedy approach), but the star-question-mark (*?) will match the minimal amount of text it can. You will need the *? to make your regular expression work.
When building the page's index, you should simply
split the page's contents
by whitespace. Don't worry about trimming
punctuation or correcting upper/lower
case -- for this assignment, the
strings "cookie", "Cookie",
and "Cookie." will be considered three different
words. Searching
for "cookie" should not match
"Cookie" or
"cookie."
Because there could be multiple links to the same page, you should make sure that you don't enter multiple times into the index a page you got through via a link.
The external representation of the index should be a file one line
word page count
for each word found in the crawl and each page (a URL) containing the word, with count the number of occurrences of the word in the page.
The search.py module takes the name of the
index file and the
sequence of search terms as arguments. It first reads
the index and stores
it into an appropriate data structure. It then
finds and sorts the pages containing
the search terms, and prints their
URLs and total search term counts in decreasing
order of the total count
of occurrences of the search terms in the pages. Pages
that contain no
search terms should not be printed. There will be at least
one search
term in the command line, and no upper limit to the number of search
terms.
To read
argument passed in on the command line, your Python file must import
the
sys package. The arguments may then be retrieved, as strings,
by calling
sys.argv[0], sys.argv[1], and so
on. Be careful -- sys.argv[0] is the program
name itself,
so you probably want to start at sys.argv[1].
This project may seem like a bit much at first,
but once you start to break
it down you'll find that the pieces can be
completed very easily and succinctly.
Try starting with a function that
takes a string, and returns a dictionary
showing for each word, how many
times the string contains that word. This should
be easy -- we did it in
lab. Test your function, then write another one that
takes a string, and
returns a list of all the links in it. Test it, then write
a recursive
function that takes a URL, builds the dictionary of terms, stores
this
information in some central variable, and then calls itself on each of
the documents' links. Make sure to include the condition for when the
recursion
stops (in this case, when you've reached the maximum link
depth). At this point,
you're almost done with crawl.py. As
for search.py, it should
construct an appropriate map from the index
file, collect information
on pages containing the search terms from the
index, have only to sort the
responsivle URLs by the total number of
occurrences of search terms.
You can test your modules on our small set of test pages. Here's what we get when crawling those pages and searching the resulting index:
> python crawl.py http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc1.html 2 index.txt > python search.py index.txt to http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc3.html 5 http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc1.html 2 http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc2.html 1 > python search.py index.txt it
http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc2.html 1
> python search.py index.txt to the
http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc3.html 6 http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc2.html 4 http://www.seas.upenn.edu/~cis1xx/projects/Python/SearchEngine/pages/doc1.html 4
You can also test on a set of pages that you control. But
please do not test it on other sites, because a buggy crawler can hit a server
will too many requests, and your IP address (or your whole domain!) could
be blacklisted. In addition, crawlers are supposed to obey the robots.txt protocol. If you don't, you are deemed to be a bad guy and blacklisting could
occur. When
debugging, don't set the link depth too high -- it'll slow your program down
significantly, and also create heavy network load. Please keep it in the
0-3 range.
You should submit code in two files crawl.py and search.py. We
will run it on several test sites to evaluate it.
Improve your search engine as follows:
crawl.py. For example, a second argument of -100
requests that the crawler keep following links until it runs out of distinct
URLs or 100 of them are found. There are several simple ways of doing this,
one of which is based on the graph search code presented in class.