Closed

need help doing this project

this project is to construct a search engine for a static corpus, namly the Enron e-mail collection. Due to disk space limitations on our small Amazon machines, however, I will give you a subset of the complete e-mail dump.

You should create index in memory, but store the e-mail files on disk. Set up my Web server configuration so that the files are available visible with the URI structure mirroring the disk structure. For example, I want URI /enron/motley-m/inbox/16 for file

enron_mail_20110402/maildir/motley-m/inbox/16.

Note the removal of the '.' on the end of the filename in the URI. The easiest way to do this is simply remove that from each filename on disk.

We are reusing the functionality from my previous project that loaded the mysql database with text files. In this case, we are using e-mail files. You are free to use any of the Python code we built on page TFIDF.

Assuming a free text search, not a Boolean AND or OR condition in the terms. In other words, it's possible that documents may come up that do not include all of the search terms.

We will, in principle, be using a cosine similarity between query and documents except that it will look like a simple scoring by "sum of weights" because our queries are short: we will use 0 or 1 as the weight for query terms. So for query "burn California burn" the resulting vector weights are 1 for terms burn and California. There would be an implicit zero in all other positions.

You can use either a binary tree (b-tree) or hash table to create the index and I recommend variable length arrays for the postings lists. You will also need an array that maps document ID to filename (or URI).

You should make search queries as efficient as possible and return the most relevant e-mails to the top of the list. That means using a heap as a priority queue to get the top K search results. Do not simply sort the search results by score. Create a heap from the results and then get the top results in the proper order. See Python module heapq.

Persist your index to disk with cpickle and reload upon server start up. Have a separate python program create the index before launch your server.

If you use heuristics to increase the speed of your queries, you must document this on a separate sheet so I don't miss it as a comment in your code. Heuristics can alter the order or set of documents that appear in the results in exchange for increased speed.

Example

Here is email file "16." within Motley's directory:

~/USF/CS680/data/enron_mail_20110402/maildir/motley-m/inbox $ cat 16.

Message-ID: <[url removed, login to view]@thyme>

Date: Fri, 15 Mar 2002 12:50:27 -0800 (PST)

From:

To:

Subject: phone number for Malowney

Mime-Version: 1.0

Content-Type: text/plain; charset=us-ascii

Content-Transfer-Encoding: 7bit

X-From: Thompson, Virginia </O=ENRON/OU=NA/CN=RECIPIENTS/CN=VTHOMPSO>

X-To: Motley, Matt </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mmotley>

X-cc:

X-bcc:

X-Folder: \ExMerge - Motley, Matt\Inbox

X-Origin: MOTLEY-M

X-FileName: matt motley [url removed, login to view]

Matt-

I talked to John Malowney today (Friday, March 15) and he asked if I would

send you an e-mail and ask you to call him on Tuesday (March 19) (503)

833-4526.

Thanks,

Virginia

A search results for this should look like:

phone number for Malowney

From:

To:

Matt- I talked to John Malowney today (Friday, March 15) and he asked if I would send you...

Strip out newlines, combined spaces, and display the first, say, 80 characters of the e-mail message.

Deliverables

You must provide a URI that return search results:

/search?q=searchterms

Show the top 20 results in the browser and show a "next page" link that will move to the next 20 results etc...

The results in the browser should look something like google results where the link has the Subject line extracted from the e-mail, the from address, the to address, and

Skills: Amazon Web Services, Data Mining, MySQL, Python, Software Architecture

See more: enron_mail_20110402, where to get python code, where do you get a python, where can i get python help, where can i get python, web content dump, vector sort, vector remove, vector queue, vector order, vector binary search, use of tree data structure, use of binary search tree, use case module, use case include example, use case amazon, uri k, type of queue in data structure, type of binary tree, tree query, tree program in data structure, tree of data structure, tree index, tree in data structure using c, tree in data structure

About the Employer:
( 0 reviews ) United States

Project ID: #1267828

1 freelancer is bidding on average $50 for this job

Reimelt

Hello babasy, I'm not a developer here on GAF, and I don't know how to send emails here ;-) but if you send me that downstripped enron korpus, I could try to throw it into my search engine that already exists and we ca More

$50 USD / hour
(0 Reviews)
0.0