Crawl a provided set of websites for email addresses

IN PROGRESS
Bids
27
Avg Bid (USD)
$403
Project Budget (USD)
$250 - $750

Project Description:
You will receive a large CSV file (approx 1.2mm rows) of names of professors at American universities. For each professor the URL of the university is listed as well.

Your job will be to write software that can crawl each website and look for pages on which the professor's name appears, and extract email addresses from there. The goal is to obtain one or more email addresses for each professor.

Since it's impossible to determine simply from the name and the URL which email address corresponds to the professor, one potential approach is to retrieve multiple pages on which the name appears and on which at least one email address appears as well (using a regex). Then, rank the email addresses based on how frequently they appear. The address that appears most often is likely to be the correct one. Example:

page 1: John Smith, j.smith(at)mit.edu
page 2: John Smith, j.smith(at)mit.edu
page 3: John Smith, j.stewart(at)mit.edu
page 4: John Smith, s.colbert(at)mit.edu

From this example it is pretty clear that is likely to be the correct address.

The output of your software, provided in CSV or other database-readable format, should contain the professor ID (from the input file) and one or more email addresses, each with a rank. Each row should also contain the URL of the page where the address was found.

Here are a few sample rows from the input file:

ID Name Department InstitutionID InstitutionName State Location URL
1 Obaid, Evelyn Computer Science 881 Obaid, Evelyn CA San Jose, CA http://www.sjsu.edu/
2 Khuri, Sami Computer Science 881 Khuri, Sami CA San Jose, CA http://www.sjsu.edu/
3 Beeson, Michael Computer Science 881 Beeson, Michael CA San Jose, CA http://www.sjsu.edu/
15 Kubelka, Richard Mathematics 881 Kubelka, Richard CA San Jose, CA http://www.sjsu.edu/
18 Lin, Ty Computer Science 881 Lin, Ty CA San Jose, CA http://www.sjsu.edu/
29 Key, Scott Philosophy 145 Key, Scott CA Riverside, CA http://www.calbaptist.edu/
45 Lash, Jamie Foundations 1230 Lash, Jamie TX Dallas, TX http://www.dbu.edu/
47 Swain, John Physics 696 Swain, John MA Boston, MA http://www.northeastern.edu/neuhome/index.php
48 Signorielli, Nancy Communication 1094 Signorielli, Nancy DE Newark, DE http://www.udel.edu/
57 Frederick, Joan English 457 Frederick, Joan VA Harrisonburg, VA http://www.jmu.edu/

To save you time, one possibility is to query Google using their API for pages that contain the name of each professor and are on the domain provided. Example (this is from the first row above):

Query: "Khuri, Sami site:www.sjsu.edu"
https://encrypted.google.com/search?q=Khuri%2C+Sami+site%3Awww.sjsu.edu

As you can see the first result in this case is actually a very good page to collect the email from:
http://www.sjsu.edu/people/sami.khuri/expert/

Generally speaking the first 10-20 results are very likely contain the correct address.

Once again, the deliverable of this project is a text (CSV or TSV) file containing one or more email addresses for each professor, ranked by probability of being correct.

The project must be delivered in at most 1 month.

Skills required:
C# Programming, C++ Programming, Java, Python, SQL
About the employer:
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.


$ 320
in 6 days
$ 250
in 10 days
$ 400
in 7 days
$ 250
in 5 days
$ 700
in 15 days
$ 380
in 10 days
$ 750
in 14 days
Hire phpXpertbd
$ 750
in 30 days
Hire medsoftngo
$ 400
in 7 days
$ 500
in 20 days