You have chosen to sponsor your bid up to a maximum amount of .
You will receive a large CSV file (approx 1.2mm rows) of names of professors at American universities. For each professor the URL of the university is listed as well.
Your job will be to write software that can crawl each website and look for pages on which the professor's name appears, and extract email addresses from there. The goal is to obtain one or more email addresses for each professor.
Since it's impossible to determine simply from the name and the URL which email address corresponds to the professor, one potential approach is to retrieve multiple pages on which the name appears and on which at least one email address appears as well (using a regex). Then, rank the email addresses based on how frequently they appear. The address that appears most often is likely to be the correct one. Example:
page 1: John Smith, j.smith(at)mit.edu
page 2: John Smith, j.smith(at)mit.edu
page 3: John Smith, j.stewart(at)mit.edu
page 4: John Smith, s.colbert(at)mit.edu
From this example it is pretty clear that is likely to be the correct address.
The output of your software, provided in CSV or other database-readable format, should contain the professor ID (from the input file) and one or more email addresses, each with a rank. Each row should also contain the URL of the page where the address was found.
Here are a few sample rows from the input file:
ID Name Department InstitutionID InstitutionName State Location URL
1 Obaid, Evelyn Computer Science 881 Obaid, Evelyn CA San Jose, CA http://www.sjsu.edu/
2 Khuri, Sami Computer Science 881 Khuri, Sami CA San Jose, CA http://www.sjsu.edu/
3 Beeson, Michael Computer Science 881 Beeson, Michael CA San Jose, CA http://www.sjsu.edu/
15 Kubelka, Richard Mathematics 881 Kubelka, Richard CA San Jose, CA http://www.sjsu.edu/
18 Lin, Ty Computer Science 881 Lin, Ty CA San Jose, CA http://www.sjsu.edu/
29 Key, Scott Philosophy 145 Key, Scott CA Riverside, CA http://www.calbaptist.edu/
45 Lash, Jamie Foundations 1230 Lash, Jamie TX Dallas, TX http://www.dbu.edu/
47 Swain, John Physics 696 Swain, John MA Boston, MA http://www.northeastern.edu/neuhome/index.php
48 Signorielli, Nancy Communication 1094 Signorielli, Nancy DE Newark, DE http://www.udel.edu/
57 Frederick, Joan English 457 Frederick, Joan VA Harrisonburg, VA http://www.jmu.edu/
To save you time, one possibility is to query Google using their API for pages that contain the name of each professor and are on the domain provided. Example (this is from the first row above):
Query: "Khuri, Sami site:www.sjsu.edu"
As you can see the first result in this case is actually a very good page to collect the email from:
Generally speaking the first 10-20 results are very likely contain the correct address.
Once again, the deliverable of this project is a text (CSV or TSV) file containing one or more email addresses for each professor, ranked by probability of being correct.
The project must be delivered in at most 1 month.