Completed

Crawl a provided set of websites for email addresses

This project was successfully completed by randombits for $250 USD in 7 days.

Get free quotes for a project like this
Employer working
Completed by:
Project Budget
$250 - $750 USD
Completed In
7 days
Total Bids
27
Project Description

You will receive a large CSV file (approx [url removed, login to view] rows) of names of professors at American universities. For each professor the URL of the university is listed as well.

Your job will be to write software that can crawl each website and look for pages on which the professor's name appears, and extract email addresses from there. The goal is to obtain one or more email addresses for each professor.

Since it's impossible to determine simply from the name and the URL which email address corresponds to the professor, one potential approach is to retrieve multiple pages on which the name appears and on which at least one email address appears as well (using a regex). Then, rank the email addresses based on how frequently they appear. The address that appears most often is likely to be the correct one. Example:

page 1: John Smith, [url removed, login to view](at)[url removed, login to view]
page 2: John Smith, [url removed, login to view](at)[url removed, login to view]
page 3: John Smith, [url removed, login to view](at)[url removed, login to view]
page 4: John Smith, [url removed, login to view](at)[url removed, login to view]

From this example it is pretty clear that is likely to be the correct address.

The output of your software, provided in CSV or other database-readable format, should contain the professor ID (from the input file) and one or more email addresses, each with a rank. Each row should also contain the URL of the page where the address was found.

Here are a few sample rows from the input file:

ID Name Department InstitutionID InstitutionName State Location URL
1 Obaid, Evelyn Computer Science 881 Obaid, Evelyn CA San Jose, CA [url removed, login to view]
2 Khuri, Sami Computer Science 881 Khuri, Sami CA San Jose, CA [url removed, login to view]
3 Beeson, Michael Computer Science 881 Beeson, Michael CA San Jose, CA [url removed, login to view]
15 Kubelka, Richard Mathematics 881 Kubelka, Richard CA San Jose, CA [url removed, login to view]
18 Lin, Ty Computer Science 881 Lin, Ty CA San Jose, CA [url removed, login to view]
29 Key, Scott Philosophy 145 Key, Scott CA Riverside, CA [url removed, login to view]
45 Lash, Jamie Foundations 1230 Lash, Jamie TX Dallas, TX [url removed, login to view]
47 Swain, John Physics 696 Swain, John MA Boston, MA [url removed, login to view]
48 Signorielli, Nancy Communication 1094 Signorielli, Nancy DE Newark, DE [url removed, login to view]
57 Frederick, Joan English 457 Frederick, Joan VA Harrisonburg, VA [url removed, login to view]

To save you time, one possibility is to query Google using their API for pages that contain the name of each professor and are on the domain provided. Example (this is from the first row above):

Query: "Khuri, Sami site:[url removed, login to view]"
[url removed, login to view]+Sami+site%[url removed, login to view]

As you can see the first result in this case is actually a very good page to collect the email from:
[url removed, login to view]

Generally speaking the first 10-20 results are very likely contain the correct address.

Once again, the deliverable of this project is a text (CSV or TSV) file containing one or more email addresses for each professor, ranked by probability of being correct.

The project must be delivered in at most 1 month.

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online