Crawl a provided set of websites for email addresses

Closed

Description

You will receive a large CSV file (approx [url removed, login to view] rows) of names of professors at American universities. For each professor the URL of the university is listed as well.

Your job will be to write software that can crawl each website and look for pages on which the professor's name appears, and extract email addresses from there. The goal is to obtain one or more email addresses for each professor.

Since it's impossible to determine simply from the name and the URL which email address corresponds to the professor, one potential approach is to retrieve multiple pages on which the name appears and on which at least one email address appears as well (using a regex). Then, rank the email addresses based on how frequently they appear. The address that appears most often is likely to be the correct one. Example:

page 1: John Smith, [url removed, login to view](at)[url removed, login to view]

page 2: John Smith, [url removed, login to view](at)[url removed, login to view]

page 3: John Smith, [url removed, login to view](at)[url removed, login to view]

page 4: John Smith, [url removed, login to view](at)[url removed, login to view]

From this example it is pretty clear that is likely to be the correct address.

The output of your software, provided in CSV or other database-readable format, should contain the professor ID (from the input file) and one or more email addresses, each with a rank. Each row should also contain the URL of the page where the address was found.

Here are a few sample rows from the input file:

ID Name Department InstitutionID InstitutionName State Location URL

1 Obaid, Evelyn Computer Science 881 Obaid, Evelyn CA San Jose, CA [url removed, login to view]

2 Khuri, Sami Computer Science 881 Khuri, Sami CA San Jose, CA [url removed, login to view]

3 Beeson, Michael Computer Science 881 Beeson, Michael CA San Jose, CA [url removed, login to view]

15 Kubelka, Richard Mathematics 881 Kubelka, Richard CA San Jose, CA [url removed, login to view]

18 Lin, Ty Computer Science 881 Lin, Ty CA San Jose, CA [url removed, login to view]

29 Key, Scott Philosophy 145 Key, Scott CA Riverside, CA [url removed, login to view]

45 Lash, Jamie Foundations 1230 Lash, Jamie TX Dallas, TX [url removed, login to view]

47 Swain, John Physics 696 Swain, John MA Boston, MA [url removed, login to view]

48 Signorielli, Nancy Communication 1094 Signorielli, Nancy DE Newark, DE [url removed, login to view]

57 Frederick, Joan English 457 Frederick, Joan VA Harrisonburg, VA [url removed, login to view]

To save you time, one possibility is to query Google using their API for pages that contain the name of each professor and are on the domain provided. Example (this is from the first row above):

Query: "Khuri, Sami site:[url removed, login to view]"

[url removed, login to view]+Sami+site%[url removed, login to view]

As you can see the first result in this case is actually a very good page to collect the email from:

[url removed, login to view]

Generally speaking the first 10-20 results are very likely contain the correct address.

Once again, the deliverable of this project is a text (CSV or TSV) file containing one or more email addresses for each professor, ranked by probability of being correct.

The project must be delivered in at most 1 month.

Skills: C# Programming, C++ Programming, Java, Python, SQL

See more: crawl website email addresses, www lin, va job search, using regex java, using regex in java, using regex in c, udel email, time in dallas, site de location, search websites for people, sami expert, regex java example, regex in java example, regex example, q email, python look for file, php san jose, php dallas, people search websites, northeastern state university, michael smith, michael query, mathematics websites, mathematics for computer science, mathematics for computer

Project ID: #4330935

Awarded to:

randombits

I have an extensive experience in scrapers and parsers, so this project won't be a problem to make for me. Please see PM for a question and more info.

$250 USD in 7 days
(0 Reviews)
0.0

27 freelancers are bidding on average $403 for this job

srinichal

I can deliver the project

$320 USD in 6 days
(76 Reviews)
6.8
diepbp

I am confident to handle your project. Please check your inbox for details, thank you

$250 USD in 10 days
(120 Reviews)
6.2
Alexod

I can do it

$400 USD in 7 days
(19 Reviews)
6.0
SigmaVisual

I can help in your project, please check PMB and our ratings/reviews to get idea of our experience. Please let me know if you have any queries.

$250 USD in 5 days
(23 Reviews)
6.0
sandeep25101982

Hi .Net/C#/ASP expert here. Please check PM for details. Thanks

$700 USD in 15 days
(21 Reviews)
5.8
proteamspb

Hi, our team specializes in web crawler. Please see PM for details.

$380 USD in 10 days
(10 Reviews)
5.6
qspsolutions

Details in PMB.

$750 USD in 14 days
(8 Reviews)
5.1
phpXpertbd

I worked on many similar projects, I have big experience in data mining projects. I can finish this task in short time, with the best quality.

$750 USD in 30 days
(2 Reviews)
4.5
medsoftngo

I've done many similar projects, actually I already have a module to start with, it will crawl every university website from the csv looking for the name and a pattern of an email, it will look for the left side of the More

$400 USD in 7 days
(7 Reviews)
4.5
DenialWang

10+ years' hands-on programming experiences, can manage this work in C# and Java, may provide previous sample codes if required. Thanks/Denial

$500 USD in 20 days
(4 Reviews)
3.9
NirvanaWebDev

I've read your project specs fully and carefully. They are very well written. I can definitely code this scraper for you; it's my specialty ;). I will send you a message with my proposed approach. Also, my bid is very More

$250 USD in 10 days
(8 Reviews)
3.5
vinsoneric

Hi, Good Day!!! Upon reading the project description. I am willing to work on this. I have an extensive experiences in web crawling on any languages. Thanks

$250 USD in 7 days
(7 Reviews)
3.4
freelancerj2ee

Please check the PMB for detail.

$250 USD in 7 days
(4 Reviews)
3.3
sureshvv

Will make the script run on the amazon cloud to parallelize it and be able to extract the results quickly.

$500 USD in 10 days
(1 Review)
2.8
greggfletcher

Hello, please refer to your INBOX. Thank You. Best Regards.

$250 USD in 7 days
(3 Reviews)
2.2
nikhil08

Hi, Please check Inbox for details.

$250 USD in 30 days
(1 Review)
1.0
bhoir

Hi, I have done similar webscrapping tasks before.And i can provide you the solution which will provide the accurate data in less time if given the opportunity is given to me. Looking forward to hear from you. More

$250 USD in 3 days
(0 Reviews)
0.0
goelvivek

Hi, Best quality work will be provided.

$250 USD in 30 days
(0 Reviews)
0.0
regeya

Looks to be a simple task. Looking forward to hearing from you!

$550 USD in 10 days
(0 Reviews)
0.0
freedevs

Hi ! I can do this program for you. Please refer pm.

$720 USD in 40 days
(1 Review)
0.0