101593 Search Engine Url Scraper

IN PROGRESS
Bids
0
Avg Bid (USD)
N/A
Project Budget (USD)
N/A

Project Description:
I'm looking to get a multithread search engine url scraper that will use a text file with up to 10k different search terms and retrieve the 1000 url results from each search. Search engines it needs to scrape are google.com, yahoo.com, and msn.com. For google.com, It can bypass google when it blocks for too many requests by simply switching between different datacenters. A good datacenter list is here.

http://www.vaughns-1-pagers.com/internet/google-data-centers.htm

Here is example list of yahoo datacenters:

http://www.vaughns-1-pagers.com/internet/google-data-centers.htm

Msn.com does not block too many requests so there should be no need to switch datacenters with them.

Also, the program needs to be able to extract trackback urls from the searches. An example would be if the search is like this http://www.google.com/search?hl=en&q=%22mt-tb.cgi%2F4%22&btnG=Google+Search

It would scrape the trackback urls seen on the search descriptions such as http://www.DoubleBarreledOpinions.com/blog/mt-tb.cgi/4 . It would need to scrape different trackback url formats as well such as "tb.php/634" or "tb&sl=" and many more. It needs to have option to add different trackback formats to scrape from search descriptions.

Program needs to be able to scrape up to 1,000,000 url records and remove any duplicates.

Skills required:
Anything Goes, C Programming, Java, Javascript, Perl, PHP
About the employer:
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.