You have chosen to sponsor your bid up to a maximum amount of .
I'm looking to get a multithread search engine url scraper that will use a text file with up to 10k different search terms and retrieve the 1000 url results from each search. Search engines it needs to scrape are google.com, yahoo.com, and msn.com. For google.com, It can bypass google when it blocks for too many requests by simply switching between different datacenters. A good datacenter list is here.
Here is example list of yahoo datacenters:
Msn.com does not block too many requests so there should be no need to switch datacenters with them.
Also, the program needs to be able to extract trackback urls from the searches. An example would be if the search is like this http://www.google.com/search?hl=en&q=%22mt-tb.cgi%2F4%22&btnG=Google+Search
It would scrape the trackback urls seen on the search descriptions such as http://www.DoubleBarreledOpinions.com/blog/mt-tb.cgi/4 . It would need to scrape different trackback url formats as well such as "tb.php/634" or "tb&sl=" and many more. It needs to have option to add different trackback formats to scrape from search descriptions.
Program needs to be able to scrape up to 1,000,000 url records and remove any duplicates.