101593 Search Engine Url Scraper

In Progress

I'm looking to get a multithread search engine url scraper that will use a text file with up to 10k different search terms and retrieve the 1000 url results from each search. Search engines it needs to scrape are [url removed, login to view], [url removed, login to view], and msn.com. For [url removed, login to view], It can bypass google when it blocks for too many requests by simply switching between different datacenters. A good datacenter list is here.

[url removed, login to view]

Here is example list of yahoo datacenters:

[url removed, login to view]

[url removed, login to view] does not block too many requests so there should be no need to switch datacenters with them.

Also, the program needs to be able to extract trackback urls from the searches. An example would be if the search is like this [url removed, login to view]+Search

It would scrape the trackback urls seen on the search descriptions such as [url removed, login to view] . It would need to scrape different trackback url formats as well such as "[url removed, login to view]" or "tb&sl=" and many more. It needs to have option to add different trackback formats to scrape from search descriptions.

Program needs to be able to scrape up to 1,000,000 url records and remove any duplicates.

Skills: Anything Goes, C Programming, Java, Javascript, Perl, PHP

See more: yahoo search engine, search here, search engine list, search 4 internet, url list, tb, search for url, scraper, remove google search, multithread, HL 7, datacenter, 10k url, block results, remove block, list url, php multithread example, scrape data list urls, retrieve url, search terms, engine data, extract records, engine cgi, multithread program

Project ID: #1847760