Closed

101593 Search Engine Url Scraper

This project is now closed with a project budget of N/A.

Get free quotes for a project like this
Employer working
Project Budget
N/A
Project Description

I'm looking to get a multithread search engine url scraper that will use a text file with up to 10k different search terms and retrieve the 1000 url results from each search. Search engines it needs to scrape are [url removed, login to view], [url removed, login to view], and msn.com. For [url removed, login to view], It can bypass google when it blocks for too many requests by simply switching between different datacenters. A good datacenter list is here.

[url removed, login to view]

Here is example list of yahoo datacenters:

[url removed, login to view]

[url removed, login to view] does not block too many requests so there should be no need to switch datacenters with them.

Also, the program needs to be able to extract trackback urls from the searches. An example would be if the search is like this [url removed, login to view]+Search

It would scrape the trackback urls seen on the search descriptions such as [url removed, login to view] . It would need to scrape different trackback url formats as well such as "[url removed, login to view]" or "tb&sl=" and many more. It needs to have option to add different trackback formats to scrape from search descriptions.

Program needs to be able to scrape up to 1,000,000 url records and remove any duplicates.

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online