Project ID:
293178
Project Type:
Fixed
Budget:
$30-$250 USD
Project Description:
I need a crawler to identify phrases in the html of websites, for example "google analytics". There will be about 10 phrases in total, i want this to be an input that i can control. I want to be able to control the depth of the crawl in terms of how many levels "deep" the crawler goes into the website (e.g., home page --> about us --> management would be 3 layers deep). Also, i want to be able to control the total number of pages crawled per site, e.g., cut-off search after 100 pages crawled.
Finally, the crawler needs to be able to crawl 20,000 sites in about a week. Therefore, the winner bidder needs to be able to build a "fast" crawler--e.g., utilizing multi-threading etc. Also, i will need to be able to upload the urls of the websites I want to crawl.
Finally, this crawler needs to be completed in a couple days.
This crawler should be straightforward for a skilled programmer.
Skills required:
.NET,
PHP,
Python