I need a crawler to identify phrases in the html of websites, for example "google analytics".
There will be about 5 phrases in total, i want this to be an input that i can control. I want to be able to control the depth of the crawl in terms of how many levels "deep" the crawler goes into the website (e.g., home page --> about us --> management would be 3 layers deep).
Also, i want to be able to control the total number of pages crawled per site, e.g., cut-off search after 100 pages crawled.
Finally, the crawler needs to be able to crawl 20,000 sites in about a week. Therefore, the winner bidder needs to be able to build a "fast" crawler--e.g., utilizing multi-threading etc. Also, i will need to be able to upload the urls of the websites I want to crawl.
Finally, this crawler needs to be completed in a couple days.
This is something that was allready asked a couple of months ago by somebody else. But I need it as well now.
6 freelancers are bidding on average $177 for this job
I can do this in PHP. This will be a multi-threading script, if we can say this. PHP doesnt naturally support it, but there are some tricks to implement it. I've the similar experience.