Project ID:
443534
Project Type:
Fixed
Budget:
$30-$250 USD
Project Description:
Am looking for a developer to create a custom crawler/spider capable of continuously crawling 1000-2000 sites per week.
1. Search 2000 sites.
2. To set frequency of crawl for each site
3. Option to search whole site or selected folders of a site
4. Option to add in a username and password for a site……where cookies, or user authentication, or submitting form is required.
5. Search urls and parameters to be managed in external SQL db
6. Collect and store content and metadata, and search info for each url. Other required information is whether new or changed since previous crawl.
7. Display results in tree structure for each site crawled.
Here are some for starters:
http://www.cms-cmck.com/
http://www.blakedawson.com/
http://www.law-now.com/
http://www.cliffordchance.com/
http://www.algoodbody.ie/
http://www.fsa.gov.uk/
http://fsahandbook.info/
http://www.addleshawgoddard.com/
For each source / folder crawled, I need results like it :
i. List urls that have been crawled in that folder. Display date / time crawled
ii. List number of Crawled URLS
iii. List number of Retrieval Errors
iv. List number of Excluded URLs
v. List number of New URLS (since last crawled)
vi. List number of Changed URLs (since last crawled)
---
I would need demo crawler first to make sure your capable. If your not interested in showing the demo .. Please don't bother bidding
It will be a long long term project
Skills required:
.NET,
Java,
Perl,
PHP,
Python