We are looking for freelancers who are specialized in web site crawling. We are working on several projects which require full crawling of web sites like e.g. http://www.parlament.ch. For large web sites we typically define several subsites which can serve as improved starting points for the crawler. The results should be the complete texts contained in the web site. Text in PDF files or in HTML-tables also need to be crawled and available in the result. Once the crawler is correctly set up for a specific site, we typically expect a periodic crawling of the site contents (e.g. once a week).
We are looking for someone who is experienced in this data gathering process and can manage all steps (setup crawler, improve crawler, manage document content update, transfer data to our server).
Crawling can be done with Apache Nutch or other crawling softwares which the specialist recommends.