I'm looking for someone to code a solid website scraper / crawler. I've already coded a version, however it is not as good as I need it to be, so I need help to create a new better version from scratch.
In short I need to be able to manage (create/edit/delete) scraping tasks through a robust, flexible and advanced UI; scraping task script need to look for things to do on regular intervals (optimally as an update daemon service on my Ubuntu VPS instead of a CRON task) with data getting scraped and inserted into an MYSQL database. The sites in questions are generally news sites relating to games and tech; key data is headlines, intro and/or full content, date published, author and URL to full story (similar to what an RSS feed could provide, but these site do not have RSS feeds).
Beyond use of PHP/JQuery and Ajax I expect you to use something like SimpleHTMLdom (which I used, however maybe you prefer another framework - so can be discussed) and Datatables for all types of tables (alternatively some bootstrap tables).
Also note that I use a them called Metronic – Admin Dashboard for my general UI design, I can provide a default template and link in that regard.
Features that will be required
Advanced create/edit/delete tasks UI so that tasks to do everything can be done via the UI as far as possible to ensure a page can get scraped for data.
Smart way to manage multiple page scrapes from the same website. E.g. when there is no way to fetch, news, reviews and features from a single page.
List of tasks with relevant status; search, filter, sort and manage options
Update daemon that can run as a background process on an VPS Ubuntu 14.04 box. This manage all the tasks based on task settings and interval criteria to fetch data.
Error handling; able to recover in case of failed fetches, interruptions, re-schedule tasks etc., logging of what is going on and error’s that occurred.
Error management; warnings system that flags tasks that might have issues, e.g. we’re no longer scraping a headline or an author etc. e.g. site change code that can cause issues.
Happy to answer any further questions, just ask.
Timeline/deadlines; while I would have loved to have this done yesterday, do let me know an estimate of how much time you believe will be required to complete the project. A high level of English also required. Offers that ignores to provide this information will not be considered.
See attached images for a view of my current system.
Updated with two missing attachment that was intended to be included.