1) I have already developed a desktop based application using Scrapy/ Python that is hard coded to crawl to three separate sites (using three "spiders") that can pull out product details such as Product ID, Title, Price, Vendor and Stock Position. At present, these details are used to generate .sql files that need to be uploaded to the web server to update the Product Table in the database.
2) The current requirement is to develop a Server version of the scraper. The expected features are as under:-
a) The Products Table in the server database to be automatically populated by the scraper. The required fields are Product ID, Title, Price, Vendor, Stock Position, Payment Options, Delivery Time
b) Easy extensibility (with some python coding) to add more sites in future.
c) To meet the above, the scraper to be implemented as two modules. The "Scraper Module" and the "Parameter Module".
f) The scraped URLs (referred by the primary URL) to be saved in a Database Table with "processed flag", so that these can be skipped if scraping needs to be resumed after interruption.
g) Primary URLs also to be saved with the date of last successful scraping, to enable scheduling of periodic repeat scrapings.
h) While executing scraping, only those fields that have changed since last scrape are to be extracted and the original table entry for the product to be "updated", as required. In case of new products, the details to be "inserted" as a new row in the Products Table.
j) Performance must be adequate to enable scraping of the sites in order to generate the Products database
Budget: USD 200 to USD 300