We are looking for a web scraper to be built as a PHP (not WebSite Project), MYSQL Database. Rather than using regular expressions the scraper should parse the pages using HTMLAgility pack. [url removed, login to view]
Project needs to be started right away and be finished in about (time line or date).
The basic program flow of the scraper should be as follows:
- Read from the database a list of starting URL's
- Scan the page for product information
- info to scrape - product name, description, retail price, sale price, brand, product url, image url, in stock,
sizes, colors, sku, scrape date, expiration date (if applicable)
- Insert the information into a database table
- Go to subsequent pages to scan and insert to database until they are all scraped
The web based program needs to have the following features:
- Be able to scrape just one product on a given page (product detail page) or scrape a series of products
on a page and then all subsequent pages.
- Example of many products to scrape with other pages to drill down to and scrape
[url removed, login to view]
- The program needs to be able to run on a schedule and also on-demand.
- Insert gathered data into an MS SQL database. We will provide the table schemas.
- Scraper should not insert duplicate items but if price/size/color has changed it should add it as a new
entry while keeping a reference to original item it is duplicating. These new updates should be flagged
somehow so we know they are new changes.
- Scraper should be able to detect "bad" data or page layout changes so we know to update the scraper.
- Scraper needs to be an asynchronous and multithreaded application. Since many sites and pages are
being scraped we need to be able to see the progress as it is running. And since many page hits will be
required it needs to be multithreaded.
- Scraper should be able to run behind a proxy server if necessary
- Every site we scrape will need to have its own &#8220;template&#8221; which lets the scraper know how to find the
data to extract. This is where HTMLAgility pack will be used. If it's easier to do this using
regularexpressions then that can be used.
- We should be able to easily create new &#8220;templates&#8221; for other pages we want to scrape in the future. And
the scraper should be smart enough to know when a template doesn't match the given site it's scraping.
- Along with the scraping templates we need a way to specify how the scraper can go to the next page
and all following pages until they are all scraped. We must be able to specify this for each website.
- Provide a function with the following signature that will be able to figure out the domain being scraped,
pick the appropriate &#8220;template&#8221; to use and also know how to get to subsequent pages. This is assuming
we have a predefined list of templates to use when the project is finished.