In Progress

269747 PHP Script Web Scraper

We are looking for a web scraper to be built as a PHP (not WebSite Project), MYSQL Database. Rather than using regular expressions the scraper should parse the pages using HTMLAgility pack. [url removed, login to view]

Project needs to be started right away and be finished in about (time line or date).

The basic program flow of the scraper should be as follows:

- Read from the database a list of starting URL's

- Scan the page for product information

- info to scrape - product name, description, retail price, sale price, brand, product url, image url, in stock,

sizes, colors, sku, scrape date, expiration date (if applicable)

- Insert the information into a database table

- Go to subsequent pages to scan and insert to database until they are all scraped

The web based program needs to have the following features:

- Be able to scrape just one product on a given page (product detail page) or scrape a series of products

on a page and then all subsequent pages.

- Example of many products to scrape with other pages to drill down to and scrape

[url removed, login to view]

- The program needs to be able to run on a schedule and also on-demand.

- Insert gathered data into an MS SQL database. We will provide the table schemas.

- Scraper should not insert duplicate items but if price/size/color has changed it should add it as a new

entry while keeping a reference to original item it is duplicating. These new updates should be flagged

somehow so we know they are new changes.

- Scraper should be able to detect "bad" data or page layout changes so we know to update the scraper.

- Scraper needs to be an asynchronous and multithreaded application. Since many sites and pages are

being scraped we need to be able to see the progress as it is running. And since many page hits will be

required it needs to be multithreaded.

- Scraper should be able to run behind a proxy server if necessary

- Every site we scrape will need to have its own “template” which lets the scraper know how to find the

data to extract. This is where HTMLAgility pack will be used. If it's easier to do this using

regularexpressions then that can be used.

- We should be able to easily create new “templates” for other pages we want to scrape in the future. And

the scraper should be smart enough to know when a template doesn't match the given site it's scraping.

- Along with the scraping templates we need a way to specify how the scraper can go to the next page

and all following pages until they are all scraped. We must be able to specify this for each website.

- Provide a function with the following signature that will be able to figure out the domain being scraped,

pick the appropriate “template” to use and also know how to get to subsequent pages. This is assuming

we have a predefined list of templates to use when the project is finished.

Skills: Anything Goes, MySQL, PHP

See more: web site templates sale, web site templates for sale of, web sites templates for sale, web scraping application, web sale template, web pages templates html, web layout for sale, web layout for retail, want to create new brand name, using regular expressions, using expressions, use regular expressions, template web sale, templates for web pages, template sale web page, sql server price list, sql get date, sql create table price, sn it, sites for sale templates, signature database, script php proxy web, scraping the web, scraping data from web database, sale site templates

About the Employer:
( 0 reviews ) pasig,

Project ID: #2016028