Closed

269747 PHP Script Web Scraper

This project is now closed with a project budget of N/A.

Get free quotes for a project like this
Employer working
Skills Required
Project Budget
N/A
Project Description

We are looking for a web scraper to be built as a PHP (not WebSite Project), MYSQL Database. Rather than using regular expressions the scraper should parse the pages using HTMLAgility pack. [url removed, login to view]

Project needs to be started right away and be finished in about (time line or date).

The basic program flow of the scraper should be as follows:

- Read from the database a list of starting URL's

- Scan the page for product information

- info to scrape - product name, description, retail price, sale price, brand, product url, image url, in stock,

sizes, colors, sku, scrape date, expiration date (if applicable)

- Insert the information into a database table

- Go to subsequent pages to scan and insert to database until they are all scraped

The web based program needs to have the following features:

- Be able to scrape just one product on a given page (product detail page) or scrape a series of products

on a page and then all subsequent pages.

- Example of many products to scrape with other pages to drill down to and scrape

[url removed, login to view]

- The program needs to be able to run on a schedule and also on-demand.

- Insert gathered data into an MS SQL database. We will provide the table schemas.

- Scraper should not insert duplicate items but if price/size/color has changed it should add it as a new

entry while keeping a reference to original item it is duplicating. These new updates should be flagged

somehow so we know they are new changes.

- Scraper should be able to detect "bad" data or page layout changes so we know to update the scraper.

- Scraper needs to be an asynchronous and multithreaded application. Since many sites and pages are

being scraped we need to be able to see the progress as it is running. And since many page hits will be

required it needs to be multithreaded.

- Scraper should be able to run behind a proxy server if necessary

- Every site we scrape will need to have its own “template” which lets the scraper know how to find the

data to extract. This is where HTMLAgility pack will be used. If it's easier to do this using

regularexpressions then that can be used.

- We should be able to easily create new “templates” for other pages we want to scrape in the future. And

the scraper should be smart enough to know when a template doesn't match the given site it's scraping.

- Along with the scraping templates we need a way to specify how the scraper can go to the next page

and all following pages until they are all scraped. We must be able to specify this for each website.

- Provide a function with the following signature that will be able to figure out the domain being scraped,

pick the appropriate “template” to use and also know how to get to subsequent pages. This is assuming

we have a predefined list of templates to use when the project is finished.

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online