Closed

Python/Scrapy Web Scraper

This project was awarded to M5L2764K for $100 USD.

Get free quotes for a project like this
Employer working
Awarded to:
Project Budget
N/A
Total Bids
6
Project Description

Looking to move from a desktop scraper to a server hosted version. I do *not* want a script hard coded to crawl only one site, but a crawler that can store and process multiple templates for different sites. For sites with a significant amount of content rendered in JavaScript, I will need the option to use Scrapy with Selenium.

For an overview of Scrapy, visit [url removed, login to view]

There are two parts to this project, crawl templates and the scraper itself.

The crawl template is essentially a form, with values stored in the database. those form values would define variables in the crawler code. The stored form values would define the start URL, rules for following links, and what data to extract, and also whether to use a standard crawl or Scrapy + Selenium crawl (for sites where large portions of the scrapable content is rendered in JavaScript).

The scraper runs based on the crawl template, pulling either the default spider code, or the Selenium spider code, and using the database stored variables for the start URL, restrictions, etc.

I would like to store the scraped URLs in a database, so that the crawler can skip pages that have already been scraped. Updates and changes to information for pages in the database would be handled separately.

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online