In Progress

Python/Scrapy Web Scraper

Looking to move from a desktop scraper to a server hosted version. I do *not* want a script hard coded to crawl only one site, but a crawler that can store and process multiple templates for different sites. For sites with a significant amount of content rendered in JavaScript, I will need the option to use Scrapy with Selenium.

For an overview of Scrapy, visit [url removed, login to view]

There are two parts to this project, crawl templates and the scraper itself.

The crawl template is essentially a form, with values stored in the database. those form values would define variables in the crawler code. The stored form values would define the start URL, rules for following links, and what data to extract, and also whether to use a standard crawl or Scrapy + Selenium crawl (for sites where large portions of the scrapable content is rendered in JavaScript).

The scraper runs based on the crawl template, pulling either the default spider code, or the Selenium spider code, and using the database stored variables for the start URL, restrictions, etc.

I would like to store the scraped URLs in a database, so that the crawler can skip pages that have already been scraped. Updates and changes to information for pages in the database would be handled separately.

Skills: Data Mining, Javascript, MySQL, Python, Web Scraping

See more: scrapy javascript, python scrapy javascript, javascript crawl, selenium web crawler python, python spider javascript, scrapy selenium, use scrapy selenium, scrapy crawler, scrapy python, web server in javascript, scrapy org, python updates, intro templates site, database web templates, scrapy crawl multiple sites crawler, script crawl, what is a crawler, web scrapy, selenium python, selenium javascript, python data, multiple url scraper, extract html code script, extract javascript, scrapy project

About the Employer:
( 11 reviews ) Las Vegas, United States

Project ID: #4181547