Web Scraper using python, scrapy, MySQL & JSON

  • Status Closed
  • Budget N/A
  • Total Bids 24

Project Description

I would like a web scraper that:

1. Retrieves a seed list of uri's from a MySQL database

2. Using multiple threads (twisted framework) and scrapy - scrapes all page for links (1 level deep only)

3. Validates the link to ensure it is a full url

4. Get the response from the scraped url (i.e. redirect, OK, not found)

4a. If no response try a DNS lookup

5. Saves the root address and response results, then import them into a MySQL table (this can be batched through a JSON file if required)

As this is being created as a proof of concept it doesn't need to be created using django unless this does not effect the price. It can be launched from a linux console.

The most important part of this project is that the scraping is made efficient by using multiple threads and by eliminating duplicate url's in step 4 to ensure the links aren't being sent requests multiple times.

This project has the potential for additional development if the right developer is found.

Note: Well commented code is expected.

Get free quotes for a project like this
Skills Required

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online