Closed

Scrapy - Python

This project was awarded to nitelfreelance for $150 USD.

Get free quotes for a project like this
Employer working
Awarded to:
Skills Required
Project Budget
$10 - $30 USD
Total Bids
17
Project Description

Hi,

Im looking for someone to build a simple spider using scrapy and python.

All I'm looking for is the spider to crawl multiple sites using the crawl spider option. The spider will look for all the links / email addresses on the site and then store them into an array associated to that site.

Example output will be as follows using the export to CSV option

domain, links, email address
[url removed, login to view], [link1, link2, link3, link4], [email{at}[url removed, login to view],email2{at}[url removed, login to view],email3{at}[url removed, login to view]]
[url removed, login to view], [link1, link2, link3, link4], [email{at}[url removed, login to view],email2{at}[url removed, login to view],email3{at}[url removed, login to view]]

The spider should get the start urls from a external text file and also use these domain names as the only allowed domains to crawl.

The arrays should only store unique variables i.e if email{at}[url removed, login to view] is captured twice it will only store one copy of it in the array.

The spider should allow us to ignore urls containing certain keywords that we can specify somewhere within the script. i.e if we specify it to ignore "blog" it will not crawl [url removed, login to view] or [url removed, login to view]

Finally the script should allow us to set a maximum amount for pages to call for that site. So for example if we set it to 30 it would call a maximum of 30 urls for that site.

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online