Closed

Web scraper in Python with Scrapy (scrapy.org) for Google

This project received 5 bids from talented freelancers with an average bid price of $190 USD.

Get free quotes for a project like this
Employer working
Skills Required
Project Budget
N/A
Total Bids
5
Project Description

I need to scrape Google search results, using Python with Scrapy ([url removed, login to view]).

My problem is that Google blocks automated scraping.

I need help to find how to configure the scraper (increase scraping delay?) and/or an anonymous proxy (like Tor+Privoxy) to be able to scrape Google search results.

What I have so far:

1) Simple Google parser:

def parse(self, response):

hxs = HtmlXPathSelector(response)

if [url removed, login to view]('[url removed, login to view]'):

for url in [url removed, login to view]('//div[@id="ires"]/ol/li//h3[@class="r"]/a/@href').extract():

... # Here parse google links

for url in [url removed, login to view]('//a[@id="pnnext"]/@href').extract():

url = "https://" + [url removed, login to view]('/')[2] + url

yield Request(url)

This simple parser, without any proxy, gets recognized as an automated scraper and blocked.

2) I installed Tor+Privoxy, with this middleware class:

class ProxyMiddleware(object):

def process_request(self, request, spider):

[url removed, login to view]['proxy'] = "http://localhost:8118"

configured in the settings:

DOWNLOADER_MIDDLEWARES = {

'[url removed, login to view]': 110,

'[url removed, login to view]': 100,

}

But scrapy seems not to work with Tor+Privoxy on https pages (with http scrapy+tor+privoxy works, but Google now only works with https).

So what I actually need is a sample project with detailed proxy configuration (Tor/Privoxy or else) on how to avoid being blocked by Google because of automated scraping.

SUBMIT DATE

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online