Closed

Web scraper in Python with Scrapy (scrapy.org) for Google

This project received 5 bids from talented freelancers with an average bid price of $190 USD.

Get free quotes for a project like this
Employer working
Skills Required
Project Budget
N/A
Total Bids
5
Project Description

I need to scrape Google search results, using Python with Scrapy ([url removed, login to view]).
My problem is that Google blocks automated scraping.
I need help to find how to configure the scraper (increase scraping delay?) and/or an anonymous proxy (like Tor+Privoxy) to be able to scrape Google search results.
What I have so far:

1) Simple Google parser:

def parse(self, response):
hxs = HtmlXPathSelector(response)

if [url removed, login to view]('[url removed, login to view]'):
for url in [url removed, login to view]('//div[@id="ires"]/ol/li//h3[@class="r"]/a/@href').extract():
... # Here parse google links
for url in [url removed, login to view]('//a[@id="pnnext"]/@href').extract():
url = "https://" + [url removed, login to view]('/')[2] + url
yield Request(url)

This simple parser, without any proxy, gets recognized as an automated scraper and blocked.

2) I installed Tor+Privoxy, with this middleware class:

class ProxyMiddleware(object):
def process_request(self, request, spider):
[url removed, login to view]['proxy'] = "http://localhost:8118"

configured in the settings:

DOWNLOADER_MIDDLEWARES = {
'[url removed, login to view]': 110,
'[url removed, login to view]': 100,
}

But scrapy seems not to work with Tor+Privoxy on https pages (with http scrapy+tor+privoxy works, but Google now only works with https).

So what I actually need is a sample project with detailed proxy configuration (Tor/Privoxy or else) on how to avoid being blocked by Google because of automated scraping.

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online