Web Crawl from Internet Archive

This project was successfully completed by debaphp for $155 USD in 3 days.

Get free quotes for a project like this
Project Budget
$30 - $250 USD
Completed In
3 days
Total Bids
Project Description

I'd like to gather some data for an academic project to study the electronic book market.

The Internet Archive (Wayback Machine) had crawled websites that are of interest to me in the relevant period, and I'd like your help to

(1) Crawl Internet Archive to save html pages of interest

(2) Extract relevant fields in the html to form a comma separated file ready for data analysis packages.

Task1: Crawl

The webpage of interest are product page of books or e-book reader devices in the following period, venue, and category:

Time period:

2010.1 - 2010.5 (one capture a day if available)


Amazon, Barns & Noble


Physical Book, Kindle/Nook book. (not textbook, newspaper, etc. )

Device itself: Kindle and Nook.

Books listed as bestseller, award winner, editor's picks, best books, book club, etc.

We can discuss whether it's easier to get all books or just the popular books.

Task2: Extract

Fields of interest: Title, author, publisher, # reviews, ratings, list price, discount price, price of other formats, whether listed as bestseller, sales rank, ISBN, category.


(1) Small sample - prefer to have a small sample by May 14th.

Amazon only, one day in mid March, one day in mid April, one day in mid May in 2010.

(2) Negotiable, but preferably completed before June 5th.

(3) Possible future projects to extract 2005-2013 if initial run goes well.

Completed by:
Skills Required

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online