Python script for CommonCrawl

CANCELLED
Bids
7
Avg Bid (EUR)
297
Project Budget (EUR)
€30 - €250

Project Description:
Write a Python-script that downloads web crawling data (ARC-format) from the CommonCrawl.org-project.

The python script must use at least three arguments: aws private, aws public and the file extension to extract from the links.

Example usage:

$ python commoncrawl.py secret public pdf
http://www.idg.com/sfsdf.pdf
.. and so on..

The output from the script will be links containing the file extension. The script must also keep state in which ARC-file it's currently processing.
The script must use the requester pays S3 option and parse crawler data in the 2012-dataset.

Example file: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz


About the ARC-file format: http://archive.org/web/researcher/ArcFileFormat.php

Exampel flow:

1. Download segment list
2. Download first ARC-file in segment and uncompress
3. Parse ARC-file to find links with user selected extensions
4. Print link/url

Skills required:
Amazon Web Services, Data Mining, Data Processing, Python, Web Scraping
Hire jonas02
Project posted by:
jonas02 Sweden
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the project creator or as one of the bidders to view bids.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.