Project Description:
Write a Python-script that downloads web crawling data (ARC-format) from the CommonCrawl.org-project.
The python script must use at least three arguments: aws private, aws public and the file extension to extract from the links.
Example usage:
$ python commoncrawl.py secret public pdf
http://www.idg.com/sfsdf.pdf
.. and so on..
The output from the script will be links containing the file extension. The script must also keep state in which ARC-file it's currently processing.
The script must use the requester pays S3 option and parse crawler data in the 2012-dataset.
Example file: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz
About the ARC-file format: http://archive.org/web/researcher/ArcFileFormat.php
Exampel flow:
1. Download segment list
2. Download first ARC-file in segment and uncompress
3. Parse ARC-file to find links with user selected extensions
4. Print link/url