Write a Python-script that downloads web crawling data (ARC-format) from the CommonCrawl.org-project.
The python script must use at least three arguments: aws private, aws public and the file extension to extract from the links.
$ python [url removed, login to view] secret public pdf
[url removed, login to view]
.. and so on..
The output from the script will be links containing the file extension. The script must also keep state in which ARC-file it's currently processing.
The script must use the requester pays S3 option and parse crawler data in the 2012-dataset.
Example file: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/[url removed, login to view]
About the ARC-file format: [url removed, login to view]
1. Download segment list
2. Download first ARC-file in segment and uncompress
3. Parse ARC-file to find links with user selected extensions
4. Print link/url