Python script for CommonCrawl
This project received 7 bids from talented freelancers with an average bid price of €297 EUR.Get free quotes for a project like this
Write a Python-script that downloads web crawling data (ARC-format) from the CommonCrawl.org-project.
The python script must use at least three arguments: aws private, aws public and the file extension to extract from the links.
$ python [url removed, login to view] secret public pdf
[url removed, login to view]
.. and so on..
The output from the script will be links containing the file extension. The script must also keep state in which ARC-file it's currently processing.
The script must use the requester pays S3 option and parse crawler data in the 2012-dataset.
Example file: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/[url removed, login to view]
About the ARC-file format: [url removed, login to view]
1. Download segment list
2. Download first ARC-file in segment and uncompress
3. Parse ARC-file to find links with user selected extensions
4. Print link/url
Looking to make some money?
- Set your budget and the timeframe
- Outline your proposal
- Get paid for your work
Hire Freelancers who also bid on this project
Looking for work?
Work on projects like this and make money from home!Sign Up Now
- The New York Times
- Wall Street Journal
- Times Online