Closed

Python script for CommonCrawl

This project received 7 bids from talented freelancers with an average bid price of €297 EUR.

Get free quotes for a project like this
Employer working
Project Budget
N/A
Total Bids
7
Project Description

Write a Python-script that downloads web crawling data (ARC-format) from the CommonCrawl.org-project.

The python script must use at least three arguments: aws private, aws public and the file extension to extract from the links.

Example usage:

$ python [url removed, login to view] secret public pdf
[url removed, login to view]
.. and so on..

The output from the script will be links containing the file extension. The script must also keep state in which ARC-file it's currently processing.
The script must use the requester pays S3 option and parse crawler data in the 2012-dataset.

Example file: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/[url removed, login to view]


About the ARC-file format: [url removed, login to view]

Exampel flow:

1. Download segment list
2. Download first ARC-file in segment and uncompress
3. Parse ARC-file to find links with user selected extensions
4. Print link/url

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online