Closed

Scrape text from pdf (to csv)

This project received 23 bids from talented freelancers with an average bid price of $147 USD.

Get free quotes for a project like this
Employer working
Skills Required
Project Budget
$30 - $250 USD
Total Bids
23
Project Description

Data need to be extracted from a 'searchable' pdf.

Some things to know -
There are libraries in Python etc. that make it very simple to extract text.
Once text is extracted, one can use some 'keywords' as triggers to harvest the data. The name of towns have a hypen that always follows them, for instance. Similarly colon ":" and phrases like 'basic service' can be used for other data. See the sample pdf.

Details about the data -

The data are about cable systems in various US towns. Information within each town starts with name of the cable company, its address and any other information. And then goes on to describe various packages that the cable company offers.

We are interested in getting information the channels, and a few other characteristics of various cable packages, for instance 'Basic Service', 'Expanded Basic Service', 'Pay Service 1', 'Pay-Per-View', 'Pay Service 2', 'Pay Service 3', 'Pay Service 4', 'Pay Service 5', 'Pay Service 6', 'Pay Service 7', 'Pay Service 8', 'Internet Service'

Not all towns will have all these packages. For instance, Abbeville just has 'Basic Service' while Addison has 'Basic Service', 'Expanded Basic Service' and 'Pay Service 1'.

Each of these 'services' have further attributes (again not all attributes will be present all the time) - subscribers, pay units, programming (received off-air), programming (via satellite), miles of plant, state manager, manager, ownership, fee, current originations, local advertising, city fee, tv market ranking, channel capacity, equipment, addressable homes, program guide, chief technician

Output file:

We want the data in a csv. Each row will represent each town. The first column would be information about the cable company. Next we will get data for each service.

More on that -

For each service:
'Basic Service', 'Expanded Basic Service', 'Pay Service 1', 'Pay-Per-View', 'Pay Service 2', 'Pay Service 3', 'Pay Service 4', 'Pay Service 5', 'Pay Service 6', 'Pay Service 7', 'Pay Service 8', 'Pay Service 9', 'Pay Service 10', 'Internet Service')

Create columns corresponding to each of the attributes:
subscribers, pay units, programming (received off-air), programming (via satellite), miles of plant, state manager, manager, ownership, fee, current originations, local advertising, city fee, tv market ranking, channel capacity, equipment, addressable homes, program guide, chief technician)

So final column names would be something like -
basic [url removed, login to view], basic [url removed, login to view] units, basic [url removed, login to view] ....[url removed, login to view],....

Each column will carry its corresponding information. If the service is missing - assign all attribute columns missing values (leave it blank). If an attribute is missing within a service - assign it as missing (leave it blank)

A sample of the pdf is attached alongside.

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online