Scrape text from pdf (to csv)

Cancelled

Data need to be extracted from a 'searchable' pdf.

Some things to know -

There are libraries in Python etc. that make it very simple to extract text.

Once text is extracted, one can use some 'keywords' as triggers to harvest the data. The name of towns have a hypen that always follows them, for instance. Similarly colon ":" and phrases like 'basic service' can be used for other data. See the sample pdf.

Details about the data -

The data are about cable systems in various US towns. Information within each town starts with name of the cable company, its address and any other information. And then goes on to describe various packages that the cable company offers.

We are interested in getting information the channels, and a few other characteristics of various cable packages, for instance 'Basic Service', 'Expanded Basic Service', 'Pay Service 1', 'Pay-Per-View', 'Pay Service 2', 'Pay Service 3', 'Pay Service 4', 'Pay Service 5', 'Pay Service 6', 'Pay Service 7', 'Pay Service 8', 'Internet Service'

Not all towns will have all these packages. For instance, Abbeville just has 'Basic Service' while Addison has 'Basic Service', 'Expanded Basic Service' and 'Pay Service 1'.

Each of these 'services' have further attributes (again not all attributes will be present all the time) - subscribers, pay units, programming (received off-air), programming (via satellite), miles of plant, state manager, manager, ownership, fee, current originations, local advertising, city fee, tv market ranking, channel capacity, equipment, addressable homes, program guide, chief technician

Output file:

We want the data in a csv. Each row will represent each town. The first column would be information about the cable company. Next we will get data for each service.

More on that -

For each service:

'Basic Service', 'Expanded Basic Service', 'Pay Service 1', 'Pay-Per-View', 'Pay Service 2', 'Pay Service 3', 'Pay Service 4', 'Pay Service 5', 'Pay Service 6', 'Pay Service 7', 'Pay Service 8', 'Pay Service 9', 'Pay Service 10', 'Internet Service')

Create columns corresponding to each of the attributes:

subscribers, pay units, programming (received off-air), programming (via satellite), miles of plant, state manager, manager, ownership, fee, current originations, local advertising, city fee, tv market ranking, channel capacity, equipment, addressable homes, program guide, chief technician)

So final column names would be something like -

basic [url removed, login to view], basic [url removed, login to view] units, basic [url removed, login to view] ....[url removed, login to view],....

Each column will carry its corresponding information. If the service is missing - assign all attribute columns missing values (leave it blank). If an attribute is missing within a service - assign it as missing (leave it blank)

A sample of the pdf is attached alongside.

Skills: PDF, PHP, Web Scraping

See more: scrape columns pdf, scrape text pdf, web scraping ranking, web scraping python 3, web programming guide, web advertising manager, tv programming guide, tv guide programming, service technician, searchable pdf service, satellite programming, python programming pdf, python programming company, programming python pdf, programming internet of things, programming in python 3 pdf, programming in python 3, plant manager, pdf services, pay for python programming, one harvest, local tv programming, local it technician, internet programming with python, guide to programming with python

Project ID: #4533554

22 freelancers are bidding on average $146 for this job

SigmaVisual

I can help in your project, please check PMB and our ratings/reviews to get idea of our experience. Please let me know if you have any queries.

$231 USD in 5 days
(242 Reviews)
7.8
zeke

Available to start immediately and finish as soon as possible.

$206 USD in 2 days
(150 Reviews)
6.8
tzo

Can help you on this. Have some prior experience with pdf parsing.

$158 USD in 3 days
(190 Reviews)
6.2
samitXI

Please check your inbox...Thanks

$185 USD in 3 days
(46 Reviews)
5.9
AlGordo

Experienced with scraping of data.

$100 USD in 3 days
(21 Reviews)
5.6
pablotorres

i can do it

$155 USD in 30 days
(56 Reviews)
5.3
esafeguard

Hi. I'm a PHP programmer with experience in text parsing projects. Please provide a sample of the searchable file. Regards.

$150 USD in 3 days
(6 Reviews)
4.1
ideadezigner

Hello Sir, Please check your private mail box

$206 USD in 1 day
(9 Reviews)
3.7
thetidevw

Hi, i can do this for you, buth using php along with XPDF/pdftotext.

$157 USD in 3 days
(12 Reviews)
3.6
iautomation

You're right, python would be great for this. I would handle the data by first straigtening the pages vertically, then splitting the columns...then feeding the columns through OCR like one long string of data. Certain More

$155 USD in 3 days
(18 Reviews)
3.6
jhliuster

PM for more details, ready to start,thanks.

$126 USD in 1 day
(4 Reviews)
3.0
hemi

I did in depth analysis of this project. Please see more details on private message. Thanks

$333 USD in 10 days
(4 Reviews)
2.7
samic

Ready to work on it.

$144 USD in 3 days
(5 Reviews)
2.5
suriyant

I have strong experience in Python plus data extract from PDF. I can do it.

$126 USD in 3 days
(2 Reviews)
2.6
HoneyITSolution

Cool! Lets start this job and get it done. Thank you

$111 USD in 5 days
(2 Reviews)
2.2
huuban

I have just completed a project about . csv file( with 17 columns and over 14000 rows). So I think I can help you to do it.

$111 USD in 3 days
(1 Review)
1.4
okolobasii

All is clear. About 2 years ago I did same tasks. I'm not a professional now. But I needs a quick job, and I haven't any other occupations now, so can make it in 1-2 days.

$55 USD in 3 days
(0 Reviews)
0.0
rubanajjar

I can extract u the data to a csv file and will send a sample when u need, how many files are they,?thx

$45 USD in 2 days
(0 Reviews)
0.0
mikecrosa

I can do it without using python , what is the deadline?

$144 USD in 2 days
(0 Reviews)
0.0
sanjoydam

Hello, I am very interested about your project and I am ready to start now. Check the PMB please.

$100 USD in 3 days
(0 Reviews)
0.0