Data Collection from a newspaper and Congress Talks

IN PROGRESS
Bids
27
Avg Bid (USD)
$171
Project Budget (USD)
$30 - $250

Project Description:
Hello all:
I am a researcher at a university and I need someone who is experienced in crawling and data collection to help me with the following:

1. Crawl into a newspaper website (I will provide which site) and scape (1) the text of articles that appeared on the site in the past 1-2 years (2) scrape the comments to the articles (written by users) for each article.

2. Collect the information that is available on the news website for each user.

3. From the congress database (open to public) collect the congressional speech texts from the past 1-2 years.


If we can manage the above, I have follow up projects that I can potentially work with you given we mutually agree on it. I am looking for someone I can work with for a long term if I am happy with the work.

Additional Project Description:
12/10/2011 at 10:20 SGT
SPECIFIC PROJECT DETAILS




Here is what I need in detail:

1. Go to NY Times

http://www.nytimes.com/

2. From the most popular list, find the 10 articles that are most viewed for that day:

http://www.nytimes.com/most-popular?src=hp1-0-M

3. For each link, collect the data
- Date,
- Author Name,
- Text of the Article itself
- Headline
- Comments for the Article (Commentor Name and Comment Text)

Repeat this for the past 365 days (1 year)


4. Then for each commentor in overall the list, when their names are clickable collect information on the number of previous comments, the date of the earliest comment, number of people following and followed by this person.

http://timespeople.nytimes.com/view/user/32473504/activities.html

For half the articles, there will not be comments, and for another half, the commentors' links will not be clickable. So the actual end data is likely to be smaller.

Can you conduct this data collection for the past year?

Let me know if you think this is feasible, before I agree.

Thanks.

Skills required:
Python, Smarty PHP, Social Engine, Social Networking, Web Scraping
About the employer:
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.