Closed

Scrape Reddit

This project is now closed with a project budget of N/A.

Get free quotes for a project like this
Employer working
Project Budget
N/A
Project Description

Scrape Reddit in Java/Perl, dump into SQL tables. See the detailed description below.

## Deliverables

**The task is to scrape popular content from [url removed, login to view]:**

**
**

1.** **If you go to <[url removed, login to view]> you can see a list of the popular reddit topics sorted by subscriber count.



2. We want to scrape the following topics.




a. ALL the topics from the top 50 EXCEPT the following,

announcements

blog

askreddit

Iama

[[url removed, login to view]][1]

bestof

sex

minecraft

doesanybodyelse

trees

skyrim

explainlikeimfive

truereddit




b. IN ADDITION, we want to scrape the following topics that are outside the top 50

gadgets

LifeProTips

wikipedia

environment

cooking

history

art

games

philosophy

photography

sports

math

health

seduction

psychology




c. Also from the [[url removed, login to view]][1] website we want the following feeds

all

random




That's 54 topics in all.







3. For all the topics/feeds mentioned in 2 what we want is 1000 urls each from the

"top scoring" and "links from this month" category. For eg. for the topic "funny",

this would be 1000 urls from this feed [[url removed, login to view]<wbr />funny/top/?sort=top&t=month][2]

(If you don't find 1000 urls in the last month, we may have to go to top in the year)




4. Store your results in a mysql table with the following schema.

<id>, <url>,<topic name>, <score count>, <comment count>, <date submitted if available>

So ALL the urls from all the topics would be stored in this single table. This populated table

is the main deliverable along with your scripts. This table will be populated on the amazon machine

you bring up (as per 5 below) and you will copy the populated table to our server (our server details will

be provided later closer to task completion). So this table will have 54 topics X 1000 urls = 54,000 rows in all.




5. Your table and scripts should reside and run on the Amazon machine (s) you bring up for this task.

Ping Nick (cc'ed) regarding what you want and he'll give you our amazon account details

to bring up the machines.




6. You may have to deal with rate limiting or throttling by reddit so be prepared for this.

You may need to use multiple machines to do the crawl/screen scrape if necessary. You may need

to use multithreading to make your scripts finish in time.




7. We want your script to finish running in less than a day.




8. You may need to do more research on the API documentation but here is reddit's documentation.

[[url removed, login to view]<wbr />reddit/wiki/API][3] Figure out the most efficient way to do this. API (if it exists)

or if not, screen scrape.




9. Use Java/Perl (preferred) on a linux machine (preferred). Let us know if you HAVE to use something else

and we'll evaluate.

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online