Scrape Reddit in Java/Perl, dump into SQL tables. See the detailed description below.
**The task is to scrape popular content from [url removed, login to view]:**
1.** **If you go to <[url removed, login to view]> you can see a list of the popular reddit topics sorted by subscriber count.
2. We want to scrape the following topics.
a. ALL the topics from the top 50 EXCEPT the following,
[[url removed, login to view]]
b. IN ADDITION, we want to scrape the following topics that are outside the top 50
c. Also from the [[url removed, login to view]] website we want the following feeds
That's 54 topics in all.
3. For all the topics/feeds mentioned in 2 what we want is 1000 urls each from the
"top scoring" and "links from this month" category. For eg. for the topic "funny",
this would be 1000 urls from this feed [[url removed, login to view]<wbr />funny/top/?sort=top&t=month]
(If you don't find 1000 urls in the last month, we may have to go to top in the year)
4. Store your results in a mysql table with the following schema.
<id>, <url>,<topic name>, <score count>, <comment count>, <date submitted if available>
So ALL the urls from all the topics would be stored in this single table. This populated table
is the main deliverable along with your scripts. This table will be populated on the amazon machine
you bring up (as per 5 below) and you will copy the populated table to our server (our server details will
be provided later closer to task completion). So this table will have 54 topics X 1000 urls = 54,000 rows in all.
5. Your table and scripts should reside and run on the Amazon machine (s) you bring up for this task.
Ping Nick (cc'ed) regarding what you want and he'll give you our amazon account details
to bring up the machines.
6. You may have to deal with rate limiting or throttling by reddit so be prepared for this.
You may need to use multiple machines to do the crawl/screen scrape if necessary. You may need
to use multithreading to make your scripts finish in time.
7. We want your script to finish running in less than a day.
8. You may need to do more research on the API documentation but here is reddit's documentation.
[[url removed, login to view]<wbr />reddit/wiki/API] Figure out the most efficient way to do this. API (if it exists)
or if not, screen scrape.
9. Use Java/Perl (preferred) on a linux machine (preferred). Let us know if you HAVE to use something else
and we'll evaluate.