
The email address is already associated with a Freelancer account. Enter your password below to link accounts:
Username:
Link your Facebook account to a new Freelancer account
Email address:
Valid username
Project Description:
Scrape Reddit in Java/Perl, dump into SQL tables. See the detailed description below.
## Deliverables
**The task is to scrape popular content from Reddit.com:**
**
**
1.** **If you go to <http://redditlist.com/> you can see a list of the popular reddit topics sorted by subscriber count.
2. We want to scrape the following topics.
a. ALL the topics from the top 50 EXCEPT the following,
announcements
blog
askreddit
Iama
[reddit.com][1]
bestof
sex
minecraft
doesanybodyelse
trees
skyrim
explainlikeimfive
truereddit
b. IN ADDITION, we want to scrape the following topics that are outside the top 50
gadgets
LifeProTips
wikipedia
environment
cooking
history
art
games
philosophy
photography
sports
math
health
seduction
psychology
c. Also from the [reddit.com][1] website we want the following feeds
all
random
That's 54 topics in all.
3. For all the topics/feeds mentioned in 2 what we want is 1000 urls each from the
"top scoring" and "links from this month" category. For eg. for the topic "funny",
this would be 1000 urls from this feed [http://www.reddit.com/r/<wbr />funny/top/?sort=top&t=month][2]
(If you don't find 1000 urls in the last month, we may have to go to top in the year)
4. Store your results in a mysql table with the following schema.
<id>, <url>,<topic name>, <score count>, <comment count>, <date submitted if available>
So ALL the urls from all the topics would be stored in this single table. This populated table
is the main deliverable along with your scripts. This table will be populated on the amazon machine
you bring up (as per 5 below) and you will copy the populated table to our server (our server details will
be provided later closer to task completion). So this table will have 54 topics X 1000 urls = 54,000 rows in all.
5. Your table and scripts should reside and run on the Amazon machine (s) you bring up for this task.
Ping Nick (cc'ed) regarding what you want and he'll give you our amazon account details
to bring up the machines.
6. You may have to deal with rate limiting or throttling by reddit so be prepared for this.
You may need to use multiple machines to do the crawl/screen scrape if necessary. You may need
to use multithreading to make your scripts finish in time.
7. We want your script to finish running in less than a day.
8. You may need to do more research on the API documentation but here is reddit's documentation.
[https://github.com/reddit/<wbr />reddit/wiki/API][3] Figure out the most efficient way to do this. API (if it exists)
or if not, screen scrape.
9. Use Java/Perl (preferred) on a linux machine (preferred). Let us know if you HAVE to use something else
and we'll evaluate.
Freelancer.com (formerly GetAFreelancer, Scriptlance and vWorker/Rentacoder) is the world's largest freelancing, outsourcing and crowdsourcing marketplace for small business. Hire freelancers to work in software, writing, data entry and design right through to engineering and the sciences, sales and marketing, and accounting & legal services.
Find freelance jobs and make money online! We have freelance coders, writers, programmers, designers, marketers and more. Getting the best web design, custom programming, professional writing or affordable marketing has never been easier!
© Copyright 2013 Freelancer Technology Pty Limited (ACN 142 189 759)
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)