I am looking to develop a platform similar to [url removed, login to view] but for a different industry. I have a collection of approximately 1200 blogs that can be used to seed.
The site would behave much like techmeme in that it would
1) scrape websites/blogs for data
2) collect and index the data
3) Algorithmically or by way of machine learning cluster articles/posts that relate to each other
4) present the data in real time in a structured and easy to navigate way
5) provide a backend that would allow an administrator/user some "editorializing" such as tagging one article/post in a cluster as the top story. Backend also needs to be able to manage all aspects of website - sponsorship, users, database updates, add new urls to the scraper, etc.
6) provide a means to organize and promote sponsorships throughout the site.
Based on my research, this project could be accomplished using a combination of Apache Nutch, solr, hadoop, and mahout.
This will likely be deployed on a platform like Amazon AWS.
Type of Website: News Media / Informational Content
Other Skills: hadoop, mahout, nutch, solr, lucene, java
11 freelancers are bidding on average $4436 for this job
I have very good knowledge on hadoop , amazon ec2 setup with optimization,and also I have very good experience on community website building from database design to front end design,