Mahout Analytics

IN PROGRESS
Bids
3
Avg Bid (USD)
$583
Project Budget (USD)
$250 - $750

Project Description:
Use Apache Mahout to create a set of similarity ratings for webpages from a weblog.
1. Create script for parsing weblogs
1.1 There are six weblog files (one per month).
1.2 Transform log information into Mahout input file (userid, itemid).
1.2.1 Create only one Mahout input file resulting from any number of weblog files in a directory
1.2.2 Treat each user session as a separate user. Cap after 30 minutes of inactivity.
1.2.3 There are about 240 unique ids (cid's) with each value identified explicitly in the URL [example: cid=1040]
1.2.4 Not all URLs contain cid numbers, and these are ignored.
1.2.5 We need to remove spiders from the data. I immediately see cases of the Googlebot and the Baidu Spider.
1.2.6 We may need to look for sessions that look at too many pages and remove those as outliers or unidentified bots (3 S.D.s or above) after removal of obvious bots in 1.2.5.
2. Run Mahout creating outputs based on the following approaches:
2.1 PearsonCorrelationSimilarity
2.2 SlopeOneRecommender
2.3.1 Unweighted
2.3.2 Count weight
2.3.3 Standard deviation weight
2.4 SVDRecommender
2.5 KnnItemBasedRecommender.
2.6 TreeClusteringRecommender
2.7 GenericItemBasedRecommender
3. Create outputs
3.1 Outputs will consist of item-item similarity scores using different algorithms
3.1.1 Column format will be cid1, cid2, score
3.1.2 Data ordered by cid1 ascending, cid2 ascending
3.1.3 Stored in comma-delimited .csv files, one per analysis method

Additional Project Description:
11/23/2012 at 7:24 KST
Clarifications:
1. Winner must provide the software that does this so that the system can be run again as data is updated (implicit and perhaps obvious, but just to be sure).
2. Winner must provide short documentation on how to use the system.

Small extension:
1. system also examines pages with mid= in addition to cid= in URL. User specifies whether to use only pages with mid=, cid=, or both.

Skills required:
Algorithm, Apache, Java, Linux, Machine Learning
Additional Files: file6sample.txt
About the employer:
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.


$ 750
in 10 days
$ 750
in 10 days
$ 250
in 7 days