In Progress

Mahout Analytics

Use Apache Mahout to create a set of similarity ratings for webpages from a weblog.

1. Create script for parsing weblogs

1.1 There are six weblog files (one per month).

1.2 Transform log information into Mahout input file (userid, itemid).

1.2.1 Create only one Mahout input file resulting from any number of weblog files in a directory

1.2.2 Treat each user session as a separate user. Cap after 30 minutes of inactivity.

1.2.3 There are about 240 unique ids (cid's) with each value identified explicitly in the URL [example: cid=1040]

1.2.4 Not all URLs contain cid numbers, and these are ignored.

1.2.5 We need to remove spiders from the data. I immediately see cases of the Googlebot and the Baidu Spider.

1.2.6 We may need to look for sessions that look at too many pages and remove those as outliers or unidentified bots (3 S.D.s or above) after removal of obvious bots in 1.2.5.

2. Run Mahout creating outputs based on the following approaches:

2.1 PearsonCorrelationSimilarity

2.2 SlopeOneRecommender

2.3.1 Unweighted

2.3.2 Count weight

2.3.3 Standard deviation weight

2.4 SVDRecommender

2.5 KnnItemBasedRecommender.

2.6 TreeClusteringRecommender

2.7 GenericItemBasedRecommender

3. Create outputs

3.1 Outputs will consist of item-item similarity scores using different algorithms

3.1.1 Column format will be cid1, cid2, score

3.1.2 Data ordered by cid1 ascending, cid2 ascending

3.1.3 Stored in comma-delimited .csv files, one per analysis method

Skills: Algorithm, Apache, Java, Linux, Machine Learning

See more: mahout analytics, svdrecommender, svdrecommender example, the analysis of algorithms, parsing input, number algorithms, example of algorithms, different algorithms, analysis of algorithms, algorithms and analysis, about algorithms, mahout apache, similarity, data removal, csv data analysis, java parsing, java file comma, csv column format, apache mahout, directory files count, user analytics, java mahout, java count, transform java, linux script java

About the Employer:
( 11 reviews ) Boulder, United States

Project ID: #2791243

Awarded to:


I am a data mining & machine learning researcher.

$750 USD in 10 days
(5 Reviews)

3 freelancers are bidding on average $583 for this job


Please check your inbox. Thanks

$750 USD in 10 days
(66 Reviews)

please check my pm

$250 USD in 7 days
(0 Reviews)