Use Apache Mahout to create a set of similarity ratings for webpages from a weblog.
1. Create script for parsing weblogs
1.1 There are six weblog files (one per month).
1.2 Transform log information into Mahout input file (userid, itemid).
1.2.1 Create only one Mahout input file resulting from any number of weblog files in a directory
1.2.2 Treat each user session as a separate user. Cap after 30 minutes of inactivity.
1.2.3 There are about 240 unique ids (cid's) with each value identified explicitly in the URL [example: cid=1040]
1.2.4 Not all URLs contain cid numbers, and these are ignored.
1.2.5 We need to remove spiders from the data. I immediately see cases of the Googlebot and the Baidu Spider.
1.2.6 We may need to look for sessions that look at too many pages and remove those as outliers or unidentified bots (3 S.D.s or above) after removal of obvious bots in 1.2.5.
2. Run Mahout creating outputs based on the following approaches:
2.3.2 Count weight
2.3.3 Standard deviation weight
3. Create outputs
3.1 Outputs will consist of item-item similarity scores using different algorithms
3.1.1 Column format will be cid1, cid2, score
3.1.2 Data ordered by cid1 ascending, cid2 ascending
3.1.3 Stored in comma-delimited .csv files, one per analysis method