You have chosen to sponsor your bid up to a maximum amount of .
Use Apache Mahout to create a set of similarity ratings for webpages from a weblog.
1. Create script for parsing weblogs
1.1 There are six weblog files (one per month).
1.2 Transform log information into Mahout input file (userid, itemid).
1.2.1 Create only one Mahout input file resulting from any number of weblog files in a directory
1.2.2 Treat each user session as a separate user. Cap after 30 minutes of inactivity.
1.2.3 There are about 240 unique ids (cid's) with each value identified explicitly in the URL [example: cid=1040]
1.2.4 Not all URLs contain cid numbers, and these are ignored.
1.2.5 We need to remove spiders from the data. I immediately see cases of the Googlebot and the Baidu Spider.
1.2.6 We may need to look for sessions that look at too many pages and remove those as outliers or unidentified bots (3 S.D.s or above) after removal of obvious bots in 1.2.5.
2. Run Mahout creating outputs based on the following approaches:
2.3.2 Count weight
2.3.3 Standard deviation weight
3. Create outputs
3.1 Outputs will consist of item-item similarity scores using different algorithms
3.1.1 Column format will be cid1, cid2, score
3.1.2 Data ordered by cid1 ascending, cid2 ascending
3.1.3 Stored in comma-delimited .csv files, one per analysis method
Additional Project Description:
11/23/2012 at 7:24 KST
1. Winner must provide the software that does this so that the system can be run again as data is updated (implicit and perhaps obvious, but just to be sure).
2. Winner must provide short documentation on how to use the system.
1. system also examines pages with mid= in addition to cid= in URL. User specifies whether to use only pages with mid=, cid=, or both.