This project was awarded to tsendee for $750 USD.Get free quotes for a project like this
Project Budget$250-$750 USD
Use Apache Mahout to create a set of similarity ratings for webpages from a weblog.
1. Create script for parsing weblogs
1.1 There are six weblog files (one per month).
1.2 Transform log information into Mahout input file (userid, itemid).
1.2.1 Create only one Mahout input file resulting from any number of weblog files in a directory
1.2.2 Treat each user session as a separate user. Cap after 30 minutes of inactivity.
1.2.3 There are about 240 unique ids (cid's) with each value identified explicitly in the URL [example: cid=1040]
1.2.4 Not all URLs contain cid numbers, and these are ignored.
1.2.5 We need to remove spiders from the data. I immediately see cases of the Googlebot and the Baidu Spider.
1.2.6 We may need to look for sessions that look at too many pages and remove those as outliers or unidentified bots (3 S.D.s or above) after removal of obvious bots in 1.2.5.
2. Run Mahout creating outputs based on the following approaches:
2.3.2 Count weight
2.3.3 Standard deviation weight
3. Create outputs
3.1 Outputs will consist of item-item similarity scores using different algorithms
3.1.1 Column format will be cid1, cid2, score
3.1.2 Data ordered by cid1 ascending, cid2 ascending
3.1.3 Stored in comma-delimited .csv files, one per analysis method
Looking to make some money?
- Set your budget and the timeframe
- Outline your proposal
- Get paid for your work
Hire Freelancers who also bid on this project
Looking for work?
Work on projects like this and make money from home!Sign Up Now
- The New York Times
- Wall Street Journal
- Times Online