Project Description:
BUDGET: 300$
YOU HAVE 36H to do the job. Do not bid if your not up to the challenge.
Source target: http://forums.unfiction.com/forums/
Features required:
#1 - Extract *all* Thread from discussion folder
PARAMETERS:
- Folder path (ex: http://forums.unfiction.com/forums/index.php?f=10 )
- Number of pages (ex: 2 page to crawl, or * for all page)
- Thread Filter - Regex to include or exclude thread by title. (Ex: include "tralhead" you only extract date for thread with the word trailhead in the title)
OUTPUT IN CSV:
- Thread ID (ex: t=31495)
- Thread Title (ex: [Trailhead] Behind The Yellow Curtain BTYC Ep5)
- Authors ID (ex u=7899)
- Replies (ex: 1241)
- Views (ex: 123421)
- Last Post Date (ex: 2011/12/30)
2- Batch Extract thread stats
PARAMETERS:
- Load a list of Thread ID from a CSV file. (Ex: t=31495, t=24481, etc...)
OUTPUT IN CSV:
- Thread ID (ex: t=31495)
- Number of post in thread
- Number of unique author in thread
- First Post Date (ex: 2011/12/30)
- Last Post Date (ex: 2011/12/30)
3- Deep extract thread stats
PARAMETERS:
- Unique Thread ID. (Ex: t=31495)
- Bolean (yes - no) - Strict word count. (Exclude "Quote" content and Signature from word count and spoiler / href tag detection)
OUTPUT IN CSV:
- Post ID
- Post Date & time (ex: 2011/12/30 23:09)
- Author ID
- Author Name
- Word count
- Spoiler tag present (true/false)
- Video tag present (true/false)
- URL href present (true/false)
4- Batch Extract users stats
PARAMETERS:
- Load a list of User ID from a CSV file. (Ex: u=7899)
OUTPUT IN CSV:
- Joined Date
- Total Post
- Posts per day
- Location
NICE TO HAVE;
- Throttle request per seconds (so I don't have any impact on the website while extracting the stats)
- Automatically crawl everything and extract all data into an access database with 4 table and 'joint' to store all data.