Create a blog list based on technorati

AWARDED
Bids
4
Avg Bid (USD)
$164
Project Budget (USD)
$30 - $250

Project Description:
The job is to create a list of all blogs (~1.2million) contained in the technorati blog directory

http://technorati.com/blogs/directory/

Each website must be categorised into the correct technorati categories. Scrape contact URL, find email address if relevant

The data would have the following column headings

1) blog homepage url
2) technorati top level category eg: "entertainment"
2) technorati 2nd level category eg: "celeb"
3) technorati "Auth" score eg: 937
4) Contact us page URL (scraped via search engine or site crawl?)
5) Email address contact if available

The data must use the following columns

HOMEPAGE | TOP_LEVEL | 2ND_LEVEL | AUTH | CONTACT | EMAIL

Each row must contain homepage, top level, 2nd level, auth and contact. We accept that not every website has a contact page or visible email address, though with the correct search engine query scrape this data should be reasonably well populated and error free.

All data lower case, please. Url format must include http://

Skills required:
Data Mining, PHP, Web Scraping
About the employer:
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.