Mass Data Extraction Project - over 15 million pages

This project received 13 bids from talented freelancers with an average bid price of $3423 USD.

Get free quotes for a project like this
Project Budget
Total Bids
Project Description

We need a very experienced data extraction expert with "underground" skills to extract the entire content of well known social network which contains over 10 million profiles. This includes extracting all profiles, all publicly available profile data, networks of friends, communities, guestbook messages, pictures, and all information that a user has publicly available in their profile including their entire network of friends. You must provide all of this info in a MY SQL database file and also be able to create an exact navigatable clone of the social network.

Note that this social network is not Myspace and the pages contain no multimedia, video, or personalized html.

You should be very familiar with the top social network and understand their structures well and why they currently prevent spiders from grabbing their data naturally.

Successful completion of this project would require an in depth understanding of the the social network, a IP detection and blocking strategy to allow for the successful extraction of terabytes of data without being blocked. You may need to set up a massive network of computers with rotating IP addresses to complete this/

The delivery of this project would be providing all this extracted data in a predefined database format and providing a web based clone copy of the entire social network site. We would provide all the hd space necessary to store this. We would need a requirements list from you of all hardware, software, and other tools necessary to complete this job.

Because of the size of this job and the closed nature of this site preventing traditional spiders, this job will require some very creative thinking and knowledge

You will need to custom building scripts and programs to complete this job.

Finally, after completing the initial extraction of all current content, using this scripts, we need the capability to repeat this process in the future using the scripts and same strategy.

please demonstrate that you clearly understand the issues involved in completing this. Demonstrate that you understand the challenges of logging in the social network, extracting data, dealing with IP issues, and all other challenges that will come up.

This is a massive job requiring a well coordinated and planned effort requiring a bit of "underground" skills and resources.

In summary, the final deliverable should be an exact navigatable social network clone of the social network allowing the ability to navigate through all profiles viewing pictures, networks of friends, communities, etc.

To a casual observer, it would appear as if you were navigating through an exact clone of the social network site. And all these information will be supported from a database containing all 10,000,000+ records of all the publicly available information we can extract.

Please take a look at the top social networks, hi5, orkut, facebook, and myspace and provide provide you insights into the differences of these social networks and how you would approach data extraction on each of these. Your ability to intelligently discuss the differences of these sites will clearly demonstrate your ability to pull of this project.

Please do very creative in your thinking and in the resources that could be made available to complete this job.

Finally, we expect several follow on jobs and potential long term ongoing work as a result of this project. So please take the time to demonstrate your ability to complete this and also let us know the limitations that might exist in completing this job.

Skills Required

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online